Today we covered Google Cloud Storage. Google Cloud Storage is a RESTful service for storing and accessing your data objects on Google’s infrastructure. Few important features: Multiple layers of redundancy. All data replicated to multiple data centers. Objects can be terabytes in size, with resumable uploads and downloads, and range read support. If you wish to start with Google cloud storage, a good starting point will be to click on the image to the left. In my session, I’ve covered:
- Migrating Data to Cloud Storage
- Object Composition
- Durable Reduced Availability Storage
There are many factors that impact the performance to copy files into Google Cloud Storage. Few bold cases: network bandwidth, CPU speed, available memory, the access path to the local disk, contention and error rates along the path between gsutil and Google, operating system buffering configuration and firewalls or other network elements that might slow us on the way. This is the main reason you might wish to use the perfdiag command – It will let you run a known measurement suite when troubleshooting performance problems.
The first step is to install the command line tool (gsutil) that will let you try the API. Here is a short way to upload a file:
- First let’s make sure we have the lastest: gsutil update (Btw, you might wish to run: gcloud components update which will make sure all your Google cloud platform SDKs are updated)
- Set Up Credentials to Access Protected Data: gsutil config
- Now let’s create a new bucket: cloud.google.com/console/project/Your-ID/storage
- Upload a file: gsutil cp rand_10m.txt gs://paris1
Of course, you can rename the bucket from paris1 to any name you wish.
- List the bucket: gsutil ls gs://paris1
- You can run: gsutil perfdiag or gsutil perfdiag gs://paris1
in order to see what is the write throughput output. It will help you analyze how much time it will take to move data in/out.
When you wish to copy a lot of data, it might be good to do it in parallel. Parallel copy from single host may not be of great benefit depending on where your bottleneck is, but it should help. However, if you have the ability to upload from several sources, it will improve the performances. Use the -m option for parallel copying:
gsutil -m cp <file1> <file2> <file3> gs://<bucket>
Another option is the Offline copy. It saves bandwidth charges and it’s good for customers with lots of data and limited bandwidth. It’s still in limited preview for customers with return address in the United States and its price is a flat fee of $80 per HDD irrespective of the drive capacity or data size.
So to allow parallel uploads we can use: gsutil compose <file1> .. <file32> <final_object>
or we can append to an existing object: gsutil compose <final_object> <file_to_append> <final_object>
To upload in parallel, split your file into smaller pieces, upload them using “gsutil -m cp”, compose the results, and delete the pieces:
$ split -b 1000000 rand-splity.txt rand-s-part- $ gsutil -m cp rand-s-part-* gs://bucket/dir/ $ rm rand-s-part-* $ gsutil compose gs://bucket/rand-s-part-* gs://bucket/big-file $ gsutil -m rm gs://bucket/dir/rand-s-part-*
You can use gsutil ls -L to examine the metadata of the objects: gsutil ls -L gs://<bucket> | grep -v ACL
Examine the Hash and ETag object and see the content of your file (if it’s text one!) with: gsutil cat
Few things to remember
- There is a limit of 32 components of any object.
- ETag value is not the MD5 hash of the object for composite object.
- Google Cloud Storage used MD5 to construct the ETag value. This is not true for composite objects; client code should make no assumptions about composite object ETags except that they will change whenever the underlying object changes per the IETF specification for HTTP/1.1. Why? Because given two objects and their MD5′s we can’t efficiently calculate the MD5 of the two combined – the only way to do it would be to walk over lots of data over again. CRC solves that because if I have CRC(A), CRC(B), and len(B), I can quickly calculate CRC(A + B) (i.e. even if B is a huge object).
Durable Reduced Availability Storage
Enables you to store data at lower cost than standard storage (via fewer replicas).DRA achieves cost savings by keeping fewer redundant replicas of data. Unlike some other reduced redundancy cloud storage offerings, DRA is designed to maintain data durability at the same level as our Standard storage class of Google Cloud Storage. Data replication is reduced for DRA comparing to the standard bucket. The important point is data is still replicated. In the very unlikely event when there is an outage of the data centers, then the data will become unavailable.
It got the following characteristics compared to standard buckets:
- Lower costs and lower availability
- Same durability - data is guaranteed to persist.
- Same performance!
You can create a DRA bucket with: gsutil mb -c DRA gs://<bucketname>/
(!) Regional buckets (Experimental) allow you to put your DRA buckets in the same region as your Compute Engine instances. You cannot just change the meta data of the bucket to switch between a DRA and standard bucket. You must download it and upload it, that is, data is physically copied down to the machine that runs gsutil. If you have lots of lots of objects, you can use the data migration strategies that we have covered previously. gsutil provides a daisy-chain mode that hooks up a download to an upload so the files do not have to be saved to the local disk first. It uses resumable uploads.
gsutil cp -D -R gs://<standard_bucket>/* gs://<durable_reduced_availability_bucket>
The daisy-chain mode can potentially slower and more expensive because of the following reason. MD5 is used to verify the files are copied correctly. This works fine when the copy runs to completion without failure, but if it fails partway through the resumable upload handler will re-read the source data up to the point of failure so it can re-compute the MD5. This makes restarting slow, even though it doesn’t retransmit the bytes to the destination object. To conclude, for backup, archiving or non time sensitive batch processing – DRA is a good option.
Here are the slides I’ve used during the talk:
Google Cloud Storage is a RESTful service for storing and accessing your data objects on Google’s infrastructure. The service combines the performance and scalability of Google’s cloud with advanced security and sharing capabilities. Main highlights include: Fast, Scalable, Highly Available Object Store. Give it a try…