When you’re using S3, an object store that has unlimited volume of data and a maximum object size of up to 5TB (the maximum for a single PUT request is 5GB) you might be tempted to start uploaded some pretty big files.
So today’s focus is about making use of the multipart upload capabilities of S3 to speed up the amount of time that it takes for a large object to land on your buckets.
The “managed” way
The AWS CLI has a number of commands that will help you upload those large files by automatically making use of multipart and so chances are that if you have used the CLI to upload documents into your buckets you have come across them. Those commands are cp, mv and sync and they can be used as followed.
> aws s3 cp your_large_file s3://your-bucket/
> aws s3 mv your_large_file s3://your-bucket/
> aws s3 sync your_large_file s3://your-bucket/
The differences between the three is out of scope for this post, however I’ll finish by saying that you can still change their configuration in order to make better use of your bandwidth. You can set the new configuration values through the CLI or directly into your AWS profile. A list of all possible configuration values can be found here.
The “unmanaged” way
AWS will recommend you to use those commands when possible (and with good reason!) but there are cases in which they don’t fit the bill and you have to do a bit of plumbing yourself. Luckily you are not left alone and the AWS CLI still provide you with the necessary commands to achieve the same result.
So let’s go ahead and upload a large file in parts into our bucket. In my case, I’ll create a 100Mb test file from the command line like this
> truncate -s 100M large_file
Now, I’ll use the split command to get four 25M parts. Split is available on both Linux and OSX (however, the OSX version might out of date and you might need to install the GNU core utilities).
> split -b 25M large_file
If you list the files in your directory, it should look something like this
We are not ready to start interacting with S3!
The first step in the process is to actually create a multipart upload
> aws s3api create-multipart-upload –bucket your-bucket-name –key your_file_name
The response from the API only contains three values, two of which have been provided by you. The last value is the UploadId and as you can imagine, this will be our reference to this multipart upload operation so go ahead and save it.
It is time to start uploading our part. The following is the command on how to upload a single part of which you’ll have to repeat N number of times depending on how many parts you’ve split your file into (In my case, N=4 and the command is for the first part), the values for part-number and body will need to be updated accordingly for every part you upload.
> aws s3api upload-part –bucket your-bucket-name –key your_file_name –part-number 1 –body xaa –upload-id UploadId
The ETag value that each upload-part returns will be used to complete the upload.
Once all parts are uploaded, you need to instruct S3 that the upload is completed. Remember S3 has no knowledge on how many parts there should be and what the references are so, passing that information back to it will complete the process. In order to do so, we need to compile a json array of all our parts and their respective ETag values.
You can use the ETag values that you have been collecting or retrieve them again by listing all parts in the upload
> aws s3api list-parts –bucket your-bucket-name –key your_file_name –upload-id UploadId
Save the output of the “Parts” array into a new file (I’ll call mine parts.json) and make sure to not include the LastModified and Size keys into the final file. Once you’re done the file should like something like this and remember that in my case, I was only dealing with four parts.
{ "Parts": [ { "PartNumber": 1, "ETag": "" }, { "PartNumber": 2, "ETag": "" }, { "PartNumber": 3, "ETag": "" }, { "PartNumber": 4, "ETag": "" } ] }
Now let’s use that to complete the upload with one final API call.
> aws s3api complete-multipart-upload –multipart-upload file://parts.json –bucket your-bucket-name –key your_file_name –upload-id UploadId
And we’re done, the response will contain the location for your newly uploaded file. We can call the list objects API or check the console if we wanted to double check our file is there.