Friday, September 18, 2015

s3 ETag and puppet

Hello faithful readers!

Today, I ponder a problem... how to know if I have the latest version of a file from an s3 bucket, and how to get that logic into puppet?

Amazon s3 buckets can contain objects, which contain an ETag... which is sometimes the md5, and sometimes not..

In trying to figure out how to know if I should download an object or not, I was inspired by the s3file module out on the puppet forge.

https://github.com/branan/puppet-module-s3file/blob/master/manifests/init.pp

Notably, this slick shell command to pull out the ETag and compare it to the md5sum of the file..
"[ -e ${name} ] && curl -I ${real_source} | grep ETag | grep `md5sum ${name} | cut -c1-32`"
However, I saw that in my use case, we used aws s3 cp /path/to/file s3://bucketname/key/path/ and our uploads happen as multipart uploads at the low level.  this makes our ETag break into multiple parts for files much smaller than the rest of the internet (I think at either 5mb or 15mb boundaries), which makes it not so keen for the above situation.

also, we don't have a publicly available bucket to curl, so we have to replace the curl -I call with a aws s3api head-object --bucket bucketname --key key/path/file.

While playing with the s3api command, I looked at a file I had uploaded previously.  When I uploaded it, I had used s3cmd's sync option.  I saw the following output.

$ aws s3api head-object --bucket bucketname --key noarch/epel-release-6-8.noarch.rpm {    "AcceptRanges": "bytes",     "ContentType": "binary/octet-stream",     "LastModified": "Sat, 19 Sep 2015 03:27:25 GMT",     "ContentLength": 14540,     "ETag": "\"2cd0ae668a585a14e07c2ea4f264d79b\"",     "Metadata": {        "s3cmd-attrs": "uid:502/gname:staff/uname:~~~~/gid:20/mode:33188/mtime:1352129496/atime:1441758431/md5:2cd0ae668a585a14e07c2ea4f264d79b/ctime:1441385182"    }}


When I manually uploaded the file using aws s3 cp, I saw the following header info on the s3 object
$ aws s3api head-object --bucket bucketname --key epel-release-6-8.noarch.rpm {    "AcceptRanges": "bytes",     "ContentType": "binary/octet-stream",     "LastModified": "Sat, 19 Sep 2015 03:39:13 GMT",     "ContentLength": 14540,     "ETag": "\"2cd0ae668a585a14e07c2ea4f264d79b\"",     "Metadata": {}}

So, the MD5 IS in there... IF I use a program/command that sets it in the metadata.  


Back to my problem at hand... how would I know if I already downloaded the file?  What are my options?
  1. Figure out some sort of script to calculate the md5 sum of the file.  This would require me assuming the number of multi-part chunks
  2. Always make our upload process either not use multi-part uploads (keeping the ETag equal to the MD5sum) or use some tool or manually set the metadata to contain an md5sum
  3. Pull out the ETag value when I download the file, and store an additional file with my downloaded s3 file (e.g. if I have s3://bucket/foo.tar.gz, pull out it's ETag and write it to foo.tar.gz.etag after a successful download)
All three have disadvantages.

#1 - This script has to assume or calculate multi-part upload chunk size.  It is explored in some of the answers at this StackOverflow post

#2 - This requires our uploads to have happened a certain way.  If someone circumvents the standard process for uploading (uses a multipart upload, or uploads with the aws s3 cp instead of aws s3api put-object), then we would always download the item. 

# 3 - This requires us saving a separate metadata file alongside the file download.  If the application we are downloading the file for is sensitive to additional files hanging around, this may interfere.  there IS a big plus in this case though.  

aws s3api get-object has a helpgul argument.... 

--if-none-match (string) Return the  object  only  if  its  entity  tag
       (ETag) is different from the one specified, otherwise return a 304 (not
       modified).
If I use that, the s3 call would skip downloading the file if the ETag matches.  

Also, with option 3... how do I do it in puppet?  Do I have some clever chaining of exec resources?  Do I create a resource type that utilizes the aws-ruby sdk?


Hrm, another post with ideas and options but no solutions.... I'll have to ponder how I should move forward with this, and fuel a future blog post!

Thanks for reading!

1 comment:

  1. exec { "sync s3://${bucket}/${key} -> ${name}":
    path => ['/bin', '/usr/bin', '/usr/local/bin', ],
    command => "aws s3api get-object --bucket ${b} --key ${k} ${n} --if-none-match `[ -e ${n} ] && md5sum ${n} | cut -c1-32 || echo 0`",
    returns => [0, 255]
    }

    ReplyDelete