2021-16: ETag of Object Storage

Today a teammate asked about ETag-related issues in the group and make me remembered some interesting details related to object storage ETag.

First, ETag implementations require that they be wrapped by a ", such as S3's HeadObject response.

HTTP/1.1 200 OK
x-amz-id-2: ef8yU9AS1ed4OpIszj7UDNEHGran
x-amz-request-id: 318BC8BC143432E5
x-amz-version-id: 3HL4kqtJlcpXroDTDmjVBH40Nrjfkd
Date: Wed, 28 Oct 2009 22:32:00 GMT
Last-Modified: Sun, 1 Jan 2006 12:00:00 GMT
ETag: "fba9dede5f27731c9771645a39863328"
Content-Length: 434234
Content-Type: text/plain
Connection: close
Server: AmazonS3

Secondly, the semantics of ETag is used to distinguish different versions of a resource, so it is up to the implementation to decide how it is generated. It can be a Hash of the content, a Hash of the last modified time or even a self-defined version number, while the ETag generated by the object store is related to how the content is uploaded.

There are several ways to upload files from common object stores.

PostObject & PutObject
Append Object
Multipart Uploads (Specify the part number when uploading, and finally use the part number to assemble the object)
Block Uploads （The block ids are returned on upload and finally combined into an object using the block ids）
Page Uploads （Specify Range when uploading）

For PostObject & PutObject, most object storage services limit the maximum to 5GB, so you can calculate its Hash and write it to the Object's metadata at upload time. For most object storage services, the algorithm used is MD5, which means the same value as the Content-MD5 Header (the semantics are different).

For other upload methods, the ETag is not necessarily MD5. Since the MD5 algorithm is not a rolling Hash algorithm, we cannot compute the MD5 of A+B with known MD5 values of A and B. Considering that the final file of these uploads will reach TB size, it is not very profitable to read the MD5 from the beginning to the end, so we can only use other algorithms to generate ETag. For example, a more common practice is to combine the ETag of each part to calculate its MD5 and return it.

So you need to take this into account when developing your application, and not simply treat ETag as Content-MD5 of the object content. This is why go-storage proposes Normalize content hash check blob/master/rfcs/14-normalize-content-hash-check.md).

This week I'm working on AOS to participate in the Open Source Software Supply Chain Illumination Program - Summer 2021, the organization's review has been approved and the project is being prepared for submission. We are preparing to submit the project. Students who are interested in our project are welcome to come to event-ospp-summer-2021 for a chat~

See you next week!

Translated via DeepL with a bit modification.