Skip to content

Conversation

hssyoo
Copy link
Contributor

@hssyoo hssyoo commented Sep 18, 2025

This PR implements full object checksum validation for multipart downloads when using the high-level s3 command.

Currently, checksum validation is only done at the low-level S3 client level via Botocore. This current validation only happens when either the object is retrieved in a single GET request, or if the requested range of an object happens to fall in the same boundary as the part size (if the object was uploaded via MPU).

At a high-level:

  1. Each download task calls an initial HeadObject (existing behavior), where it retrieves the stored checksum value and algorithm, if in the response.
  2. If there's a stored checksum and the checksum type is FULL_OBJECT and the checksum algorithm is CRC-based, then it provides the checksum value and algorithm to the future meta.
  3. If a multipart download is required and the checksum properties are provided to the future meta, then it'll create a FullObjectChecksum object that's responsible for storing part checksums and then later combining them to a full object checksum and validation.
  4. As parts are being downloaded, the response streams are wrapped by PartStreamingChecksumBody. This class calculates the checksum for each part, unless the underlying stream is already calculating the checksum in which case the underlying stream's checksum is reused.
  5. Once all part download tasks are completed, a final validation task is invoked. FullObjectChecksum combines all part-level checksums and then validates the calculated checksum against the stored checksum.

Note that s3transfer.checksums.combine_crc32 will only be added to the S3Transfer library, not in AWS CLI v2's vended library because AWS CLI v2 can just use CRT's future bindings. It's included in this PR for now so reviewers can play around with it.

Calculating full object checksums for multipart uploads will reuse some code from this PR. To get a sense for what that may look like, refer to this POC PR: #9660

To manually test:
Upload an object with CRC32 with a single PUT request:

aws s3api put-object --body myobject --bucket mybucket --key mykey --checksum-algorithm CRC32

Ensure checksum type is FULL_OBJECT and has CRC32 checksum value:

aws s3api head-object --bucket mybucket --key mykey --checksum-mode ENABLED
{
    "AcceptRanges": "bytes",
    "LastModified": "2025-09-15T14:13:18+00:00",
    "ContentLength": 39542919,
    "ChecksumCRC32": "mychecksum",
    "ChecksumType": "FULL_OBJECT",
    "ETag": "\"myetag"",
    "ContentType": "application/octet-stream",
    "ServerSideEncryption": "AES256",
    "Metadata": {}
}

Download object using multipart ranged GETs:

aws s3 cp s3://mybucket/mykey /tmp/ --debug

In the debug logs you should see something like:

s3transfer.checksums - DEBUG - Successfully validated stored checksum mychecksum against calculated checksum mychecksum

Comment on lines +111 to +113
self._calculated_checksum = base64.b64encode(
combined.to_bytes(4, byteorder='big')
).decode('ascii')
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is actually dependent on the checksum algorithm. I think instead I'll instantiate the appropriate checksum object, set the int_crc property to the combined value, then call the digest() method

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant