Skip to content

Conversation

@ggershinsky
Copy link
Contributor

* Avro data encryption: enable encryption of data files in tables that use the Avro format.
* Tamper proofing of Iceberg data and metadata files.

## Technical approach
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the spec I don't think we need information about the reference implementation but that's just me.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good. We have a section with this name in a parquet spec, but I don't see it in the iceberg specs. I'll remove this section, and move the spec-related parts to the relevant sections.

Copy link
Member

@RussellSpitzer RussellSpitzer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think minus the technical section this looks good to me

where

- `nonce` is the AES GCM nonce, with a length of 12 bytes.
- `ciphertext` is the encrypted block. Its length is identical to the length of the block before encryption ("plaintext"). The length of all plaintext blocks, except the last, is `BlockLength` bytes. The last block keeps the data residue, with a length <= `BlockLength`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems a bit awkward because streaming through the file will eventually hit a partial block and EOF. Then the reader would need to backtrack and separate the last 12 bytes read for the tag. The reader could also keep track of position in the file and stop before reading the tag and hitting EOF, but then that requires position tracking and logic.

I don't see a good way around this, but I thought I'd raise it in case anyone else does.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AGS should not hit EOF. It gets the length of the encrypted file from the source InputFile object, then reads the block side, and knows from the start the number/size of the encrypted blocks.


The AES GCM cipher protects against byte replacement inside a ciphertext block - but, without an AAD, it can't prevent replacement of one ciphertext block with another (encrypted with the same key). AGS leverages AADs to protect against swapping ciphertext blocks inside a file or between files. AGS can also protect against swapping full files - for example, replacement of a metadata file with an old version. AADs are built to reflects the identity of a file and of the blocks inside the file.

AGS constructs a block AAD from two components: an AAD prefix - a string provided by Iceberg for the file (with the file ID), and an AAD suffix - the block sequence number in the file, as an int in a 4-byte little-endian form. The block AAD is a direct concatenation of the prefix and suffix parts.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you provide more context on how the AAD prefix is managed for the metadata tree?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Technically, this spec is narrowly focused on the low level encryption stream and the tools it provides, including a file ID (aka AAD prefix) that should be supplied by the upper layers. We don't have the details yet on how it will be constructed for an Iceberg metadata tree; TBD; but once we have these details, I guess we can/should put them in a separate higher-level spec doc? E.g. a new encryption/integrity section in the Iceberg spec?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds fine. We'll need to solve that problem eventually though.

Cipher blocks have the following structure

| nonce | ciphertext | tag |
|-------|------------|-----|
Copy link
Contributor

@rdblue rdblue Aug 21, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be able to split formats like Avro that are splittable, we need to be able to go back and forth between offsets in the underlying data stream and in the AGS stream. I think that's possible since this has a strict format. The header structure is 8 bytes (4 magic + 4 block length) and then there is 24 bytes of overhead per CipherBlock (12 nonce + 12 tag).

For any given offset in the underlying stream, we can find the block where it will be found by dividing by the block length:

def plain_block_index(plain_offset, block_length):
    return offset // block_length

Then we can skip to a given block:

def cipher_block_offset(block_index, block_length):
    return 8 + block_index * (24 + block_length)

But the problem is that we don't get split offsets in the underlying file, we get split offsets in the AGS stream and need to figure out where in the underlying file those translate to, so that we don't have overlapping ranges that tasks are responsible for. We have a picture like this for a block size of 1024, 3 blocks, and a last block of size 553:

magic block length nonce-1 ciphertext-1 tag-1 nonce-2 ciphertext-2 tag-2 nonce-3 ciphertext-3 tag-3 EOF
AGS offset 0 4 8 20 (to 1043) 1044 1056 1068 (to 2091) 2092 2104 2116 (to 2668) 2669 2681
Stream offset 0 (to 1023) 1024 (to 2047) 2048 (to 2601) 2602

To convert an AGS offset into a stream offset, I think we need to produce the the block index and an offset in that block. Producing the block index is the opposite of the cipher_block_offset function:

def cipher_block_index(cipher_offset, block_length):
    if (cipher_offset < 8):
        raise ValueError()
    return (cipher_offset - 8) // (block_length + 24)

Then the offset in the plaintext block can be produced by subtracting the ciphertext offset from the start of the ciphertext block and some extra accounting:

def plaintext_offset(cipher_offset, block_length):
    block_index = cipher_block_index(cipher_offset, block_length)
    block_start = cipher_block_offset(block_index, block_length)
    local_offset = min(cipher_offset - block_start, block_length)
    return block_index * block_length + local_offset

The min with block_length ensures that the local block offsets never overlap. Offsets that are between the ciphertext blocks (like 1048 in tag-1 for the example above) are mapped to the end of the plaintext block. This gives us a stable way to recover plaintext offsets.

Using the plaintext offsets, I think we can reliably split AGS files that store Avro data:

  1. Convert AGS byte range splits into plaintext splits, removing any 0-byte splits (unlikely but possible)
  2. Implement an AGS stream that can seek to plaintext offsets (using the translation methods above)
  3. Pass the plaintext splits into the Avro reader with the plaintext seeking AGS stream

@ggershinsky, what do you think?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rdblue , thanks! I need to think deeper about this; currently on vacation; will respond the first week of September.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rdblue I'm back in office, went through this carefully. I believe we are on the same page. The technical approach you describe is the same as used in the spec implementation PR

At a high level, this should be fully transparent to the current Avro readers. The API they use (file getLength; stream skip, seek, getPos) return the same values and work the same as before/without AGS encryption. This is the AGS code job to translate the plaintext<->ciphertext stuff behind the scenes. So, as long as the Avro split functionality depends on these APIs, it should continue to work as usual with the AGS encrypted files.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rdblue what do you think?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that this should work.

Copy link
Contributor

@rdblue rdblue left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ggershinsky, this seems fine, but I don't think several of the changes I requested in the last round of review were ever pushed. I only see comments about them.

To get this in, I think this needs:

  1. To stop using the term AGS -- this is confusing and I don't think that we should create an acronym for this.
  2. Note that a goal is to be compatible with splittable formats like Avro.
  3. Define "residue" before it is used.
  4. Clarify how the ending "residue" block is handled to avoid the problem I noted

@ggershinsky
Copy link
Contributor Author

All suggestions SGTM, I'll update the doc accordingly.

@ggershinsky
Copy link
Contributor Author

@rdblue I've sent a commit that addresses the review comments.
Regarding the magic string - it's 4-byte long, so I've kept "AGS1" (only) there; but if it's a problem, I can replace it with any suggested alternative.

@rdblue rdblue merged commit 4b23ecb into apache:master Nov 27, 2022
@rdblue
Copy link
Contributor

rdblue commented Nov 27, 2022

Looks great. I merged this. Thanks, @ggershinsky!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants