AES GCM Stream Spec #5432

ggershinsky · 2022-08-04T06:32:20Z

@rdblue @RussellSpitzer @jackye1995

RussellSpitzer · 2022-08-04T15:46:53Z

format/gcm-stream-spec.md

+* Avro data encryption: enable encryption of data files in tables that use the Avro format.
+* Tamper proofing of Iceberg data and metadata files.
+
+## Technical approach


For the spec I don't think we need information about the reference implementation but that's just me.

Sounds good. We have a section with this name in a parquet spec, but I don't see it in the iceberg specs. I'll remove this section, and move the spec-related parts to the relevant sections.

RussellSpitzer

I think minus the technical section this looks good to me

format/gcm-stream-spec.md

rdblue · 2022-08-21T22:41:02Z

format/gcm-stream-spec.md

+where
+
+- `nonce` is the AES GCM nonce, with a length of 12 bytes.
+- `ciphertext` is the encrypted block. Its length is identical to the length of the block before encryption ("plaintext"). The length of all plaintext blocks, except the last, is `BlockLength` bytes. The last block keeps the data residue, with a length <= `BlockLength`.


This seems a bit awkward because streaming through the file will eventually hit a partial block and EOF. Then the reader would need to backtrack and separate the last 12 bytes read for the tag. The reader could also keep track of position in the file and stop before reading the tag and hitting EOF, but then that requires position tracking and logic.

I don't see a good way around this, but I thought I'd raise it in case anyone else does.

AGS should not hit EOF. It gets the length of the encrypted file from the source InputFile object, then reads the block side, and knows from the start the number/size of the encrypted blocks.

rdblue · 2022-08-21T22:43:48Z

format/gcm-stream-spec.md

+
+The AES GCM cipher protects against byte replacement inside a ciphertext block - but, without an AAD, it can't prevent replacement of one ciphertext block with another (encrypted with the same key). AGS leverages AADs to protect against swapping ciphertext blocks inside a file or between files. AGS can also protect against swapping full files - for example, replacement of a metadata file with an old version. AADs are built to reflects the identity of a file and of the blocks inside the file.
+
+AGS constructs a block AAD from two components: an AAD prefix - a string provided by Iceberg for the file (with the file ID), and an AAD suffix - the block sequence number in the file, as an int in a 4-byte little-endian form. The block AAD is a direct concatenation of the prefix and suffix parts.


Can you provide more context on how the AAD prefix is managed for the metadata tree?

Technically, this spec is narrowly focused on the low level encryption stream and the tools it provides, including a file ID (aka AAD prefix) that should be supplied by the upper layers. We don't have the details yet on how it will be constructed for an Iceberg metadata tree; TBD; but once we have these details, I guess we can/should put them in a separate higher-level spec doc? E.g. a new encryption/integrity section in the Iceberg spec?

Sounds fine. We'll need to solve that problem eventually though.

rdblue · 2022-08-21T23:34:20Z

format/gcm-stream-spec.md

+Cipher blocks have the following structure
+
+| nonce | ciphertext | tag |
+|-------|------------|-----|


To be able to split formats like Avro that are splittable, we need to be able to go back and forth between offsets in the underlying data stream and in the AGS stream. I think that's possible since this has a strict format. The header structure is 8 bytes (4 magic + 4 block length) and then there is 24 bytes of overhead per CipherBlock (12 nonce + 12 tag).

For any given offset in the underlying stream, we can find the block where it will be found by dividing by the block length:

def plain_block_index(plain_offset, block_length): return offset // block_length

Then we can skip to a given block:

def cipher_block_offset(block_index, block_length): return 8 + block_index * (24 + block_length)

But the problem is that we don't get split offsets in the underlying file, we get split offsets in the AGS stream and need to figure out where in the underlying file those translate to, so that we don't have overlapping ranges that tasks are responsible for. We have a picture like this for a block size of 1024, 3 blocks, and a last block of size 553:

magic block length nonce-1 ciphertext-1 tag-1 nonce-2 ciphertext-2 tag-2 nonce-3 ciphertext-3 tag-3 EOF

AGS offset 0 4 8 20 (to 1043) 1044 1056 1068 (to 2091) 2092 2104 2116 (to 2668) 2669 2681

Stream offset 0 (to 1023) 1024 (to 2047) 2048 (to 2601) 2602

To convert an AGS offset into a stream offset, I think we need to produce the the block index and an offset in that block. Producing the block index is the opposite of the cipher_block_offset function:

def cipher_block_index(cipher_offset, block_length): if (cipher_offset < 8): raise ValueError() return (cipher_offset - 8) // (block_length + 24)

Then the offset in the plaintext block can be produced by subtracting the ciphertext offset from the start of the ciphertext block and some extra accounting:

def plaintext_offset(cipher_offset, block_length): block_index = cipher_block_index(cipher_offset, block_length) block_start = cipher_block_offset(block_index, block_length) local_offset = min(cipher_offset - block_start, block_length) return block_index * block_length + local_offset

The min with block_length ensures that the local block offsets never overlap. Offsets that are between the ciphertext blocks (like 1048 in tag-1 for the example above) are mapped to the end of the plaintext block. This gives us a stable way to recover plaintext offsets.

Using the plaintext offsets, I think we can reliably split AGS files that store Avro data:

Convert AGS byte range splits into plaintext splits, removing any 0-byte splits (unlikely but possible)

Implement an AGS stream that can seek to plaintext offsets (using the translation methods above)

Pass the plaintext splits into the Avro reader with the plaintext seeking AGS stream

@ggershinsky, what do you think?

@rdblue , thanks! I need to think deeper about this; currently on vacation; will respond the first week of September.

@rdblue I'm back in office, went through this carefully. I believe we are on the same page. The technical approach you describe is the same as used in the spec implementation PR

At a high level, this should be fully transparent to the current Avro readers. The API they use (file getLength; stream skip, seek, getPos) return the same values and work the same as before/without AGS encryption. This is the AGS code job to translate the plaintext<->ciphertext stuff behind the scenes. So, as long as the Avro split functionality depends on these APIs, it should continue to work as usual with the AGS encrypted files.

@rdblue what do you think?

I think that this should work.

format/gcm-stream-spec.md

rdblue

@ggershinsky, this seems fine, but I don't think several of the changes I requested in the last round of review were ever pushed. I only see comments about them.

To get this in, I think this needs:

To stop using the term AGS -- this is confusing and I don't think that we should create an acronym for this.
Note that a goal is to be compatible with splittable formats like Avro.
Define "residue" before it is used.
Clarify how the ending "residue" block is handled to avoid the problem I noted

ggershinsky · 2022-11-09T12:09:25Z

All suggestions SGTM, I'll update the doc accordingly.

ggershinsky · 2022-11-10T11:41:03Z

@rdblue I've sent a commit that addresses the review comments.
Regarding the magic string - it's 4-byte long, so I've kept "AGS1" (only) there; but if it's a problem, I can replace it with any suggested alternative.

rdblue · 2022-11-27T22:08:48Z

Looks great. I merged this. Thanks, @ggershinsky!

ggershinsky added 2 commits August 4, 2022 09:23

initial commit

fc4413a

clean up

8decf43

ggershinsky mentioned this pull request Aug 4, 2022

GCM encryption stream #3231

Merged

RussellSpitzer reviewed Aug 4, 2022

View reviewed changes

RussellSpitzer approved these changes Aug 4, 2022

View reviewed changes

rm tech approach section

b0b121e

rdblue reviewed Aug 21, 2022

View reviewed changes

format/gcm-stream-spec.md Outdated Show resolved Hide resolved

rdblue reviewed Aug 21, 2022

View reviewed changes

format/gcm-stream-spec.md Outdated Show resolved Hide resolved

rdblue reviewed Aug 21, 2022

View reviewed changes

format/gcm-stream-spec.md Show resolved Hide resolved

rdblue reviewed Aug 21, 2022

View reviewed changes

rdblue reviewed Nov 7, 2022

View reviewed changes

format/gcm-stream-spec.md Show resolved Hide resolved

rdblue requested changes Nov 7, 2022

View reviewed changes

address review comments

87e7088

rdblue merged commit 4b23ecb into apache:master Nov 27, 2022


		The AES GCM cipher protects against byte replacement inside a ciphertext block - but, without an AAD, it can't prevent replacement of one ciphertext block with another (encrypted with the same key). AGS leverages AADs to protect against swapping ciphertext blocks inside a file or between files. AGS can also protect against swapping full files - for example, replacement of a metadata file with an old version. AADs are built to reflects the identity of a file and of the blocks inside the file.

		AGS constructs a block AAD from two components: an AAD prefix - a string provided by Iceberg for the file (with the file ID), and an AAD suffix - the block sequence number in the file, as an int in a 4-byte little-endian form. The block AAD is a direct concatenation of the prefix and suffix parts.

	magic	block length	nonce-1	ciphertext-1	tag-1	nonce-2	ciphertext-2	tag-2	nonce-3	ciphertext-3	tag-3	EOF
AGS offset	0	4	8	20 (to 1043)	1044	1056	1068 (to 2091)	2092	2104	2116 (to 2668)	2669	2681
Stream offset				0 (to 1023)			1024 (to 2047)			2048 (to 2601)		2602

AES GCM Stream Spec #5432

AES GCM Stream Spec #5432

Uh oh!

Conversation

ggershinsky commented Aug 4, 2022

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

RussellSpitzer left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rdblue Aug 21, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

rdblue left a comment

Choose a reason for hiding this comment

Uh oh!

ggershinsky commented Nov 9, 2022

Uh oh!

ggershinsky commented Nov 10, 2022

Uh oh!

rdblue commented Nov 27, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

rdblue Aug 21, 2022 •

edited

Loading