-
Notifications
You must be signed in to change notification settings - Fork 3k
Deliver key metadata for encryption of data files #9359
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
7df9db2 to
80ff9d5
Compare
|
The last round of comments in #6762 is addressed here. |
| public static WriteBuilder write(EncryptedOutputFile file) { | ||
| Preconditions.checkState( | ||
| file.keyMetadata() == null || file.keyMetadata() == EncryptionKeyMetadata.EMPTY, | ||
| "Currenty, encryption of data files in Avro format is not supported"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this not supported? The AES GCM stream could easily be used here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should also run TestAvroFileSplit on Avro inside of AES GCM streams
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How nice. I expected Avro table encryption to work directly with AES GCM Streams - but not without some hiccups and fixes, since I never ran this usecase before. Turns out it just works out of box.
Now I have a functioning e2e unitest that encrypts/decrypts an Iceberg table with the Avro data format. The unitest is based on Spark SQL and catalog clients, so will go into the integration PR.
I'll add an encrypting version of the TestAvroFileSplit to this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll verify of course where the file length comes from for the Avro reader.
| } | ||
|
|
||
| @Override | ||
| public FileAppender<InternalRow> newAppender(EncryptedOutputFile file, FileFormat fileFormat) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I was expecting these changes in Spark 3.5. Feel free to move them there, or we can add them to Spark 3.4 and open follow up PRs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good, I'll move them to 3.5.
| } | ||
| } | ||
|
|
||
| private InputFile getSourceFile(EncryptedInputFile encryptedFile) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Style: Iceberg method names should not include get. Instead, use a more helpful verb like create or find, or simply leave it out.
| String tableKeyId, | ||
| int dataKeyLength, | ||
| KeyManagementClient kmsClient, | ||
| boolean nativeDataEncryption) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think that this should be passed in. The encryption manager needs to support files that use both native encryption (Parquet) and files that use AES GCM streams (Avro). There is no way to set this correctly because the behavior depends on the file type.
| public Iterable<InputFile> decrypt(Iterable<EncryptedInputFile> encrypted) { | ||
| // Bulk decrypt is only applied to data files. Returning source input files for parquet. | ||
| return Iterables.transform(encrypted, this::decrypt); | ||
| if (nativeDataEncryption) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is no way to know the intended file format or whether it will use native encryption at this point. I think this needs to always return the result of decrypt.
|
@ggershinsky, I'm still working through this review, so don't feel like you need to address or respond to my comments yet! Also, here are a few notes for myself when I pick this up tomorrow:
|
| import org.apache.iceberg.types.Types; | ||
|
|
||
| class StandardKeyMetadata implements EncryptionKeyMetadata, IndexedRecord { | ||
| public class StandardKeyMetadata implements EncryptionKeyMetadata, IndexedRecord { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can't make this class public because it implements an Avro interface. If it needs to be public, we will have to extract StandardKeyMetadata as an interface.
|
|
||
| protected CloseableIterable<ColumnarBatch> newBatchIterable( | ||
| InputFile inputFile, | ||
| ByteBuffer keyMetadata, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like these changes can't be made to the corresponding BaseBatchReader in iceberg-arrow because it is public. I think that means that Arrow cannot support reading from encrypted tables in this PR even if it calls the EncryptionManager correctly.
Add native subclasses for InputFile and OutputFile to simplify.
|
Merged! Thanks, @ggershinsky for getting this done! |
Moving #6762 to main branch base