Skip to content

Conversation

@ggershinsky
Copy link
Contributor

Moving #6762 to main branch base

@ggershinsky ggershinsky force-pushed the deliver-key-metadata2 branch from 7df9db2 to 80ff9d5 Compare December 21, 2023 17:07
@ggershinsky
Copy link
Contributor Author

The last round of comments in #6762 is addressed here.

public static WriteBuilder write(EncryptedOutputFile file) {
Preconditions.checkState(
file.keyMetadata() == null || file.keyMetadata() == EncryptionKeyMetadata.EMPTY,
"Currenty, encryption of data files in Avro format is not supported");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this not supported? The AES GCM stream could easily be used here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should also run TestAvroFileSplit on Avro inside of AES GCM streams

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SGTM

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How nice. I expected Avro table encryption to work directly with AES GCM Streams - but not without some hiccups and fixes, since I never ran this usecase before. Turns out it just works out of box.
Now I have a functioning e2e unitest that encrypts/decrypts an Iceberg table with the Avro data format. The unitest is based on Spark SQL and catalog clients, so will go into the integration PR.
I'll add an encrypting version of the TestAvroFileSplit to this PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll verify of course where the file length comes from for the Avro reader.

@ggershinsky ggershinsky changed the title Deliver key metadata to parquet encryption Deliver key metadata for encryption of data files Dec 26, 2023
}

@Override
public FileAppender<InternalRow> newAppender(EncryptedOutputFile file, FileFormat fileFormat) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I was expecting these changes in Spark 3.5. Feel free to move them there, or we can add them to Spark 3.4 and open follow up PRs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good, I'll move them to 3.5.

}
}

private InputFile getSourceFile(EncryptedInputFile encryptedFile) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Style: Iceberg method names should not include get. Instead, use a more helpful verb like create or find, or simply leave it out.

String tableKeyId,
int dataKeyLength,
KeyManagementClient kmsClient,
boolean nativeDataEncryption) {
Copy link
Contributor

@rdblue rdblue Jan 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that this should be passed in. The encryption manager needs to support files that use both native encryption (Parquet) and files that use AES GCM streams (Avro). There is no way to set this correctly because the behavior depends on the file type.

public Iterable<InputFile> decrypt(Iterable<EncryptedInputFile> encrypted) {
// Bulk decrypt is only applied to data files. Returning source input files for parquet.
return Iterables.transform(encrypted, this::decrypt);
if (nativeDataEncryption) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no way to know the intended file format or whether it will use native encryption at this point. I think this needs to always return the result of decrypt.

@rdblue
Copy link
Contributor

rdblue commented Jan 3, 2024

@ggershinsky, I'm still working through this review, so don't feel like you need to address or respond to my comments yet! Also, here are a few notes for myself when I pick this up tomorrow:

  • How do we ensure that all files that are committed to a table are encrypted? I think we should have a validation in FastAppend and MergingSnapshotProducer that verifies the encryption key metadata is non-null if the standard encryption manager is used.
  • There are places like BaseTaskWriter.openCurrent that also call EncryptedOutputFile.encryptingOutputFile() to get the location. We should look into whether those should use the underlying file path or need to be updated.
  • I think the way to solve the problem with StandardEncryptionManager.decrypt is to make StandardDecryptedInputFile implement both InputFile (that runs AES GCM decryption) and EncryptedInputFile (to provide access to key metadata and the encrypted underlying InputFile). Then the read path continues to pass around InputFile, but can check whether the file can be used via EncryptedInputFile for native decryption.

import org.apache.iceberg.types.Types;

class StandardKeyMetadata implements EncryptionKeyMetadata, IndexedRecord {
public class StandardKeyMetadata implements EncryptionKeyMetadata, IndexedRecord {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can't make this class public because it implements an Avro interface. If it needs to be public, we will have to extract StandardKeyMetadata as an interface.


protected CloseableIterable<ColumnarBatch> newBatchIterable(
InputFile inputFile,
ByteBuffer keyMetadata,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like these changes can't be made to the corresponding BaseBatchReader in iceberg-arrow because it is public. I think that means that Arrow cannot support reading from encrypted tables in this PR even if it calls the EncryptionManager correctly.

Add native subclasses for InputFile and OutputFile to simplify.
@github-actions github-actions bot added API and removed MR labels Jan 4, 2024
@rdblue rdblue merged commit 1288eb8 into apache:main Jan 4, 2024
@rdblue
Copy link
Contributor

rdblue commented Jan 4, 2024

Merged! Thanks, @ggershinsky for getting this done!

@ggershinsky ggershinsky deleted the deliver-key-metadata2 branch January 5, 2024 07:36
geruh pushed a commit to geruh/iceberg that referenced this pull request Jan 26, 2024
devangjhabakh pushed a commit to cdouglas/iceberg that referenced this pull request Apr 22, 2024
@stevenzwu stevenzwu moved this from In progress to Done in [Priority 2] Spec v3: Encryption Oct 27, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Development

Successfully merging this pull request may close these issues.

2 participants