Deliver key metadata for encryption of data files #9359

ggershinsky · 2023-12-21T13:51:42Z

Moving #6762 to main branch base

ggershinsky · 2023-12-21T18:49:39Z

The last round of comments in #6762 is addressed here.

rdblue · 2023-12-21T19:34:43Z

core/src/main/java/org/apache/iceberg/avro/Avro.java

+  public static WriteBuilder write(EncryptedOutputFile file) {
+    Preconditions.checkState(
+        file.keyMetadata() == null || file.keyMetadata() == EncryptionKeyMetadata.EMPTY,
+        "Currenty, encryption of data files in Avro format is not supported");


Why is this not supported? The AES GCM stream could easily be used here.

We should also run TestAvroFileSplit on Avro inside of AES GCM streams

How nice. I expected Avro table encryption to work directly with AES GCM Streams - but not without some hiccups and fixes, since I never ran this usecase before. Turns out it just works out of box.
Now I have a functioning e2e unitest that encrypts/decrypts an Iceberg table with the Avro data format. The unitest is based on Spark SQL and catalog clients, so will go into the integration PR.
I'll add an encrypting version of the TestAvroFileSplit to this PR.

I'll verify of course where the file length comes from for the Avro reader.

rdblue · 2024-01-03T00:23:09Z

spark/v3.4/spark/src/main/java/org/apache/iceberg/spark/source/SparkAppenderFactory.java

+  }
+
+  @Override
+  public FileAppender<InternalRow> newAppender(EncryptedOutputFile file, FileFormat fileFormat) {


Ah, I was expecting these changes in Spark 3.5. Feel free to move them there, or we can add them to Spark 3.4 and open follow up PRs.

Sounds good, I'll move them to 3.5.

rdblue · 2024-01-03T00:28:27Z

core/src/main/java/org/apache/iceberg/encryption/StandardEncryptionManager.java

+    }
+  }
+
+  private InputFile getSourceFile(EncryptedInputFile encryptedFile) {


Style: Iceberg method names should not include get. Instead, use a more helpful verb like create or find, or simply leave it out.

rdblue · 2024-01-03T00:32:34Z

core/src/main/java/org/apache/iceberg/encryption/StandardEncryptionManager.java

+      String tableKeyId,
+      int dataKeyLength,
+      KeyManagementClient kmsClient,
+      boolean nativeDataEncryption) {


I don't think that this should be passed in. The encryption manager needs to support files that use both native encryption (Parquet) and files that use AES GCM streams (Avro). There is no way to set this correctly because the behavior depends on the file type.

rdblue · 2024-01-03T00:59:43Z

core/src/main/java/org/apache/iceberg/encryption/StandardEncryptionManager.java

  public Iterable<InputFile> decrypt(Iterable<EncryptedInputFile> encrypted) {
    // Bulk decrypt is only applied to data files. Returning source input files for parquet.
-    return Iterables.transform(encrypted, this::decrypt);
+    if (nativeDataEncryption) {


There is no way to know the intended file format or whether it will use native encryption at this point. I think this needs to always return the result of decrypt.

rdblue · 2024-01-03T01:14:09Z

@ggershinsky, I'm still working through this review, so don't feel like you need to address or respond to my comments yet! Also, here are a few notes for myself when I pick this up tomorrow:

How do we ensure that all files that are committed to a table are encrypted? I think we should have a validation in FastAppend and MergingSnapshotProducer that verifies the encryption key metadata is non-null if the standard encryption manager is used.
There are places like BaseTaskWriter.openCurrent that also call EncryptedOutputFile.encryptingOutputFile() to get the location. We should look into whether those should use the underlying file path or need to be updated.
I think the way to solve the problem with StandardEncryptionManager.decrypt is to make StandardDecryptedInputFile implement both InputFile (that runs AES GCM decryption) and EncryptedInputFile (to provide access to key metadata and the encrypted underlying InputFile). Then the read path continues to pass around InputFile, but can check whether the file can be used via EncryptedInputFile for native decryption.

rdblue · 2024-01-03T18:39:27Z

core/src/main/java/org/apache/iceberg/encryption/StandardKeyMetadata.java

 import org.apache.iceberg.types.Types;

-class StandardKeyMetadata implements EncryptionKeyMetadata, IndexedRecord {
+public class StandardKeyMetadata implements EncryptionKeyMetadata, IndexedRecord {


We can't make this class public because it implements an Avro interface. If it needs to be public, we will have to extract StandardKeyMetadata as an interface.

rdblue · 2024-01-03T18:56:27Z

spark/v3.4/spark/src/main/java/org/apache/iceberg/spark/source/BaseBatchReader.java


  protected CloseableIterable<ColumnarBatch> newBatchIterable(
      InputFile inputFile,
+      ByteBuffer keyMetadata,


Looks like these changes can't be made to the corresponding BaseBatchReader in iceberg-arrow because it is public. I think that means that Arrow cannot support reading from encrypted tables in this PR even if it calls the EncryptionManager correctly.

Add native subclasses for InputFile and OutputFile to simplify.

rdblue · 2024-01-04T16:55:26Z

Merged! Thanks, @ggershinsky for getting this done!

… files (apache#9359)

github-actions bot added spark parquet core data flink MR ORC labels Dec 21, 2023

ggershinsky mentioned this pull request Dec 21, 2023

Deliver key metadata to parquet encryption #6762

Closed

6762 for main branch

80ff9d5

ggershinsky force-pushed the deliver-key-metadata2 branch from 7df9db2 to 80ff9d5 Compare December 21, 2023 17:07

rdblue reviewed Dec 21, 2023

View reviewed changes

ggershinsky changed the title ~~Deliver key metadata to parquet encryption~~ Deliver key metadata for encryption of data files Dec 26, 2023

rdblue reviewed Jan 3, 2024

View reviewed changes

Add native subclasses for InputFile and OutputFile to simplify.

a2a5642

rdblue mentioned this pull request Jan 3, 2024

Add native subclasses for InputFile and OutputFile to simplify. ggershinsky/iceberg#10

Merged

Merge pull request #10 from rdblue/deliver-key-metadata2

619c234

Add native subclasses for InputFile and OutputFile to simplify.

github-actions bot added API and removed MR labels Jan 4, 2024

rdblue merged commit 1288eb8 into apache:main Jan 4, 2024

ggershinsky deleted the deliver-key-metadata2 branch January 5, 2024 07:36

geruh pushed a commit to geruh/iceberg that referenced this pull request Jan 26, 2024

Core, Spark, Flink, Data: Deliver key metadata for encryption of data…

431d5db

… files (apache#9359)

devangjhabakh pushed a commit to cdouglas/iceberg that referenced this pull request Apr 22, 2024

Core, Spark, Flink, Data: Deliver key metadata for encryption of data…

6f41f0f

… files (apache#9359)

stevenzwu moved this from In progress to Done in [Priority 2] Spec v3: Encryption Oct 27, 2025

Deliver key metadata for encryption of data files #9359

Deliver key metadata for encryption of data files #9359

Uh oh!

Conversation

ggershinsky commented Dec 21, 2023

Uh oh!

ggershinsky commented Dec 21, 2023

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rdblue Jan 3, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rdblue commented Jan 3, 2024

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rdblue commented Jan 4, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

rdblue Jan 3, 2024 •

edited

Loading