Core: add key_metadata in ManifestFile #2520

jackye1995 · 2021-04-26T21:22:26Z

This PR adds key_metadata to the ManifestFile API, and updates all the read and write methods that directly references the constructor up to ManifestFiles to use EncryptionManager.

To avoid touching too many files in different engines in a single PR, there will be subsequent PRs to update each engine and the metadata metadata tables to support encrypting manifest files.

@yyanyy @rdblue @RussellSpitzer @ggershinsky @flyrain

core/src/main/java/org/apache/iceberg/GenericManifestFile.java

RussellSpitzer · 2021-04-27T02:16:27Z

core/src/main/java/org/apache/iceberg/GenericManifestFile.java

        .add("deleted_data_files_count", deletedFilesCount)
        .add("deleted_rows_count", deletedRowsCount)
        .add("partitions", partitions)
+        .add("key_metadata", keyMetadata == null ? "null" : "(redacted)")


Is this a place where we are leaking information? Sorry I'm still a crypto beginner, is it ok for us to reveal that a file is encrypted? I guess that's obvious if you can get the file ....?

yeah that's a good catch, I am also not sure if this is considered fine, I just followed what the BaseFile did: https://github.com/apache/iceberg/blob/master/core/src/main/java/org/apache/iceberg/BaseFile.java#L454

I think this is fine. The key metadata could be a 0-length buffer set an encryption manager if someone cares about this. I think it's reasonable to show that it is null when using the plain text manager.

RussellSpitzer

The change from OutputFile to EncryptedOutputFile everywhere looks a little odd to me because I know sometimes our output will not be encrypted even though we use that class. I'm not sure if this is required but it just seems a bit odd to me.

My big request for this PR though is to add a few tests to just make sure that the serialization/deserialization of GenericManifestFile works. I think you can do this without adding too much more just to make sure that both a null and populated column can be correctly from a manifest file.

core/src/main/java/org/apache/iceberg/ManifestFiles.java

jackye1995 · 2021-04-27T17:53:59Z

The change from OutputFile to EncryptedOutputFile everywhere looks a little odd to me because I know sometimes our output will not be encrypted even though we use that class.

@RussellSpitzer I think this pattern is also used currently by data files, where all files are encrypted before written and go through the OutputFileFactory to get the output file instances:

https://github.com/apache/iceberg/blob/master/core/src/main/java/org/apache/iceberg/io/OutputFileFactory.java#L106-L118

And we are not encrypting the file only because we are using the PlainTextEncryptionManager. The change here is to try achieving the same effect.

core/src/main/java/org/apache/iceberg/GenericManifestFile.java

core/src/main/java/org/apache/iceberg/encryption/EncryptionManagers.java

flyrain · 2021-04-28T22:31:59Z

api/src/main/java/org/apache/iceberg/ManifestFile.java

+  /**
+   * Returns metadata about how this manifest file is encrypted, or null if the file is stored in plain text.
+   */
+  default ByteBuffer keyMetadata() {


Does keyMetadata have substructure or is it a pure binary buffer? Looks like it will have substructures form the description. Are we going to to define it later or in this patch?

I was wondering about this too, should we make this a struct with name like encryptionContext or something so that if we only plan to add new things in future (e.g. KEK id for double wrapping?), we can collect them in a single struct; and to workaround the problem of having to unwrap two layers to reach this buffer from ManifestFile we may return EncryptionKeyMetadata here, and potentially extend EncryptionKeyMetadata to have more fields when needed in the future. Or will this binary buffer free formed and could contain whatever information needed if the right encryption manager is used?

The intent here is that an encryption manager can decide what the key metadata holds. It could be an encrypted key or it could be a key reference. There are lots of possibilities and we did it this way to not constrain what the encryption manager can choose to do.

+1 to a byte array, serialized by an encryption manager from its structs. Btw, besides encryption keys, we have the AAD prefixes. We can keep them inside the key metadata (because it is convenient and flexible) - or we can add a separate manifest field/column for them (because technically AADs are not used for key retrieval). In both cases, the decision can be made later, when we get to handle end-to-end table integrity.

flyrain · 2021-04-28T23:15:02Z

The change from OutputFile to EncryptedOutputFile looks weird to me as well. Logic is OK to me though. Just brainstorming ideas.
Option 1. We can rename EncryptedOutputFile something else, like GenericOutputFile, which doesn’t confuse people from name perspective.
Option 2. Have a new interface(e.g. GenericOutputFile) inherited by both EncryptedOutputFile and Output file. The interface can be empty.

rdblue · 2021-05-02T21:52:47Z

core/src/main/java/org/apache/iceberg/ManifestFiles.java

   * @return a {@link ManifestReader}
   */
  public static ManifestReader<DataFile> read(ManifestFile manifest, FileIO io, Map<Integer, PartitionSpec> specsById) {
+    return read(manifest, io, EncryptionManagers.defaultManager(), specsById);


I'd like to have compatibility with the bulk-decryption and bulk-encryption methods in the encryption manager. For bulk decryption, the decrypt method is passed Iterable<EncryptedInputFile> and returns Iterable<InputFile>. Then those input files need to be used.

I think that means we should refactor this a bit differently and create a read(ManifestFile, InputFile, Map) method that both this and the version with EncryptionManager call. Then this one could simply call io.newInputFile(manifest.path()) and hand off to the InputFile reader. That avoids using the default manager where it isn't needed and allows us to use the bulk methods.

rdblue · 2021-05-02T21:55:33Z

core/src/main/java/org/apache/iceberg/encryption/EncryptionManagers.java

+
+package org.apache.iceberg.encryption;
+
+public class EncryptionManagers {


Encryption managers are plugged in through tables, so I don't think it is a good idea to have a global default. That doesn't seem to make sense with how they are used.

jackye1995 · 2021-05-14T18:39:46Z

Sorry I was a bit distracted and did not update this PR for a while.

@rdblue thanks for the suggestion for refactoring, I actually thought about the approach you considered, and my major concern is that we are creating 2 different code paths for using or not using encryption. On the other hand, for data file encryption, Iceberg is currently having a single unified code path for always using encryption, with the default encryption manager to be the plaintext manager. I think it would be good to keep things consistent.

I completely agree that the use of a global default encryption manager class is not necessary, and I will remove that. What I hope to achieve is that the method ManifestFiles.read(ManifestFile manifest, FileIO io, Map<Integer, PartitionSpec> specsById) is completely not used and deprecated, and all the callers should use the method with encryption, and supply table.encryption() as the input to the encryption manager. I was trying to avoid too many single line changes in this PR, but my final intention became not clear, so I will update based on that and we can reevaluate the situation.

For what @flyrain and @RussellSpitzer concern, I think I can make it better by directly passing in EncryptedOutputFIle.encryptingOutputFile() to minimize interface change, will notify you guys once I update.

jackye1995 · 2021-05-14T22:50:18Z

@rdblue @flyrain @RussellSpitzer I have updated this PR with all the changes I expected to make in several PRs, it is a lot of files but I think it makes the discussion much easier to show them in a single place.

The key thing we need to determine is: should we separate the code path for manifest read/write with and without encryption. My stand is that we should not separate it, because:

data file read/write does not separate it
managing things in a single code path is much easier in long run

With this assumption, the key principles I followed for the changes are:

for every method in ManifestFiles, create a new version that takes encryption manager, and deprecate the old methods
change the callers of old methods all the way to the top, and deprecate any related public methods.
use EncryptionManagers.plaintext() for all those deprecated methods so that we don't create a lot of redundant plain text manager.

Regarding the concern brought up by @RussellSpitzer and @flyrain , my current take is that we have to pass EncryptedOutputFile all the way down to the place where the encryptedOutputFile.keyMetadata() is written by the engine to the actual manifest list. This does make the naming a bit awkward as @flyrain suggested, but I think this is the same strategy taken by the data file write path, where we can notice that the FileAppenderFactory also have those methods containing EncryptedOutputFile for exactly the same purpose.

Regarding EncryptionManagers that @rdblue talked about removing, as I said in point 3 before, now I am treating it as the static factory to get a plain text manager singleton and that is only used for deprecated methods. I do expect to have a second usage of the factory, similar to LocationProviders, to add a method EncryptionManagers.load(...) to load a custom encryption manager and allow a Catalog to initialize a custom encryption manager for a TableOperations.

jackye1995 · 2021-05-15T02:18:21Z

restart test

rdblue · 2021-05-20T00:21:16Z

@jackye1995, I'm not sure it makes sense to use the plain text manager as you're suggesting. I know it mimics the choice we made for data files, but data files are a bit cleaner because they're not using similar methods to open the files. The main concern that I have is the ability to use the bulk decryption methods from the manager so planning a scan doesn't necessarily incur repeated RPCs to a key manager. As long as that is possible, then I'm flexible on the exact API here.

jackye1995 · 2021-05-25T18:33:59Z

The main concern that I have is the ability to use the bulk decryption methods from the manager so planning a scan doesn't necessarily incur repeated RPCs to a key manager

I suppose you are describing something similar to what is done for data files like here:

https://github.com/apache/iceberg/blob/master/spark/src/main/java/org/apache/iceberg/spark/source/BaseDataReader.java#L74-L75

I wonder if there are any real use cases where this approach actually saves RPC calls to a KMS.

Because in the envelope encryption schemes we discussed:

if it is a single wrap system, then for each file we need to call KMS once to get a new data encryption key (DEK) based on the same key encryption key (KEK), so we always have N KMS calls for N files. (unless the KMS has a batch DEK generation API, which is not very common as far as I know)
if it is a double wrap system, then the first call gets a master encryption key (MEK) to generate a KEK, and all DEKs are generated locally based on that cached KEK. So we always have 1 KMS call for N files.

So the number of calls to KMS really depends on the algorithm, not that much on the encryption manager API, unless the KMS allows batch operation or it is a completely different encryption scheme. And the KMS would be embedded in the encryption manager implementation and perform necessary caching of keys behind the scene.

With that being said, I completely agree that we should use batch operation whenever possible, I will try to find a best way to update the manifest read and write path to be able to leverage batch operations. I will provide an updated PR tonight, meanwhile please let me know what you think about the comment above.

rdblue · 2021-05-25T22:30:03Z

The situation where we want to use the bulk API is whenever we are going to make a call to the KMS per file. If we can batch up the files then we can make a single call to get all of the keys at once. You may be right that the strategies that you want to build don't require it, but the plugin system is generic enough that we don't necessarily know that batching is unnecessary.

ggershinsky · 2021-05-27T07:02:58Z

Double wrapping will be an Iceberg mode, always under our control, an assured way to avoid per-file KMS calls.
But certainly, if a KMS supports batch calls (some do; not all), this should be leveraged by Iceberg in either single or double wrapping mode. This might require an explicit addition in the Iceberg "kms client" plug-in interface (aka "key_provider"), because a typical per-key interface wouldn't know when the batch request is "full" and should be sent to the KMS server.

rdblue · 2021-05-31T22:41:08Z

api/src/main/java/org/apache/iceberg/ManifestFile.java

      "Summary for each partition");
-  // next ID to assign: 519
+  Types.NestedField KEY_METADATA = optional(519, "key_metadata", Types.BinaryType.get(),
+          "Encryption key metadata blob");


Added this to the spec in #2654.

flyrain · 2021-06-02T00:09:51Z

core/src/main/java/org/apache/iceberg/ManifestFiles.java

-    InputFile file = io.newInputFile(manifest.path());
+
+    EncryptedInputFile encryptedFile = EncryptedFiles.encryptedInput(
+            io.newInputFile(manifest.path()), manifest.keyMetadata());


A nit: 4 spaces indentation rather than 8 spaces.

flyrain · 2021-06-02T00:10:07Z

core/src/main/java/org/apache/iceberg/ManifestFiles.java

+  public static ManifestWriter<DataFile> write(int formatVersion, PartitionSpec spec, OutputFile outputFile,
+                                               Long snapshotId) {
+    return write(formatVersion, spec, EncryptedFiles.encryptedOutput(outputFile,
+            EncryptionKeyMetadata.EMPTY), snapshotId);


A nit: 4 spaces indentation rather than 8 spaces.

flyrain · 2021-06-02T00:10:58Z

core/src/main/java/org/apache/iceberg/ManifestFiles.java

+  public static ManifestWriter<DeleteFile> writeDeleteManifest(int formatVersion, PartitionSpec spec,
+                                                               OutputFile outputFile, Long snapshotId) {
+    return writeDeleteManifest(formatVersion, spec,
+            EncryptedFiles.encryptedOutput(outputFile, EncryptionKeyMetadata.EMPTY), snapshotId);


A nit: 4 spaces indentation rather than 8 spaces.

flyrain · 2021-06-02T00:25:49Z

core/src/main/java/org/apache/iceberg/StaticTableOperations.java


  /**
-   * Creates a StaticTableOperations tied to a specific static version of the TableMetadata
+   * @deprecated please use {@link #StaticTableOperations(String, FileIO, EncryptionManager)}


Do we need to deprecate it? To provide a method with a plainText EncryptionManager as the default value looks harmless.

flyrain · 2021-06-02T00:36:53Z

Regarding the concern brought up by @RussellSpitzer and @flyrain , my current take is that we have to pass EncryptedOutputFile all the way down to the place where the encryptedOutputFile.keyMetadata() is written by the engine to the actual manifest list. This does make the naming a bit awkward as @flyrain suggested, but I think this is the same strategy taken by the data file write path, where we can notice that the FileAppenderFactory also have those methods containing EncryptedOutputFile for exactly the same purpose.

I don't have a strong opinion on this. The logic looks good to me. But the more descriptive name provides benefits:

Doesn't confuse code reader, reader won't confuse a plain text file with an encrypted file.
Better naming usually means better abstraction, and more future-proof. However, sometimes, it is kind of over-engineered, but less likely, and need to be analyzed case by case.

jackye1995 · 2021-06-04T07:24:02Z

@flyrain thanks for the review! I am marking this as a draft. I spent a few days to figure out the way to go forward with all the changes needed in ManifestFiles API and I think I have a plan. I will separate this PR to a few different ones, mainly:

adding keyMetadata byte array to ManifestFile interface and implementation classes.
add support in EncryptionManager to decrypt and encrypt by file type
changes to use encryption manager to supply key metadata information down to manifest writer and reader. This will be separated to multiple PRs for core, and engine related changes.

flyrain · 2021-06-04T22:15:33Z

@jackye1995 thanks for the update, looking forward to your new PRs.

github-actions · 2024-07-17T00:13:14Z

This pull request has been marked as stale due to 30 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the [email protected] list. Thank you for your contributions.

github-actions · 2024-07-24T00:13:55Z

This pull request has been closed due to lack of activity. This is not a judgement on the merit of the PR in any way. It is just a way of keeping the PR queue manageable. If you think that is incorrect, or the pull request requires review, you can revive the PR at any time.

github-actions bot added API core flink labels Apr 26, 2021

RussellSpitzer reviewed Apr 27, 2021

View reviewed changes

core/src/main/java/org/apache/iceberg/GenericManifestFile.java Outdated Show resolved Hide resolved

RussellSpitzer reviewed Apr 27, 2021

View reviewed changes

core/src/main/java/org/apache/iceberg/GenericManifestFile.java Show resolved Hide resolved

RussellSpitzer reviewed Apr 27, 2021

View reviewed changes

core/src/main/java/org/apache/iceberg/ManifestFiles.java Outdated Show resolved Hide resolved

flyrain reviewed Apr 28, 2021

View reviewed changes

core/src/main/java/org/apache/iceberg/GenericManifestFile.java Show resolved Hide resolved

flyrain reviewed Apr 28, 2021

View reviewed changes

core/src/main/java/org/apache/iceberg/encryption/EncryptionManagers.java Outdated Show resolved Hide resolved

flyrain reviewed Apr 28, 2021

View reviewed changes

rdblue reviewed May 2, 2021

View reviewed changes

github-actions bot added hive spark AWS MR labels May 14, 2021

Jack Ye added 3 commits May 14, 2021 15:34

Core: add key_metadata in ManifestFile

f332a24

add missing changes

0cfe03b

fix checkstyle

a1dd9c8

add missing changes

8da6fd4

jackye1995 closed this May 15, 2021

jackye1995 reopened this May 15, 2021

rdblue mentioned this pull request May 31, 2021

Update spec for v2 changes #2654

Merged

rdblue reviewed May 31, 2021

View reviewed changes

flyrain reviewed Jun 2, 2021

View reviewed changes

jackye1995 marked this pull request as draft June 4, 2021 07:24

jackye1995 mentioned this pull request Jun 4, 2021

Core: add key_metadata to ManifestFile spec #2675

Merged

github-actions bot added the stale label Jul 17, 2024

github-actions bot closed this Jul 24, 2024


		package org.apache.iceberg.encryption;

		public class EncryptionManagers {

Core: add key_metadata in ManifestFile #2520

Core: add key_metadata in ManifestFile #2520

Uh oh!

Conversation

jackye1995 commented Apr 26, 2021

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

RussellSpitzer left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

jackye1995 commented Apr 27, 2021

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

flyrain commented Apr 28, 2021

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jackye1995 commented May 14, 2021

Uh oh!

jackye1995 commented May 14, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jackye1995 commented May 15, 2021

Uh oh!

rdblue commented May 20, 2021

Uh oh!

jackye1995 commented May 25, 2021

Uh oh!

rdblue commented May 25, 2021

Uh oh!

ggershinsky commented May 27, 2021

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

flyrain Jun 2, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

flyrain commented Jun 2, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jackye1995 commented Jun 4, 2021

Uh oh!

flyrain commented Jun 4, 2021

Uh oh!

github-actions bot commented Jul 17, 2024

Uh oh!

github-actions bot commented Jul 24, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

jackye1995 commented May 14, 2021 •

edited

Loading

flyrain Jun 2, 2021 •

edited

Loading

flyrain commented Jun 2, 2021 •

edited

Loading