Skip to content

Conversation

@jackye1995
Copy link
Contributor

This PR adds key_metadata to the ManifestFile API, and updates all the read and write methods that directly references the constructor up to ManifestFiles to use EncryptionManager.

To avoid touching too many files in different engines in a single PR, there will be subsequent PRs to update each engine and the metadata metadata tables to support encrypting manifest files.

@yyanyy @rdblue @RussellSpitzer @ggershinsky @flyrain

.add("deleted_data_files_count", deletedFilesCount)
.add("deleted_rows_count", deletedRowsCount)
.add("partitions", partitions)
.add("key_metadata", keyMetadata == null ? "null" : "(redacted)")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a place where we are leaking information? Sorry I'm still a crypto beginner, is it ok for us to reveal that a file is encrypted? I guess that's obvious if you can get the file ....?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah that's a good catch, I am also not sure if this is considered fine, I just followed what the BaseFile did: https://github.com/apache/iceberg/blob/master/core/src/main/java/org/apache/iceberg/BaseFile.java#L454

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is fine. The key metadata could be a 0-length buffer set an encryption manager if someone cares about this. I think it's reasonable to show that it is null when using the plain text manager.

Copy link
Member

@RussellSpitzer RussellSpitzer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The change from OutputFile to EncryptedOutputFile everywhere looks a little odd to me because I know sometimes our output will not be encrypted even though we use that class. I'm not sure if this is required but it just seems a bit odd to me.

My big request for this PR though is to add a few tests to just make sure that the serialization/deserialization of GenericManifestFile works. I think you can do this without adding too much more just to make sure that both a null and populated column can be correctly from a manifest file.

@jackye1995
Copy link
Contributor Author

The change from OutputFile to EncryptedOutputFile everywhere looks a little odd to me because I know sometimes our output will not be encrypted even though we use that class.

@RussellSpitzer I think this pattern is also used currently by data files, where all files are encrypted before written and go through the OutputFileFactory to get the output file instances:

https://github.com/apache/iceberg/blob/master/core/src/main/java/org/apache/iceberg/io/OutputFileFactory.java#L106-L118

And we are not encrypting the file only because we are using the PlainTextEncryptionManager. The change here is to try achieving the same effect.

/**
* Returns metadata about how this manifest file is encrypted, or null if the file is stored in plain text.
*/
default ByteBuffer keyMetadata() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does keyMetadata have substructure or is it a pure binary buffer? Looks like it will have substructures form the description. Are we going to to define it later or in this patch?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was wondering about this too, should we make this a struct with name like encryptionContext or something so that if we only plan to add new things in future (e.g. KEK id for double wrapping?), we can collect them in a single struct; and to workaround the problem of having to unwrap two layers to reach this buffer from ManifestFile we may return EncryptionKeyMetadata here, and potentially extend EncryptionKeyMetadata to have more fields when needed in the future. Or will this binary buffer free formed and could contain whatever information needed if the right encryption manager is used?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The intent here is that an encryption manager can decide what the key metadata holds. It could be an encrypted key or it could be a key reference. There are lots of possibilities and we did it this way to not constrain what the encryption manager can choose to do.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 to a byte array, serialized by an encryption manager from its structs. Btw, besides encryption keys, we have the AAD prefixes. We can keep them inside the key metadata (because it is convenient and flexible) - or we can add a separate manifest field/column for them (because technically AADs are not used for key retrieval). In both cases, the decision can be made later, when we get to handle end-to-end table integrity.

@flyrain
Copy link
Contributor

flyrain commented Apr 28, 2021

The change from OutputFile to EncryptedOutputFile looks weird to me as well. Logic is OK to me though. Just brainstorming ideas.
Option 1. We can rename EncryptedOutputFile something else, like GenericOutputFile, which doesn’t confuse people from name perspective.
Option 2. Have a new interface(e.g. GenericOutputFile) inherited by both EncryptedOutputFile and Output file. The interface can be empty.

* @return a {@link ManifestReader}
*/
public static ManifestReader<DataFile> read(ManifestFile manifest, FileIO io, Map<Integer, PartitionSpec> specsById) {
return read(manifest, io, EncryptionManagers.defaultManager(), specsById);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to have compatibility with the bulk-decryption and bulk-encryption methods in the encryption manager. For bulk decryption, the decrypt method is passed Iterable<EncryptedInputFile> and returns Iterable<InputFile>. Then those input files need to be used.

I think that means we should refactor this a bit differently and create a read(ManifestFile, InputFile, Map) method that both this and the version with EncryptionManager call. Then this one could simply call io.newInputFile(manifest.path()) and hand off to the InputFile reader. That avoids using the default manager where it isn't needed and allows us to use the bulk methods.


package org.apache.iceberg.encryption;

public class EncryptionManagers {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Encryption managers are plugged in through tables, so I don't think it is a good idea to have a global default. That doesn't seem to make sense with how they are used.

@jackye1995
Copy link
Contributor Author

Sorry I was a bit distracted and did not update this PR for a while.

@rdblue thanks for the suggestion for refactoring, I actually thought about the approach you considered, and my major concern is that we are creating 2 different code paths for using or not using encryption. On the other hand, for data file encryption, Iceberg is currently having a single unified code path for always using encryption, with the default encryption manager to be the plaintext manager. I think it would be good to keep things consistent.

I completely agree that the use of a global default encryption manager class is not necessary, and I will remove that. What I hope to achieve is that the method ManifestFiles.read(ManifestFile manifest, FileIO io, Map<Integer, PartitionSpec> specsById) is completely not used and deprecated, and all the callers should use the method with encryption, and supply table.encryption() as the input to the encryption manager. I was trying to avoid too many single line changes in this PR, but my final intention became not clear, so I will update based on that and we can reevaluate the situation.

For what @flyrain and @RussellSpitzer concern, I think I can make it better by directly passing in EncryptedOutputFIle.encryptingOutputFile() to minimize interface change, will notify you guys once I update.

@jackye1995
Copy link
Contributor Author

jackye1995 commented May 14, 2021

@rdblue @flyrain @RussellSpitzer I have updated this PR with all the changes I expected to make in several PRs, it is a lot of files but I think it makes the discussion much easier to show them in a single place.

The key thing we need to determine is: should we separate the code path for manifest read/write with and without encryption. My stand is that we should not separate it, because:

  1. data file read/write does not separate it
  2. managing things in a single code path is much easier in long run

With this assumption, the key principles I followed for the changes are:

  1. for every method in ManifestFiles, create a new version that takes encryption manager, and deprecate the old methods
  2. change the callers of old methods all the way to the top, and deprecate any related public methods.
  3. use EncryptionManagers.plaintext() for all those deprecated methods so that we don't create a lot of redundant plain text manager.

Regarding the concern brought up by @RussellSpitzer and @flyrain , my current take is that we have to pass EncryptedOutputFile all the way down to the place where the encryptedOutputFile.keyMetadata() is written by the engine to the actual manifest list. This does make the naming a bit awkward as @flyrain suggested, but I think this is the same strategy taken by the data file write path, where we can notice that the FileAppenderFactory also have those methods containing EncryptedOutputFile for exactly the same purpose.

Regarding EncryptionManagers that @rdblue talked about removing, as I said in point 3 before, now I am treating it as the static factory to get a plain text manager singleton and that is only used for deprecated methods. I do expect to have a second usage of the factory, similar to LocationProviders, to add a method EncryptionManagers.load(...) to load a custom encryption manager and allow a Catalog to initialize a custom encryption manager for a TableOperations.

@jackye1995
Copy link
Contributor Author

restart test

@jackye1995 jackye1995 closed this May 15, 2021
@jackye1995 jackye1995 reopened this May 15, 2021
@rdblue
Copy link
Contributor

rdblue commented May 20, 2021

@jackye1995, I'm not sure it makes sense to use the plain text manager as you're suggesting. I know it mimics the choice we made for data files, but data files are a bit cleaner because they're not using similar methods to open the files. The main concern that I have is the ability to use the bulk decryption methods from the manager so planning a scan doesn't necessarily incur repeated RPCs to a key manager. As long as that is possible, then I'm flexible on the exact API here.

@jackye1995
Copy link
Contributor Author

The main concern that I have is the ability to use the bulk decryption methods from the manager so planning a scan doesn't necessarily incur repeated RPCs to a key manager

I suppose you are describing something similar to what is done for data files like here:

https://github.com/apache/iceberg/blob/master/spark/src/main/java/org/apache/iceberg/spark/source/BaseDataReader.java#L74-L75

I wonder if there are any real use cases where this approach actually saves RPC calls to a KMS.

Because in the envelope encryption schemes we discussed:

  1. if it is a single wrap system, then for each file we need to call KMS once to get a new data encryption key (DEK) based on the same key encryption key (KEK), so we always have N KMS calls for N files. (unless the KMS has a batch DEK generation API, which is not very common as far as I know)
  2. if it is a double wrap system, then the first call gets a master encryption key (MEK) to generate a KEK, and all DEKs are generated locally based on that cached KEK. So we always have 1 KMS call for N files.

So the number of calls to KMS really depends on the algorithm, not that much on the encryption manager API, unless the KMS allows batch operation or it is a completely different encryption scheme. And the KMS would be embedded in the encryption manager implementation and perform necessary caching of keys behind the scene.

With that being said, I completely agree that we should use batch operation whenever possible, I will try to find a best way to update the manifest read and write path to be able to leverage batch operations. I will provide an updated PR tonight, meanwhile please let me know what you think about the comment above.

@rdblue
Copy link
Contributor

rdblue commented May 25, 2021

The situation where we want to use the bulk API is whenever we are going to make a call to the KMS per file. If we can batch up the files then we can make a single call to get all of the keys at once. You may be right that the strategies that you want to build don't require it, but the plugin system is generic enough that we don't necessarily know that batching is unnecessary.

@ggershinsky
Copy link
Contributor

Double wrapping will be an Iceberg mode, always under our control, an assured way to avoid per-file KMS calls.
But certainly, if a KMS supports batch calls (some do; not all), this should be leveraged by Iceberg in either single or double wrapping mode. This might require an explicit addition in the Iceberg "kms client" plug-in interface (aka "key_provider"), because a typical per-key interface wouldn't know when the batch request is "full" and should be sent to the KMS server.

"Summary for each partition");
// next ID to assign: 519
Types.NestedField KEY_METADATA = optional(519, "key_metadata", Types.BinaryType.get(),
"Encryption key metadata blob");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added this to the spec in #2654.

InputFile file = io.newInputFile(manifest.path());

EncryptedInputFile encryptedFile = EncryptedFiles.encryptedInput(
io.newInputFile(manifest.path()), manifest.keyMetadata());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A nit: 4 spaces indentation rather than 8 spaces.

public static ManifestWriter<DataFile> write(int formatVersion, PartitionSpec spec, OutputFile outputFile,
Long snapshotId) {
return write(formatVersion, spec, EncryptedFiles.encryptedOutput(outputFile,
EncryptionKeyMetadata.EMPTY), snapshotId);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A nit: 4 spaces indentation rather than 8 spaces.

public static ManifestWriter<DeleteFile> writeDeleteManifest(int formatVersion, PartitionSpec spec,
OutputFile outputFile, Long snapshotId) {
return writeDeleteManifest(formatVersion, spec,
EncryptedFiles.encryptedOutput(outputFile, EncryptionKeyMetadata.EMPTY), snapshotId);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A nit: 4 spaces indentation rather than 8 spaces.


/**
* Creates a StaticTableOperations tied to a specific static version of the TableMetadata
* @deprecated please use {@link #StaticTableOperations(String, FileIO, EncryptionManager)}
Copy link
Contributor

@flyrain flyrain Jun 2, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to deprecate it? To provide a method with a plainText EncryptionManager as the default value looks harmless.

@flyrain
Copy link
Contributor

flyrain commented Jun 2, 2021

Regarding the concern brought up by @RussellSpitzer and @flyrain , my current take is that we have to pass EncryptedOutputFile all the way down to the place where the encryptedOutputFile.keyMetadata() is written by the engine to the actual manifest list. This does make the naming a bit awkward as @flyrain suggested, but I think this is the same strategy taken by the data file write path, where we can notice that the FileAppenderFactory also have those methods containing EncryptedOutputFile for exactly the same purpose.

I don't have a strong opinion on this. The logic looks good to me. But the more descriptive name provides benefits:

  1. Doesn't confuse code reader, reader won't confuse a plain text file with an encrypted file.
  2. Better naming usually means better abstraction, and more future-proof. However, sometimes, it is kind of over-engineered, but less likely, and need to be analyzed case by case.

@jackye1995
Copy link
Contributor Author

@flyrain thanks for the review! I am marking this as a draft. I spent a few days to figure out the way to go forward with all the changes needed in ManifestFiles API and I think I have a plan. I will separate this PR to a few different ones, mainly:

  1. adding keyMetadata byte array to ManifestFile interface and implementation classes.
  2. add support in EncryptionManager to decrypt and encrypt by file type
  3. changes to use encryption manager to supply key metadata information down to manifest writer and reader. This will be separated to multiple PRs for core, and engine related changes.

@jackye1995 jackye1995 marked this pull request as draft June 4, 2021 07:24
@flyrain
Copy link
Contributor

flyrain commented Jun 4, 2021

@jackye1995 thanks for the update, looking forward to your new PRs.

@github-actions
Copy link

This pull request has been marked as stale due to 30 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the [email protected] list. Thank you for your contributions.

@github-actions github-actions bot added the stale label Jul 17, 2024
@github-actions
Copy link

This pull request has been closed due to lack of activity. This is not a judgement on the merit of the PR in any way. It is just a way of keeping the PR queue manageable. If you think that is incorrect, or the pull request requires review, you can revive the PR at any time.

@github-actions github-actions bot closed this Jul 24, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Development

Successfully merging this pull request may close these issues.

6 participants