-
Notifications
You must be signed in to change notification settings - Fork 3k
Core: add key_metadata in ManifestFile #2520
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| .add("deleted_data_files_count", deletedFilesCount) | ||
| .add("deleted_rows_count", deletedRowsCount) | ||
| .add("partitions", partitions) | ||
| .add("key_metadata", keyMetadata == null ? "null" : "(redacted)") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this a place where we are leaking information? Sorry I'm still a crypto beginner, is it ok for us to reveal that a file is encrypted? I guess that's obvious if you can get the file ....?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah that's a good catch, I am also not sure if this is considered fine, I just followed what the BaseFile did: https://github.com/apache/iceberg/blob/master/core/src/main/java/org/apache/iceberg/BaseFile.java#L454
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is fine. The key metadata could be a 0-length buffer set an encryption manager if someone cares about this. I think it's reasonable to show that it is null when using the plain text manager.
RussellSpitzer
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The change from OutputFile to EncryptedOutputFile everywhere looks a little odd to me because I know sometimes our output will not be encrypted even though we use that class. I'm not sure if this is required but it just seems a bit odd to me.
My big request for this PR though is to add a few tests to just make sure that the serialization/deserialization of GenericManifestFile works. I think you can do this without adding too much more just to make sure that both a null and populated column can be correctly from a manifest file.
@RussellSpitzer I think this pattern is also used currently by data files, where all files are encrypted before written and go through the OutputFileFactory to get the output file instances: And we are not encrypting the file only because we are using the PlainTextEncryptionManager. The change here is to try achieving the same effect. |
core/src/main/java/org/apache/iceberg/encryption/EncryptionManagers.java
Outdated
Show resolved
Hide resolved
| /** | ||
| * Returns metadata about how this manifest file is encrypted, or null if the file is stored in plain text. | ||
| */ | ||
| default ByteBuffer keyMetadata() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does keyMetadata have substructure or is it a pure binary buffer? Looks like it will have substructures form the description. Are we going to to define it later or in this patch?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was wondering about this too, should we make this a struct with name like encryptionContext or something so that if we only plan to add new things in future (e.g. KEK id for double wrapping?), we can collect them in a single struct; and to workaround the problem of having to unwrap two layers to reach this buffer from ManifestFile we may return EncryptionKeyMetadata here, and potentially extend EncryptionKeyMetadata to have more fields when needed in the future. Or will this binary buffer free formed and could contain whatever information needed if the right encryption manager is used?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The intent here is that an encryption manager can decide what the key metadata holds. It could be an encrypted key or it could be a key reference. There are lots of possibilities and we did it this way to not constrain what the encryption manager can choose to do.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 to a byte array, serialized by an encryption manager from its structs. Btw, besides encryption keys, we have the AAD prefixes. We can keep them inside the key metadata (because it is convenient and flexible) - or we can add a separate manifest field/column for them (because technically AADs are not used for key retrieval). In both cases, the decision can be made later, when we get to handle end-to-end table integrity.
|
The change from OutputFile to EncryptedOutputFile looks weird to me as well. Logic is OK to me though. Just brainstorming ideas. |
| * @return a {@link ManifestReader} | ||
| */ | ||
| public static ManifestReader<DataFile> read(ManifestFile manifest, FileIO io, Map<Integer, PartitionSpec> specsById) { | ||
| return read(manifest, io, EncryptionManagers.defaultManager(), specsById); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd like to have compatibility with the bulk-decryption and bulk-encryption methods in the encryption manager. For bulk decryption, the decrypt method is passed Iterable<EncryptedInputFile> and returns Iterable<InputFile>. Then those input files need to be used.
I think that means we should refactor this a bit differently and create a read(ManifestFile, InputFile, Map) method that both this and the version with EncryptionManager call. Then this one could simply call io.newInputFile(manifest.path()) and hand off to the InputFile reader. That avoids using the default manager where it isn't needed and allows us to use the bulk methods.
|
|
||
| package org.apache.iceberg.encryption; | ||
|
|
||
| public class EncryptionManagers { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Encryption managers are plugged in through tables, so I don't think it is a good idea to have a global default. That doesn't seem to make sense with how they are used.
|
Sorry I was a bit distracted and did not update this PR for a while. @rdblue thanks for the suggestion for refactoring, I actually thought about the approach you considered, and my major concern is that we are creating 2 different code paths for using or not using encryption. On the other hand, for data file encryption, Iceberg is currently having a single unified code path for always using encryption, with the default encryption manager to be the plaintext manager. I think it would be good to keep things consistent. I completely agree that the use of a global default encryption manager class is not necessary, and I will remove that. What I hope to achieve is that the method For what @flyrain and @RussellSpitzer concern, I think I can make it better by directly passing in |
|
@rdblue @flyrain @RussellSpitzer I have updated this PR with all the changes I expected to make in several PRs, it is a lot of files but I think it makes the discussion much easier to show them in a single place. The key thing we need to determine is: should we separate the code path for manifest read/write with and without encryption. My stand is that we should not separate it, because:
With this assumption, the key principles I followed for the changes are:
Regarding the concern brought up by @RussellSpitzer and @flyrain , my current take is that we have to pass Regarding |
|
restart test |
|
@jackye1995, I'm not sure it makes sense to use the plain text manager as you're suggesting. I know it mimics the choice we made for data files, but data files are a bit cleaner because they're not using similar methods to open the files. The main concern that I have is the ability to use the bulk decryption methods from the manager so planning a scan doesn't necessarily incur repeated RPCs to a key manager. As long as that is possible, then I'm flexible on the exact API here. |
I suppose you are describing something similar to what is done for data files like here: I wonder if there are any real use cases where this approach actually saves RPC calls to a KMS. Because in the envelope encryption schemes we discussed:
So the number of calls to KMS really depends on the algorithm, not that much on the encryption manager API, unless the KMS allows batch operation or it is a completely different encryption scheme. And the KMS would be embedded in the encryption manager implementation and perform necessary caching of keys behind the scene. With that being said, I completely agree that we should use batch operation whenever possible, I will try to find a best way to update the manifest read and write path to be able to leverage batch operations. I will provide an updated PR tonight, meanwhile please let me know what you think about the comment above. |
|
The situation where we want to use the bulk API is whenever we are going to make a call to the KMS per file. If we can batch up the files then we can make a single call to get all of the keys at once. You may be right that the strategies that you want to build don't require it, but the plugin system is generic enough that we don't necessarily know that batching is unnecessary. |
|
Double wrapping will be an Iceberg mode, always under our control, an assured way to avoid per-file KMS calls. |
| "Summary for each partition"); | ||
| // next ID to assign: 519 | ||
| Types.NestedField KEY_METADATA = optional(519, "key_metadata", Types.BinaryType.get(), | ||
| "Encryption key metadata blob"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added this to the spec in #2654.
| InputFile file = io.newInputFile(manifest.path()); | ||
|
|
||
| EncryptedInputFile encryptedFile = EncryptedFiles.encryptedInput( | ||
| io.newInputFile(manifest.path()), manifest.keyMetadata()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A nit: 4 spaces indentation rather than 8 spaces.
| public static ManifestWriter<DataFile> write(int formatVersion, PartitionSpec spec, OutputFile outputFile, | ||
| Long snapshotId) { | ||
| return write(formatVersion, spec, EncryptedFiles.encryptedOutput(outputFile, | ||
| EncryptionKeyMetadata.EMPTY), snapshotId); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A nit: 4 spaces indentation rather than 8 spaces.
| public static ManifestWriter<DeleteFile> writeDeleteManifest(int formatVersion, PartitionSpec spec, | ||
| OutputFile outputFile, Long snapshotId) { | ||
| return writeDeleteManifest(formatVersion, spec, | ||
| EncryptedFiles.encryptedOutput(outputFile, EncryptionKeyMetadata.EMPTY), snapshotId); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A nit: 4 spaces indentation rather than 8 spaces.
|
|
||
| /** | ||
| * Creates a StaticTableOperations tied to a specific static version of the TableMetadata | ||
| * @deprecated please use {@link #StaticTableOperations(String, FileIO, EncryptionManager)} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to deprecate it? To provide a method with a plainText EncryptionManager as the default value looks harmless.
I don't have a strong opinion on this. The logic looks good to me. But the more descriptive name provides benefits:
|
|
@flyrain thanks for the review! I am marking this as a draft. I spent a few days to figure out the way to go forward with all the changes needed in
|
|
@jackye1995 thanks for the update, looking forward to your new PRs. |
|
This pull request has been marked as stale due to 30 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the [email protected] list. Thank you for your contributions. |
|
This pull request has been closed due to lack of activity. This is not a judgement on the merit of the PR in any way. It is just a way of keeping the PR queue manageable. If you think that is incorrect, or the pull request requires review, you can revive the PR at any time. |
This PR adds key_metadata to the ManifestFile API, and updates all the read and write methods that directly references the constructor up to
ManifestFilesto useEncryptionManager.To avoid touching too many files in different engines in a single PR, there will be subsequent PRs to update each engine and the metadata metadata tables to support encrypting manifest files.
@yyanyy @rdblue @RussellSpitzer @ggershinsky @flyrain