Conversation
Signed-off-by: Corwin Joy <corwin.joy@gmail.com>
Signed-off-by: Corwin Joy <corwin.joy@gmail.com>
Signed-off-by: Corwin Joy <corwin.joy@gmail.com>
Signed-off-by: Corwin Joy <corwin.joy@gmail.com>
… constructor Signed-off-by: Corwin Joy <corwin.joy@gmail.com>
This reverts commit a3ca59b
…hanges through. Signed-off-by: Corwin Joy <corwin.joy@gmail.com>
Signed-off-by: Corwin Joy <corwin.joy@gmail.com>
…o examples. Signed-off-by: Corwin Joy <corwin.joy@gmail.com>
…sionContext Signed-off-by: Corwin Joy <corwin.joy@gmail.com>
Signed-off-by: Corwin Joy <corwin.joy@gmail.com>
Signed-off-by: Corwin Joy <corwin.joy@gmail.com>
Signed-off-by: Corwin Joy <corwin.joy@gmail.com>
# Conflicts: # crates/core/src/operations/mod.rs # crates/core/src/operations/update.rs
There was a problem hiding this comment.
Pull Request Overview
This pull request introduces file format options infrastructure to the Delta Lake Rust library, enabling support for Parquet encryption and other customizable file format behaviors. The changes refactor how writer properties are handled throughout the codebase, transitioning from direct WriterProperties usage to a more flexible WriterPropertiesFactory pattern.
- Adds new file format options abstraction and encryption support infrastructure
- Refactors all operations to use WriterPropertiesFactory instead of direct WriterProperties
- Integrates encryption capabilities into DataFusion operations and file scanning
Reviewed Changes
Copilot reviewed 37 out of 37 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
| crates/core/src/table/file_format_options.rs | New module providing file format options trait and WriterPropertiesFactory abstraction |
| crates/core/src/operations/encryption.rs | New encryption configuration module for table-level encryption support |
| python/src/lib.rs | Updates Python bindings to use new constructor signature with None parameter |
| crates/core/src/operations/write/writer.rs | Major refactor replacing WriterProperties with WriterPropertiesFactory pattern |
| crates/core/src/operations/optimize.rs | Enhanced optimize operations with encryption support and new factory pattern |
| crates/deltalake/examples/ | New example files demonstrating encryption capabilities |
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
| ) | ||
| .await?; |
There was a problem hiding this comment.
The asynchronous call in the constructor pattern is unusual. Consider whether the factory creation should be synchronous or if this indicates a design issue where async operations are being performed during object construction.
There was a problem hiding this comment.
@adamreeve maybe copilot has a point here? I guess this is coming from create_writer_properties now being async. Do we need to change this?
There was a problem hiding this comment.
Hmm, I agree it's not ideal to have to change PartitionWriterConfig::try_new to be async, but git grep 'async fn try_' finds a bunch of other async constructors so I don't think this is that unusual. And I can't see a way to avoid this.
# Conflicts: # crates/core/src/delta_datafusion/mod.rs # crates/core/src/operations/mod.rs # crates/deltalake/Cargo.toml
adamreeve
left a comment
There was a problem hiding this comment.
This is looking good thanks Corwin. I've left some comments on possible improvements, but I think this is something we could create a draft PR from to try to solicit feedback and start a discussion.
| // We have to set the encryption factory on the ParquetSource based on the Parquet options, | ||
| // as this is usually handled by the ParquetFormat type in DataFusion, | ||
| // which is not used in delta-rs. |
There was a problem hiding this comment.
If we pass a FileFormatOptions instead of TableParquetOptions, we could maybe add a new method to FileFormatOptions to allow customizing the ParquetSource creation using and avoid needing to set the encryption factory here.
Although I don't feel like this code is too bad but it does leak some knowledge of Parquet encryption outside of the FileFormatOptions implementation.
| ) -> Result<Arc<dyn ExecutionPlan>> { | ||
| register_store(self.log_store(), session.runtime_env().clone()); | ||
| if let Some(format_options) = &self.file_format_options { | ||
| format_options.update_session(session)?; |
There was a problem hiding this comment.
It might be nice to consolidate this session update with the register_store method above.
| let meta = ObjectMeta::try_from(file).unwrap(); | ||
| let file_format_options = file_format_options.clone(); | ||
| async move { | ||
| let decrypt: Option<FileDecryptionProperties> = |
There was a problem hiding this comment.
We might be able to tidy this up by adding a method to get ArrowReaderOptions, or alternatively changing this to use DataFusion for reading like for Z-Order compaction.
|
|
||
| let writer_properties = this.writer_properties.unwrap_or_else(|| { | ||
| WriterProperties::builder() | ||
| .set_compression(Compression::ZSTD(ZstdLevel::try_new(4).unwrap())) |
There was a problem hiding this comment.
For completeness, we probably want a way for FileFormatOptions / WriterPropertiesFactory to use different compression or other properties for files written during compaction.
I guess users can just modify the options passed when the set up the DeltaOps, so maybe that's enough and we don't need to change the API to support this when reusing the same options for different ops.
| } | ||
|
|
||
| #[async_trait] | ||
| pub trait WriterPropertiesFactory: Send + Sync + std::fmt::Debug + 'static { |
There was a problem hiding this comment.
I wonder if we could move these methods into FileFormatOptions rather than need a separate trait.
I guess this might be needed for backwards compatibility with setting writer properties though.
| // More advanced factory with KMS support | ||
| #[cfg(feature = "datafusion")] | ||
| #[derive(Clone, Debug)] | ||
| pub struct KMSWriterPropertiesFactory { |
There was a problem hiding this comment.
It might make sense to move things around a bit so this lives in the encryption module alongside TableEncryption. encryption is also nested under operations but could maybe be next to file_format_options under the table module?
KmsFileFormatOptions is defined in the example, but maybe it should be in the same module as this type too.
| } | ||
|
|
||
| #[cfg(feature = "datafusion")] | ||
| pub fn state_with_parquet_options( |
There was a problem hiding this comment.
I still think we should avoid setting table specific options on the state if possible (registering encryption factory is different as the factory also needs to be specified by id in the table options). I will take a look at the cases where this is used and see if we can remove the need for this.
| Ok(()) | ||
| } | ||
|
|
||
| async fn round_trip_test() -> Result<(), deltalake::errors::DeltaTableError> { |
There was a problem hiding this comment.
As well as having an example, it would be good to have unit tests for each type of operation that check that the table files are all encrypted after the operation is complete.
Compare latest df-itegrate with main to see what a PR would look like