Skip to content

PR to compare with main#9

Draft
corwinjoy wants to merge 68 commits intomainfrom
df-integrate
Draft

PR to compare with main#9
corwinjoy wants to merge 68 commits intomainfrom
df-integrate

Conversation

@corwinjoy
Copy link
Owner

Compare latest df-itegrate with main to see what a PR would look like

corwinjoy and others added 30 commits August 13, 2025 15:27
Signed-off-by: Corwin Joy <corwin.joy@gmail.com>
Signed-off-by: Corwin Joy <corwin.joy@gmail.com>
Signed-off-by: Corwin Joy <corwin.joy@gmail.com>
Signed-off-by: Corwin Joy <corwin.joy@gmail.com>
… constructor

Signed-off-by: Corwin Joy <corwin.joy@gmail.com>
…hanges through.

Signed-off-by: Corwin Joy <corwin.joy@gmail.com>
Signed-off-by: Corwin Joy <corwin.joy@gmail.com>
…o examples.

Signed-off-by: Corwin Joy <corwin.joy@gmail.com>
…sionContext

Signed-off-by: Corwin Joy <corwin.joy@gmail.com>
Signed-off-by: Corwin Joy <corwin.joy@gmail.com>
Signed-off-by: Corwin Joy <corwin.joy@gmail.com>
Signed-off-by: Corwin Joy <corwin.joy@gmail.com>
# Conflicts:
#	crates/core/src/operations/mod.rs
#	crates/core/src/operations/update.rs
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This pull request introduces file format options infrastructure to the Delta Lake Rust library, enabling support for Parquet encryption and other customizable file format behaviors. The changes refactor how writer properties are handled throughout the codebase, transitioning from direct WriterProperties usage to a more flexible WriterPropertiesFactory pattern.

  • Adds new file format options abstraction and encryption support infrastructure
  • Refactors all operations to use WriterPropertiesFactory instead of direct WriterProperties
  • Integrates encryption capabilities into DataFusion operations and file scanning

Reviewed Changes

Copilot reviewed 37 out of 37 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
crates/core/src/table/file_format_options.rs New module providing file format options trait and WriterPropertiesFactory abstraction
crates/core/src/operations/encryption.rs New encryption configuration module for table-level encryption support
python/src/lib.rs Updates Python bindings to use new constructor signature with None parameter
crates/core/src/operations/write/writer.rs Major refactor replacing WriterProperties with WriterPropertiesFactory pattern
crates/core/src/operations/optimize.rs Enhanced optimize operations with encryption support and new factory pattern
crates/deltalake/examples/ New example files demonstrating encryption capabilities

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

Comment on lines +226 to +227
)
.await?;
Copy link

Copilot AI Sep 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The asynchronous call in the constructor pattern is unusual. Consider whether the factory creation should be synchronous or if this indicates a design issue where async operations are being performed during object construction.

Copilot uses AI. Check for mistakes.
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@adamreeve maybe copilot has a point here? I guess this is coming from create_writer_properties now being async. Do we need to change this?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I agree it's not ideal to have to change PartitionWriterConfig::try_new to be async, but git grep 'async fn try_' finds a bunch of other async constructors so I don't think this is that unusual. And I can't see a way to avoid this.

# Conflicts:
#	crates/core/src/delta_datafusion/mod.rs
#	crates/core/src/operations/mod.rs
#	crates/deltalake/Cargo.toml
Copy link
Collaborator

@adamreeve adamreeve left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is looking good thanks Corwin. I've left some comments on possible improvements, but I think this is something we could create a draft PR from to try to solicit feedback and start a discussion.

Comment on lines +457 to +459
// We have to set the encryption factory on the ParquetSource based on the Parquet options,
// as this is usually handled by the ParquetFormat type in DataFusion,
// which is not used in delta-rs.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we pass a FileFormatOptions instead of TableParquetOptions, we could maybe add a new method to FileFormatOptions to allow customizing the ParquetSource creation using and avoid needing to set the encryption factory here.

Although I don't feel like this code is too bad but it does leak some knowledge of Parquet encryption outside of the FileFormatOptions implementation.

) -> Result<Arc<dyn ExecutionPlan>> {
register_store(self.log_store(), session.runtime_env().clone());
if let Some(format_options) = &self.file_format_options {
format_options.update_session(session)?;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be nice to consolidate this session update with the register_store method above.

let meta = ObjectMeta::try_from(file).unwrap();
let file_format_options = file_format_options.clone();
async move {
let decrypt: Option<FileDecryptionProperties> =
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We might be able to tidy this up by adding a method to get ArrowReaderOptions, or alternatively changing this to use DataFusion for reading like for Z-Order compaction.


let writer_properties = this.writer_properties.unwrap_or_else(|| {
WriterProperties::builder()
.set_compression(Compression::ZSTD(ZstdLevel::try_new(4).unwrap()))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For completeness, we probably want a way for FileFormatOptions / WriterPropertiesFactory to use different compression or other properties for files written during compaction.

I guess users can just modify the options passed when the set up the DeltaOps, so maybe that's enough and we don't need to change the API to support this when reusing the same options for different ops.

}

#[async_trait]
pub trait WriterPropertiesFactory: Send + Sync + std::fmt::Debug + 'static {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we could move these methods into FileFormatOptions rather than need a separate trait.

I guess this might be needed for backwards compatibility with setting writer properties though.

// More advanced factory with KMS support
#[cfg(feature = "datafusion")]
#[derive(Clone, Debug)]
pub struct KMSWriterPropertiesFactory {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might make sense to move things around a bit so this lives in the encryption module alongside TableEncryption. encryption is also nested under operations but could maybe be next to file_format_options under the table module?

KmsFileFormatOptions is defined in the example, but maybe it should be in the same module as this type too.

}

#[cfg(feature = "datafusion")]
pub fn state_with_parquet_options(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still think we should avoid setting table specific options on the state if possible (registering encryption factory is different as the factory also needs to be specified by id in the table options). I will take a look at the cases where this is used and see if we can remove the need for this.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Made a PR for this here: #11

Ok(())
}

async fn round_trip_test() -> Result<(), deltalake::errors::DeltaTableError> {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As well as having an example, it would be good to have unit tests for each type of operation that check that the table files are all encrypted after the operation is complete.

@adamreeve adamreeve mentioned this pull request Sep 16, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants