feat!: Add support for sparse transform expressions #1199

scovich · 2025-08-25T17:48:10Z

What changes are proposed in this pull request?

Log replay needs to define per-file transforms for row of metadata that survives data skipping. Unfortunately, Expression::Struct is "dense" (mentions every output field) and this produces excessive overhead when injecting the (usually very few) partition columns for tables with wide schemas (hundreds or thousands of columns). Column mapping tables are even worse, because they don't change the columns at all -- they just need to apply the output schema to the input data.

Solution: Define a new Expression::Transform that is a "sparse" representation of the changes to be made to a given top-level schema or nested struct. Input columns can be dropped or replaced, and new output columns can be injected after an input column of choice (or prepended to the output schema). The engine's expression evaluator does the actual work to transfer unchanged input columns across while building the output EngineData.

Update log replay to use the new transform capability, so that the cost is O(partition_columns) instead of O(schema_width). For non-partitioned tables with column mapping mode enabled, this translates to an empty (identity) transform which the default engine expression evaluator has been updated to optimize as a special case (just apply_schema directly to the input and return).

Result: Scan times are cut by nearly half in the metadata_bench benchmark:

Benchmarking scan_metadata/scan_metadata: Warming up for 3.0000 s
Warning: Unable to complete 20 samples in 5.0s. You may wish to increase target time to 5.1s, or reduce sample count to 10.
scan_metadata/scan_metadata
                        time:   [250.51 ms 253.72 ms 257.45 ms]
                        change: [-45.173% -44.306% -43.415%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 1 outliers among 20 measurements (5.00%)
  1 (5.00%) high mild

This PR affects the following public APIs

Added a new Expression::Transform enum variant.

How was this change tested?

New and existing unit tests, existing benchmarks.

codecov · 2025-08-25T17:52:04Z

Codecov Report

❌ Patch coverage is 90.60000% with 47 lines in your changes missing coverage. Please review.
✅ Project coverage is 83.61%. Comparing base (a4724de) to head (bd84add).
⚠️ Report is 1 commits behind head on main.

Files with missing lines	Patch %	Lines
kernel/src/scan/log_replay.rs	75.51%	12 Missing ⚠️
kernel/src/expressions/mod.rs	75.67%	9 Missing ⚠️
...src/engine/arrow_expression/evaluate_expression.rs	97.64%	0 Missing and 8 partials ⚠️
kernel/src/engine/arrow_expression/mod.rs	60.00%	1 Missing and 5 partials ⚠️
kernel/src/expressions/transforms.rs	0.00%	6 Missing ⚠️
kernel/src/engine/arrow_expression/apply_schema.rs	90.00%	1 Missing and 2 partials ⚠️
kernel/src/kernel_predicates/mod.rs	0.00%	2 Missing ⚠️
ffi/src/expressions/engine_visitor.rs	0.00%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #1199      +/-   ##
==========================================
+ Coverage   83.47%   83.61%   +0.13%     
==========================================
  Files         105      105              
  Lines       24024    24444     +420     
  Branches    24024    24444     +420     
==========================================
+ Hits        20054    20438     +384     
- Misses       2939     2966      +27     
- Partials     1031     1040       +9

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

scovich

Self-review

scovich · 2025-08-25T17:50:47Z

kernel/src/engine/arrow_expression/apply_schema.rs

-    if transformed_cols.len() != input_col_count {
-        return Err(Error::InternalError(format!(
-            "Passed struct had {input_col_count} columns, but transformed column has {}",
-            transformed_cols.len()
-        )));
-    }


I think this was removed by mistake... but it's worrisome that no unit tests failed as a result?

(reverted the mistake, but unit test question remains open)

was going to open an issue but figured I might as well just fix it.. #1210

kernel/src/engine/arrow_expression/mod.rs

kernel/src/scan/mod.rs

scovich · 2025-08-25T19:05:45Z

kernel/src/scan/mod.rs

+    #[allow(unused)]
+    StaticReplace(String, ExpressionRef), // Replace physical_field_name with expression
+    #[allow(unused)]
+    StaticInsert(Option<String>, ExpressionRef), // Insert expression after physical_field_name (None = prepend)


I don't actually know of an immediate use for static insertion...

Unless e.g. the row tracking COALESCE above needs to land in a different position than the physical row id column it replaces; then we'd need to drop the physical row id column and insert its replacement in the correct location.

But it's really easy to support this case, and it seems nice to have for completeness.

scovich · 2025-08-25T20:27:13Z

kernel/src/engine/arrow_expression/evaluate_expression.rs

+            } // else no replacement => dropped
+        } else {
+            // Field passes through unchanged
+            expressions.push(Arc::new(Expression::column([field_name])));


Yikes! This is incorrect: Column references are paths from top-level, and this transform could be nested arbitrarily deeply in an expression tree. We need to just take the field name, not a column path.

and following up from before, the solution here is that this expression needs to know its name? and the we can do a full path? or do you mean that Expression::column should somehow change to just take the field name in a 'relative' mode or something?

I'll post the fix as soon as I've addressed your comments, but tl;dr is:

Expr::Transform now tracks an Option<ColumnName>, which can be used to request Some pathing.

When pathing is requested, the input struct is found by normal path resolution using that path, and then the transform iterates over its fields instead of top-level fields.

A slightly enhanced ProvidesColumnsByName trait helps a lot here.

zachschuermann

thanks @scovich - I think changes look great, transform expression seems to make sense and address that sparse-ness gap which structs fundamentally couldn't handle.

i left a few comments/nits/questions and will follow up after that bug is fixed!

zachschuermann · 2025-08-25T23:02:10Z

ffi/src/expressions/engine_visitor.rs

@@ -482,6 +482,11 @@ fn visit_expression_impl(
        Expression::Opaque(OpaqueExpression { op, exprs }) => {
            visit_expression_opaque(visitor, op, exprs, sibling_list_id)
        }
+        Expression::Transform(_) => {
+            // Minimal FFI support: Transform expressions are treated as unknown
+            // TODO: Implement full Transform FFI support in future version


started #1205

kernel/src/scan/log_replay.rs

zachschuermann · 2025-08-25T23:19:26Z

kernel/src/engine/arrow_expression/apply_schema.rs

+    use super::*;
+    use crate::arrow::array::{Int32Array, StructArray};
+    use crate::arrow::datatypes::{
+        DataType as ArrowDataType, Field as ArrowField, Schema as ArrowSchema,
+    };
+    use crate::schema::{DataType, StructField, StructType};
+    use std::sync::Arc;


nit: import ordering/blocks (aside: we should probably provide guidance since this is one of the places clippy doesn't help)

Suggested change

use super::*;

use crate::arrow::array::{Int32Array, StructArray};

use crate::arrow::datatypes::{

DataType as ArrowDataType, Field as ArrowField, Schema as ArrowSchema,

};

use crate::schema::{DataType, StructField, StructType};

use std::sync::Arc;

use super::*;

use std::sync::Arc;

use crate::arrow::array::{Int32Array, StructArray};

use crate::arrow::datatypes::{

DataType as ArrowDataType, Field as ArrowField, Schema as ArrowSchema,

};

use crate::schema::{DataType, StructField, StructType};

That's odd... I thought fmt did enforce import ordering? Or at least that it could enforce it?

I also thought stdlib was supposed to be the last thing imported, but 🤷

yea i should be more specific: it enforces alphabetical ordering within contiguous blocks but it doesn't do the nice std/3p/crate split like we usually practice. also I've forgotten which way we typically prefer haha but we should probably just decide on one and stick to it.. how about std -> 3p -> internal crates -> this crate? (can document this in CONTRIBUTING.md)

zachschuermann · 2025-08-25T23:23:08Z

kernel/src/engine/arrow_expression/evaluate_expression.rs

+/// Evaluates a transform expression by building expressions in input schema order
+fn evaluate_transform_expression(


While it does seem like this general architecture is the Right Thing - I do feel we are marginally increasing the evaluation complexity. This comment is mostly an aside that maybe this underscores the utility of having another data-path microbenchmark to confirm the lack of data-size impact

As a side effect of the pathing bug fix, the output struct is now built up directly as part of the evaluation -- no intermediate vec of expressions (it's actually incorrect to use Expr::Column for nested relative pathing)

ah that's nice :)

Also as part of fixing the pathing bug, I extended the early-out optimization for identity transforms to work for paths as well. So basically, we have several levels of complexity now (simplest to most complex):

Expr::Transform (identity, no path) -- just applies the output schema to the input data

Expr::Column -- just paths to the requested column and applies the output schema

Expr::Transform (identity, with path) -- equivalent to Expr::Column, but has an extra Arc::new

Expr::Transform -- equivalent to Expr::Struct, but the input struct providing pass-thru fields is pathed to only once and its fields extracted directly.

Expr::Struct -- has to create a new StructArray from the result of evaluating all field expressions, including many Expr::Column which all have to path independently even if the paths are similar.

tl;dr: I would expect Expr::Transform to be more efficient than the equivalent Expr::Struct under virtually all circumstances where the transformation to be expressed is actually sparse.

zachschuermann · 2025-08-25T23:26:56Z

kernel/src/engine/arrow_expression/evaluate_expression.rs

+            } // else no replacement => dropped
+        } else {
+            // Field passes through unchanged
+            expressions.push(Arc::new(Expression::column([field_name])));


and following up from before, the solution here is that this expression needs to know its name? and the we can do a full path? or do you mean that Expression::column should somehow change to just take the field name in a 'relative' mode or something?

kernel/src/expressions/mod.rs

zachschuermann · 2025-08-25T23:38:06Z

kernel/src/expressions/mod.rs

+    /// fields inserted).
+    pub fn is_identity(&self) -> bool {
+        self.field_replacements.is_empty() && self.field_insertions.is_empty()
+    }


after reading through these methods I'm now wondering if we could simplify the interface by hiding hashmaps internally and instead just exposing the ability to probe through methods here? haven't thought about this too much but maybe only worth it if we use a very constrained subset of the hashmaps?

Normally I would have said "not worth it" -- but this will become part of the public API so encapsulation by default makes a lot of sense. Done.

Hmm. I just noticed that (a) all the other expression types have pub fields; and (b) users cannot create expression transforms unless the expression types have pub fields. I might need to revert this change :(

ah okay, yup seems reasonable

Maybe a follow-up for us to not have everything be so public

kernel/src/engine/arrow_expression/evaluate_expression.rs

zachschuermann · 2025-08-25T23:46:07Z

kernel/src/expressions/transforms.rs

+    /// Recursively transforms the children of a [`Transform`]. Returns `None` if all
+    /// children were removed, `Some(Cow::Owned)` if at least one child was changed or removed, and
+    /// `Some(Cow::Borrowed)` otherwise.
+    fn recurse_into_expr_transform(&mut self, t: &'a Transform) -> Option<Cow<'a, Transform>> {


untested? and seems there are some TODOs - should we perhaps make this unimplemented/no-op/something(?) for now and try to get it right in a follow up? considering it's a new expression anyways we don't have anyone trying to transform it already

Problem is, existing expression transforms that encounter a transform expression have to know what to do with it...

zachschuermann · 2025-08-25T23:49:28Z

kernel/src/schema/mod.rs

        self.fields.values()
    }

+    #[allow(unused)] // Most uses can leverage ExactSizeIterator::len instead


perhaps we should just remove then (separately)

nicklan

nice! this looks great, thanks. Had a few small things but overall lgtm.

nicklan · 2025-08-26T20:40:36Z

kernel/src/expressions/mod.rs

+    /// fields inserted).
+    pub fn is_identity(&self) -> bool {
+        self.field_replacements.is_empty() && self.field_insertions.is_empty()
+    }


Maybe a follow-up for us to not have everything be so public

kernel/src/expressions/transforms.rs

nicklan · 2025-08-26T20:45:08Z

kernel/src/scan/mod.rs

+    /// Insert the given expression after the given physical field name (None = prepend instead)
+    #[allow(unused)]
+    StaticInsert(Option<String>, ExpressionRef),
+    /// Insert the ith partition value after the given physical field name


nit: "ith" in what?

hmm, good question.. this is inherited from pre-existing code. It's ith in whatever numering the creator of the transform spec is using, which hopefully matches the numbering the log replay scan logic is using?

Looks like it comes from metaData.partitionColumns, which is an array of partition column names.
No idea how its order is decided, but it should be stable for any one snapshot.

No, sorry... it's the partition column's position in the input schema:
https://github.com/delta-io/delta-kernel-rs/blob/main/kernel/src/scan/mod.rs#L835-L844

It's basically a reference to the schema field, but by index to avoid lifetime issues.

Worth inlining this finding in comments? I also notice we have a very hard dependency on schema_fields ordering - i don't see anywhere that it would ever change but wondering if we enforce/describe that sufficiently?

It's actually a dependency on the table's logical schema, which should be quite stable during the life of any given Snapshot. I updated the doc comment.

nicklan · 2025-08-26T20:48:31Z

kernel/src/scan/log_replay.rs

+                    let partition_value = Arc::new(partition_value.into());
+                    transform.with_inserted_field(insert_after.clone(), partition_value)


nit:

Suggested change

let partition_value = Arc::new(partition_value.into());

transform.with_inserted_field(insert_after.clone(), partition_value)

let partition_expr = Arc::new(partition_value.into());

transform.with_inserted_field(insert_after.clone(), partition_expr)

I don't love having two names for basically the same thing, especially when one is moved-from and useless... and partition_value_expr would be more accurate than just partition_expr? Also: a partition value [literal] is a kind of expression?

nicklan · 2025-08-26T20:54:12Z

kernel/src/engine/arrow_expression/evaluate_expression.rs

+        let field_name = Some(Cow::Borrowed(field_name));
+        if let Some(insertion_exprs) = transform.field_insertions.get(&field_name) {


Could we just define a Transform::get_field_insertions_for(field: &str) and call that here? likewise above although it would save less code, it would make it easier if we want to make the fields not pub in the future.

heh. I had that code before I realized the fields needed to stay pub to not break expression transforms.

It can't take a &str tho... it's Option<&str> (None = no predecessor = prepend)

So far, this is the only call site and it's in private code. I'd prefer to not design this until we have more usage patters to learn from. Taking the fields private will anyway be a breaking change, regardless of whether we define accessors now.

kernel/src/engine/arrow_expression/evaluate_expression.rs

nicklan · 2025-08-26T21:01:44Z

kernel/src/expressions/mod.rs

+    }
+
+    /// Specifies an expression to replace a field with.
+    pub fn with_replaced_field(mut self, name: impl Into<String>, expr: ExpressionRef) -> Self {


nit:

Suggested change

pub fn with_replaced_field(mut self, name: impl Into<String>, expr: ExpressionRef) -> Self {

pub fn with_replaced_field(mut self, name: impl Into<String>, replacement_expr: ExpressionRef) -> Self {

That will spill the method signature from one line to five lines, and I'm not sure it really adds any information?
The method clearly states that it's replacing a field?

zachschuermann

LGTM - thanks @scovich excited to have this land!!

zachschuermann · 2025-08-26T21:35:12Z

kernel/src/scan/mod.rs

+    /// Insert the given expression after the given physical field name (None = prepend instead)
+    #[allow(unused)]
+    StaticInsert(Option<String>, ExpressionRef),
+    /// Insert the ith partition value after the given physical field name


Worth inlining this finding in comments? I also notice we have a very hard dependency on schema_fields ordering - i don't see anywhere that it would ever change but wondering if we enforce/describe that sufficiently?

kernel/src/expressions/mod.rs

zachschuermann · 2025-08-26T21:46:33Z

kernel/src/scan/mod.rs

+    #[allow(unused)]
+    StaticInsert(Option<String>, ExpressionRef),
+    /// Insert the ith partition value after the given physical field name
+    Partition(Option<String>, usize),


also - wonder if it might be useful to just have a quick alias like type PartitionIndex = usize; with docs saying it's a ref to the schema field as an index into input schema etc.?

I went ahead and turned all these tuple enum variants into struct variants instead.
That should improve readability a lot.

zachschuermann · 2025-08-26T22:21:23Z

kernel/src/expressions/transforms.rs

    fn transform_expr_transform(&mut self, transform: &'a Transform) -> Option<Cow<'a, Transform>> {
-        self.recurse_into_expr_transform(transform)
+        Some(Cow::Borrowed(transform))


should we track a follow-up here?

Anyone who implements the trait is welcome to recurse however they want. The recurse_into_xxx helpers are really just that -- helpers -- and implementations are not required to use them. If/when we have an implementation or two that actually do similar things, maybe then we should track a TODO to factor out a helper?

## What changes are proposed in this pull request? Discovered test gap in #1199 (see #1199 (comment)) - this fixes it. ## How was this change tested? test-only

scovich added 15 commits August 22, 2025 12:50

checkpoint - initial phase1

c598166

checkpoint - phase1 refined

3012d1d

checkpoint - phase 1.5 initial

141ae80

checkpoint - phase2 initial

857c0ba

checkpoint - phase2 fix

2a96ec3

checkpoint - phase2 complete

06b1518

fmt

b839eb6

checkpoint phase3 complete

0e560a8

checkpoint - use sparse transforms - initial

0e10d6b

manual cleanup - compiles cleanly

b42a7af

checkpoint - fix column mapping

3473687

checkpoint - remove incorrect validation

a5c11d2

checkpoint - short circuit

454f3d1

fmt

22289f5

cleanups

6f651c1

scovich requested review from nicklan and zachschuermann August 25, 2025 17:48

github-actions bot assigned scovich Aug 25, 2025

github-actions bot added the breaking-change Change that require a major version bump label Aug 25, 2025

scovich commented Aug 25, 2025

View reviewed changes

scovich added 2 commits August 25, 2025 12:26

self review

11761fa

Merge remote-tracking branch 'oss/main' into sparse-expressions

ffeb490

scovich commented Aug 25, 2025

View reviewed changes

scovich added 5 commits August 25, 2025 14:56

Fix nested pathing bug

b57d632

identity transform shortcut must be path-aware

dd4926c

checkpoint

0823b92

checkpoint

22fd71e

checkpoint

20f2cc4

zachschuermann reviewed Aug 25, 2025

View reviewed changes

zachschuermann mentioned this pull request Aug 25, 2025

perf: partition cache for transform expressions #1172

Closed

scovich added 5 commits August 25, 2025 17:03

checkpoint

6a944d3

checkpoint

419f714

pathing bug fixed, tests cover it now

e1f8a16

review comments

64527df

probe the transform insertion map with borrowed keys

28a3f84

scovich requested review from zachschuermann and rtyler August 26, 2025 15:15

zachschuermann mentioned this pull request Aug 26, 2025

test: add apply_schema mismatch test #1210

Merged

more cleanups; add drop field transform spec

62d2488

nicklan approved these changes Aug 26, 2025

View reviewed changes

zachschuermann approved these changes Aug 26, 2025

View reviewed changes

zachschuermann reviewed Aug 26, 2025

View reviewed changes

scovich and others added 3 commits August 26, 2025 14:54

review feedback

8eade9a

use struct enum variants for FieldTransformSpec

0f6b72c

Merge branch 'main' into sparse-expressions

86f0b39

zachschuermann reviewed Aug 26, 2025

View reviewed changes

Merge branch 'main' into sparse-expressions

bd84add

scovich merged commit 7ba3af0 into delta-io:main Aug 27, 2025
21 checks passed

zachschuermann added a commit that referenced this pull request Aug 27, 2025

test: add apply_schema mismatch test (#1210)

8917cce

## What changes are proposed in this pull request? Discovered test gap in #1199 (see #1199 (comment)) - this fixes it. ## How was this change tested? test-only

		/// Evaluates a transform expression by building expressions in input schema order
		fn evaluate_transform_expression(

		let partition_value = Arc::new(partition_value.into());
		transform.with_inserted_field(insert_after.clone(), partition_value)

		let field_name = Some(Cow::Borrowed(field_name));
		if let Some(insertion_exprs) = transform.field_insertions.get(&field_name) {

	pub fn with_replaced_field(mut self, name: impl Into<String>, expr: ExpressionRef) -> Self {
	pub fn with_replaced_field(mut self, name: impl Into<String>, replacement_expr: ExpressionRef) -> Self {

feat!: Add support for sparse transform expressions #1199

feat!: Add support for sparse transform expressions #1199

Uh oh!

Conversation

scovich commented Aug 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes are proposed in this pull request?

This PR affects the following public APIs

How was this change tested?

Uh oh!

codecov bot commented Aug 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

scovich left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zachschuermann left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

scovich Aug 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nicklan left a comment

Choose a reason for hiding this comment

scovich commented Aug 25, 2025 •

edited

Loading

codecov bot commented Aug 25, 2025 •

edited

Loading

scovich Aug 26, 2025 •

edited

Loading

scovich Aug 26, 2025 •

edited

Loading