feat!: Add row tracking writer feature #1239

lbhm · 2025-09-02T14:25:14Z

What changes are proposed in this pull request?

This PR adds row tracking as a writer feature. As part of this overall goal, it introduces the following changes:

Modify the ADD_FILES_SCHEMA and nest numRecords in a stats struct. This simplifies the two-stage expression evaluation in Transaction::generate_adds for the moment.
- This subchange is why the PR touches the default engine and FFI with small modifications.
Add WriterFeature::DomainMetadata and WriterFeature::RowTracking to the supported writer features and extend the table properties, config, and deserialization logic accordingly.
Add a row tracking RowVisitor that extracts the numRecords field out of add actions to compute a base row ID for each add action.
Refactor and extend the commit path in Transaction::commit to enrich Add actions with row tracking metadata on demand.
Add a new test suite for row tracking-related tests.
- Tests that are not part of this PR because they require refactoring of existing test infra are tracked in Add more row tracking write path tests #1265.
Modify the create_table function in test utils to accept a Vec of table features so that we don't have to touch it every time we introduce new table features.

This PR affects the following public APIs

This PR updates the ADD_FILES_SCHEMA as described above.

How was this change tested?

Added a new row_tracking.rs test suite with write integration tests. Golden table tests will be added in a follow-up PR.

codecov · 2025-09-02T14:28:40Z

Codecov Report

❌ Patch coverage is 92.00000% with 32 lines in your changes missing coverage. Please review.
✅ Project coverage is 83.67%. Comparing base (5552d21) to head (5faf2db).
⚠️ Report is 1 commits behind head on main.

Files with missing lines	Patch %	Lines
kernel/src/row_tracking.rs	93.16%	2 Missing and 9 partials ⚠️
kernel/src/table_configuration.rs	65.51%	9 Missing and 1 partial ⚠️
kernel/src/transaction/mod.rs	93.57%	1 Missing and 6 partials ⚠️
kernel/src/engine/default/parquet.rs	84.61%	1 Missing and 1 partial ⚠️
ffi/src/transaction/mod.rs	50.00%	0 Missing and 1 partial ⚠️
kernel/src/table_properties/deserialize.rs	80.00%	0 Missing and 1 partial ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #1239      +/-   ##
==========================================
+ Coverage   83.60%   83.67%   +0.06%     
==========================================
  Files         107      108       +1     
  Lines       25651    25926     +275     
  Branches    25651    25926     +275     
==========================================
+ Hits        21446    21694     +248     
- Misses       3135     3144       +9     
- Partials     1070     1088      +18

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

scovich

There are enough potential performance pitfalls in this new code (multiple collect passes over add actions on the write path, double expression eval, etc) that I worry we'll need to fully separate the non-row tracking code path from the row tracking path. Otherwise we risk regressing existing use cases by some unknown and potentially large amount.

We should also seriously consider adding a basic benchmark of some kind, to measure the impact of this new code. If the impact is negligible for non-rowtracking tables, we don't need to worry about getting fancy with performance isolation; if it's massive, at least we know what we're up against.

kernel/src/actions/mod.rs

kernel/src/row_tracking.rs

kernel/src/transaction/mod.rs

lbhm · 2025-09-03T10:04:45Z

Thank you for your thorough review, @scovich!

Based on your feedback, I was able to get rid of the collect()s in the row tracking path. Apart from that, the non-row tracking path should not be affected by this PR performance-wise, since the row tracking path is hidden behind an if-branch, and generate_adds() has not changed apart from

accepting a more generic input
being moved into impl Transaction

kernel/tests/row_tracking.rs

kernel/src/actions/mod.rs

kernel/src/row_tracking.rs

kernel/src/table_configuration.rs

kernel/src/transaction/mod.rs

lbhm · 2025-09-08T07:30:59Z

kernel/tests/write.rs

+    // We can add shredding features as well as we are allowed to write unshredded variants
+    // into shredded tables and shredded reads are explicitly blocked in the default
+    // engine's parquet reader.
+    // TODO: (#1124) we don't actually support column mapping writes yet, but have some
+    // tests that do column mapping on writes. For now omit the writer feature to let tests
+    // run, but after actual support this should be enabled.


I moved these comments since they used to be in create_table().

lbhm · 2025-09-08T07:42:46Z

kernel/src/transaction/mod.rs

+}
+
+/// The static instance referenced by [`add_files_schema`].
+pub(crate) static ADD_FILES_SCHEMA: LazyLock<SchemaRef> = LazyLock::new(|| {


Note that I kept the name ADD_FILES_SCHEMA here since we discussed this in another PR.
I still think the naming is not super clear, but I would rather address this in a separate PR.

johanl-db

The change looks good

kernel/src/row_tracking.rs

scovich

LGTM. The review was a bit frazzled/interrupted tho, so hopefully somebody else can double check that I didn't miss anything important.

zachschuermann

flushing comments, reviewed everything except new integration test + want to take a look again at the generate_adds changes

kernel/src/actions/mod.rs

kernel/src/table_properties/deserialize.rs

kernel/src/transaction/mod.rs

kernel/src/row_tracking.rs

kernel/src/table_configuration.rs

kernel/src/row_tracking.rs

zachschuermann · 2025-09-11T15:21:47Z

kernel/src/transaction/mod.rs

+        // Generate add actions including row tracking metadata
+        let add_actions = self.generate_adds(
+            engine,
+            extended_add_files_metadata,
+            with_row_tracking_cols(add_files_schema()),
+            as_log_add_schema(with_row_tracking_cols(&with_stats_col(
+                mandatory_add_file_schema(),
+            ))),
        );
-        adds_evaluator.evaluate(add_files_batch.as_ref())
-    })
+
+        // Return a chained iterator with add and domain metadata actions
+        Ok(Box::new(
+            add_actions.chain(iter::once(domain_metadata_action)),
+        ))


instead of embedding generate_adds into this method would it be reasonable to have the flow above be something like:

let (domain_metadata, input_schema, output_schema) = if row_tracking { // ... } else { // ... }; // then just one let adds = generate_adds(...);

I thought about that. I personally prefer the current approach because it reduces complexity in commit(). Looking forward, commit() will only become more complex as we add features, so I'd favor separating more logic in helper methods rather than inlining.

though in the future we will have more domain metadatas, etc. and seems more scalable to not embed them into the 'generate adds'

this may not be worth blocking this PR on but would prefer to track as a follow-up if we decide not to pursue here.

can take on in #1274 ?

I agree it's better to address as part of #1274. Since row tracking is an internal domain metadata, we might have to handle it different than user-provided one anyway.

sounds good, i'll take care of it in #1274.

zachschuermann

LGTM to merge - I'll get this in and let you follow up on comments (and possibly in little follow-up PR to fix stuff like the as but don't want to block merge)

thanks @lbhm, awesome work! 🚢

zachschuermann · 2025-09-12T04:15:13Z

kernel/src/transaction/mod.rs

+        let actions = iter::once(commit_info_action)
+            .chain(set_transaction_actions)
+            .chain(add_actions);


no i meant of set_transactions and adds - we always write commit info first. but previously we wrote adds then set_transaction.

kernel/src/actions/mod.rs

kernel/src/transaction/mod.rs

zachschuermann · 2025-09-12T04:22:09Z

kernel/src/transaction/mod.rs

+fn with_stats_col(schema: &SchemaRef) -> SchemaRef {
+    let fields = schema
+        .fields()
+        .cloned()


given this PR has run up into lots of the existing Schema shortcomings it would be great to add as much detail/examples/pointers/issues to #1284 as possible :)

zachschuermann · 2025-09-12T04:24:12Z

kernel/src/transaction/mod.rs

+    {
+        let evaluation_handler = engine.evaluation_handler();
+
+        Box::new(add_files_metadata.map(move |add_files_batch| {


I think previously we returned an impl Iterator and avoided the Box?

can possibly just take on in #1274

The problem here was that generate_adds() and generate_adds_with_row_tracking() return different kinds of iterators, so I had to use a trait object and box it. Maybe my Rust typing foo wasn't strong enough though. Do you see a better solution @zachschuermann?

zachschuermann · 2025-09-12T04:27:48Z

kernel/src/transaction/mod.rs

+
+        // Read the current rowIdHighWaterMark from the snapshot's row tracking domain metadata
+        let row_id_high_water_mark =
+            RowTrackingDomainMetadata::get_high_water_mark(&self.read_snapshot, engine)?;


aside: does spark also do a separate log replay for this? wonder if we can eagerly track domain metadatas we care about during snapshot construction to avoid this? open a follow up to consider?

Tagging @johanl-db and @scovich here since they are more familiar with Spark.

Same discussion came up during the row tracking write path implementation in Java (delta-io/delta#3835 (comment), IIRC the discussion took place in the kernel slack channel)

I believe kernel java now collects domain metadata during the main log replay if possible, and also attemps to get them from the CRC file (delta-io/delta@a59495b)

I can double check what spark is doing but I believe is fairly similar: get from CRC file if available, or from snapshot assuming log replay already happened

That would anyway be better suited as a follow up

zachschuermann · 2025-09-12T04:31:57Z

kernel/src/transaction/mod.rs

+
+        // Create a row tracking visitor and visit all files to collect row tracking information
+        let mut row_tracking_visitor = RowTrackingVisitor::new(row_id_high_water_mark);
+        let mut base_row_id_batches = Vec::with_capacity(self.add_files_metadata.len());


couldn't the visitor track the base_row_id_batches internally? (could even pass in length hint if we want)

I refactored the RowTrackingVisitor accordingly. I agree that it clarifies the separation of concerns, but if we are playing code golf, it does not really help.

kernel/src/transaction/mod.rs

lbhm · 2025-09-12T08:58:17Z

Thank you @zachschuermann! I addressed the open comments in #1291 and the schema API-related changes in #1266.

github-actions bot assigned lbhm Sep 2, 2025

lbhm marked this pull request as draft September 2, 2025 14:25

github-actions bot added the breaking-change Change that require a major version bump label Sep 2, 2025

scovich reviewed Sep 2, 2025

View reviewed changes

lbhm mentioned this pull request Sep 3, 2025

feat!: Add numRecords to ADD_FILES_SCHEMA #1235

Merged

lbhm requested a review from scovich September 3, 2025 11:59

johanl-db reviewed Sep 3, 2025

View reviewed changes

kernel/tests/row_tracking.rs Outdated Show resolved Hide resolved

kernel/tests/row_tracking.rs Show resolved Hide resolved

lbhm force-pushed the row-tracking-write-path-new branch from e48bff7 to eaac1df Compare September 4, 2025 14:13

lbhm marked this pull request as ready for review September 4, 2025 14:37

lbhm requested review from nicklan, zachschuermann and OussamaSaoudi September 4, 2025 14:38

lbhm changed the title ~~[WIP] feat: Add row tracking writer feature~~ feat: Add row tracking writer feature Sep 4, 2025

lbhm mentioned this pull request Sep 5, 2025

feat!: Add try_append_columns to EngineData #1190

Merged

johanl-db reviewed Sep 5, 2025

View reviewed changes

lbhm force-pushed the row-tracking-write-path-new branch from 741ad22 to 439bed6 Compare September 7, 2025 10:56

lbhm mentioned this pull request Sep 8, 2025

Add more row tracking write path tests #1265

Open

lbhm commented Sep 8, 2025

View reviewed changes

johanl-db approved these changes Sep 8, 2025

View reviewed changes

kernel/src/row_tracking.rs Outdated Show resolved Hide resolved

kernel/src/row_tracking.rs Outdated Show resolved Hide resolved

scovich approved these changes Sep 8, 2025

View reviewed changes

lbhm mentioned this pull request Sep 9, 2025

[WIP][Old] Add row tracking writer feature #1156

Closed

2 tasks

feat: Add write support for row tracking

905f405

lbhm force-pushed the row-tracking-write-path-new branch from d31e51b to 905f405 Compare September 9, 2025 11:08

lbhm mentioned this pull request Sep 10, 2025

Support domain metadata writes #1270

Open

zachschuermann reviewed Sep 11, 2025

View reviewed changes

Address review comments

5fe0ecd

lbhm requested a review from zachschuermann September 11, 2025 16:58

zachschuermann mentioned this pull request Sep 11, 2025

<wip> feat: support writing domain metadata (1/2) #1274

Draft

zachschuermann approved these changes Sep 12, 2025

View reviewed changes

Merge branch 'main' into row-tracking-write-path-new

5faf2db

zachschuermann added breaking-change Change that require a major version bump and removed breaking-change Change that require a major version bump labels Sep 12, 2025

zachschuermann changed the title ~~feat: Add row tracking writer feature~~ feat!: Add row tracking writer feature Sep 12, 2025

zachschuermann merged commit 53d9ca1 into delta-io:main Sep 12, 2025
21 checks passed

lbhm mentioned this pull request Sep 12, 2025

refactor: Row tracking write cleanup #1291

Open

feat!: Add row tracking writer feature #1239

feat!: Add row tracking writer feature #1239

Conversation

lbhm commented Sep 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes are proposed in this pull request?

This PR affects the following public APIs

How was this change tested?

Uh oh!

codecov bot commented Sep 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

scovich left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

lbhm commented Sep 3, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

johanl-db left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

scovich left a comment

Choose a reason for hiding this comment

Uh oh!

zachschuermann left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zachschuermann left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

lbhm commented Sep 2, 2025 •

edited

Loading

codecov bot commented Sep 2, 2025 •

edited

Loading