feat!: Introduce metadata column API #1266

lbhm · 2025-09-08T09:05:58Z

NOTE: The PR is currently stacked on #1278.

What changes are proposed in this pull request?

This PR introduces a metadata column API to kernel-rs. The API design aims to replicate the recently agreed-upon metadata column API in kernel-java.

This PR affects the following public APIs

This PR adds new methods to StructType.

How was this change tested?

New UT.

codecov · 2025-09-08T09:08:57Z

Codecov Report

❌ Patch coverage is 90.06135% with 81 lines in your changes missing coverage. Please review.
✅ Project coverage is 83.68%. Comparing base (53d9ca1) to head (6db7fd4).

Files with missing lines	Patch %	Lines
kernel/src/schema/mod.rs	88.81%	30 Missing and 20 partials ⚠️
kernel/src/scan/mod.rs	80.00%	10 Missing and 3 partials ⚠️
kernel/src/actions/mod.rs	87.50%	4 Missing and 1 partial ⚠️
kernel/src/engine/arrow_conversion.rs	33.33%	2 Missing and 2 partials ⚠️
kernel/src/engine/arrow_utils.rs	94.66%	3 Missing and 1 partial ⚠️
ffi/src/schema.rs	0.00%	1 Missing ⚠️
ffi/src/test_ffi.rs	0.00%	1 Missing ⚠️
ffi/src/transaction/mod.rs	92.30%	1 Missing ⚠️
kernel/src/table_changes/mod.rs	66.66%	1 Missing ⚠️
kernel/src/table_changes/scan.rs	75.00%	0 Missing and 1 partial ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #1266      +/-   ##
==========================================
+ Coverage   83.67%   83.68%   +0.01%     
==========================================
  Files         108      108              
  Lines       25926    26391     +465     
  Branches    25926    26391     +465     
==========================================
+ Hits        21694    22086     +392     
- Misses       3144     3190      +46     
- Partials     1088     1115      +27

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copilot

Pull Request Overview

This PR introduces a metadata column API to kernel-rs, following the agreed-upon metadata column API design from kernel-java. The API allows creating and working with special metadata columns that provide additional information about rows in Delta tables.

Adds support for metadata columns (row_index, row_id, row_commit_version) with validation
Makes StructType creation fallible to prevent invalid metadata column usage
Refactors existing code to use new fallible constructors with appropriate error handling

Reviewed Changes

Copilot reviewed 37 out of 37 changed files in this pull request and generated 4 comments.

Show a summary per file

File	Description
kernel/src/schema/mod.rs	Core metadata column API implementation with validation
kernel/src/actions/mod.rs	Adds validation to reject metadata columns in table schemas
kernel/tests/write.rs	Updates test code to use new fallible StructType constructors
kernel/tests/read.rs	Updates Schema creation to handle fallible constructor
Multiple other files	Updates existing code to use new_unchecked() for internal schemas

_{Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.}

kernel/src/table_changes/physical_to_logical.rs

kernel/src/engine/arrow_conversion.rs

kernel/src/actions/mod.rs

lbhm · 2025-09-09T09:34:35Z

kernel/src/actions/mod.rs

@@ -209,6 +210,18 @@ impl Metadata {
        created_time: i64,
        configuration: HashMap<String, String>,
    ) -> DeltaResult<Self> {
+        // Validate that the schema does not contain metadata columns


We need to enforce that we never leak metadata columns into the Delta log.

lbhm · 2025-09-09T09:38:11Z

kernel/src/schema/mod.rs

@@ -113,11 +116,91 @@ impl AsRef<str> for ColumnMetadataKey {
            Self::IdentityHighWaterMark => "delta.identity.highWaterMark",
            Self::IdentityStart => "delta.identity.start",
            Self::IdentityStep => "delta.identity.step",
+            Self::InternalColumn => "delta.isInternalColumn",


This is a pre-factoring for deletion vector and row tracking reads. Kernel might need to add the row index to the read schema even though it was not requested by the user. In this case, we need to mark it accordingly and remove the column before returning to the user.

scovich

Made an initial pass. General direction looks reasonable, but we're materializing too many field lists all over the place.

kernel/examples/common/src/lib.rs

kernel/src/schema/mod.rs

lbhm · 2025-09-09T22:21:26Z

Thank you for the thorough review @scovich! I resolved or responded to all of your comments. As we discussed offline, I will factor out the StructType constructor-related changes and open up a dedicated PR for them tomorrow. Afterwards, I'll rebase this PR.

lbhm · 2025-09-10T12:22:40Z

kernel/src/actions/mod.rs

@@ -180,6 +180,7 @@ impl TryFrom<Format> for Scalar {
 )]
 #[internal_api]
 pub(crate) struct Metadata {
+    // TODO: Make the struct fields private to force using the try_new function.


Metadata is too often constructed directly across the codebase, so I would rather address this in a follow-up PR than make this PR bigger.

scovich

LGTM!

kernel/src/engine/arrow_utils.rs

kernel/src/schema/mod.rs

scovich · 2025-09-11T18:00:09Z

kernel/src/schema/mod.rs

 }

 impl StructType {
    /// Creates a new [`StructType`] from the given fields.
    ///
    /// Returns an error if:
    /// - the schema contains duplicate field names
+    /// - the schema contains duplicate metadata columns


ah, different from the first bullet because you could register the same metadata column twice with different names...

scovich · 2025-09-11T21:39:46Z

kernel/src/schema/mod.rs

-    // Checks if the `StructType` contains a field with the specified name.
-    pub(crate) fn contains(&self, name: impl AsRef<str>) -> bool {
-        self.fields.contains_key(name.as_ref())
-    }
-


nit: Any particular reason this was moved?

It felt cleaner to have all the index_of* and contains* methods co-located 🤷🏼‍♂️

scovich

Blocking merge until we sort out a potential bug I couldn't give a suggested fix for.

scovich · 2025-09-11T22:01:23Z

kernel/src/engine/arrow_utils.rs

            if !found_fields.contains(field.name()) {
-                if field.nullable {
+                if let Some(metadata_spec) = field.get_metadata_column_spec() {


Yikes! While reviewing #1272 I realized that this is probably wrong?

The whole point of a metadata column being identified by metadata is to be immune to name collisions, no?
So shouldn't we process a metadata column as metadata, even if its name happens to match something in the underlying file schema?

If so, we probably need some kind of change around L413-418?

CC @nicklan since he's the one most likely to understand the intricacies of reorder indexes, and what it might mean to inject the metadata column there instead of here.

Hmm should this ever be able to happen given that we don't allow duplicate names across regular and metadata columns?

Thinking more about this, we might want to add an additional check somewhere in the parquet reader that goes over all columns and verifies that if a column is a metadata column, it doesn't show up in the actual data columns.

Do you know a good place for such a check @scovich? I'm not very familiar with this part of the code base yet.

AFAIK, the correct semantics are a metadata column read should return the requested metadata, regardless of whether that metadata column happens to have a name that happens to be present in the underlying parquet file. So there's no error to check for -- the field's metadata annotation just takes precedence over the field's name.

This is definitely true when reading by field id (the spec for field ids requires to ignore column names entirely).

Yeah, so the issue (i think) is that we'll mark the metadata column as "found" above if there so happens to be a physical column with the same name. And then we'll read that column instead of generating a metadata column.

I think the right approach here is that match_parquet_fields should ignore metadata columns in requested_schema, and then we're guaranteed that we never look at those fields in the big loop over all the parquet fields, we'll just skip over them like we do any other unselected fields.

we might want to add an additional check somewhere in the parquet reader that goes over all columns and verifies that if a column is a metadata column, it doesn't show up in the actual data columns.

Do we need to verify this? I mean, it probably means there's something funky with the parquet, but in general we should probably just ignore such columns?

I extended match_parquet_fields() to filter out metadata columns. Could you verify that my understanding of the method and my change is correct @nicklan?

Do we need to verify this? I mean, it probably means there's something funky with the parquet, but in general we should probably just ignore such columns?

I agree that we might not want to check this in the Parquet reader, but I would argue that we should forbid this at the level of kernel. If a table has a column "revenue", the user should not be allowed to call a metadata column "revenue" and request it from the table. Not because kernel couldn't handle that (your suggested fix makes sure that we return the metadata column), but because this is semantically weird.

I extended StateInfo::try_new to ensure this.

If a table has a column "revenue", the user should not be allowed to call a metadata column "revenue" and request it from the table

CDC has this policy, but it has caused problems because apparently some tables do have column names that match the special CDC column names. Result: Users simply cannot enable CDC on such tables.

I actually had to do some very annoying and painful work in spark to handle metadata column name collisions as well, and would rather we didn't go back down that path in kernel.

Also, look at it this way: A column is a metadata column because it is annotated as such. If we see a metadata column, somebody had to do that on purpose, there's no mechanism that automatically turns an arbitrary column name into a metadata column.

So, if the metadata column does collide with a table schema name or a parquet schema name, we have one of two possibilities:

The table/file has a very annoying column name (but it's legal... neither engine nor Delta spec forbids users to create a column called e.g. _metadata.row_index).

The engine/connector has a really bad bug.

IMO, kernel already relies on engine not to have egregious correctness bugs, so we should let the engine decide how to resolve an arbitrary column name from a query into a kernel schema field. Most likely, name collisions will result in selecting the non-metadata version of the column, perhaps with some special syntax to force treating it as a metadata column name (and mapping that name to an actual metadata column spec in kernel).

Overall, I like @nicklan suggestion -- handle metadata columns similarly to partition columns, and filter them out of the list of columns we even search the parquet schema for. If the row index metadata column was requested, just add the corresponding transform directly. Any other type returns an error.

Fair point. If the others agree that we should allow the engine to purposefully "overwrite" physical columns with metadata columns, I'll remove the checks I added to scan.rs.

See also #1272 (comment)

nicklan

Left some comments. I think I understand the issue correctly, but lmk if I'm missing something.

kernel/src/schema/mod.rs

nicklan · 2025-09-11T23:41:50Z

kernel/src/engine/arrow_utils.rs

            if !found_fields.contains(field.name()) {
-                if field.nullable {
+                if let Some(metadata_spec) = field.get_metadata_column_spec() {


Yeah, so the issue (i think) is that we'll mark the metadata column as "found" above if there so happens to be a physical column with the same name. And then we'll read that column instead of generating a metadata column.

I think the right approach here is that match_parquet_fields should ignore metadata columns in requested_schema, and then we're guaranteed that we never look at those fields in the big loop over all the parquet fields, we'll just skip over them like we do any other unselected fields.

we might want to add an additional check somewhere in the parquet reader that goes over all columns and verifies that if a column is a metadata column, it doesn't show up in the actual data columns.

Do we need to verify this? I mean, it probably means there's something funky with the parquet, but in general we should probably just ignore such columns?

lbhm · 2025-09-12T13:40:16Z

kernel/src/scan/mod.rs

@@ -868,7 +868,9 @@ impl StateInfo {
        column_mapping_mode: ColumnMappingMode,
    ) -> DeltaResult<Self> {
        let mut have_partition_cols = false;
-        let mut read_fields = Vec::with_capacity(logical_schema.fields.len());
+        let mut read_fields = Vec::with_capacity(logical_schema.fields_len());
+        let mut read_field_names = HashSet::with_capacity(logical_schema.fields_len());


See #1266 (comment) why I think that we need the two additional checks in this method.

lbhm · 2025-09-12T13:42:10Z

kernel/src/schema/mod.rs

    // We use indexmap to preserve the order of fields as they are defined in the schema
    // while also allowing for fast lookup by name. The alternative is to do a linear search
    // for each field by name would be potentially quite expensive for large schemas.
-    pub fields: IndexMap<String, StructField>,
+    fields: IndexMap<String, StructField>,


I am making the struct fields private to prevent users from bypassing the constructor methods of StructType. This required me to implement a few additional helper methods below.

lbhm · 2025-09-12T13:43:45Z

kernel/src/schema/mod.rs

+    pub fn into_fields(self) -> impl ExactSizeIterator<Item = StructField> {
+        self.fields.into_values()
+    }


This helps to fix the schema.fields().cloned() pattern described in #1284 whenever the StructType is not wrapped in an Arc.

nicklan

small bug in the way you added filtering to match_parquet_fields, but other than that I think we're about good to go.

If there's not a test that's failing right now though, could you add one that ensures that match_parquet_fields returns with kernel_field_info set to None for any metadata cols in the schema

kernel/src/schema/mod.rs

kernel/src/engine/arrow_utils.rs

lbhm · 2025-09-12T22:06:14Z

@nicklan I renamed fields_len, fixed the bug in match_parquet_fields as you suggested, and added a new unit test. Could you take another look?

nicklan

thanks! one small nit, but lgtm!

kernel/src/engine/arrow_utils.rs

lbhm · 2025-09-13T18:04:40Z

Thank you @nicklan! @scovich could you also take another look and unblock or request specific changes?

lbhm marked this pull request as draft September 8, 2025 09:06

github-actions bot assigned lbhm Sep 8, 2025

github-actions bot added the breaking-change Change that require a major version bump label Sep 8, 2025

lbhm changed the title ~~[WIP] feat!: Introduce metadata column API~~ feat!: Introduce metadata column API Sep 8, 2025

lbhm requested a review from Copilot September 8, 2025 15:10

Copilot AI reviewed Sep 8, 2025

View reviewed changes

lbhm force-pushed the metadata-column-api branch from 0a39b5d to 2736ced Compare September 8, 2025 15:57

lbhm commented Sep 9, 2025

View reviewed changes

lbhm force-pushed the metadata-column-api branch from 1f16cc5 to a274522 Compare September 9, 2025 09:46

lbhm mentioned this pull request Sep 9, 2025

feat!: Introduce row index metadata column #1272

Open

scovich reviewed Sep 9, 2025

View reviewed changes

lbhm force-pushed the metadata-column-api branch 2 times, most recently from d9ddda8 to 5023fd6 Compare September 10, 2025 12:00

lbhm requested a review from scovich September 10, 2025 12:00

lbhm force-pushed the metadata-column-api branch from 5023fd6 to b84c360 Compare September 10, 2025 12:03

lbhm marked this pull request as ready for review September 10, 2025 12:05

lbhm commented Sep 10, 2025

View reviewed changes

lbhm force-pushed the metadata-column-api branch from b84c360 to 8ba4de6 Compare September 11, 2025 07:41

lbhm requested a review from zachschuermann September 11, 2025 08:00

scovich approved these changes Sep 11, 2025

View reviewed changes

scovich requested changes Sep 11, 2025

View reviewed changes

nicklan reviewed Sep 11, 2025

View reviewed changes

lbhm mentioned this pull request Sep 12, 2025

feat!: Add row tracking writer feature #1239

Merged

refactor!: Update StructType constructors

fa2e546

lbhm mentioned this pull request Sep 12, 2025

Introduce methods for incrementally modifying schemas #1284

Open

feat!: Introduce metadata column API

9de2423

lbhm force-pushed the metadata-column-api branch from 3318bfe to 9de2423 Compare September 12, 2025 13:37

lbhm commented Sep 12, 2025

View reviewed changes

nicklan requested changes Sep 12, 2025

View reviewed changes

kernel/src/schema/mod.rs Outdated Show resolved Hide resolved

kernel/src/engine/arrow_utils.rs Show resolved Hide resolved

kernel/src/engine/arrow_utils.rs Outdated Show resolved Hide resolved

Address review comments

528a24a

lbhm requested review from nicklan and scovich September 12, 2025 22:06

nicklan approved these changes Sep 12, 2025

View reviewed changes

kernel/src/engine/arrow_utils.rs Outdated Show resolved Hide resolved

Address nit

78a14f0

Fix clippy

6db7fd4

feat!: Introduce metadata column API #1266

Are you sure you want to change the base?

feat!: Introduce metadata column API #1266

Uh oh!

Conversation

lbhm commented Sep 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes are proposed in this pull request?

This PR affects the following public APIs

How was this change tested?

Uh oh!

codecov bot commented Sep 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

scovich left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

lbhm commented Sep 9, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

scovich left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

scovich left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

scovich Sep 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lbhm Sep 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

scovich Sep 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lbhm commented Sep 8, 2025 •

edited

Loading

codecov bot commented Sep 8, 2025 •

edited

Loading

scovich Sep 11, 2025 •

edited

Loading

lbhm Sep 12, 2025 •

edited

Loading

scovich Sep 12, 2025 •

edited

Loading

scovich Sep 12, 2025 •

edited

Loading