Skip to content

Conversation

lbhm
Copy link
Collaborator

@lbhm lbhm commented Sep 8, 2025

NOTE: The PR is currently stacked on #1278.

What changes are proposed in this pull request?

This PR introduces a metadata column API to kernel-rs. The API design aims to replicate the recently agreed-upon metadata column API in kernel-java.

This PR affects the following public APIs

This PR adds new methods to StructType.

How was this change tested?

New UT.

@lbhm lbhm marked this pull request as draft September 8, 2025 09:06
Copy link

codecov bot commented Sep 8, 2025

Codecov Report

❌ Patch coverage is 90.06135% with 81 lines in your changes missing coverage. Please review.
✅ Project coverage is 83.68%. Comparing base (53d9ca1) to head (6db7fd4).

Files with missing lines Patch % Lines
kernel/src/schema/mod.rs 88.81% 30 Missing and 20 partials ⚠️
kernel/src/scan/mod.rs 80.00% 10 Missing and 3 partials ⚠️
kernel/src/actions/mod.rs 87.50% 4 Missing and 1 partial ⚠️
kernel/src/engine/arrow_conversion.rs 33.33% 2 Missing and 2 partials ⚠️
kernel/src/engine/arrow_utils.rs 94.66% 3 Missing and 1 partial ⚠️
ffi/src/schema.rs 0.00% 1 Missing ⚠️
ffi/src/test_ffi.rs 0.00% 1 Missing ⚠️
ffi/src/transaction/mod.rs 92.30% 1 Missing ⚠️
kernel/src/table_changes/mod.rs 66.66% 1 Missing ⚠️
kernel/src/table_changes/scan.rs 75.00% 0 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1266      +/-   ##
==========================================
+ Coverage   83.67%   83.68%   +0.01%     
==========================================
  Files         108      108              
  Lines       25926    26391     +465     
  Branches    25926    26391     +465     
==========================================
+ Hits        21694    22086     +392     
- Misses       3144     3190      +46     
- Partials     1088     1115      +27     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@github-actions github-actions bot added the breaking-change Change that require a major version bump label Sep 8, 2025
@lbhm lbhm changed the title [WIP] feat!: Introduce metadata column API feat!: Introduce metadata column API Sep 8, 2025
@lbhm lbhm requested a review from Copilot September 8, 2025 15:10
Copy link

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR introduces a metadata column API to kernel-rs, following the agreed-upon metadata column API design from kernel-java. The API allows creating and working with special metadata columns that provide additional information about rows in Delta tables.

  • Adds support for metadata columns (row_index, row_id, row_commit_version) with validation
  • Makes StructType creation fallible to prevent invalid metadata column usage
  • Refactors existing code to use new fallible constructors with appropriate error handling

Reviewed Changes

Copilot reviewed 37 out of 37 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
kernel/src/schema/mod.rs Core metadata column API implementation with validation
kernel/src/actions/mod.rs Adds validation to reject metadata columns in table schemas
kernel/tests/write.rs Updates test code to use new fallible StructType constructors
kernel/tests/read.rs Updates Schema creation to handle fallible constructor
Multiple other files Updates existing code to use new_unchecked() for internal schemas

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

@lbhm lbhm force-pushed the metadata-column-api branch from 0a39b5d to 2736ced Compare September 8, 2025 15:57
@@ -209,6 +210,18 @@ impl Metadata {
created_time: i64,
configuration: HashMap<String, String>,
) -> DeltaResult<Self> {
// Validate that the schema does not contain metadata columns
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to enforce that we never leak metadata columns into the Delta log.

@@ -113,11 +116,91 @@ impl AsRef<str> for ColumnMetadataKey {
Self::IdentityHighWaterMark => "delta.identity.highWaterMark",
Self::IdentityStart => "delta.identity.start",
Self::IdentityStep => "delta.identity.step",
Self::InternalColumn => "delta.isInternalColumn",
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a pre-factoring for deletion vector and row tracking reads. Kernel might need to add the row index to the read schema even though it was not requested by the user. In this case, we need to mark it accordingly and remove the column before returning to the user.

Copy link
Collaborator

@scovich scovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Made an initial pass. General direction looks reasonable, but we're materializing too many field lists all over the place.

@lbhm
Copy link
Collaborator Author

lbhm commented Sep 9, 2025

Thank you for the thorough review @scovich! I resolved or responded to all of your comments. As we discussed offline, I will factor out the StructType constructor-related changes and open up a dedicated PR for them tomorrow. Afterwards, I'll rebase this PR.

@lbhm lbhm force-pushed the metadata-column-api branch 2 times, most recently from d9ddda8 to 5023fd6 Compare September 10, 2025 12:00
@lbhm lbhm requested a review from scovich September 10, 2025 12:00
@lbhm lbhm force-pushed the metadata-column-api branch from 5023fd6 to b84c360 Compare September 10, 2025 12:03
@lbhm lbhm marked this pull request as ready for review September 10, 2025 12:05
@@ -180,6 +180,7 @@ impl TryFrom<Format> for Scalar {
)]
#[internal_api]
pub(crate) struct Metadata {
// TODO: Make the struct fields private to force using the try_new function.
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Metadata is too often constructed directly across the codebase, so I would rather address this in a follow-up PR than make this PR bigger.

@lbhm lbhm force-pushed the metadata-column-api branch from b84c360 to 8ba4de6 Compare September 11, 2025 07:41
@lbhm lbhm requested a review from zachschuermann September 11, 2025 08:00
Copy link
Collaborator

@scovich scovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

}

impl StructType {
/// Creates a new [`StructType`] from the given fields.
///
/// Returns an error if:
/// - the schema contains duplicate field names
/// - the schema contains duplicate metadata columns
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah, different from the first bullet because you could register the same metadata column twice with different names...

Comment on lines 456 to 429
// Checks if the `StructType` contains a field with the specified name.
pub(crate) fn contains(&self, name: impl AsRef<str>) -> bool {
self.fields.contains_key(name.as_ref())
}

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Any particular reason this was moved?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It felt cleaner to have all the index_of* and contains* methods co-located 🤷🏼‍♂️

Copy link
Collaborator

@scovich scovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Blocking merge until we sort out a potential bug I couldn't give a suggested fix for.

Comment on lines 513 to 514
if !found_fields.contains(field.name()) {
if field.nullable {
if let Some(metadata_spec) = field.get_metadata_column_spec() {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yikes! While reviewing #1272 I realized that this is probably wrong?

The whole point of a metadata column being identified by metadata is to be immune to name collisions, no?
So shouldn't we process a metadata column as metadata, even if its name happens to match something in the underlying file schema?

If so, we probably need some kind of change around L413-418?

CC @nicklan since he's the one most likely to understand the intricacies of reorder indexes, and what it might mean to inject the metadata column there instead of here.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm should this ever be able to happen given that we don't allow duplicate names across regular and metadata columns?

Thinking more about this, we might want to add an additional check somewhere in the parquet reader that goes over all columns and verifies that if a column is a metadata column, it doesn't show up in the actual data columns.

Do you know a good place for such a check @scovich? I'm not very familiar with this part of the code base yet.

Copy link
Collaborator

@scovich scovich Sep 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAIK, the correct semantics are a metadata column read should return the requested metadata, regardless of whether that metadata column happens to have a name that happens to be present in the underlying parquet file. So there's no error to check for -- the field's metadata annotation just takes precedence over the field's name.

This is definitely true when reading by field id (the spec for field ids requires to ignore column names entirely).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, so the issue (i think) is that we'll mark the metadata column as "found" above if there so happens to be a physical column with the same name. And then we'll read that column instead of generating a metadata column.

I think the right approach here is that match_parquet_fields should ignore metadata columns in requested_schema, and then we're guaranteed that we never look at those fields in the big loop over all the parquet fields, we'll just skip over them like we do any other unselected fields.

we might want to add an additional check somewhere in the parquet reader that goes over all columns and verifies that if a column is a metadata column, it doesn't show up in the actual data columns.

Do we need to verify this? I mean, it probably means there's something funky with the parquet, but in general we should probably just ignore such columns?

Copy link
Collaborator Author

@lbhm lbhm Sep 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I extended match_parquet_fields() to filter out metadata columns. Could you verify that my understanding of the method and my change is correct @nicklan?

Do we need to verify this? I mean, it probably means there's something funky with the parquet, but in general we should probably just ignore such columns?

I agree that we might not want to check this in the Parquet reader, but I would argue that we should forbid this at the level of kernel. If a table has a column "revenue", the user should not be allowed to call a metadata column "revenue" and request it from the table. Not because kernel couldn't handle that (your suggested fix makes sure that we return the metadata column), but because this is semantically weird.

I extended StateInfo::try_new to ensure this.

Copy link
Collaborator

@scovich scovich Sep 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If a table has a column "revenue", the user should not be allowed to call a metadata column "revenue" and request it from the table

CDC has this policy, but it has caused problems because apparently some tables do have column names that match the special CDC column names. Result: Users simply cannot enable CDC on such tables.

I actually had to do some very annoying and painful work in spark to handle metadata column name collisions as well, and would rather we didn't go back down that path in kernel.

Also, look at it this way: A column is a metadata column because it is annotated as such. If we see a metadata column, somebody had to do that on purpose, there's no mechanism that automatically turns an arbitrary column name into a metadata column.

So, if the metadata column does collide with a table schema name or a parquet schema name, we have one of two possibilities:

  • The table/file has a very annoying column name (but it's legal... neither engine nor Delta spec forbids users to create a column called e.g. _metadata.row_index).
  • The engine/connector has a really bad bug.

IMO, kernel already relies on engine not to have egregious correctness bugs, so we should let the engine decide how to resolve an arbitrary column name from a query into a kernel schema field. Most likely, name collisions will result in selecting the non-metadata version of the column, perhaps with some special syntax to force treating it as a metadata column name (and mapping that name to an actual metadata column spec in kernel).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, I like @nicklan suggestion -- handle metadata columns similarly to partition columns, and filter them out of the list of columns we even search the parquet schema for. If the row index metadata column was requested, just add the corresponding transform directly. Any other type returns an error.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair point. If the others agree that we should allow the engine to purposefully "overwrite" physical columns with metadata columns, I'll remove the checks I added to scan.rs.

Copy link
Collaborator

@scovich scovich Sep 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See also #1272 (comment)

Copy link
Collaborator

@nicklan nicklan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left some comments. I think I understand the issue correctly, but lmk if I'm missing something.

Comment on lines 513 to 514
if !found_fields.contains(field.name()) {
if field.nullable {
if let Some(metadata_spec) = field.get_metadata_column_spec() {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, so the issue (i think) is that we'll mark the metadata column as "found" above if there so happens to be a physical column with the same name. And then we'll read that column instead of generating a metadata column.

I think the right approach here is that match_parquet_fields should ignore metadata columns in requested_schema, and then we're guaranteed that we never look at those fields in the big loop over all the parquet fields, we'll just skip over them like we do any other unselected fields.

we might want to add an additional check somewhere in the parquet reader that goes over all columns and verifies that if a column is a metadata column, it doesn't show up in the actual data columns.

Do we need to verify this? I mean, it probably means there's something funky with the parquet, but in general we should probably just ignore such columns?

@lbhm lbhm force-pushed the metadata-column-api branch from 3318bfe to 9de2423 Compare September 12, 2025 13:37
@@ -868,7 +868,9 @@ impl StateInfo {
column_mapping_mode: ColumnMappingMode,
) -> DeltaResult<Self> {
let mut have_partition_cols = false;
let mut read_fields = Vec::with_capacity(logical_schema.fields.len());
let mut read_fields = Vec::with_capacity(logical_schema.fields_len());
let mut read_field_names = HashSet::with_capacity(logical_schema.fields_len());
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See #1266 (comment) why I think that we need the two additional checks in this method.

// We use indexmap to preserve the order of fields as they are defined in the schema
// while also allowing for fast lookup by name. The alternative is to do a linear search
// for each field by name would be potentially quite expensive for large schemas.
pub fields: IndexMap<String, StructField>,
fields: IndexMap<String, StructField>,
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am making the struct fields private to prevent users from bypassing the constructor methods of StructType. This required me to implement a few additional helper methods below.

Comment on lines +672 to +674
pub fn into_fields(self) -> impl ExactSizeIterator<Item = StructField> {
self.fields.into_values()
}
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This helps to fix the schema.fields().cloned() pattern described in #1284 whenever the StructType is not wrapped in an Arc.

Copy link
Collaborator

@nicklan nicklan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

small bug in the way you added filtering to match_parquet_fields, but other than that I think we're about good to go.

If there's not a test that's failing right now though, could you add one that ensures that match_parquet_fields returns with kernel_field_info set to None for any metadata cols in the schema

@lbhm
Copy link
Collaborator Author

lbhm commented Sep 12, 2025

@nicklan I renamed fields_len, fixed the bug in match_parquet_fields as you suggested, and added a new unit test. Could you take another look?

@lbhm lbhm requested review from nicklan and scovich September 12, 2025 22:06
Copy link
Collaborator

@nicklan nicklan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks! one small nit, but lgtm!

@lbhm
Copy link
Collaborator Author

lbhm commented Sep 13, 2025

Thank you @nicklan! @scovich could you also take another look and unblock or request specific changes?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
breaking-change Change that require a major version bump
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants