-
Notifications
You must be signed in to change notification settings - Fork 102
feat!: Introduce metadata column API #1266
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #1266 +/- ##
==========================================
+ Coverage 83.67% 83.68% +0.01%
==========================================
Files 108 108
Lines 25926 26391 +465
Branches 25926 26391 +465
==========================================
+ Hits 21694 22086 +392
- Misses 3144 3190 +46
- Partials 1088 1115 +27 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR introduces a metadata column API to kernel-rs, following the agreed-upon metadata column API design from kernel-java. The API allows creating and working with special metadata columns that provide additional information about rows in Delta tables.
- Adds support for metadata columns (row_index, row_id, row_commit_version) with validation
- Makes StructType creation fallible to prevent invalid metadata column usage
- Refactors existing code to use new fallible constructors with appropriate error handling
Reviewed Changes
Copilot reviewed 37 out of 37 changed files in this pull request and generated 4 comments.
Show a summary per file
File | Description |
---|---|
kernel/src/schema/mod.rs | Core metadata column API implementation with validation |
kernel/src/actions/mod.rs | Adds validation to reject metadata columns in table schemas |
kernel/tests/write.rs | Updates test code to use new fallible StructType constructors |
kernel/tests/read.rs | Updates Schema creation to handle fallible constructor |
Multiple other files | Updates existing code to use new_unchecked() for internal schemas |
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
0a39b5d
to
2736ced
Compare
@@ -209,6 +210,18 @@ impl Metadata { | |||
created_time: i64, | |||
configuration: HashMap<String, String>, | |||
) -> DeltaResult<Self> { | |||
// Validate that the schema does not contain metadata columns |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to enforce that we never leak metadata columns into the Delta log.
@@ -113,11 +116,91 @@ impl AsRef<str> for ColumnMetadataKey { | |||
Self::IdentityHighWaterMark => "delta.identity.highWaterMark", | |||
Self::IdentityStart => "delta.identity.start", | |||
Self::IdentityStep => "delta.identity.step", | |||
Self::InternalColumn => "delta.isInternalColumn", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a pre-factoring for deletion vector and row tracking reads. Kernel might need to add the row index to the read schema even though it was not requested by the user. In this case, we need to mark it accordingly and remove the column before returning to the user.
1f16cc5
to
a274522
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Made an initial pass. General direction looks reasonable, but we're materializing too many field lists all over the place.
Thank you for the thorough review @scovich! I resolved or responded to all of your comments. As we discussed offline, I will factor out the StructType constructor-related changes and open up a dedicated PR for them tomorrow. Afterwards, I'll rebase this PR. |
d9ddda8
to
5023fd6
Compare
5023fd6
to
b84c360
Compare
@@ -180,6 +180,7 @@ impl TryFrom<Format> for Scalar { | |||
)] | |||
#[internal_api] | |||
pub(crate) struct Metadata { | |||
// TODO: Make the struct fields private to force using the try_new function. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Metadata
is too often constructed directly across the codebase, so I would rather address this in a follow-up PR than make this PR bigger.
b84c360
to
8ba4de6
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
} | ||
|
||
impl StructType { | ||
/// Creates a new [`StructType`] from the given fields. | ||
/// | ||
/// Returns an error if: | ||
/// - the schema contains duplicate field names | ||
/// - the schema contains duplicate metadata columns |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah, different from the first bullet because you could register the same metadata column twice with different names...
kernel/src/schema/mod.rs
Outdated
// Checks if the `StructType` contains a field with the specified name. | ||
pub(crate) fn contains(&self, name: impl AsRef<str>) -> bool { | ||
self.fields.contains_key(name.as_ref()) | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Any particular reason this was moved?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It felt cleaner to have all the index_of*
and contains*
methods co-located 🤷🏼♂️
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Blocking merge until we sort out a potential bug I couldn't give a suggested fix for.
kernel/src/engine/arrow_utils.rs
Outdated
if !found_fields.contains(field.name()) { | ||
if field.nullable { | ||
if let Some(metadata_spec) = field.get_metadata_column_spec() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yikes! While reviewing #1272 I realized that this is probably wrong?
The whole point of a metadata column being identified by metadata is to be immune to name collisions, no?
So shouldn't we process a metadata column as metadata, even if its name happens to match something in the underlying file schema?
If so, we probably need some kind of change around L413-418?
CC @nicklan since he's the one most likely to understand the intricacies of reorder indexes, and what it might mean to inject the metadata column there instead of here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm should this ever be able to happen given that we don't allow duplicate names across regular and metadata columns?
Thinking more about this, we might want to add an additional check somewhere in the parquet reader that goes over all columns and verifies that if a column is a metadata column, it doesn't show up in the actual data columns.
Do you know a good place for such a check @scovich? I'm not very familiar with this part of the code base yet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AFAIK, the correct semantics are a metadata column read should return the requested metadata, regardless of whether that metadata column happens to have a name that happens to be present in the underlying parquet file. So there's no error to check for -- the field's metadata annotation just takes precedence over the field's name.
This is definitely true when reading by field id (the spec for field ids requires to ignore column names entirely).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, so the issue (i think) is that we'll mark the metadata column as "found" above if there so happens to be a physical column with the same name. And then we'll read that column instead of generating a metadata column.
I think the right approach here is that match_parquet_fields
should ignore metadata columns in requested_schema
, and then we're guaranteed that we never look at those fields in the big loop over all the parquet fields, we'll just skip over them like we do any other unselected fields.
we might want to add an additional check somewhere in the parquet reader that goes over all columns and verifies that if a column is a metadata column, it doesn't show up in the actual data columns.
Do we need to verify this? I mean, it probably means there's something funky with the parquet, but in general we should probably just ignore such columns?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I extended match_parquet_fields()
to filter out metadata columns. Could you verify that my understanding of the method and my change is correct @nicklan?
Do we need to verify this? I mean, it probably means there's something funky with the parquet, but in general we should probably just ignore such columns?
I agree that we might not want to check this in the Parquet reader, but I would argue that we should forbid this at the level of kernel. If a table has a column "revenue", the user should not be allowed to call a metadata column "revenue" and request it from the table. Not because kernel couldn't handle that (your suggested fix makes sure that we return the metadata column), but because this is semantically weird.
I extended StateInfo::try_new
to ensure this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If a table has a column "revenue", the user should not be allowed to call a metadata column "revenue" and request it from the table
CDC has this policy, but it has caused problems because apparently some tables do have column names that match the special CDC column names. Result: Users simply cannot enable CDC on such tables.
I actually had to do some very annoying and painful work in spark to handle metadata column name collisions as well, and would rather we didn't go back down that path in kernel.
Also, look at it this way: A column is a metadata column because it is annotated as such. If we see a metadata column, somebody had to do that on purpose, there's no mechanism that automatically turns an arbitrary column name into a metadata column.
So, if the metadata column does collide with a table schema name or a parquet schema name, we have one of two possibilities:
- The table/file has a very annoying column name (but it's legal... neither engine nor Delta spec forbids users to create a column called e.g.
_metadata.row_index
). - The engine/connector has a really bad bug.
IMO, kernel already relies on engine not to have egregious correctness bugs, so we should let the engine decide how to resolve an arbitrary column name from a query into a kernel schema field. Most likely, name collisions will result in selecting the non-metadata version of the column, perhaps with some special syntax to force treating it as a metadata column name (and mapping that name to an actual metadata column spec in kernel).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall, I like @nicklan suggestion -- handle metadata columns similarly to partition columns, and filter them out of the list of columns we even search the parquet schema for. If the row index metadata column was requested, just add the corresponding transform directly. Any other type returns an error.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fair point. If the others agree that we should allow the engine to purposefully "overwrite" physical columns with metadata columns, I'll remove the checks I added to scan.rs
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See also #1272 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left some comments. I think I understand the issue correctly, but lmk if I'm missing something.
kernel/src/engine/arrow_utils.rs
Outdated
if !found_fields.contains(field.name()) { | ||
if field.nullable { | ||
if let Some(metadata_spec) = field.get_metadata_column_spec() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, so the issue (i think) is that we'll mark the metadata column as "found" above if there so happens to be a physical column with the same name. And then we'll read that column instead of generating a metadata column.
I think the right approach here is that match_parquet_fields
should ignore metadata columns in requested_schema
, and then we're guaranteed that we never look at those fields in the big loop over all the parquet fields, we'll just skip over them like we do any other unselected fields.
we might want to add an additional check somewhere in the parquet reader that goes over all columns and verifies that if a column is a metadata column, it doesn't show up in the actual data columns.
Do we need to verify this? I mean, it probably means there's something funky with the parquet, but in general we should probably just ignore such columns?
3318bfe
to
9de2423
Compare
kernel/src/scan/mod.rs
Outdated
@@ -868,7 +868,9 @@ impl StateInfo { | |||
column_mapping_mode: ColumnMappingMode, | |||
) -> DeltaResult<Self> { | |||
let mut have_partition_cols = false; | |||
let mut read_fields = Vec::with_capacity(logical_schema.fields.len()); | |||
let mut read_fields = Vec::with_capacity(logical_schema.fields_len()); | |||
let mut read_field_names = HashSet::with_capacity(logical_schema.fields_len()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See #1266 (comment) why I think that we need the two additional checks in this method.
// We use indexmap to preserve the order of fields as they are defined in the schema | ||
// while also allowing for fast lookup by name. The alternative is to do a linear search | ||
// for each field by name would be potentially quite expensive for large schemas. | ||
pub fields: IndexMap<String, StructField>, | ||
fields: IndexMap<String, StructField>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am making the struct fields private to prevent users from bypassing the constructor methods of StructType
. This required me to implement a few additional helper methods below.
pub fn into_fields(self) -> impl ExactSizeIterator<Item = StructField> { | ||
self.fields.into_values() | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This helps to fix the schema.fields().cloned()
pattern described in #1284 whenever the StructType
is not wrapped in an Arc
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
small bug in the way you added filtering to match_parquet_fields
, but other than that I think we're about good to go.
If there's not a test that's failing right now though, could you add one that ensures that match_parquet_fields
returns with kernel_field_info
set to None
for any metadata cols in the schema
@nicklan I renamed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks! one small nit, but lgtm!
NOTE: The PR is currently stacked on #1278.
What changes are proposed in this pull request?
This PR introduces a metadata column API to kernel-rs. The API design aims to replicate the recently agreed-upon metadata column API in kernel-java.
This PR affects the following public APIs
This PR adds new methods to
StructType
.How was this change tested?
New UT.