Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

don't error if more fields exist than expected in a struct expression #267

Merged
merged 4 commits into from
Jun 26, 2024

Conversation

nicklan
Copy link
Collaborator

@nicklan nicklan commented Jun 24, 2024

This allows us to evaluate a struct expression where the base data has more fields than the specified schema.

This is an alternative to #264, and matches more closely with the spec which states:

clients can assume that unrecognized actions, fields, and/or metadata domains are never required in order to correctly interpret the transaction log. Clients must ignore such unrecognized fields, and should not produce an error when reading a table that contains unrecognized fields.

NB: This does NOT actually drop the data from the returned expression, it just does no validation on columns that aren't specified. So if the parquet files contains:

{
  "a": 1,
  "b": 2,
}

and kernel passes a struct schema like {"name": "a", "type": "int"}, the returned data will be:

{
  "a": 1,
  "b": 2,
}

This works for metadata since we can then call extract and things will "just work", but might not be good enough for when we do final "fix-up" via expressions. However, actually dropping the column from the arrow data requires a lot more code change, so I'm proposing we do this for now to fix things like #261, and then figure out the best way to "do the right thing" when a subset of fields are specified in a struct expression.

Verified that things work as expected with this change:

D select * from delta_scan('/home/nick/databricks/delta-kernel-rs/acceptance/tests/dat/out/reader_tests/generated/with_checkpoint/delta/');
┌─────────┬───────┬────────────┐
│ letter  │  int  │    date    │
│ varchar │ int64 │    date    │
├─────────┼───────┼────────────┤
│ a       │    93 │ 1975-06-01 │
│ b       │   753 │ 2012-05-01 │
│ c       │   620 │ 1983-10-01 │
│ a       │   595 │ 2013-03-01 │
│         │   653 │ 1995-12-01 │
└─────────┴───────┴────────────┘

Closes #261

@nicklan nicklan linked an issue Jun 24, 2024 that may be closed by this pull request
Copy link
Collaborator

@zachschuermann zachschuermann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm little nit and for now DAT test seems fine but unit test maybe good follow up

))
);
for (kernel_field, arrow_field) in kernel_fields.fields().zip(arrow_fields.iter()) {
// build a list of kernel fields that matches the order of the arrow fields
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: can we update docstring on ensure_data_types above (can't comment on it since it isn't in the diff); just maybe a more rigorous definition of 'matching' saying that we will check kernel_fields is a subset?

// keep track of how many fields we matched up
let mut found_fields = 0;
// ensure that for the fields that we found, the types match
for (kernel_field, arrow_field) in mapped_fields.zip(arrow_fields) {
ensure_data_types(&kernel_field.data_type, arrow_field.data_type())?;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we originally had this function returning the actual schema to use... but dropped that because it wasn't actually changing the schema. Ironically we may need to reinstate that control flow? That way, unexpected struct fields just get ignored because they're not in the returned schema?

(but agree that can be a follow-up after this PR fixes the immediate DAT issue)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, exactly. I started writing that for this PR and realized that munging arrow structs is... complicated, so decided to just do this quick fix.

We will need to make a deeper PR that will have to pass the actual <dyn Array> rather than just the schema, and then if the struct is having only a subset of fields selected, deconstruct each element, and put it back together without the dropped fields, and then return a new array to return as the result.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wait a minute... shouldn't we be filtering the read schema so it never even gets fetched from disk? Why would we need to filter arrow data at all?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great point! The issue is because we don't push down selection into leaf fields, and only support selection on the top level fields. Probably the right thing to do is properly push that down, and then we don't have to filter the arrow data, as you note, and this can probably get more simple again...

@nicklan nicklan merged commit ed2b80b into delta-io:main Jun 26, 2024
9 checks passed
@samansmink samansmink mentioned this pull request Jun 27, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

DAT Failures
3 participants