don't error if more fields exist than expected in a struct expression #267

nicklan · 2024-06-24T22:21:22Z

This allows us to evaluate a struct expression where the base data has more fields than the specified schema.

This is an alternative to #264, and matches more closely with the spec which states:

clients can assume that unrecognized actions, fields, and/or metadata domains are never required in order to correctly interpret the transaction log. Clients must ignore such unrecognized fields, and should not produce an error when reading a table that contains unrecognized fields.

NB: This does NOT actually drop the data from the returned expression, it just does no validation on columns that aren't specified. So if the parquet files contains:

{
  "a": 1,
  "b": 2,
}

and kernel passes a struct schema like {"name": "a", "type": "int"}, the returned data will be:

{
  "a": 1,
  "b": 2,
}

This works for metadata since we can then call extract and things will "just work", but might not be good enough for when we do final "fix-up" via expressions. However, actually dropping the column from the arrow data requires a lot more code change, so I'm proposing we do this for now to fix things like #261, and then figure out the best way to "do the right thing" when a subset of fields are specified in a struct expression.

Verified that things work as expected with this change:

D select * from delta_scan('/home/nick/databricks/delta-kernel-rs/acceptance/tests/dat/out/reader_tests/generated/with_checkpoint/delta/');
┌─────────┬───────┬────────────┐
│ letter  │  int  │    date    │
│ varchar │ int64 │    date    │
├─────────┼───────┼────────────┤
│ a       │    93 │ 1975-06-01 │
│ b       │   753 │ 2012-05-01 │
│ c       │   620 │ 1983-10-01 │
│ a       │   595 │ 2013-03-01 │
│         │   653 │ 1995-12-01 │
└─────────┴───────┴────────────┘

Closes #261

zachschuermann

lgtm little nit and for now DAT test seems fine but unit test maybe good follow up

zachschuermann · 2024-06-25T20:41:30Z

kernel/src/engine/arrow_expression.rs

-                ))
-            );
-            for (kernel_field, arrow_field) in kernel_fields.fields().zip(arrow_fields.iter()) {
+            // build a list of kernel fields that matches the order of the arrow fields


nit: can we update docstring on ensure_data_types above (can't comment on it since it isn't in the diff); just maybe a more rigorous definition of 'matching' saying that we will check kernel_fields is a subset?

scovich · 2024-06-26T16:05:42Z

kernel/src/engine/arrow_expression.rs

+            // keep track of how many fields we matched up
+            let mut found_fields = 0;
+            // ensure that for the fields that we found, the types match
+            for (kernel_field, arrow_field) in mapped_fields.zip(arrow_fields) {
                ensure_data_types(&kernel_field.data_type, arrow_field.data_type())?;


I think we originally had this function returning the actual schema to use... but dropped that because it wasn't actually changing the schema. Ironically we may need to reinstate that control flow? That way, unexpected struct fields just get ignored because they're not in the returned schema?

(but agree that can be a follow-up after this PR fixes the immediate DAT issue)

Yep, exactly. I started writing that for this PR and realized that munging arrow structs is... complicated, so decided to just do this quick fix.

We will need to make a deeper PR that will have to pass the actual <dyn Array> rather than just the schema, and then if the struct is having only a subset of fields selected, deconstruct each element, and put it back together without the dropped fields, and then return a new array to return as the result.

Wait a minute... shouldn't we be filtering the read schema so it never even gets fetched from disk? Why would we need to filter arrow data at all?

Great point! The issue is because we don't push down selection into leaf fields, and only support selection on the top level fields. Probably the right thing to do is properly push that down, and then we don't have to filter the arrow data, as you note, and this can probably get more simple again...

don't error if more fields exist

2ec4e1d

nicklan linked an issue Jun 24, 2024 that may be closed by this pull request

DAT Failures #261

Closed

nicklan mentioned this pull request Jun 24, 2024

add maxRowIndex to dv descriptor #264

Closed

nicklan added 2 commits June 24, 2024 15:26

use match instead

5611f06

handle out of order struct fields

7e13948

nicklan requested review from hntd187, scovich, zachschuermann and roeap June 24, 2024 22:58

nicklan mentioned this pull request Jun 25, 2024

switch to using scan_data version of scan for execute() as well #265

Merged

zachschuermann approved these changes Jun 25, 2024

View reviewed changes

scovich approved these changes Jun 26, 2024

View reviewed changes

Better comment

04981e2

nicklan merged commit ed2b80b into delta-io:main Jun 26, 2024
9 checks passed

samansmink mentioned this pull request Jun 27, 2024

DAT Failures #261

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

don't error if more fields exist than expected in a struct expression #267

don't error if more fields exist than expected in a struct expression #267

nicklan commented Jun 24, 2024 •

edited

Loading

zachschuermann left a comment

zachschuermann Jun 25, 2024

scovich Jun 26, 2024

nicklan Jun 26, 2024

scovich Jun 27, 2024

nicklan Jun 27, 2024

don't error if more fields exist than expected in a struct expression #267

don't error if more fields exist than expected in a struct expression #267

Conversation

nicklan commented Jun 24, 2024 • edited Loading

zachschuermann left a comment

Choose a reason for hiding this comment

zachschuermann Jun 25, 2024

Choose a reason for hiding this comment

scovich Jun 26, 2024

Choose a reason for hiding this comment

nicklan Jun 26, 2024

Choose a reason for hiding this comment

scovich Jun 27, 2024

Choose a reason for hiding this comment

nicklan Jun 27, 2024

Choose a reason for hiding this comment

nicklan commented Jun 24, 2024 •

edited

Loading