-
Notifications
You must be signed in to change notification settings - Fork 47
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DAT Failures #261
Comments
This was referenced Jun 21, 2024
nicklan
added a commit
that referenced
this issue
Jun 26, 2024
…#267) This allows us to evaluate a `struct` expression where the base data has more fields than the specified schema. This is an alternative to #264, and matches more closely with the spec which states: > clients can assume that unrecognized actions, fields, and/or metadata domains are never required in order to correctly interpret the transaction log. Clients must ignore such unrecognized fields, and should not produce an error when reading a table that contains unrecognized fields. NB: This does NOT actually _drop_ the data from the returned expression, it just does no validation on columns that aren't specified. So if the parquet files contains: ```json { "a": 1, "b": 2, } ``` and kernel passes a struct schema like `{"name": "a", "type": "int"}`, the returned data will be: ```json { "a": 1, "b": 2, } ``` This works for metadata since we can then call `extract` and things will "just work", but might not be good enough for when we do final "fix-up" via expressions. However, actually dropping the column from the arrow data requires a lot more code change, so I'm proposing we do this for now to fix things like #261, and then figure out the best way to "do the right thing" when a subset of fields are specified in a struct expression. Verified that things work as expected with this change: ``` D select * from delta_scan('/home/nick/databricks/delta-kernel-rs/acceptance/tests/dat/out/reader_tests/generated/with_checkpoint/delta/'); ┌─────────┬───────┬────────────┐ │ letter │ int │ date │ │ varchar │ int64 │ date │ ├─────────┼───────┼────────────┤ │ a │ 93 │ 1975-06-01 │ │ b │ 753 │ 2012-05-01 │ │ c │ 620 │ 1983-10-01 │ │ a │ 595 │ 2013-03-01 │ │ │ 653 │ 1995-12-01 │ └─────────┴───────┴────────────┘ ``` Closes #261
nicklan
added a commit
that referenced
this issue
Jul 16, 2024
…265) Previously we had two independent code paths for interacting with tables via a scan, `execute()` or `scan_data()`. `scan_data` is what most engines will use, and is a bit more complex in that it evaluates expressions over the returned add files. This meant that bugs like #261 could happen because our tests used `execute` which didn't catch the issue. This PR makes `execute` use `scan_data` under the hood. It's a bit more complex, but now we won't need to maintain two code paths. Until #271 merges, this PR will fail tests, because nullable columns are not filled in as expected.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
I'm getting some failing DAT tests:
I'm on this commit (current main branch):
Then I build by running this in project root:
Then I run the example
delta-kernel-rs/kernel/examples/read-table-multi-threaded
like so:Note that this happens for
no_replay
,no_stats
,stats_as_struct
, andwith_checkpoint
The text was updated successfully, but these errors were encountered: