-
Notifications
You must be signed in to change notification settings - Fork 47
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add maxRowIndex to dv descriptor #264
Conversation
bd22f42
to
4941665
Compare
4941665
to
eb98da3
Compare
Not sure I fully understand the issue here? The Delta spec specifically mandates that:
At a glance, it sounds like the kernel code has a bug, if it cannot tolerate unrecognized fields. This is independent of whether we think kernel should recognize the field, or whether the field should be documented in the Delta spec. |
Yep, I realize this is the wrong way to fix this and will adjust. We should just drop fields we don't expect. |
Closing in favor of #267 |
…#267) This allows us to evaluate a `struct` expression where the base data has more fields than the specified schema. This is an alternative to #264, and matches more closely with the spec which states: > clients can assume that unrecognized actions, fields, and/or metadata domains are never required in order to correctly interpret the transaction log. Clients must ignore such unrecognized fields, and should not produce an error when reading a table that contains unrecognized fields. NB: This does NOT actually _drop_ the data from the returned expression, it just does no validation on columns that aren't specified. So if the parquet files contains: ```json { "a": 1, "b": 2, } ``` and kernel passes a struct schema like `{"name": "a", "type": "int"}`, the returned data will be: ```json { "a": 1, "b": 2, } ``` This works for metadata since we can then call `extract` and things will "just work", but might not be good enough for when we do final "fix-up" via expressions. However, actually dropping the column from the arrow data requires a lot more code change, so I'm proposing we do this for now to fix things like #261, and then figure out the best way to "do the right thing" when a subset of fields are specified in a struct expression. Verified that things work as expected with this change: ``` D select * from delta_scan('/home/nick/databricks/delta-kernel-rs/acceptance/tests/dat/out/reader_tests/generated/with_checkpoint/delta/'); ┌─────────┬───────┬────────────┐ │ letter │ int │ date │ │ varchar │ int64 │ date │ ├─────────┼───────┼────────────┤ │ a │ 93 │ 1975-06-01 │ │ b │ 753 │ 2012-05-01 │ │ c │ 620 │ 1983-10-01 │ │ a │ 595 │ 2013-03-01 │ │ │ 653 │ 1995-12-01 │ └─────────┴───────┴────────────┘ ``` Closes #261
This is a (so far) undocumented field that can appear in checkpoint files.
I've pinged people internally to get it added to the spec asap.
Fixes #261
Note, this is another place where #120 would have made life a lot easier
Tested by running all DAT tests in
read-table-multi-threaded
which was failing, and by compiling DuckDB against this and ensuring that works.