Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DAT Failures #261

Closed
samansmink opened this issue Jun 21, 2024 · 1 comment · Fixed by #267
Closed

DAT Failures #261

samansmink opened this issue Jun 21, 2024 · 1 comment · Fixed by #267

Comments

@samansmink
Copy link

I'm getting some failing DAT tests:

I'm on this commit (current main branch):

git rev-parse HEAD

6f95fd3bfaaa57698d72f539f8c6a0475a52c4e7

Then I build by running this in project root:

cargo build --package delta_kernel_ffi --workspace --all-features

Compiling libc v0.2.155
Compiling proc-macro2 v1.0.85
Compiling unicode-ident v1.0.12
Compiling autocfg v1.3.0
...

Then I run the example delta-kernel-rs/kernel/examples/read-table-multi-threaded like so:

cargo run -- -e sync ../../../acceptance/tests/dat/out/reader_tests/generated/with_checkpoint/delta

Compiling libc v0.2.155
Compiling arrow-schema v52.0.0
...
Running `/Users/sam/Development/delta-kernel-rs/target/debug/read-table-multi-threaded -e sync ../../../acceptance/tests/dat/out/reader_tests/generated/with_checkpoint/delta`
Reading file:///Users/sam/Development/delta-kernel-rs/acceptance/tests/dat/out/reader_tests/generated/with_checkpoint/delta/
Arrow(
    SchemaError(
        "No such field: deletionVector",
    ),
)

Note that this happens for no_replay, no_stats, stats_as_struct, and with_checkpoint

nicklan added a commit that referenced this issue Jun 26, 2024
…#267)

This allows us to evaluate a `struct` expression where the base data has
more fields than the specified schema.

This is an alternative to #264, and matches more closely with the spec
which states:
> clients can assume that unrecognized actions, fields, and/or metadata
domains are never required in order to correctly interpret the
transaction log. Clients must ignore such unrecognized fields, and
should not produce an error when reading a table that contains
unrecognized fields.

NB: This does NOT actually _drop_ the data from the returned expression,
it just does no validation on columns that aren't specified. So if the
parquet files contains:
```json
{
  "a": 1,
  "b": 2,
}
```

and kernel passes a struct schema like `{"name": "a", "type": "int"}`,
the returned data will be:
```json
{
  "a": 1,
  "b": 2,
}
```

This works for metadata since we can then call `extract` and things will
"just work", but might not be good enough for when we do final "fix-up"
via expressions. However, actually dropping the column from the arrow
data requires a lot more code change, so I'm proposing we do this for
now to fix things like #261, and then figure out the best way to "do the
right thing" when a subset of fields are specified in a struct
expression.

Verified that things work as expected with this change:
```
D select * from delta_scan('/home/nick/databricks/delta-kernel-rs/acceptance/tests/dat/out/reader_tests/generated/with_checkpoint/delta/');
┌─────────┬───────┬────────────┐
│ letter  │  int  │    date    │
│ varchar │ int64 │    date    │
├─────────┼───────┼────────────┤
│ a       │    93 │ 1975-06-01 │
│ b       │   753 │ 2012-05-01 │
│ c       │   620 │ 1983-10-01 │
│ a       │   595 │ 2013-03-01 │
│         │   653 │ 1995-12-01 │
└─────────┴───────┴────────────┘
```

Closes #261
@samansmink
Copy link
Author

@nicklan I'm still seeing the same error in ed2b80b127984481adba8e59879f39b9e5f871d1 both when compiled in DuckDB delta and through the above example. I'm a little confused, because in your example in #267 this seemed to work from DuckDB

nicklan added a commit that referenced this issue Jul 16, 2024
…265)

Previously we had two independent code paths for interacting with tables
via a scan, `execute()` or `scan_data()`.

`scan_data` is what most engines will use, and is a bit more complex in
that it evaluates expressions over the returned add files. This meant
that bugs like #261 could happen because our tests used `execute` which
didn't catch the issue.

This PR makes `execute` use `scan_data` under the hood. It's a bit more
complex, but now we won't need to maintain two code paths.

Until #271 merges, this PR will fail tests, because nullable columns are
not filled in as expected.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
1 participant