Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add maxRowIndex to dv descriptor #264

Closed

Conversation

nicklan
Copy link
Collaborator

@nicklan nicklan commented Jun 21, 2024

This is a (so far) undocumented field that can appear in checkpoint files.

I've pinged people internally to get it added to the spec asap.

Fixes #261

Note, this is another place where #120 would have made life a lot easier

Tested by running all DAT tests in read-table-multi-threaded which was failing, and by compiling DuckDB against this and ensuring that works.

D select * from delta_scan('/home/nick/databricks/delta-kernel-rs/acceptance/tests/dat/out/reader_tests/generated/with_checkpoint/delta/');
┌─────────┬───────┬────────────┐
│ letter  │  int  │    date    │
│ varchar │ int64 │    date    │
├─────────┼───────┼────────────┤
│ a       │    93 │ 1975-06-01 │
│ b       │   753 │ 2012-05-01 │
│ c       │   620 │ 1983-10-01 │
│ a       │   595 │ 2013-03-01 │
│         │   653 │ 1995-12-01 │
└─────────┴───────┴────────────┘

@nicklan nicklan force-pushed the add-max-row-index-to-dv-schema branch from bd22f42 to 4941665 Compare June 21, 2024 19:36
@scovich
Copy link
Collaborator

scovich commented Jun 24, 2024

Not sure I fully understand the issue here? The Delta spec specifically mandates that:

clients can assume that unrecognized actions, fields, and/or metadata domains are never required in order to correctly interpret the transaction log. Clients must ignore such unrecognized fields, and should not produce an error when reading a table that contains unrecognized fields.

At a glance, it sounds like the kernel code has a bug, if it cannot tolerate unrecognized fields. This is independent of whether we think kernel should recognize the field, or whether the field should be documented in the Delta spec.

@nicklan
Copy link
Collaborator Author

nicklan commented Jun 24, 2024

At a glance, it sounds like the kernel code has a bug, if it cannot tolerate unrecognized fields. This is independent of whether we think kernel should recognize the field, or whether the field should be documented in the Delta spec.

Yep, I realize this is the wrong way to fix this and will adjust. We should just drop fields we don't expect.

@nicklan
Copy link
Collaborator Author

nicklan commented Jun 24, 2024

Closing in favor of #267

@nicklan nicklan closed this Jun 24, 2024
nicklan added a commit that referenced this pull request Jun 26, 2024
…#267)

This allows us to evaluate a `struct` expression where the base data has
more fields than the specified schema.

This is an alternative to #264, and matches more closely with the spec
which states:
> clients can assume that unrecognized actions, fields, and/or metadata
domains are never required in order to correctly interpret the
transaction log. Clients must ignore such unrecognized fields, and
should not produce an error when reading a table that contains
unrecognized fields.

NB: This does NOT actually _drop_ the data from the returned expression,
it just does no validation on columns that aren't specified. So if the
parquet files contains:
```json
{
  "a": 1,
  "b": 2,
}
```

and kernel passes a struct schema like `{"name": "a", "type": "int"}`,
the returned data will be:
```json
{
  "a": 1,
  "b": 2,
}
```

This works for metadata since we can then call `extract` and things will
"just work", but might not be good enough for when we do final "fix-up"
via expressions. However, actually dropping the column from the arrow
data requires a lot more code change, so I'm proposing we do this for
now to fix things like #261, and then figure out the best way to "do the
right thing" when a subset of fields are specified in a struct
expression.

Verified that things work as expected with this change:
```
D select * from delta_scan('/home/nick/databricks/delta-kernel-rs/acceptance/tests/dat/out/reader_tests/generated/with_checkpoint/delta/');
┌─────────┬───────┬────────────┐
│ letter  │  int  │    date    │
│ varchar │ int64 │    date    │
├─────────┼───────┼────────────┤
│ a       │    93 │ 1975-06-01 │
│ b       │   753 │ 2012-05-01 │
│ c       │   620 │ 1983-10-01 │
│ a       │   595 │ 2013-03-01 │
│         │   653 │ 1995-12-01 │
└─────────┴───────┴────────────┘
```

Closes #261
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

DAT Failures
2 participants