Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix Delta write stats if data schema is missing columns relative to table schema [databricks] #8287

Merged
merged 1 commit into from
May 15, 2023

Conversation

jlowe
Copy link
Contributor

@jlowe jlowe commented May 12, 2023

Fixes #8263.

Delta Lake 2.2+ and Delta Lake on Databricks 11.3+ supports overwriting a table while preserving the old table's column for statistics gathering. This requires mapping the data schema onto the stats collection schema, the latter of which may refer to columns that are missing in the data schema or be reordered relative to the stats collection schema.

This resolves the issue by creating an exploded schema map for the data schema. As we walk the stats collection schema as usual to generate the statistics, the exploded map can be referenced to know whether the column is present, and if it is, what ordinal it's at.

@jlowe jlowe self-assigned this May 12, 2023
@jlowe
Copy link
Contributor Author

jlowe commented May 12, 2023

build

@sameerz sameerz added the bug Something isn't working label May 14, 2023
@tgravescs
Copy link
Collaborator

I manually tested on Databricks 11.3 on query where I found the issue and this fixes it.

@jlowe jlowe merged commit 8dd775d into NVIDIA:branch-23.06 May 15, 2023
@jlowe jlowe deleted the fix-delta-write-stats branch May 15, 2023 15:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
3 participants