[HUDI-3841] Fixing Column Stats in the presence of Schema Evolution#5275
[HUDI-3841] Fixing Column Stats in the presence of Schema Evolution#5275nsivabalan merged 6 commits intoapache:masterfrom
Conversation
…every column is present for every file (due to schema evolution, CSI config changes, etc)
259cb09 to
f5340cc
Compare
codope
left a comment
There was a problem hiding this comment.
Will this patch be able to handle column drops? Particularly, if the dropped column was part of HoodieMetadataConfig.COLUMN_STATS_INDEX_FOR_COLUMNS? On the reader side, we pass the latest tableSchema and not the fileSchema right? On the writer side, don't we need to cleanup this config or throw an error and ask the user to reset it. I know this is slightly tangential point. But if you think there is more work to be done for handling shema evolution comprehensively, then maybe create a followup ticket.
| // NOTE: We have to collect list of indexed columns to make sure we properly align the rows | ||
| // w/in the transposed dataset: since some files might not have all of the columns indexed | ||
| // either due to the Column Stats Index config changes, schema evolution, etc, we have | ||
| // to make sure that all of the rows w/in transposed data-frame are properly padded (with null | ||
| // values) for such file-column combinations | ||
| val indexedColumns: Seq[String] = colStatsDF.rdd.map(row => row.getString(colNameOrdinal)).distinct().collect() |
There was a problem hiding this comment.
Why not do this one level above in readColumnStatsIndex so that colStatsDF itself is correctly populated and transposeColumnStatsIndex simply transposes as today?
There was a problem hiding this comment.
colStatsDF is populated correctly -- it bears 1 row / column (let's call it "row-based"), therefore for all columns in a file we will have N rows corresponding to it (eq to the # of columns in that file).
Transposed table is "column-based", ie there's 1 row / file and each column's stat is mapped to a column in such view. Therefore only in that view we have a need to align the rows (to pad them).
|
In the interest of time for 0.11 release, here is my take. I haven't looked at the changes as such. but my 2 cents given the last min changes. We can ignore scheme evolution, CSI config changes (list of columns to index) for now. We can call out that CSI configs can be set only once and cannot be changed (list of cols to index), and may not work w/ schema evolution. enable and disable should be doable, just changing the list of columns to index on the fly is not feasible. just that we should not miss or regress any core flow by accomodating changes to support advanced use-cases like config changes and schema evolution. may be there are lot more scenarios to consider like column renaming, integrating schema evolution w/ col stats etc. So, we can take it up for 0.12. |
|
@codope yes after this patch it will be able to handle it -- on the read path we're not relying on writer's config, instead we use whatever is in the Index as the source of truth and play by that (which helps us also built in resilience against any indexing progress gaps, schema evolution, etc) @nsivabalan sorry, heading might be misleading, Schema Evolution is just one of the cases that might lead to crashes in this flow but is the one that is easier to explain. We have to bring this PR in 0.11, as it covers other critical cases as well (for ex, when query contains columns that are not indexed) |
|
Also, this PR is very limited in scope and has practically 100% test coverage. I see no risk in this PR landing. |
codope
left a comment
There was a problem hiding this comment.
LGTM. Discussed offline. Drop columns can be handled with this patch. Essentially, this patch is not about schema evolution. It is about difference in what columns have been indexed and what's being queried. Schema evolution is one use case where it would be helpful. But, to cover schema evolution comprehensively including renames, we need some more work. However, that should not block this patch from landing.
…5275) Currently, Data Skipping is not handling correctly the case when column-stats are not aligned and, for ex, some of the (column, file) combinations are missing from the CSI. This could occur in different scenarios (schema evolution, CSI config changes), and has to be handled properly when we're composing CSI projection for Data Skipping. This PR addresses that. - Added appropriate aligning for the transposed CSI projection
Tips
What is the purpose of the pull request
Currently, Data Skipping is not handling correctly the case when column-stats are not aligned and, for ex, some of the (column, file) combinations are missing from the CSI.
This could occur in different scenarios (schema evolution, CSI config changes), and has to be handled properly when we're composing CSI projection for Data Skipping. This PR addresses that.
Brief change log
Verify this pull request
This pull request is already covered by existing tests, such as (please describe tests).
This change added tests and can be verified as follows:
Committer checklist
Has a corresponding JIRA in PR title & commit
Commit message is descriptive of the change
CI is green
Necessary doc changes done or have another open PR
For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.