[HUDI-3841] Fixing Column Stats in the presence of Schema Evolution by alexeykudinkin · Pull Request #5275 · apache/hudi

alexeykudinkin · 2022-04-09T22:27:50Z

Tips

Thank you very much for contributing to Apache Hudi.
Please review https://hudi.apache.org/contribute/how-to-contribute before opening a pull request.

What is the purpose of the pull request

Currently, Data Skipping is not handling correctly the case when column-stats are not aligned and, for ex, some of the (column, file) combinations are missing from the CSI.

This could occur in different scenarios (schema evolution, CSI config changes), and has to be handled properly when we're composing CSI projection for Data Skipping. This PR addresses that.

Brief change log

Added appropriate aligning for the transposed CSI projection

Verify this pull request

This pull request is already covered by existing tests, such as (please describe tests).
This change added tests and can be verified as follows:

Committer checklist

Has a corresponding JIRA in PR title & commit
Commit message is descriptive of the change
CI is green
Necessary doc changes done or have another open PR
For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.

…every column is present for every file (due to schema evolution, CSI config changes, etc)

hudi-bot · 2022-04-10T21:50:31Z

CI report:

f5340cc Azure: SUCCESS

Bot commands

@hudi-bot supports the following commands:

@hudi-bot run azure re-run the last Azure build

codope

Will this patch be able to handle column drops? Particularly, if the dropped column was part of HoodieMetadataConfig.COLUMN_STATS_INDEX_FOR_COLUMNS? On the reader side, we pass the latest tableSchema and not the fileSchema right? On the writer side, don't we need to cleanup this config or throw an error and ask the user to reset it. I know this is slightly tangential point. But if you think there is more work to be done for handling shema evolution comprehensively, then maybe create a followup ticket.

codope · 2022-04-11T06:36:39Z

...rk-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/ColumnStatsIndexSupport.scala

+    // NOTE: We have to collect list of indexed columns to make sure we properly align the rows
+    //       w/in the transposed dataset: since some files might not have all of the columns indexed
+    //       either due to the Column Stats Index config changes, schema evolution, etc, we have
+    //       to make sure that all of the rows w/in transposed data-frame are properly padded (with null
+    //       values) for such file-column combinations
+    val indexedColumns: Seq[String] = colStatsDF.rdd.map(row => row.getString(colNameOrdinal)).distinct().collect()


Why not do this one level above in readColumnStatsIndex so that colStatsDF itself is correctly populated and transposeColumnStatsIndex simply transposes as today?

colStatsDF is populated correctly -- it bears 1 row / column (let's call it "row-based"), therefore for all columns in a file we will have N rows corresponding to it (eq to the # of columns in that file).

Transposed table is "column-based", ie there's 1 row / file and each column's stat is mapped to a column in such view. Therefore only in that view we have a need to align the rows (to pad them).

nsivabalan · 2022-04-11T14:56:37Z

In the interest of time for 0.11 release, here is my take. I haven't looked at the changes as such. but my 2 cents given the last min changes.

We can ignore scheme evolution, CSI config changes (list of columns to index) for now. We can call out that CSI configs can be set only once and cannot be changed (list of cols to index), and may not work w/ schema evolution. enable and disable should be doable, just changing the list of columns to index on the fly is not feasible.

just that we should not miss or regress any core flow by accomodating changes to support advanced use-cases like config changes and schema evolution. may be there are lot more scenarios to consider like column renaming, integrating schema evolution w/ col stats etc. So, we can take it up for 0.12.

alexeykudinkin · 2022-04-11T15:53:32Z

@codope yes after this patch it will be able to handle it -- on the read path we're not relying on writer's config, instead we use whatever is in the Index as the source of truth and play by that (which helps us also built in resilience against any indexing progress gaps, schema evolution, etc)

@nsivabalan sorry, heading might be misleading, Schema Evolution is just one of the cases that might lead to crashes in this flow but is the one that is easier to explain. We have to bring this PR in 0.11, as it covers other critical cases as well (for ex, when query contains columns that are not indexed)

alexeykudinkin · 2022-04-11T15:54:24Z

Also, this PR is very limited in scope and has practically 100% test coverage. I see no risk in this PR landing.

codope

LGTM. Discussed offline. Drop columns can be handled with this patch. Essentially, this patch is not about schema evolution. It is about difference in what columns have been indexed and what's being queried. Schema evolution is one use case where it would be helpful. But, to cover schema evolution comprehensively including renames, we need some more work. However, that should not block this patch from landing.

…5275) Currently, Data Skipping is not handling correctly the case when column-stats are not aligned and, for ex, some of the (column, file) combinations are missing from the CSI. This could occur in different scenarios (schema evolution, CSI config changes), and has to be handled properly when we're composing CSI projection for Data Skipping. This PR addresses that. - Added appropriate aligning for the transposed CSI projection

alexeykudinkin mentioned this pull request Apr 9, 2022

[HUDI-3812] Fixing Data Skipping configuration to respect Metadata Table configs #5244

Merged

5 tasks

nsivabalan added the priority:blocker Production down; release blocker label Apr 9, 2022

Alexey Kudinkin added 2 commits April 10, 2022 13:12

Fixed ColumnStatsIndexSupport to properly handle the case when not …

f0de37e

…every column is present for every file (due to schema evolution, CSI config changes, etc)

Added tests

67416d3

alexeykudinkin changed the title ~~[HUDI-3841][Stacked on 5244] Fixing Column Stats in the presence of Schema Evolution~~ [HUDI-3841] Fixing Column Stats in the presence of Schema Evolution Apr 10, 2022

Alexey Kudinkin added 4 commits April 10, 2022 13:21

Fixed mixed up test fixtures

f1bc48a

Tidying up

cdf420a

zorder > colstats

9089109

Handle the case when min/max stats are null

f5340cc

alexeykudinkin force-pushed the ak/dskp-schm-ev branch from 259cb09 to f5340cc Compare April 10, 2022 20:27

codope reviewed Apr 11, 2022

View reviewed changes

codope approved these changes Apr 11, 2022

View reviewed changes

nsivabalan approved these changes Apr 11, 2022

View reviewed changes

nsivabalan merged commit 458fdd5 into apache:master Apr 11, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[HUDI-3841] Fixing Column Stats in the presence of Schema Evolution#5275

[HUDI-3841] Fixing Column Stats in the presence of Schema Evolution#5275
nsivabalan merged 6 commits intoapache:masterfrom
onehouseinc:ak/dskp-schm-ev

alexeykudinkin commented Apr 9, 2022

Uh oh!

hudi-bot commented Apr 10, 2022

Uh oh!

codope left a comment

Uh oh!

codope Apr 11, 2022

Uh oh!

alexeykudinkin Apr 11, 2022

Uh oh!

nsivabalan commented Apr 11, 2022

Uh oh!

alexeykudinkin commented Apr 11, 2022

Uh oh!

alexeykudinkin commented Apr 11, 2022

Uh oh!

codope left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

alexeykudinkin commented Apr 9, 2022

Tips

What is the purpose of the pull request

Brief change log

Verify this pull request

Committer checklist

Uh oh!

hudi-bot commented Apr 10, 2022

CI report:

Uh oh!

codope left a comment

Choose a reason for hiding this comment

Uh oh!

codope Apr 11, 2022

Choose a reason for hiding this comment

Uh oh!

alexeykudinkin Apr 11, 2022

Choose a reason for hiding this comment

Uh oh!

nsivabalan commented Apr 11, 2022

Uh oh!

alexeykudinkin commented Apr 11, 2022

Uh oh!

alexeykudinkin commented Apr 11, 2022

Uh oh!

codope left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants