[HUDI-3812] Fixing Data Skipping configuration to respect Metadata Table configs by alexeykudinkin · Pull Request #5244 · apache/hudi

alexeykudinkin · 2022-04-07T00:23:20Z

Tips

Thank you very much for contributing to Apache Hudi.
Please review https://hudi.apache.org/contribute/how-to-contribute before opening a pull request.

What is the purpose of the pull request

Addressing the problem of Data Skipping not respecting Metadata Table configs which might differ b/w write/read paths. More details could be found in HUDI-3812.

Brief change log

Fixing Data Skipping configuration to respect MT configs (on the Read path)
Tightening up DS handling of cases when no top-level columns are in the target query
Enhancing tests to cover all possible cases

Verify this pull request

This pull request is already covered by existing tests, such as (please describe tests).
This change added tests and can be verified as follows:

Committer checklist

Has a corresponding JIRA in PR title & commit
Commit message is descriptive of the change
CI is green
Necessary doc changes done or have another open PR
For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.

yihua · 2022-04-07T01:04:11Z

hudi-common/src/main/java/org/apache/hudi/common/config/HoodieMetadataConfig.java


-  public static final boolean DEFAULT_METADATA_ENABLE_FOR_READERS = false;
+  // TODO rectify
+  public static final boolean DEFAULT_METADATA_ENABLE_FOR_READERS = true;


A reminder here to remove this change before merging.

yihua · 2022-04-07T01:16:54Z

hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieFileIndex.scala

+    //          nothing CSI in particular could be applied for)
+    lazy val queryReferencedColumns = collectReferencedColumns(spark, queryFilters, schema)
+
+    if (!isMetadataTableEnabled || !isColumnStatsIndexEnabled || !isColumnStatsIndexAvailable || !isDataSkippingEnabled) {


The check here should not rely on isMetadataTableEnabled (hoodie.metadata.enable) and isColumnStatsIndexEnabled (hoodie.metadata.index.column.stats.enable) which may not be the source of truth on the query side. isColumnStatsIndexAvailable should be the only source of truth of whether col_stats partition is ready to read in metadata table.

As discussed offline: MT might be enabled on the write path, and therefore have the Column Stats index available, but since we're deliberately splitting configs for both Write/Read paths, we have to check whether these are enabled on the Read path.

we can probably go w/ 3 guards here
!isMetadataTableEnabled || !isColumnStatsIndexAvailable || !isDataSkippingEnabled

to utilize base metadata itself, one has to enable explicitly on the read path. So, I prefer to guard that. and then check if data skipping is enabled. And then only if col stats partition is available to be used.

Yes. Per discussion, hoodie.metadata.enable is still needed to make sure the right API fetching column stats is made to prevent any exception. hoodie.metadata.index.column.stats.enable might not be needed. We need to revisit the abstraction and configs of reading metadata table as a whole in a separate effort.

hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieFileIndex.scala

nsivabalan

Left a comment about expecting hoodieWriteConfig on the read path.

alexeykudinkin · 2022-04-09T00:05:23Z

@hudi-bot run azure

alexeykudinkin · 2022-04-09T02:05:00Z

@hudi-bot run azure

…perly (along w/ MT, and CSI); Added config validation printing logs in case of invalid config

… cases when no top-level column is actually referenced

…key-prefix fetches; Make sure `TestHoodieFileIndex` tests all config combinations

…ld proceed (since this flag is a write-path flag, which we would like to not urge users to specify on the Read path)

hudi-bot · 2022-04-09T18:56:25Z

CI report:

c252f8d UNKNOWN
cf0ecea Azure: SUCCESS

Bot commands

@hudi-bot supports the following commands:

@hudi-bot run azure re-run the last Azure build

nsivabalan · 2022-04-09T20:10:55Z

hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieFileIndex.scala

+    //          nothing CSI in particular could be applied for)
+    lazy val queryReferencedColumns = collectReferencedColumns(spark, queryFilters, schema)
+
+    if (!isMetadataTableEnabled || !isColumnStatsIndexAvailable || !isDataSkippingEnabled) {


given that CSI does not have stats for top level columns, if predicate references both top level and non-top level columns, we gonna skip leveraging CSI is it? since anyways, for non top level column, we have to visit all data files?

also, how do we deduce what columns have been indexed in MDT CSI?
for eg, we have two flows.
a. hoodie.metadata.index.column.stats.all_columns.enable = true, where in all cols will be enabled.
b. hoodie.metadata.index.column.stats.column.list set to list of columns to be indexed.

So, when we are looking to apply data skipping on the query side, should we check for these configs and decided whether a particular col is indexed by CSI or not ?

given that CSI does not have stats for top level columns, if predicate references both top level and non-top level columns, we gonna skip leveraging CSI is it? since anyways, for non top level column, we have to visit all data files?

It depends on the predicate, but we will at least try to leverage it to filter out for top-level columns only

So, when we are looking to apply data skipping on the query side, should we check for these configs and decided whether a particular col is indexed by CSI or not ?

We can't do that, we have to play by what's actually in index: this is handled when we execute the filter against lookup table -- if it doesn't contain the column of the filter, it will just match all of the files.

However, your question made me realize that we're actually deriving index schema incorrectly currently. Let me address that

@nsivabalan i'm addressing this problem in a separate PR to avoid overloading this one: #5275

…ble configs (#5244) Addressing the problem of Data Skipping not respecting Metadata Table configs which might differ b/w write/read paths. More details could be found in HUDI-3812. - Fixing Data Skipping configuration to respect MT configs (on the Read path) - Tightening up DS handling of cases when no top-level columns are in the target query - Enhancing tests to cover all possible case

yihua reviewed Apr 7, 2022

View reviewed changes

yihua self-assigned this Apr 7, 2022

yihua added the priority:blocker Production down; release blocker label Apr 7, 2022

nsivabalan reviewed Apr 7, 2022

View reviewed changes

alexeykudinkin changed the title ~~[WIP][HUDI-3812] Fixing Data Skipping configuration to respect Metadata Table configs~~ [HUDI-3812] Fixing Data Skipping configuration to respect Metadata Table configs Apr 8, 2022

Alexey Kudinkin added 11 commits April 9, 2022 09:53

Added checks validating whether Data Skipping is being configured pro…

7ab435b

…perly (along w/ MT, and CSI); Added config validation printing logs in case of invalid config

Enable MT by default for Readers

66c6a70

Added one more sanity check to HoodieFileIndex to bypass DS flow in…

e53b673

… cases when no top-level column is actually referenced

Make sure TestColumnStatsIndex tests both full MT read, as well as …

55010fe

…key-prefix fetches; Make sure `TestHoodieFileIndex` tests all config combinations

Fixing compilation

1e2e98f

Simplifying tests

e51ec65

Fixing tests

562b737

Removing CSI enabled flag from gating whether Data Skipping flow shou…

56a4aed

…ld proceed (since this flag is a write-path flag, which we would like to not urge users to specify on the Read path)

Reverting enabling MT for readers by default

8a886d1

Fixing MT config propagation when reading whole MT

7b7faea

Fixing tests

cf0ecea

alexeykudinkin force-pushed the ak/dskp-cfg-fix branch from 6097520 to cf0ecea Compare April 9, 2022 17:05

nsivabalan reviewed Apr 9, 2022

View reviewed changes

nsivabalan approved these changes Apr 10, 2022

View reviewed changes

nsivabalan merged commit 976840e into apache:master Apr 10, 2022

Comments

Conversation

alexeykudinkin commented Apr 7, 2022

Tips

What is the purpose of the pull request

Brief change log

Verify this pull request

Committer checklist

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

nsivabalan left a comment

Choose a reason for hiding this comment

Uh oh!

alexeykudinkin commented Apr 9, 2022

Uh oh!

alexeykudinkin commented Apr 9, 2022

Uh oh!

hudi-bot commented Apr 9, 2022

CI report:

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants