[HUDI-3812] Fixing Data Skipping configuration to respect Metadata Table configs#5244
[HUDI-3812] Fixing Data Skipping configuration to respect Metadata Table configs#5244nsivabalan merged 11 commits intoapache:masterfrom
Conversation
|
|
||
| public static final boolean DEFAULT_METADATA_ENABLE_FOR_READERS = false; | ||
| // TODO rectify | ||
| public static final boolean DEFAULT_METADATA_ENABLE_FOR_READERS = true; |
There was a problem hiding this comment.
A reminder here to remove this change before merging.
| // nothing CSI in particular could be applied for) | ||
| lazy val queryReferencedColumns = collectReferencedColumns(spark, queryFilters, schema) | ||
|
|
||
| if (!isMetadataTableEnabled || !isColumnStatsIndexEnabled || !isColumnStatsIndexAvailable || !isDataSkippingEnabled) { |
There was a problem hiding this comment.
The check here should not rely on isMetadataTableEnabled (hoodie.metadata.enable) and isColumnStatsIndexEnabled (hoodie.metadata.index.column.stats.enable) which may not be the source of truth on the query side. isColumnStatsIndexAvailable should be the only source of truth of whether col_stats partition is ready to read in metadata table.
There was a problem hiding this comment.
As discussed offline: MT might be enabled on the write path, and therefore have the Column Stats index available, but since we're deliberately splitting configs for both Write/Read paths, we have to check whether these are enabled on the Read path.
There was a problem hiding this comment.
we can probably go w/ 3 guards here
!isMetadataTableEnabled || !isColumnStatsIndexAvailable || !isDataSkippingEnabled
to utilize base metadata itself, one has to enable explicitly on the read path. So, I prefer to guard that. and then check if data skipping is enabled. And then only if col stats partition is available to be used.
There was a problem hiding this comment.
we can probably go w/ 3 guards here
!isMetadataTableEnabled || !isColumnStatsIndexAvailable || !isDataSkippingEnabled
to utilize base metadata itself, one has to enable explicitly on the read path. So, I prefer to guard that. and then check if data skipping is enabled. And then only if col stats partition is available to be used.
There was a problem hiding this comment.
Yes. Per discussion, hoodie.metadata.enable is still needed to make sure the right API fetching column stats is made to prevent any exception. hoodie.metadata.index.column.stats.enable might not be needed. We need to revisit the abstraction and configs of reading metadata table as a whole in a separate effort.
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieFileIndex.scala
Show resolved
Hide resolved
nsivabalan
left a comment
There was a problem hiding this comment.
Left a comment about expecting hoodieWriteConfig on the read path.
|
@hudi-bot run azure |
1 similar comment
|
@hudi-bot run azure |
…perly (along w/ MT, and CSI); Added config validation printing logs in case of invalid config
… cases when no top-level column is actually referenced
…key-prefix fetches; Make sure `TestHoodieFileIndex` tests all config combinations
…ld proceed (since this flag is a write-path flag, which we would like to not urge users to specify on the Read path)
6097520 to
cf0ecea
Compare
| // nothing CSI in particular could be applied for) | ||
| lazy val queryReferencedColumns = collectReferencedColumns(spark, queryFilters, schema) | ||
|
|
||
| if (!isMetadataTableEnabled || !isColumnStatsIndexAvailable || !isDataSkippingEnabled) { |
There was a problem hiding this comment.
given that CSI does not have stats for top level columns, if predicate references both top level and non-top level columns, we gonna skip leveraging CSI is it? since anyways, for non top level column, we have to visit all data files?
There was a problem hiding this comment.
also, how do we deduce what columns have been indexed in MDT CSI?
for eg, we have two flows.
a. hoodie.metadata.index.column.stats.all_columns.enable = true, where in all cols will be enabled.
b. hoodie.metadata.index.column.stats.column.list set to list of columns to be indexed.
So, when we are looking to apply data skipping on the query side, should we check for these configs and decided whether a particular col is indexed by CSI or not ?
There was a problem hiding this comment.
given that CSI does not have stats for top level columns, if predicate references both top level and non-top level columns, we gonna skip leveraging CSI is it? since anyways, for non top level column, we have to visit all data files?
It depends on the predicate, but we will at least try to leverage it to filter out for top-level columns only
So, when we are looking to apply data skipping on the query side, should we check for these configs and decided whether a particular col is indexed by CSI or not ?
We can't do that, we have to play by what's actually in index: this is handled when we execute the filter against lookup table -- if it doesn't contain the column of the filter, it will just match all of the files.
There was a problem hiding this comment.
However, your question made me realize that we're actually deriving index schema incorrectly currently. Let me address that
There was a problem hiding this comment.
@nsivabalan i'm addressing this problem in a separate PR to avoid overloading this one: #5275
…ble configs (#5244) Addressing the problem of Data Skipping not respecting Metadata Table configs which might differ b/w write/read paths. More details could be found in HUDI-3812. - Fixing Data Skipping configuration to respect MT configs (on the Read path) - Tightening up DS handling of cases when no top-level columns are in the target query - Enhancing tests to cover all possible case
Tips
What is the purpose of the pull request
Addressing the problem of Data Skipping not respecting Metadata Table configs which might differ b/w write/read paths. More details could be found in HUDI-3812.
Brief change log
Verify this pull request
This pull request is already covered by existing tests, such as (please describe tests).
This change added tests and can be verified as follows:
Committer checklist
Has a corresponding JIRA in PR title & commit
Commit message is descriptive of the change
CI is green
Necessary doc changes done or have another open PR
For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.