Skip to content

Comments

[HUDI-3812] Fixing Data Skipping configuration to respect Metadata Table configs#5244

Merged
nsivabalan merged 11 commits intoapache:masterfrom
onehouseinc:ak/dskp-cfg-fix
Apr 10, 2022
Merged

[HUDI-3812] Fixing Data Skipping configuration to respect Metadata Table configs#5244
nsivabalan merged 11 commits intoapache:masterfrom
onehouseinc:ak/dskp-cfg-fix

Conversation

@alexeykudinkin
Copy link
Contributor

Tips

What is the purpose of the pull request

Addressing the problem of Data Skipping not respecting Metadata Table configs which might differ b/w write/read paths. More details could be found in HUDI-3812.

Brief change log

  • Fixing Data Skipping configuration to respect MT configs (on the Read path)
  • Tightening up DS handling of cases when no top-level columns are in the target query
  • Enhancing tests to cover all possible cases

Verify this pull request

This pull request is already covered by existing tests, such as (please describe tests).
This change added tests and can be verified as follows:

Committer checklist

  • Has a corresponding JIRA in PR title & commit

  • Commit message is descriptive of the change

  • CI is green

  • Necessary doc changes done or have another open PR

  • For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.


public static final boolean DEFAULT_METADATA_ENABLE_FOR_READERS = false;
// TODO rectify
public static final boolean DEFAULT_METADATA_ENABLE_FOR_READERS = true;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A reminder here to remove this change before merging.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

// nothing CSI in particular could be applied for)
lazy val queryReferencedColumns = collectReferencedColumns(spark, queryFilters, schema)

if (!isMetadataTableEnabled || !isColumnStatsIndexEnabled || !isColumnStatsIndexAvailable || !isDataSkippingEnabled) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The check here should not rely on isMetadataTableEnabled (hoodie.metadata.enable) and isColumnStatsIndexEnabled (hoodie.metadata.index.column.stats.enable) which may not be the source of truth on the query side. isColumnStatsIndexAvailable should be the only source of truth of whether col_stats partition is ready to read in metadata table.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed offline: MT might be enabled on the write path, and therefore have the Column Stats index available, but since we're deliberately splitting configs for both Write/Read paths, we have to check whether these are enabled on the Read path.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can probably go w/ 3 guards here
!isMetadataTableEnabled || !isColumnStatsIndexAvailable || !isDataSkippingEnabled

to utilize base metadata itself, one has to enable explicitly on the read path. So, I prefer to guard that. and then check if data skipping is enabled. And then only if col stats partition is available to be used.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can probably go w/ 3 guards here
!isMetadataTableEnabled || !isColumnStatsIndexAvailable || !isDataSkippingEnabled

to utilize base metadata itself, one has to enable explicitly on the read path. So, I prefer to guard that. and then check if data skipping is enabled. And then only if col stats partition is available to be used.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. Per discussion, hoodie.metadata.enable is still needed to make sure the right API fetching column stats is made to prevent any exception. hoodie.metadata.index.column.stats.enable might not be needed. We need to revisit the abstraction and configs of reading metadata table as a whole in a separate effort.

@yihua yihua self-assigned this Apr 7, 2022
@yihua yihua added the priority:blocker Production down; release blocker label Apr 7, 2022
Copy link
Contributor

@nsivabalan nsivabalan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a comment about expecting hoodieWriteConfig on the read path.

@alexeykudinkin alexeykudinkin changed the title [WIP][HUDI-3812] Fixing Data Skipping configuration to respect Metadata Table configs [HUDI-3812] Fixing Data Skipping configuration to respect Metadata Table configs Apr 8, 2022
@alexeykudinkin
Copy link
Contributor Author

@hudi-bot run azure

1 similar comment
@alexeykudinkin
Copy link
Contributor Author

@hudi-bot run azure

Alexey Kudinkin added 11 commits April 9, 2022 09:53
…perly (along w/ MT, and CSI);

Added config validation printing logs in case of invalid config
… cases when no top-level column is actually referenced
…key-prefix fetches;

Make sure `TestHoodieFileIndex` tests all config combinations
…ld proceed (since this flag is a write-path flag, which we would like to not urge users to specify on the Read path)
@hudi-bot
Copy link
Collaborator

hudi-bot commented Apr 9, 2022

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

// nothing CSI in particular could be applied for)
lazy val queryReferencedColumns = collectReferencedColumns(spark, queryFilters, schema)

if (!isMetadataTableEnabled || !isColumnStatsIndexAvailable || !isDataSkippingEnabled) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

given that CSI does not have stats for top level columns, if predicate references both top level and non-top level columns, we gonna skip leveraging CSI is it? since anyways, for non top level column, we have to visit all data files?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also, how do we deduce what columns have been indexed in MDT CSI?
for eg, we have two flows.
a. hoodie.metadata.index.column.stats.all_columns.enable = true, where in all cols will be enabled.
b. hoodie.metadata.index.column.stats.column.list set to list of columns to be indexed.

So, when we are looking to apply data skipping on the query side, should we check for these configs and decided whether a particular col is indexed by CSI or not ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

given that CSI does not have stats for top level columns, if predicate references both top level and non-top level columns, we gonna skip leveraging CSI is it? since anyways, for non top level column, we have to visit all data files?

It depends on the predicate, but we will at least try to leverage it to filter out for top-level columns only

So, when we are looking to apply data skipping on the query side, should we check for these configs and decided whether a particular col is indexed by CSI or not ?

We can't do that, we have to play by what's actually in index: this is handled when we execute the filter against lookup table -- if it doesn't contain the column of the filter, it will just match all of the files.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

However, your question made me realize that we're actually deriving index schema incorrectly currently. Let me address that

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nsivabalan i'm addressing this problem in a separate PR to avoid overloading this one: #5275

@nsivabalan nsivabalan merged commit 976840e into apache:master Apr 10, 2022
xushiyan pushed a commit that referenced this pull request Apr 14, 2022
…ble configs (#5244)

Addressing the problem of Data Skipping not respecting Metadata Table configs which might differ b/w write/read paths. More details could be found in HUDI-3812.

- Fixing Data Skipping configuration to respect MT configs (on the Read path)
- Tightening up DS handling of cases when no top-level columns are in the target query
- Enhancing tests to cover all possible case
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

priority:blocker Production down; release blocker

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants