Skip to content

Conversation

@yihua
Copy link
Contributor

@yihua yihua commented Feb 23, 2022

What is the purpose of the pull request

This PR adds the functionality to validate the column stats and bloom filters stored in metadata table against the ground truth in base files in HoodieMetadataTableValidator.

Brief change log

  • Adds two new validation tasks in HoodieMetadataTableValidator
  • --validate-all-column-stats: validate column stats for all columns in the schema
  • --validate-bloom-filters: validate bloom filters of base files
  • Adds a new class BloomFilterData to facilitate validation of bloom filters
  • Makes ParquetUtils::readMetadata static
  • Cleans up naming and adds minor changes around logging messages and levels

Verify this pull request

This PR is verified by running the HoodieMetadataTableValidator with spark-submit for all validation task to make sure that existing and new validation logic work properly. The validation of column stats is able to catch a bug, which is addressed in #4875 .

Committer checklist

  • Has a corresponding JIRA in PR title & commit

  • Commit message is descriptive of the change

  • CI is green

  • Necessary doc changes done or have another open PR

  • For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.

@yihua
Copy link
Contributor Author

yihua commented Feb 23, 2022

@zhangyue19921010 Feel free to review this PR

@yihua yihua force-pushed the HUDI-3465-metadata-index-validation branch from ca5cc35 to d53a320 Compare February 23, 2022 06:30
@nsivabalan nsivabalan self-assigned this Feb 23, 2022
Copy link
Contributor

@zhangyue19921010 zhangyue19921010 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM just a few comments. Thanks for this contribution!

List<HoodieColumnRangeMetadata<String>> metadataBasedColStats = metadataTableBasedContext
.getSortedColumnStatsList(partitionPath, latestBaseFilenameList);
List<HoodieColumnRangeMetadata<String>> fsBasedColStats = fsBasedContext
.getSortedColumnStatsList(partitionPath, latestBaseFilenameList);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We get latestBaseFilenameList based on fsBasedContext and use it to get metadataBasedColStats and fsBasedColStats both. Could we do

fsBasedContext -> fileNameListFsBased -> fsBasedColStats
metadataTableBasedContext -> fileNameListMetaBased -> metadataBasedColStats

Or it's ok to reuse the same latestBaseFilenameList .

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is intentional to use the same input latestBaseFilenameList for column stats and bloom filter validation to limit the validation scope. Since there is already a validation task for the latest base file, here we don't use different logic for fetching the latest base file list and the col stats are validated only.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make sense. Thanks for your explanation :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for raising the concern. The idea is to separate the issue due to wrong column stats from the mismatched latest base file list for clarity of validation.

.collect(Collectors.toList());
} else {
return baseFileNameList.stream().flatMap(filename ->
new ParquetUtils().readRangeFromParquetMetadata(
Copy link
Contributor

@zhangyue19921010 zhangyue19921010 Feb 24, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we use ParquetUtils to build HoodieColumnRangeMetadata when fs based
and use tableMetadata.getColumnStats to build it when metaBased.

Also I see that the api getColumnStats is not supported in FileSystemBackedTableMetadata

  @Override
  public Map<Pair<String, String>, HoodieMetadataColumnStats> getColumnStats(final List<Pair<String, String>> partitionNameFileNameList, final String columnName)
      throws HoodieMetadataException {
    throw new HoodieMetadataException("Unsupported operation: getColumnsStats!");
  }

Maybe we could move this logic into FileSystemBackedTableMetadata ?

Just a suggestion and it looks good to me for current logic.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call! I thought about that but that requires some refactoring inside FileSystemBackedTableMetadata and related classes and that's not trivial. So I suggest following this up later as an independent task: https://issues.apache.org/jira/browse/HUDI-3532

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice.

HoodieMetadataValidationContext fsBasedContext, String partitionPath) {
List<String> latestBaseFilenameList = fsBasedContext.getSortedLatestBaseFileList(partitionPath)
.stream().map(BaseFile::getFileName).collect(Collectors.toList());
List<BloomFilterData> metadataBasedBloomFilters = metadataTableBasedContext
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same question for latestBaseFilenameList mentioned before.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for raising this. The same reasoning as above.

.collect(Collectors.toList());
} else {
return baseFileNameList.stream()
.map(filename -> readBloomFilterFromFile(partitionPath, filename))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as

  @Override
  public Map<Pair<String, String>, ByteBuffer> getBloomFilters(final List<Pair<String, String>> partitionNameFileNameList)
      throws HoodieMetadataException {
    throw new HoodieMetadataException("Unsupported operation: getBloomFilters!");
  }

If we could unify the API, we don' need to check if (enableMetadataTable) here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good point as well. I would tackle that in a separate PR and refactor the bloom index altogether: https://issues.apache.org/jira/browse/HUDI-3533

.map(filename -> readBloomFilterFromFile(partitionPath, filename))
.filter(Option::isPresent)
.map(Option::get)
.collect(Collectors.toList());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we also need to call .sorted() here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch! Fixed.

@nsivabalan
Copy link
Contributor

I have quite a few PRs and issues to catch up. I will let @zhangyue19921010 review this. Once he stamps, I can help land it.

@zhangyue19921010
Copy link
Contributor

All good. We could land it after CI jobs passed.
CC @nsivabalan :)

@apache apache deleted a comment from hudi-bot Mar 1, 2022
@hudi-bot
Copy link
Collaborator

hudi-bot commented Mar 1, 2022

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

@yihua yihua merged commit 257052a into apache:master Mar 1, 2022
vingov pushed a commit to vingov/hudi that referenced this pull request Apr 3, 2022
stayrascal pushed a commit to stayrascal/hudi that referenced this pull request Apr 12, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants