-
Notifications
You must be signed in to change notification settings - Fork 2.5k
[HUDI-3465] Add validation of column stats and bloom filters in HoodieMetadataTableValidator #4878
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[HUDI-3465] Add validation of column stats and bloom filters in HoodieMetadataTableValidator #4878
Conversation
|
@zhangyue19921010 Feel free to review this PR |
…eMetadataTableValidator
ca5cc35 to
d53a320
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM just a few comments. Thanks for this contribution!
| List<HoodieColumnRangeMetadata<String>> metadataBasedColStats = metadataTableBasedContext | ||
| .getSortedColumnStatsList(partitionPath, latestBaseFilenameList); | ||
| List<HoodieColumnRangeMetadata<String>> fsBasedColStats = fsBasedContext | ||
| .getSortedColumnStatsList(partitionPath, latestBaseFilenameList); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We get latestBaseFilenameList based on fsBasedContext and use it to get metadataBasedColStats and fsBasedColStats both. Could we do
fsBasedContext -> fileNameListFsBased -> fsBasedColStats
metadataTableBasedContext -> fileNameListMetaBased -> metadataBasedColStats
Or it's ok to reuse the same latestBaseFilenameList .
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is intentional to use the same input latestBaseFilenameList for column stats and bloom filter validation to limit the validation scope. Since there is already a validation task for the latest base file, here we don't use different logic for fetching the latest base file list and the col stats are validated only.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Make sense. Thanks for your explanation :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for raising the concern. The idea is to separate the issue due to wrong column stats from the mismatched latest base file list for clarity of validation.
| .collect(Collectors.toList()); | ||
| } else { | ||
| return baseFileNameList.stream().flatMap(filename -> | ||
| new ParquetUtils().readRangeFromParquetMetadata( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we use ParquetUtils to build HoodieColumnRangeMetadata when fs based
and use tableMetadata.getColumnStats to build it when metaBased.
Also I see that the api getColumnStats is not supported in FileSystemBackedTableMetadata
@Override
public Map<Pair<String, String>, HoodieMetadataColumnStats> getColumnStats(final List<Pair<String, String>> partitionNameFileNameList, final String columnName)
throws HoodieMetadataException {
throw new HoodieMetadataException("Unsupported operation: getColumnsStats!");
}
Maybe we could move this logic into FileSystemBackedTableMetadata ?
Just a suggestion and it looks good to me for current logic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good call! I thought about that but that requires some refactoring inside FileSystemBackedTableMetadata and related classes and that's not trivial. So I suggest following this up later as an independent task: https://issues.apache.org/jira/browse/HUDI-3532
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice.
| HoodieMetadataValidationContext fsBasedContext, String partitionPath) { | ||
| List<String> latestBaseFilenameList = fsBasedContext.getSortedLatestBaseFileList(partitionPath) | ||
| .stream().map(BaseFile::getFileName).collect(Collectors.toList()); | ||
| List<BloomFilterData> metadataBasedBloomFilters = metadataTableBasedContext |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same question for latestBaseFilenameList mentioned before.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for raising this. The same reasoning as above.
| .collect(Collectors.toList()); | ||
| } else { | ||
| return baseFileNameList.stream() | ||
| .map(filename -> readBloomFilterFromFile(partitionPath, filename)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same as
@Override
public Map<Pair<String, String>, ByteBuffer> getBloomFilters(final List<Pair<String, String>> partitionNameFileNameList)
throws HoodieMetadataException {
throw new HoodieMetadataException("Unsupported operation: getBloomFilters!");
}
If we could unify the API, we don' need to check if (enableMetadataTable) here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a good point as well. I would tackle that in a separate PR and refactor the bloom index altogether: https://issues.apache.org/jira/browse/HUDI-3533
| .map(filename -> readBloomFilterFromFile(partitionPath, filename)) | ||
| .filter(Option::isPresent) | ||
| .map(Option::get) | ||
| .collect(Collectors.toList()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we also need to call .sorted() here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch! Fixed.
|
I have quite a few PRs and issues to catch up. I will let @zhangyue19921010 review this. Once he stamps, I can help land it. |
|
All good. We could land it after CI jobs passed. |
…eMetadataTableValidator (apache#4878)
…eMetadataTableValidator (apache#4878)
What is the purpose of the pull request
This PR adds the functionality to validate the column stats and bloom filters stored in metadata table against the ground truth in base files in
HoodieMetadataTableValidator.Brief change log
HoodieMetadataTableValidator--validate-all-column-stats: validate column stats for all columns in the schema--validate-bloom-filters: validate bloom filters of base filesBloomFilterDatato facilitate validation of bloom filtersParquetUtils::readMetadatastaticVerify this pull request
This PR is verified by running the
HoodieMetadataTableValidatorwith spark-submit for all validation task to make sure that existing and new validation logic work properly. The validation of column stats is able to catch a bug, which is addressed in #4875 .Committer checklist
Has a corresponding JIRA in PR title & commit
Commit message is descriptive of the change
CI is green
Necessary doc changes done or have another open PR
For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.