-
Notifications
You must be signed in to change notification settings - Fork 2.5k
[HUDI-5420] Fix metadata table validator to exclude uncommitted log files due to retry #7517
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[HUDI-5420] Fix metadata table validator to exclude uncommitted log files due to retry #7517
Conversation
|
Ack! Will review this pr later this week. |
zhangyue19921010
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall LGTM. Thanks for this fixing!
Just left two minor comments. PTAL
hudi-utilities/src/main/java/org/apache/hudi/utilities/HoodieMetadataTableValidator.java
Show resolved
Hide resolved
hudi-utilities/src/main/java/org/apache/hudi/utilities/HoodieMetadataTableValidator.java
Show resolved
Hide resolved
|
Let's ignore MDT for now. I have some basic doubt on MOR table inner workings. So, how does extraneous log files are ignored while reading a committed data from DT? |
|
Patch looks good from MDT validation standpoint. |
5cffefb to
eaa6d00
Compare
Based on my understanding, the extraneous logFile1 or particular log block from a successful commit is read for snapshot query if the file or the block is not corrupted, e.g., partially written. Correctness-wise, it should be okay as long as the update logic generates the same merged payload after applying the same change log twice. |
…iles due to retry (apache#7517) When a write transaction writes uncommitted log files in a delta commit, e.g., due to Spark task retries, these log files stay in the file system after the successful delta commit for some time (unlike uncommitted base files, which are deleted based on the markers). The delta commit metadata does not contain these log files, and the metadata table does not contain these entries either. This is a valid case where the metadata-table-based file listing (providing committed data files) is different from the file system (providing committed data files + uncommited log files in this case). In such a case, before this PR, the metadata table validator throws an exception for the mismatch, because the log blocks are checked based on the commit time, not validated against the commit metadata. This PR fixes the logic of the metadata table validator to check whether the difference in the list of log files between metadata table and direct file system is due to committed log files, based on the commit metadata.
…iles due to retry (apache#7517) When a write transaction writes uncommitted log files in a delta commit, e.g., due to Spark task retries, these log files stay in the file system after the successful delta commit for some time (unlike uncommitted base files, which are deleted based on the markers). The delta commit metadata does not contain these log files, and the metadata table does not contain these entries either. This is a valid case where the metadata-table-based file listing (providing committed data files) is different from the file system (providing committed data files + uncommited log files in this case). In such a case, before this PR, the metadata table validator throws an exception for the mismatch, because the log blocks are checked based on the commit time, not validated against the commit metadata. This PR fixes the logic of the metadata table validator to check whether the difference in the list of log files between metadata table and direct file system is due to committed log files, based on the commit metadata.
Change Logs
When a write transaction writes uncommitted log files in a delta commit, e.g., due to Spark task retries, these log files stay in the file system after the successful delta commit for some time (unlike uncommitted base files, which are deleted based on the markers). The delta commit metadata does not contain these log files, and the metadata table does not contain these entries either. This is a valid case where the metadata-table-based file listing (providing committed data files) is different from the file system (providing committed data files + uncommited log files in this case).
In such a case, currently, the metadata table validator throws an exception for the mismatch, because the log blocks are checked based on the commit time, not validated against the commit metadata.
This PR fixes the logic of the metadata table validator to check whether the difference in the list of log files between metadata table and direct file system is due to committed log files, based on the commit metadata.
The PR is tested locally with such a case. Before this PR, the metadata table validator throws an exception. After this PR, the validator succeeds.
Impact
This PR improves the robustness of the metadata table validator so that it does not fire false alarms for the valid case above.
Risk level
low
Documentation Update
N/A
Contributor's checklist