Skip to content

Conversation

@yihua
Copy link
Contributor

@yihua yihua commented Dec 20, 2022

Change Logs

When a write transaction writes uncommitted log files in a delta commit, e.g., due to Spark task retries, these log files stay in the file system after the successful delta commit for some time (unlike uncommitted base files, which are deleted based on the markers). The delta commit metadata does not contain these log files, and the metadata table does not contain these entries either. This is a valid case where the metadata-table-based file listing (providing committed data files) is different from the file system (providing committed data files + uncommited log files in this case).

In such a case, currently, the metadata table validator throws an exception for the mismatch, because the log blocks are checked based on the commit time, not validated against the commit metadata.

This PR fixes the logic of the metadata table validator to check whether the difference in the list of log files between metadata table and direct file system is due to committed log files, based on the commit metadata.

The PR is tested locally with such a case. Before this PR, the metadata table validator throws an exception. After this PR, the validator succeeds.

Impact

This PR improves the robustness of the metadata table validator so that it does not fire false alarms for the valid case above.

Risk level

low

Documentation Update

N/A

Contributor's checklist

  • Read through contributor's guide
  • Change Logs and Impact were stated clearly
  • Adequate tests were added if applicable
  • CI passed

@yihua yihua added priority:critical Production degraded; pipelines stalled metadata labels Dec 20, 2022
@yihua yihua requested a review from nsivabalan December 20, 2022 06:49
@zhangyue19921010
Copy link
Contributor

Ack! Will review this pr later this week.

Copy link
Contributor

@zhangyue19921010 zhangyue19921010 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall LGTM. Thanks for this fixing!
Just left two minor comments. PTAL

@nsivabalan
Copy link
Contributor

Let's ignore MDT for now. I have some basic doubt on MOR table inner workings.

So, how does extraneous log files are ignored while reading a committed data from DT?
ie. let's say we make commit3 which had some spark retries. So, instead of logFile1, we now have logFile1 and logFile2, where only logFile2 is valid. our marker based re-concilliation is not going to delete nor add rollback block for logFile1.
So, when someone does snapshot read, where exactly we skip logFile1? It should be part of AbstractLogRecordReader right. I could not locate it only.

@nsivabalan
Copy link
Contributor

Patch looks good from MDT validation standpoint.

@yihua yihua force-pushed the HUDI-5420-fix-mt-validator-log-files branch from 5cffefb to eaa6d00 Compare December 28, 2022 21:19
@yihua
Copy link
Contributor Author

yihua commented Dec 28, 2022

Let's ignore MDT for now. I have some basic doubt on MOR table inner workings.

So, how does extraneous log files are ignored while reading a committed data from DT? ie. let's say we make commit3 which had some spark retries. So, instead of logFile1, we now have logFile1 and logFile2, where only logFile2 is valid. our marker based re-concilliation is not going to delete nor add rollback block for logFile1. So, when someone does snapshot read, where exactly we skip logFile1? It should be part of AbstractLogRecordReader right. I could not locate it only.

Based on my understanding, the extraneous logFile1 or particular log block from a successful commit is read for snapshot query if the file or the block is not corrupted, e.g., partially written. Correctness-wise, it should be okay as long as the update logic generates the same merged payload after applying the same change log twice.

@apache apache deleted a comment from hudi-bot Dec 29, 2022
@hudi-bot
Copy link
Collaborator

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

@yihua yihua merged commit fb28ad8 into apache:master Dec 29, 2022
nsivabalan pushed a commit to nsivabalan/hudi that referenced this pull request Mar 22, 2023
…iles due to retry (apache#7517)

When a write transaction writes uncommitted log files in a delta commit, e.g., due to Spark task retries, these log files stay in the file system after the successful delta commit for some time (unlike uncommitted base files, which are deleted based on the markers). The delta commit metadata does not contain these log files, and the metadata table does not contain these entries either. This is a valid case where the metadata-table-based file listing (providing committed data files) is different from the file system (providing committed data files + uncommited log files in this case).  In such a case, before this PR, the metadata table validator throws an exception for the mismatch, because the log blocks are checked based on the commit time, not validated against the commit metadata.

This PR fixes the logic of the metadata table validator to check whether the difference in the list of log files between metadata table and direct file system is due to committed log files, based on the commit metadata.
fengjian428 pushed a commit to fengjian428/hudi that referenced this pull request Apr 5, 2023
…iles due to retry (apache#7517)

When a write transaction writes uncommitted log files in a delta commit, e.g., due to Spark task retries, these log files stay in the file system after the successful delta commit for some time (unlike uncommitted base files, which are deleted based on the markers). The delta commit metadata does not contain these log files, and the metadata table does not contain these entries either. This is a valid case where the metadata-table-based file listing (providing committed data files) is different from the file system (providing committed data files + uncommited log files in this case).  In such a case, before this PR, the metadata table validator throws an exception for the mismatch, because the log blocks are checked based on the commit time, not validated against the commit metadata.

This PR fixes the logic of the metadata table validator to check whether the difference in the list of log files between metadata table and direct file system is due to committed log files, based on the commit metadata.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

priority:critical Production degraded; pipelines stalled

Projects

Archived in project

Development

Successfully merging this pull request may close these issues.

4 participants