-
Notifications
You must be signed in to change notification settings - Fork 2.9k
API: Add validation that delete file referenced files are not rewritten #2892
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Closed
rdblue
wants to merge
1
commit into
apache:master
from
rdblue:validate-rewrite-delete-file-data-files-exist
Closed
Changes from all commits
Commits
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reconsidered the comment, I agree that conflicts between REPLACE and OVERWRITE/INSERT (say conflict-1) is another different story compared to conflicts between REPLACE and REPLACE (say conflict-2).
For this case
The REPLACE operation remove the data files that was relied by other committed APPEND/OVERWRITE/DELTE operations, both conflict-1 and conflict-2 should be avoided because both of them will lead to incorrect data set.For the other case
The APPEND/OVERWRITE/DELETE operations removed the data files that was relied by other committing REPLACE operation, conflict-1 won't lead to incorrect data set although there will be some remaining dangling positional deletes (as you said in this comment). but it's possible to lead to incorrect data set when considering conflict-2:Step.1 : Table has FILE_A and EQ_DELETE_FILE_A;
Step.2 : RewriteAction_1 rewrite the FILE_A to another FILE_B - not commit;
Step.3 : RewriteAction_2 rewrite the EQ_DELETE_FILE_A to POS_DELETE_FILE_C which reference to FILE_A - not commit.
Step.4. : Committed RewriteAction_1 ;
Step.5 : Committed RewriteAction_2.
In the end, the POS_DELETE_FILE_C won't be able to apply to the newly rewritten FILE_B, which create the incorrect data set. Using older sequence number for RewriteAction also cannot fix this bug.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the case I was thinking about in my last comment on the other PR. In order for
RewriteAction_1to be valid, it must reuse the sequence fromFILE_AforFILE_B. Otherwise, the rewrite on its own would have un-deleted a row becauseEQ_DELETE_FILE_Awould no longer apply.The validation in this PR can catch this case because the files referenced by
POS_DELETE_FILE_Cwould be passed to the validation. That'sFILE_Ain this case and the commit forRewriteAction_2would check thatFILE_Astill exists and would fail. I'm fine merging this PR if you think that this is something that may happen.But, I think that it is really unlikely that rewrites will alter sequence numbers to avoid applying deletes. That makes little sense because you may as well apply deletes as long as you're rewriting the data. But assuming that you wanted to, this may not even be possible if the files that are rewritten are from different sequence numbers, with different sets of equality delete files that must be applied. I think a far better option is to apply the deletes when rewriting.
I'm fine merging this if you think we need it. I'll remove the draft status so that we can.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One of the assumptions we made while implementing the new data compaction is that we will always apply deletes while compacting data and will use a new sequence number for compacted files to make sure no deletes apply to them. Is there a good use case not to do that?
Well, if we wanted to compact data files without applying deletes, the implementation would have to be really complicated to address the point above.
If there is a use case (even though there is no implementation yet), I think we should add this validation. If not, I'd probably skip it. I don't have a use case in mind right now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc @RussellSpitzer, any use case you have?