-
Notifications
You must be signed in to change notification settings - Fork 3k
Core: Make removal of delete files optional in manifest filtering #4370
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| dropPartitions.contains(file.specId(), file.partition()) || | ||
| (isDelete && entry.sequenceNumber() > 0 && entry.sequenceNumber() < minSequenceNumber); | ||
|
|
||
| boolean nonMatchingDeleteFile = !file.content().equals(FileContent.DATA) && !evaluator.rowsMustMatch(file); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should avoid extra calls to evaluator if possible. Also, there was isDelete var already defined.
| reader.entries().forEach(entry -> { | ||
| F file = entry.file(); | ||
| boolean fileDelete = deletePaths.contains(file.path()) || | ||
| boolean markedForDelete = deletePaths.contains(file.path()) || |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Renamed for consistency.
| manifest.path(), wrapper.get()); | ||
| duplicateDeleteCount += 1; | ||
| if (allRowsMatch) { | ||
| writer.delete(entry); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Simply moved the existing logic under if now.
| .commit(); | ||
|
|
||
| validateTableFiles(table, FILE_DAY_2, FILE_DAY_2_MODIFIED); | ||
| validateTableDeleteFiles(table, FILE_DAY_2_POS_DELETES); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would previously fail as we should remove one delete file and one should be kept.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this fix. It makes sense, you are right on the optimization tip and the missing case where it passes the check but not the actual delete, sorry I did not see this earlier.
| hasDeletedFiles = allRowsMatch; | ||
|
|
||
| if (failAnyDelete) { | ||
| throw new DeleteException(reader.spec().partitionToPath(file.partition())); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In failAnyDelete case, this would trigger if we have a deleteFile but rows do not match. Probably this never happens in delete call, but wonder if we need to update the condition (failAnyDelete && allRowsMatch).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch. I'll fix.
| table.newOverwrite() | ||
| .overwriteByRowFilter(EXPRESSION_DAY_2_ANOTHER_ID_RANGE) | ||
| .addFile(FILE_DAY_2_MODIFIED) | ||
| .validateFromSnapshot(baseSnapshot.snapshotId()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Optional: this validate from snapshot seems to be a bit useless in reproducing the problem, especially as there should be no other transactions after this base here, would it be better to remove all the validateXX flags to simplify the test?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good call, will simplify.
szehon-ho
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, new changes look good to me.
|
Thanks, @szehon-ho! |
…ache#4370) (cherry picked from commit ab1a3a8)
This PR is a follow-up to #4304. In that PR, we updated the method that checks whether a manifest has an entry to be deleted. However, we did not update the method that actually does the manifest filtering. As a consequence, the test added in this PR would fail with a runtime exception as one delete file should be removed and one kept. In addition, we should not introduce any extra calls to
evaluatoras they are expensive.Overall, the problem is that whenever we overwrite data by filter, we should not fail operations when it is not possible to remove delete files. Removal of delete files is optional and is an optimization we should perform if possible.