-
Notifications
You must be signed in to change notification settings - Fork 3k
Core, Spark: Remove dangling deletes as part of RewriteDataFilesAction #9724
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
@szehon-ho can I ask for your eyes first? |
api/src/main/java/org/apache/iceberg/actions/RewriteDataFiles.java
Outdated
Show resolved
Hide resolved
aca367c to
abe98de
Compare
spark/v3.5/spark/src/test/java/org/apache/iceberg/spark/actions/TestRewriteDataFilesAction.java
Outdated
Show resolved
Hide resolved
szehon-ho
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @dramaticlly thanks for doing this. Can you put me as a co-author as most of the code is from #6581 ?
And as that is the case, it would be nice if @aokolnychyi takes a look as well
| * <p> | ||
| * | ||
| * <ul> | ||
| * <li>If remove-dangling-deletes=metadata, then dangling delete files will be pruned from iceberg |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe can remove 'remove-dangling-deletes=' , its a bit repetitive?
ie,
metadata: dangling delete files will be pruned from...
| * | ||
| * <p> | ||
| */ | ||
| public enum RemoveDanglingDeletesMode { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@aokolnychyi the thought here, is we will have further modes: STATS (for partition file stats), FULL (run a whole rewritePositionDeletes job)
Default is NONE because it does create one more snapshot and will might break people who are depending on listing the snapshots for information.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure we need this enum to be honest. The decision to use partition stats instead of scanning should be done by Iceberg, not users. If we detect there is a viable partition stats file, we should always use it, instead of scanning the metadata. Also, the FULL mode seems a bit awkward as it would actually rewrite deletes, rather than drop dangling.
I'd not add it for now and see if we want to reconsider this decision later.
| * <p> | ||
| * | ||
| * <ul> | ||
| * <li>If remove-dangling-deletes=metadata, then dangling delete files will be pruned from |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same small comment from above javadoc (remove 'remove-dangling-deletes=' as its too repetitive)
ie,
metadata: dangling delete files will be pruned from...
| // Replace the hyphen in order name with underscore to map to the enum value. For example: | ||
| // rewrite-position to REWRITE_POSITION | ||
| try { | ||
| return RemoveDanglingDeletesMode.valueOf( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is not needed anymore right? (The enum values have no hyphen )
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes you are right. Initially i though add rewrite-position so that we can trigger rewrite position deletes spark action but removed later. I guess I can remove it now and add it later
| return ImmutableRewriteDataFiles.Result.builder().rewriteResults(rewriteResults); | ||
| } | ||
|
|
||
| private List<DeleteFile> removeDanglingDeletes() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you think we can move this to another , package protected class like RemoveDanglingDeleteSparkAction? (for better code reading)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The logic is non-trivial here and is off by default. I'd probably move it into a separate action, the existing action is already complicated. If so, I am not sure we even have to call it from rewrite data files then. If we ask the user to pass a property explicitly, I'd prefer separating the two actions and have a dedicated procedure.
...k/v3.5/spark/src/main/java/org/apache/iceberg/spark/actions/RewriteDataFilesSparkAction.java
Outdated
Show resolved
Hide resolved
...k/v3.5/spark/src/main/java/org/apache/iceberg/spark/actions/RewriteDataFilesSparkAction.java
Outdated
Show resolved
Hide resolved
|
I should have time to take a look this week. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, I did one pass and here are my high-level notes:
- We should use
RewriteFilesinstead ofDeleteFiles, changes inDeleteFilesshould be reverted. - I don't see a need for the enum to control the cleanup mode.
- I'd consider having a separate action but I can be convinced otherwise. Especially, given that we may account for partition stats in the future.
- I'd consider the following algorithm:
- Extend
data_filesanddelete_filesmetadata tables to include data sequence numbers, if needed. I don't remember if we already populate them. This should be trivial as eachDeleteFileobject already has this info. - Query
data_files, aggregate, compute min data sequence number per partition. Don't cache the computed result, just keep a reference to it. - Query
delete_files, potentially projecting only strictly required columns. - Join the summary with
delete_fileson the spec ID and partition. Find delete files that can be discarded in one go by having a predicate that accounts for the delete type (position vs equality). - Collect the result to the driver and use
SparkDeleteFileto wrap Spark rows as valid delete files. See the action for rewriting manifests for an example.
- Extend
| * | ||
| * <p> | ||
| */ | ||
| public enum RemoveDanglingDeletesMode { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure we need this enum to be honest. The decision to use partition stats instead of scanning should be done by Iceberg, not users. If we detect there is a viable partition stats file, we should always use it, instead of scanning the metadata. Also, the FULL mode seems a bit awkward as it would actually rewrite deletes, rather than drop dangling.
I'd not add it for now and see if we want to reconsider this decision later.
api/src/main/java/org/apache/iceberg/actions/RewriteDataFiles.java
Outdated
Show resolved
Hide resolved
api/src/main/java/org/apache/iceberg/actions/RewriteDataFiles.java
Outdated
Show resolved
Hide resolved
spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/actions/BaseSparkAction.java
Outdated
Show resolved
Hide resolved
spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/actions/BaseSparkAction.java
Outdated
Show resolved
Hide resolved
...k/v3.5/spark/src/main/java/org/apache/iceberg/spark/actions/RewriteDataFilesSparkAction.java
Outdated
Show resolved
Hide resolved
| return ImmutableRewriteDataFiles.Result.builder().rewriteResults(rewriteResults); | ||
| } | ||
|
|
||
| private List<DeleteFile> removeDanglingDeletes() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The logic is non-trivial here and is off by default. I'd probably move it into a separate action, the existing action is already complicated. If so, I am not sure we even have to call it from rewrite data files then. If we ask the user to pass a property explicitly, I'd prefer separating the two actions and have a dedicated procedure.
...k/v3.5/spark/src/main/java/org/apache/iceberg/spark/actions/RewriteDataFilesSparkAction.java
Outdated
Show resolved
Hide resolved
...k/v3.5/spark/src/main/java/org/apache/iceberg/spark/actions/RewriteManifestsSparkAction.java
Show resolved
Hide resolved
api/src/main/java/org/apache/iceberg/RemoveDanglingDeletesMode.java
Outdated
Show resolved
Hide resolved
api/src/main/java/org/apache/iceberg/RemoveDanglingDeletesMode.java
Outdated
Show resolved
Hide resolved
...k/v3.5/spark/src/main/java/org/apache/iceberg/spark/actions/RewriteDataFilesSparkAction.java
Outdated
Show resolved
Hide resolved
spark/v3.5/spark/src/test/java/org/apache/iceberg/spark/actions/TestRewriteDataFilesAction.java
Outdated
Show resolved
Hide resolved
Based on Anton's feedback, I will try divide the changes into 2 PRs where first PR (#9813) to support data sequence number in data and delete files table. Once merged, I will update to scan data_files first to aggregate per spec/partition min data sequence number, then compare against the delete_files. With left join, we can identify dangling deletes and remove them in one pass. SparkDeleteFile will be used to convert from spark row to POJO to be used for pruning, in consideration of partition evolution. Finally, dangling delete will be removed by reconstruction instead of by file path, to benefit manifest pruning when iceberg table was scanned. |
I guess only partitionData and path is needed, others all not used. |
9062ee0 to
fdf6d8b
Compare
...3.5/spark/src/test/java/org/apache/iceberg/spark/actions/TestRemoveDanglingDeleteAction.java
Outdated
Show resolved
Hide resolved
Optionally can be enabled as part of RewriteDataFilesSparkAction Co-authored-by: Szehon Ho <[email protected]>
fdf6d8b to
4a093d2
Compare
|
With the merge of #10203 , I refactored the algorithm a bit to scan entries table for getting minSequenceNumberPerPartitionAndSpec and for getting delete files table for data sequence number instead of rely on data sequence number as virtual columns. I also identified and fixed the problem in partition evolution tests so that now it's all handled correctly. Would you like to take another look? @szehon-ho @aokolnychyi |
szehon-ho
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks mostly good to me, did a review round on the code portion.
api/src/main/java/org/apache/iceberg/actions/RewriteDataFiles.java
Outdated
Show resolved
Hide resolved
api/src/main/java/org/apache/iceberg/actions/RewriteDataFiles.java
Outdated
Show resolved
Hide resolved
...5/spark/src/main/java/org/apache/iceberg/spark/actions/RemoveDanglingDeletesSparkAction.java
Outdated
Show resolved
Hide resolved
...5/spark/src/main/java/org/apache/iceberg/spark/actions/RemoveDanglingDeletesSparkAction.java
Outdated
Show resolved
Hide resolved
...5/spark/src/main/java/org/apache/iceberg/spark/actions/RemoveDanglingDeletesSparkAction.java
Outdated
Show resolved
Hide resolved
...k/v3.5/spark/src/main/java/org/apache/iceberg/spark/actions/RewriteDataFilesSparkAction.java
Outdated
Show resolved
Hide resolved
...5/spark/src/main/java/org/apache/iceberg/spark/actions/RemoveDanglingDeletesSparkAction.java
Outdated
Show resolved
Hide resolved
...5/spark/src/main/java/org/apache/iceberg/spark/actions/RemoveDanglingDeletesSparkAction.java
Outdated
Show resolved
Hide resolved
...5/spark/src/main/java/org/apache/iceberg/spark/actions/RemoveDanglingDeletesSparkAction.java
Outdated
Show resolved
Hide resolved
...5/spark/src/main/java/org/apache/iceberg/spark/actions/RemoveDanglingDeletesSparkAction.java
Outdated
Show resolved
Hide resolved
- Rewording documentation and add more comments - Changed removed deletes in results from List to Iterable to save on memory - Added BaseRemoveDanglingDeleteFiles and generate Immutable implementation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @dramaticlly ! Left some more minor comments, mostly in the doc part as its a bit of a complex algorithm, but also some others.
...5/spark/src/main/java/org/apache/iceberg/spark/actions/RemoveDanglingDeletesSparkAction.java
Outdated
Show resolved
Hide resolved
...5/spark/src/main/java/org/apache/iceberg/spark/actions/RemoveDanglingDeletesSparkAction.java
Outdated
Show resolved
Hide resolved
...5/spark/src/main/java/org/apache/iceberg/spark/actions/RemoveDanglingDeletesSparkAction.java
Outdated
Show resolved
Hide resolved
...5/spark/src/main/java/org/apache/iceberg/spark/actions/RemoveDanglingDeletesSparkAction.java
Outdated
Show resolved
Hide resolved
...5/spark/src/main/java/org/apache/iceberg/spark/actions/RemoveDanglingDeletesSparkAction.java
Outdated
Show resolved
Hide resolved
...5/spark/src/main/java/org/apache/iceberg/spark/actions/RemoveDanglingDeletesSparkAction.java
Outdated
Show resolved
Hide resolved
...5/spark/src/main/java/org/apache/iceberg/spark/actions/RemoveDanglingDeletesSparkAction.java
Show resolved
Hide resolved
...5/spark/src/main/java/org/apache/iceberg/spark/actions/RemoveDanglingDeletesSparkAction.java
Show resolved
Hide resolved
...k/v3.5/spark/src/main/java/org/apache/iceberg/spark/actions/RewriteDataFilesSparkAction.java
Show resolved
Hide resolved
...k/v3.5/spark/src/main/java/org/apache/iceberg/spark/actions/RewriteDataFilesSparkAction.java
Outdated
Show resolved
Hide resolved
- Instantiate spark action without clone session - Update javadoc to use html order list - Inline resultBuilder in RewriteDataFilesSparkAction
szehon-ho
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks great to me, @dramaticlly one more comment
...5/spark/src/main/java/org/apache/iceberg/spark/actions/RemoveDanglingDeletesSparkAction.java
Show resolved
Hide resolved
# Conflicts: # api/src/main/java/org/apache/iceberg/actions/ActionsProvider.java
|
Merged, thanks @dramaticlly ! |
|
Thanks everyone for the review and input, special thanks to @aokolnychyi for optimized algorithm and @szehon-ho as original author and detailed review! |
Backport #9724 to Spark 3.4
Backport apache#9724 to Spark 3.4
* Core, Spark 3.5: Remove dangling deletes as part of RewriteDataFilesAction (apache#9724) * Spark 3.4: Action to remove dangling deletes (apache#11377) * SpotlessApply --------- Co-authored-by: Hongyue/Steve Zhang <[email protected]>
Goal: Attempt to clean up the dangling deletes as part of Spark RewriteDataFilesAction, it can be controlled by feature flag
remove-dangling-deletesand it's by default turned off. Most of the code come from #6581 and reason on why we need it: The problem and design doc is here: https://docs.google.com/document/d/11d-cIUR_89kRsMmWnEoxXGZCvp7L4TUmPJqUC60zB5M/edit#Changes
figure out predicate push down for entries metadata table