Core, Spark: Remove dangling deletes as part of RewriteDataFilesAction #9724

dramaticlly · 2024-02-13T23:45:36Z

Goal: Attempt to clean up the dangling deletes as part of Spark RewriteDataFilesAction, it can be controlled by feature flag
remove-dangling-deletes and it's by default turned off. Most of the code come from #6581 and reason on why we need it: The problem and design doc is here: https://docs.google.com/document/d/11d-cIUR_89kRsMmWnEoxXGZCvp7L4TUmPJqUC60zB5M/edit#

Changes

DeleteFiles to remove a given deleteFile
RewriteDataFilesResult now provide count on number of dangling files removed
withReusableDS() function moved from RewriteManifestsSparkAction to base so it can be reused in RewriteDataFilesAction.

~~figure out predicate push down for entries metadata table~~

dramaticlly · 2024-02-13T23:50:52Z

@szehon-ho can I ask for your eyes first?

api/src/main/java/org/apache/iceberg/actions/RewriteDataFiles.java

spark/v3.5/spark/src/test/java/org/apache/iceberg/spark/actions/TestRewriteDataFilesAction.java

szehon-ho

Hi @dramaticlly thanks for doing this. Can you put me as a co-author as most of the code is from #6581 ?

And as that is the case, it would be nice if @aokolnychyi takes a look as well

szehon-ho · 2024-02-20T18:10:36Z

api/src/main/java/org/apache/iceberg/RemoveDanglingDeletesMode.java

+ * <p>
+ *
+ * <ul>
+ *   <li>If remove-dangling-deletes=metadata, then dangling delete files will be pruned from iceberg


maybe can remove 'remove-dangling-deletes=' , its a bit repetitive?

ie,

metadata: dangling delete files will be pruned from...

szehon-ho · 2024-02-20T18:11:28Z

api/src/main/java/org/apache/iceberg/RemoveDanglingDeletesMode.java

+ *
+ * <p>
+ */
+public enum RemoveDanglingDeletesMode {


@aokolnychyi the thought here, is we will have further modes: STATS (for partition file stats), FULL (run a whole rewritePositionDeletes job)

Default is NONE because it does create one more snapshot and will might break people who are depending on listing the snapshots for information.

I am not sure we need this enum to be honest. The decision to use partition stats instead of scanning should be done by Iceberg, not users. If we detect there is a viable partition stats file, we should always use it, instead of scanning the metadata. Also, the FULL mode seems a bit awkward as it would actually rewrite deletes, rather than drop dangling.

I'd not add it for now and see if we want to reconsider this decision later.

szehon-ho · 2024-02-20T18:13:14Z

api/src/main/java/org/apache/iceberg/actions/RewriteDataFiles.java

+   * <p>
+   *
+   * <ul>
+   *   <li>If remove-dangling-deletes=metadata, then dangling delete files will be pruned from


Same small comment from above javadoc (remove 'remove-dangling-deletes=' as its too repetitive)

ie,

metadata: dangling delete files will be pruned from...

szehon-ho · 2024-02-20T18:15:27Z

api/src/main/java/org/apache/iceberg/RemoveDanglingDeletesMode.java

+    // Replace the hyphen in order name with underscore to map to the enum value. For example:
+    // rewrite-position to REWRITE_POSITION
+    try {
+      return RemoveDanglingDeletesMode.valueOf(


this is not needed anymore right? (The enum values have no hyphen )

yes you are right. Initially i though add rewrite-position so that we can trigger rewrite position deletes spark action but removed later. I guess I can remove it now and add it later

szehon-ho · 2024-02-20T18:16:19Z

...k/v3.5/spark/src/main/java/org/apache/iceberg/spark/actions/RewriteDataFilesSparkAction.java

+    return ImmutableRewriteDataFiles.Result.builder().rewriteResults(rewriteResults);
+  }
+
+  private List<DeleteFile> removeDanglingDeletes() {


Do you think we can move this to another , package protected class like RemoveDanglingDeleteSparkAction? (for better code reading)

The logic is non-trivial here and is off by default. I'd probably move it into a separate action, the existing action is already complicated. If so, I am not sure we even have to call it from rewrite data files then. If we ask the user to pass a property explicitly, I'd prefer separating the two actions and have a dedicated procedure.

...k/v3.5/spark/src/main/java/org/apache/iceberg/spark/actions/RewriteDataFilesSparkAction.java

aokolnychyi · 2024-02-22T01:04:19Z

I should have time to take a look this week.

aokolnychyi

Okay, I did one pass and here are my high-level notes:

We should use RewriteFiles instead of DeleteFiles, changes in DeleteFiles should be reverted.
I don't see a need for the enum to control the cleanup mode.
I'd consider having a separate action but I can be convinced otherwise. Especially, given that we may account for partition stats in the future.
I'd consider the following algorithm:
- Extend data_files and delete_files metadata tables to include data sequence numbers, if needed. I don't remember if we already populate them. This should be trivial as each DeleteFile object already has this info.
- Query data_files, aggregate, compute min data sequence number per partition. Don't cache the computed result, just keep a reference to it.
- Query delete_files, potentially projecting only strictly required columns.
- Join the summary with delete_files on the spec ID and partition. Find delete files that can be discarded in one go by having a predicate that accounts for the delete type (position vs equality).
- Collect the result to the driver and use SparkDeleteFile to wrap Spark rows as valid delete files. See the action for rewriting manifests for an example.

api/src/main/java/org/apache/iceberg/DeleteFiles.java

aokolnychyi · 2024-02-22T01:18:14Z

api/src/main/java/org/apache/iceberg/RemoveDanglingDeletesMode.java

+ *
+ * <p>
+ */
+public enum RemoveDanglingDeletesMode {


I am not sure we need this enum to be honest. The decision to use partition stats instead of scanning should be done by Iceberg, not users. If we detect there is a viable partition stats file, we should always use it, instead of scanning the metadata. Also, the FULL mode seems a bit awkward as it would actually rewrite deletes, rather than drop dangling.

I'd not add it for now and see if we want to reconsider this decision later.

api/src/main/java/org/apache/iceberg/actions/RewriteDataFiles.java

spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/actions/BaseSparkAction.java

...k/v3.5/spark/src/main/java/org/apache/iceberg/spark/actions/RewriteDataFilesSparkAction.java

aokolnychyi · 2024-02-22T03:52:25Z

...k/v3.5/spark/src/main/java/org/apache/iceberg/spark/actions/RewriteDataFilesSparkAction.java

+    return ImmutableRewriteDataFiles.Result.builder().rewriteResults(rewriteResults);
+  }
+
+  private List<DeleteFile> removeDanglingDeletes() {


The logic is non-trivial here and is off by default. I'd probably move it into a separate action, the existing action is already complicated. If so, I am not sure we even have to call it from rewrite data files then. If we ask the user to pass a property explicitly, I'd prefer separating the two actions and have a dedicated procedure.

...k/v3.5/spark/src/main/java/org/apache/iceberg/spark/actions/RewriteDataFilesSparkAction.java

...k/v3.5/spark/src/main/java/org/apache/iceberg/spark/actions/RewriteManifestsSparkAction.java

api/src/main/java/org/apache/iceberg/RemoveDanglingDeletesMode.java

...k/v3.5/spark/src/main/java/org/apache/iceberg/spark/actions/RewriteDataFilesSparkAction.java

spark/v3.5/spark/src/test/java/org/apache/iceberg/spark/actions/TestRewriteDataFilesAction.java

dramaticlly · 2024-02-27T02:39:16Z

Okay, I did one pass and here are my high-level notes:

We should use RewriteFiles instead of DeleteFiles, changes in DeleteFiles should be reverted.

I don't see a need for the enum to control the cleanup mode.

I'd consider having a separate action but I can be convinced otherwise. Especially, given that we may account for partition stats in the future.

I'd consider the following algorithm:

Extend data_files and delete_files metadata tables to include data sequence numbers, if needed. I don't remember if we already populate them. This should be trivial as each DeleteFile object already has this info.

Query data_files, aggregate, compute min data sequence number per partition. Don't cache the computed result, just keep a reference to it.

Query delete_files, potentially projecting only strictly required columns.

Join the summary with delete_files on the spec ID and partition. Find delete files that can be discarded in one go by having a predicate that accounts for the delete type (position vs equality).

Collect the result to the driver and use SparkDeleteFile to wrap Spark rows as valid delete files. See the action for rewriting manifests for an example.

Based on Anton's feedback, I will try divide the changes into 2 PRs where first PR (#9813) to support data sequence number in data and delete files table. Once merged, I will update to scan data_files first to aggregate per spec/partition min data sequence number, then compare against the delete_files. With left join, we can identify dangling deletes and remove them in one pass. SparkDeleteFile will be used to convert from spark row to POJO to be used for pruning, in consideration of partition evolution. Finally, dangling delete will be removed by reconstruction instead of by file path, to benefit manifest pruning when iceberg table was scanned.

zinking · 2024-02-27T08:39:11Z

"Finally, dangling delete will be removed by reconstruction instead of by file path, to benefit manifest pruning when iceberg table was scanned."

I guess only partitionData and path is needed, others all not used.

...3.5/spark/src/test/java/org/apache/iceberg/spark/actions/TestRemoveDanglingDeleteAction.java

Optionally can be enabled as part of RewriteDataFilesSparkAction Co-authored-by: Szehon Ho <[email protected]>

dramaticlly · 2024-07-26T16:29:08Z

With the merge of #10203 , I refactored the algorithm a bit to scan entries table for getting minSequenceNumberPerPartitionAndSpec and for getting delete files table for data sequence number instead of rely on data sequence number as virtual columns. I also identified and fixed the problem in partition evolution tests so that now it's all handled correctly. Would you like to take another look? @szehon-ho @aokolnychyi

szehon-ho

Looks mostly good to me, did a review round on the code portion.

api/src/main/java/org/apache/iceberg/actions/RewriteDataFiles.java

...5/spark/src/main/java/org/apache/iceberg/spark/actions/RemoveDanglingDeletesSparkAction.java

...k/v3.5/spark/src/main/java/org/apache/iceberg/spark/actions/RewriteDataFilesSparkAction.java

...5/spark/src/main/java/org/apache/iceberg/spark/actions/RemoveDanglingDeletesSparkAction.java

- Rewording documentation and add more comments - Changed removed deletes in results from List to Iterable to save on memory - Added BaseRemoveDanglingDeleteFiles and generate Immutable implementation

szehon-ho

Thanks @dramaticlly ! Left some more minor comments, mostly in the doc part as its a bit of a complex algorithm, but also some others.

...5/spark/src/main/java/org/apache/iceberg/spark/actions/RemoveDanglingDeletesSparkAction.java

...k/v3.5/spark/src/main/java/org/apache/iceberg/spark/actions/RewriteDataFilesSparkAction.java

- Instantiate spark action without clone session - Update javadoc to use html order list - Inline resultBuilder in RewriteDataFilesSparkAction

szehon-ho

This looks great to me, @dramaticlly one more comment

...5/spark/src/main/java/org/apache/iceberg/spark/actions/RemoveDanglingDeletesSparkAction.java

# Conflicts: # api/src/main/java/org/apache/iceberg/actions/ActionsProvider.java

szehon-ho · 2024-10-22T22:29:04Z

Merged, thanks @dramaticlly !

dramaticlly · 2024-10-22T23:24:28Z

Thanks everyone for the review and input, special thanks to @aokolnychyi for optimized algorithm and @szehon-ho as original author and detailed review!

backport of apache#9724

Backport #9724 to Spark 3.4

…ction (apache#9724)

Backport apache#9724 to Spark 3.4

* Core, Spark 3.5: Remove dangling deletes as part of RewriteDataFilesAction (apache#9724) * Spark 3.4: Action to remove dangling deletes (apache#11377) * SpotlessApply --------- Co-authored-by: Hongyue/Steve Zhang <[email protected]>

github-actions bot added API spark core labels Feb 13, 2024

ajantha-bhat reviewed Feb 14, 2024

View reviewed changes

api/src/main/java/org/apache/iceberg/actions/RewriteDataFiles.java Outdated Show resolved Hide resolved

dramaticlly force-pushed the danglingDeletes branch 5 times, most recently from aca367c to abe98de Compare February 15, 2024 00:13

nastra reviewed Feb 15, 2024

View reviewed changes

spark/v3.5/spark/src/test/java/org/apache/iceberg/spark/actions/TestRewriteDataFilesAction.java Outdated Show resolved Hide resolved

dramaticlly marked this pull request as ready for review February 20, 2024 18:11

szehon-ho reviewed Feb 20, 2024

View reviewed changes

zinking reviewed Feb 21, 2024

View reviewed changes

...k/v3.5/spark/src/main/java/org/apache/iceberg/spark/actions/RewriteDataFilesSparkAction.java Outdated Show resolved Hide resolved

aokolnychyi reviewed Feb 22, 2024

View reviewed changes

nastra reviewed Feb 22, 2024

View reviewed changes

api/src/main/java/org/apache/iceberg/RemoveDanglingDeletesMode.java Outdated Show resolved Hide resolved

nastra reviewed Feb 22, 2024

View reviewed changes

api/src/main/java/org/apache/iceberg/RemoveDanglingDeletesMode.java Outdated Show resolved Hide resolved

nastra reviewed Feb 22, 2024

View reviewed changes

...k/v3.5/spark/src/main/java/org/apache/iceberg/spark/actions/RewriteDataFilesSparkAction.java Outdated Show resolved Hide resolved

nastra reviewed Feb 22, 2024

View reviewed changes

spark/v3.5/spark/src/test/java/org/apache/iceberg/spark/actions/TestRewriteDataFilesAction.java Outdated Show resolved Hide resolved

dramaticlly force-pushed the danglingDeletes branch from 9062ee0 to fdf6d8b Compare May 3, 2024 22:04

github-actions bot added the flink label May 3, 2024

nastra reviewed May 4, 2024

View reviewed changes

...3.5/spark/src/test/java/org/apache/iceberg/spark/actions/TestRemoveDanglingDeleteAction.java Outdated Show resolved Hide resolved

Core, Spark: Remove dangling delete files

4a093d2

Optionally can be enabled as part of RewriteDataFilesSparkAction Co-authored-by: Szehon Ho <[email protected]>

dramaticlly force-pushed the danglingDeletes branch from fdf6d8b to 4a093d2 Compare July 26, 2024 06:22

Fix assertJ for static import

82b6889

szehon-ho reviewed Aug 13, 2024

View reviewed changes

Address szehon feedback

f6b0be3

- Rewording documentation and add more comments - Changed removed deletes in results from List to Iterable to save on memory - Added BaseRemoveDanglingDeleteFiles and generate Immutable implementation

szehon-ho reviewed Aug 20, 2024

View reviewed changes

dramaticlly added 3 commits August 20, 2024 16:38

Address review feedback and improve docs

18e2f79

- Instantiate spark action without clone session - Update javadoc to use html order list - Inline resultBuilder in RewriteDataFilesSparkAction

Merge branch 'main' into danglingDeletes

ae887af

spotlessApply to reformat spaces

a84cf88

szehon-ho approved these changes Oct 21, 2024

View reviewed changes

...5/spark/src/main/java/org/apache/iceberg/spark/actions/RemoveDanglingDeletesSparkAction.java Show resolved Hide resolved

Merge remote-tracking branch 'upstream/main' into danglingDeletes

253228a

# Conflicts: # api/src/main/java/org/apache/iceberg/actions/ActionsProvider.java

szehon-ho merged commit a8ec43d into apache:main Oct 22, 2024

dramaticlly deleted the danglingDeletes branch October 22, 2024 23:14

dramaticlly added a commit to dramaticlly/iceberg that referenced this pull request Oct 23, 2024

Spark 3.4: Action to remove dangling deletes

b436a61

backport of apache#9724

dramaticlly mentioned this pull request Oct 23, 2024

Spark 3.4: Action to remove dangling deletes #11377

Merged

amogh-jahagirdar pushed a commit that referenced this pull request Oct 23, 2024

Spark 3.4: Action to remove dangling deletes (#11377)

9c0a806

Backport #9724 to Spark 3.4

dramaticlly mentioned this pull request Oct 25, 2024

Doc: Update rewrite data files spark procedure #11396

Merged

zachdisc pushed a commit to zachdisc/iceberg that referenced this pull request Dec 23, 2024

Core, Spark 3.5: Remove dangling deletes as part of RewriteDataFilesA…

a8cf77e

…ction (apache#9724)

zachdisc pushed a commit to zachdisc/iceberg that referenced this pull request Dec 23, 2024

Spark 3.4: Action to remove dangling deletes (apache#11377)

19e6183

Backport apache#9724 to Spark 3.4

Core, Spark: Remove dangling deletes as part of RewriteDataFilesAction #9724

Core, Spark: Remove dangling deletes as part of RewriteDataFilesAction #9724

Uh oh!

Conversation

dramaticlly commented Feb 13, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dramaticlly commented Feb 13, 2024

Uh oh!

Uh oh!

Uh oh!

szehon-ho left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

aokolnychyi commented Feb 22, 2024

Uh oh!

aokolnychyi left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

dramaticlly commented Feb 27, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zinking commented Feb 27, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

dramaticlly commented Jul 26, 2024

Uh oh!

szehon-ho left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

szehon-ho left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dramaticlly commented Feb 13, 2024 •

edited

Loading

aokolnychyi left a comment •

edited

Loading

dramaticlly commented Feb 27, 2024 •

edited

Loading

zinking commented Feb 27, 2024 •

edited

Loading

szehon-ho left a comment •

edited

Loading