Skip to content

Conversation

@dramaticlly
Copy link
Contributor

@dramaticlly dramaticlly commented Feb 13, 2024

Goal: Attempt to clean up the dangling deletes as part of Spark RewriteDataFilesAction, it can be controlled by feature flag
remove-dangling-deletes and it's by default turned off. Most of the code come from #6581 and reason on why we need it: The problem and design doc is here: https://docs.google.com/document/d/11d-cIUR_89kRsMmWnEoxXGZCvp7L4TUmPJqUC60zB5M/edit#

Changes

  • DeleteFiles to remove a given deleteFile
  • RewriteDataFilesResult now provide count on number of dangling files removed
  • withReusableDS() function moved from RewriteManifestsSparkAction to base so it can be reused in RewriteDataFilesAction.

figure out predicate push down for entries metadata table

@dramaticlly
Copy link
Contributor Author

@szehon-ho can I ask for your eyes first?

@dramaticlly dramaticlly force-pushed the danglingDeletes branch 5 times, most recently from aca367c to abe98de Compare February 15, 2024 00:13
@dramaticlly dramaticlly marked this pull request as ready for review February 20, 2024 18:11
Copy link
Member

@szehon-ho szehon-ho left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @dramaticlly thanks for doing this. Can you put me as a co-author as most of the code is from #6581 ?

And as that is the case, it would be nice if @aokolnychyi takes a look as well

* <p>
*
* <ul>
* <li>If remove-dangling-deletes=metadata, then dangling delete files will be pruned from iceberg
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe can remove 'remove-dangling-deletes=' , its a bit repetitive?

ie,

metadata: dangling delete files will be pruned from...

*
* <p>
*/
public enum RemoveDanglingDeletesMode {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@aokolnychyi the thought here, is we will have further modes: STATS (for partition file stats), FULL (run a whole rewritePositionDeletes job)

Default is NONE because it does create one more snapshot and will might break people who are depending on listing the snapshots for information.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure we need this enum to be honest. The decision to use partition stats instead of scanning should be done by Iceberg, not users. If we detect there is a viable partition stats file, we should always use it, instead of scanning the metadata. Also, the FULL mode seems a bit awkward as it would actually rewrite deletes, rather than drop dangling.

I'd not add it for now and see if we want to reconsider this decision later.

* <p>
*
* <ul>
* <li>If remove-dangling-deletes=metadata, then dangling delete files will be pruned from
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same small comment from above javadoc (remove 'remove-dangling-deletes=' as its too repetitive)

ie,

metadata: dangling delete files will be pruned from...

// Replace the hyphen in order name with underscore to map to the enum value. For example:
// rewrite-position to REWRITE_POSITION
try {
return RemoveDanglingDeletesMode.valueOf(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is not needed anymore right? (The enum values have no hyphen )

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes you are right. Initially i though add rewrite-position so that we can trigger rewrite position deletes spark action but removed later. I guess I can remove it now and add it later

return ImmutableRewriteDataFiles.Result.builder().rewriteResults(rewriteResults);
}

private List<DeleteFile> removeDanglingDeletes() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think we can move this to another , package protected class like RemoveDanglingDeleteSparkAction? (for better code reading)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The logic is non-trivial here and is off by default. I'd probably move it into a separate action, the existing action is already complicated. If so, I am not sure we even have to call it from rewrite data files then. If we ask the user to pass a property explicitly, I'd prefer separating the two actions and have a dedicated procedure.

@aokolnychyi
Copy link
Contributor

I should have time to take a look this week.

Copy link
Contributor

@aokolnychyi aokolnychyi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, I did one pass and here are my high-level notes:

  • We should use RewriteFiles instead of DeleteFiles, changes in DeleteFiles should be reverted.
  • I don't see a need for the enum to control the cleanup mode.
  • I'd consider having a separate action but I can be convinced otherwise. Especially, given that we may account for partition stats in the future.
  • I'd consider the following algorithm:
    • Extend data_files and delete_files metadata tables to include data sequence numbers, if needed. I don't remember if we already populate them. This should be trivial as each DeleteFile object already has this info.
    • Query data_files, aggregate, compute min data sequence number per partition. Don't cache the computed result, just keep a reference to it.
    • Query delete_files, potentially projecting only strictly required columns.
    • Join the summary with delete_files on the spec ID and partition. Find delete files that can be discarded in one go by having a predicate that accounts for the delete type (position vs equality).
    • Collect the result to the driver and use SparkDeleteFile to wrap Spark rows as valid delete files. See the action for rewriting manifests for an example.

*
* <p>
*/
public enum RemoveDanglingDeletesMode {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure we need this enum to be honest. The decision to use partition stats instead of scanning should be done by Iceberg, not users. If we detect there is a viable partition stats file, we should always use it, instead of scanning the metadata. Also, the FULL mode seems a bit awkward as it would actually rewrite deletes, rather than drop dangling.

I'd not add it for now and see if we want to reconsider this decision later.

return ImmutableRewriteDataFiles.Result.builder().rewriteResults(rewriteResults);
}

private List<DeleteFile> removeDanglingDeletes() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The logic is non-trivial here and is off by default. I'd probably move it into a separate action, the existing action is already complicated. If so, I am not sure we even have to call it from rewrite data files then. If we ask the user to pass a property explicitly, I'd prefer separating the two actions and have a dedicated procedure.

@dramaticlly
Copy link
Contributor Author

dramaticlly commented Feb 27, 2024

Okay, I did one pass and here are my high-level notes:

  • We should use RewriteFiles instead of DeleteFiles, changes in DeleteFiles should be reverted.

  • I don't see a need for the enum to control the cleanup mode.

  • I'd consider having a separate action but I can be convinced otherwise. Especially, given that we may account for partition stats in the future.

  • I'd consider the following algorithm:

    • Extend data_files and delete_files metadata tables to include data sequence numbers, if needed. I don't remember if we already populate them. This should be trivial as each DeleteFile object already has this info.
    • Query data_files, aggregate, compute min data sequence number per partition. Don't cache the computed result, just keep a reference to it.
    • Query delete_files, potentially projecting only strictly required columns.
    • Join the summary with delete_files on the spec ID and partition. Find delete files that can be discarded in one go by having a predicate that accounts for the delete type (position vs equality).
    • Collect the result to the driver and use SparkDeleteFile to wrap Spark rows as valid delete files. See the action for rewriting manifests for an example.

Based on Anton's feedback, I will try divide the changes into 2 PRs where first PR (#9813) to support data sequence number in data and delete files table. Once merged, I will update to scan data_files first to aggregate per spec/partition min data sequence number, then compare against the delete_files. With left join, we can identify dangling deletes and remove them in one pass. SparkDeleteFile will be used to convert from spark row to POJO to be used for pruning, in consideration of partition evolution. Finally, dangling delete will be removed by reconstruction instead of by file path, to benefit manifest pruning when iceberg table was scanned.

@zinking
Copy link
Contributor

zinking commented Feb 27, 2024

"Finally, dangling delete will be removed by reconstruction instead of by file path, to benefit manifest pruning when iceberg table was scanned."

I guess only partitionData and path is needed, others all not used.

Optionally can be enabled as part of RewriteDataFilesSparkAction

Co-authored-by: Szehon Ho <[email protected]>
@dramaticlly
Copy link
Contributor Author

With the merge of #10203 , I refactored the algorithm a bit to scan entries table for getting minSequenceNumberPerPartitionAndSpec and for getting delete files table for data sequence number instead of rely on data sequence number as virtual columns. I also identified and fixed the problem in partition evolution tests so that now it's all handled correctly. Would you like to take another look? @szehon-ho @aokolnychyi

Copy link
Member

@szehon-ho szehon-ho left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks mostly good to me, did a review round on the code portion.

- Rewording documentation and add more comments
- Changed removed deletes in results from List to Iterable to save on memory
- Added BaseRemoveDanglingDeleteFiles and generate Immutable implementation
Copy link
Member

@szehon-ho szehon-ho left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @dramaticlly ! Left some more minor comments, mostly in the doc part as its a bit of a complex algorithm, but also some others.

- Instantiate spark action without clone session
- Update javadoc to use html order list
- Inline resultBuilder in RewriteDataFilesSparkAction
Copy link
Member

@szehon-ho szehon-ho left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great to me, @dramaticlly one more comment

# Conflicts:
#	api/src/main/java/org/apache/iceberg/actions/ActionsProvider.java
@szehon-ho szehon-ho merged commit a8ec43d into apache:main Oct 22, 2024
@szehon-ho
Copy link
Member

Merged, thanks @dramaticlly !

@dramaticlly dramaticlly deleted the danglingDeletes branch October 22, 2024 23:14
@dramaticlly
Copy link
Contributor Author

Thanks everyone for the review and input, special thanks to @aokolnychyi for optimized algorithm and @szehon-ho as original author and detailed review!

dramaticlly added a commit to dramaticlly/iceberg that referenced this pull request Oct 23, 2024
amogh-jahagirdar pushed a commit that referenced this pull request Oct 23, 2024
zachdisc pushed a commit to zachdisc/iceberg that referenced this pull request Dec 23, 2024
zachdisc pushed a commit to zachdisc/iceberg that referenced this pull request Dec 23, 2024
parthchandra pushed a commit to parthchandra/iceberg that referenced this pull request Oct 22, 2025
* Core, Spark 3.5: Remove dangling deletes as part of RewriteDataFilesAction (apache#9724)

* Spark 3.4: Action to remove dangling deletes (apache#11377)

* SpotlessApply

---------

Co-authored-by: Hongyue/Steve Zhang <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants