Core: Handle statistics file clean up from expireSnapshots #6090

ajantha-bhat · 2022-11-01T07:28:28Z

Currently, Statistics files are safeguarded against orphan_files cleanup. But they are never cleaned up from table metadata and from the storage once the snapshots are expired/deleted.

Hence, this PR adds a change to handle the Statistics file cleanup during expire_snapshot.

Note that this is just for API level clean up (table#expireSnapshots)

Clean-up from expired snapshots spark action/procedure will be built on top of it in a follow-up PR.

ajantha-bhat · 2022-11-01T07:28:53Z

cc: @findepi, @rdblue, @szehon-ho

core/src/main/java/org/apache/iceberg/FileCleanupStrategy.java

ajantha-bhat · 2022-11-07T06:31:31Z

@findepi: Thinking more about this, As the TableMetadata has just the list of StatisticsFile. And you have mentioned, statisticsFile.snapshotId() is "ID of the Iceberg table's snapshot the statistics were computed from"
So, how will the query knows which statistics file to use for the current snapshot (Incase of rewrite data files, the current snapshot id may not be present in that list of statistics file?)

@rdblue, @findepi: Please help in clearing my above doubt.

findepi · 2022-11-07T13:56:08Z

I think we should change the label of the snapshot-id entry in https://iceberg.apache.org/spec/#table-statistics (to level, not blob level)

ajantha-bhat · 2022-11-07T18:54:53Z

I think we should change the label of the snapshot-id entry in https://iceberg.apache.org/spec/#table-statistics (to level, not blob level)

Sorry, I still didn't get how the query engine will figure out the statistics file for the current snapshot (when the snapshot is reused).
Instead of the suggested change, can we change statisticsFile.snapshotId() to the snapshot id of the referring snapshot? This way TableMetadata will have entries for each snapshot id (even for the resue case). Snapshot file path can be reused.

@rdblue: What do you think about this?

core/src/main/java/org/apache/iceberg/FileCleanupStrategy.java

core/src/main/java/org/apache/iceberg/RemoveSnapshots.java

core/src/main/java/org/apache/iceberg/FileCleanupStrategy.java

ajantha-bhat · 2022-12-13T13:37:19Z

@rdblue, @findepi, @amogh-jahagirdar: Handled the comments. Please take a look at it again.
Also, #6267 is ready.

core/src/main/java/org/apache/iceberg/IncrementalFileCleanup.java

core/src/main/java/org/apache/iceberg/FileCleanupStrategy.java

core/src/main/java/org/apache/iceberg/IncrementalFileCleanup.java

core/src/main/java/org/apache/iceberg/FileCleanupStrategy.java

core/src/main/java/org/apache/iceberg/IncrementalFileCleanup.java

core/src/main/java/org/apache/iceberg/TableMetadata.java

core/src/test/java/org/apache/iceberg/TestRemoveSnapshots.java

rdblue · 2022-12-20T16:05:03Z

core/src/test/java/org/apache/iceberg/TestRemoveSnapshots.java

    return (RemoveSnapshots) removeSnapshots.withIncrementalCleanup(incrementalCleanup);
  }
+
+  private StatisticsFile writeStatsFileForCurrentSnapshot(Table table, File statsLocation)


I don't think there's a reason to pass table to this method. I think this should accept a String location, a FileIO, and a snapshot ID.

This should also not use File for writing.

This was taken from existing code.
https://github.com/apache/iceberg/blob/master/spark/v3.3/spark/src/test/java/org/apache/iceberg/spark/actions/TestRemoveOrphanFilesAction.java#L917

I will modify the current testcase.

core/src/test/java/org/apache/iceberg/TestRemoveSnapshots.java

rdblue · 2023-01-31T19:09:46Z

core/src/test/java/org/apache/iceberg/TestRemoveSnapshots.java

+    table.newAppend().appendFile(FILE_B).commit();
+    // Note: RewriteDataFiles can reuse statistics files across operations.
+    // This test reuses stats for append just to mimic this scenario without having to run
+    // RewriteDataFiles.


Does this actually happen in RewriteDataFiles? I don't think that the same stats file should be added more than once. It's a good idea to make sure it doesn't, but that should not be the behavior of built-in operations.

@findepi has mentioned about reusing the stats file. I think we should not allow it because concurrent operations can add extra stats during rewrite operation.

We don't have any engine integration with stats in this repo. So, I mentioned "can reuse" instead of "will resue"

I think this is inaccurate then. It should be enough to state that in the even that a snapshot file is for some reason reused, we want to detect that it is still referenced and not delete it from the file system.

ok. rephrased it.

rdblue · 2023-01-31T19:10:49Z

Thanks, @ajantha-bhat! I made some comments in tests to fix.

ajantha-bhat · 2023-02-01T07:22:27Z

Thanks, @ajantha-bhat! I made some comments in tests to fix.

Thanks for the review. I have addressed the comments.

ajantha-bhat · 2023-02-05T17:24:24Z

If the changes are ok, please merge this PR. So that I can rebase #6091 and make it ready for review.

rdblue · 2023-02-09T19:22:19Z

core/src/test/java/org/apache/iceberg/TestRemoveSnapshots.java

+    }
+  }
+
+  private StatisticsFile reuseStatsForCurrentSnapshot(


This is not for the "current" snapshot because the snapshot ID is being passed in.

When there are problems that need to be fixed in multiple places, I might just mention it once to avoid unnecessary repetition. So to keep PRs moving faster, you should always look for similar cases that also need to be fixed.

ACK.

Apologies for the back and forth. This was induced during refactoring.

rdblue

@ajantha-bhat, looks like there are just two more things to fix. Thanks!

ajantha-bhat · 2023-02-10T04:39:51Z

@ajantha-bhat, looks like there are just two more things to fix. Thanks!

Done. Thanks for the review.

ajantha-bhat · 2023-02-17T12:43:09Z

@jackye1995: Can you please consider this for the 1.2.0 release?

rdblue · 2023-02-22T18:16:26Z

core/src/test/java/org/apache/iceberg/TestRemoveSnapshots.java

+  }
+
+  private void commitStats(Table table, StatisticsFile statisticsFile) {
+    table.updateStatistics().setStatistics(statisticsFile.snapshotId(), statisticsFile).commit();


I find it odd that a single line like this is in a separate method. Seems like this could be inlined and would make the tests more readable.

rdblue · 2023-02-22T18:19:04Z

Thanks, @ajantha-bhat. Good to have this in.

ajantha-bhat · 2023-02-23T06:52:52Z

@rdblue: Thanks for the review and merge.

Now, I have rebased and reworked #6091 based on the learnings of this PR.
So, it is ready for review.

…ots (apache#6090)

github-actions bot added the core label Nov 1, 2022

ajantha-bhat mentioned this pull request Nov 1, 2022

Spark-3.3: Handle statistics file clean up from expireSnapshots action/procedure #6091

Merged

amogh-jahagirdar reviewed Nov 3, 2022

View reviewed changes

core/src/main/java/org/apache/iceberg/FileCleanupStrategy.java Show resolved Hide resolved

findepi reviewed Nov 4, 2022

View reviewed changes

core/src/main/java/org/apache/iceberg/FileCleanupStrategy.java Outdated Show resolved Hide resolved

ajantha-bhat mentioned this pull request Nov 24, 2022

Docs: Update spec about statistics file snapshot id #6267

Merged

ajantha-bhat force-pushed the stats_expire branch from 13ae3e0 to be54044 Compare December 5, 2022 12:02

rdblue reviewed Dec 8, 2022

View reviewed changes

core/src/main/java/org/apache/iceberg/FileCleanupStrategy.java Outdated Show resolved Hide resolved

rdblue reviewed Dec 8, 2022

View reviewed changes

core/src/main/java/org/apache/iceberg/FileCleanupStrategy.java Outdated Show resolved Hide resolved

rdblue reviewed Dec 8, 2022

View reviewed changes

core/src/main/java/org/apache/iceberg/RemoveSnapshots.java Outdated Show resolved Hide resolved

rdblue reviewed Dec 8, 2022

View reviewed changes

core/src/main/java/org/apache/iceberg/FileCleanupStrategy.java Outdated Show resolved Hide resolved

ajantha-bhat force-pushed the stats_expire branch from be54044 to 36e99a4 Compare December 9, 2022 07:22

ajantha-bhat marked this pull request as draft December 12, 2022 23:44

ajantha-bhat force-pushed the stats_expire branch 3 times, most recently from 911ed3d to 7d9dae0 Compare December 13, 2022 09:33

ajantha-bhat marked this pull request as ready for review December 13, 2022 09:35

amogh-jahagirdar reviewed Dec 13, 2022

View reviewed changes

core/src/main/java/org/apache/iceberg/IncrementalFileCleanup.java Outdated Show resolved Hide resolved

core/src/main/java/org/apache/iceberg/FileCleanupStrategy.java Outdated Show resolved Hide resolved