-
Notifications
You must be signed in to change notification settings - Fork 3k
Core: Handle statistics file clean up from expireSnapshots #6090
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
cc: @findepi, @rdblue, @szehon-ho |
|
|
I think we should change the label of the |
Sorry, I still didn't get how the query engine will figure out the statistics file for the current snapshot (when the snapshot is reused). @rdblue: What do you think about this? |
13ae3e0 to
be54044
Compare
be54044 to
36e99a4
Compare
911ed3d to
7d9dae0
Compare
|
@rdblue, @findepi, @amogh-jahagirdar: Handled the comments. Please take a look at it again. |
core/src/main/java/org/apache/iceberg/IncrementalFileCleanup.java
Outdated
Show resolved
Hide resolved
core/src/main/java/org/apache/iceberg/IncrementalFileCleanup.java
Outdated
Show resolved
Hide resolved
core/src/main/java/org/apache/iceberg/IncrementalFileCleanup.java
Outdated
Show resolved
Hide resolved
| return (RemoveSnapshots) removeSnapshots.withIncrementalCleanup(incrementalCleanup); | ||
| } | ||
|
|
||
| private StatisticsFile writeStatsFileForCurrentSnapshot(Table table, File statsLocation) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think there's a reason to pass table to this method. I think this should accept a String location, a FileIO, and a snapshot ID.
This should also not use File for writing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was taken from existing code.
https://github.com/apache/iceberg/blob/master/spark/v3.3/spark/src/test/java/org/apache/iceberg/spark/actions/TestRemoveOrphanFilesAction.java#L917
I will modify the current testcase.
| table.newAppend().appendFile(FILE_B).commit(); | ||
| // Note: RewriteDataFiles can reuse statistics files across operations. | ||
| // This test reuses stats for append just to mimic this scenario without having to run | ||
| // RewriteDataFiles. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this actually happen in RewriteDataFiles? I don't think that the same stats file should be added more than once. It's a good idea to make sure it doesn't, but that should not be the behavior of built-in operations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@findepi has mentioned about reusing the stats file. I think we should not allow it because concurrent operations can add extra stats during rewrite operation.
We don't have any engine integration with stats in this repo. So, I mentioned "can reuse" instead of "will resue"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is inaccurate then. It should be enough to state that in the even that a snapshot file is for some reason reused, we want to detect that it is still referenced and not delete it from the file system.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok. rephrased it.
|
Thanks, @ajantha-bhat! I made some comments in tests to fix. |
15f9794 to
fad3899
Compare
Thanks for the review. I have addressed the comments. |
|
If the changes are ok, please merge this PR. So that I can rebase #6091 and make it ready for review. |
| } | ||
| } | ||
|
|
||
| private StatisticsFile reuseStatsForCurrentSnapshot( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not for the "current" snapshot because the snapshot ID is being passed in.
When there are problems that need to be fixed in multiple places, I might just mention it once to avoid unnecessary repetition. So to keep PRs moving faster, you should always look for similar cases that also need to be fixed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ACK.
Apologies for the back and forth. This was induced during refactoring.
rdblue
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ajantha-bhat, looks like there are just two more things to fix. Thanks!
fad3899 to
e0f51fd
Compare
Done. Thanks for the review. |
|
@jackye1995: Can you please consider this for the 1.2.0 release? |
| } | ||
|
|
||
| private void commitStats(Table table, StatisticsFile statisticsFile) { | ||
| table.updateStatistics().setStatistics(statisticsFile.snapshotId(), statisticsFile).commit(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I find it odd that a single line like this is in a separate method. Seems like this could be inlined and would make the tests more readable.
|
Thanks, @ajantha-bhat. Good to have this in. |
Currently, Statistics files are safeguarded against orphan_files cleanup. But they are never cleaned up from table metadata and from the storage once the snapshots are expired/deleted.
Hence, this PR adds a change to handle the Statistics file cleanup during expire_snapshot.
Note that this is just for API level clean up (
table#expireSnapshots)Clean-up from expired snapshots spark action/procedure will be built on top of it in a follow-up PR.