Skip to content

Conversation

@ajantha-bhat
Copy link
Member

@ajantha-bhat ajantha-bhat commented Nov 1, 2022

Currently, Statistics files are safeguarded against orphan_files cleanup. But they are never cleaned up from table metadata and from the storage once the snapshots are expired/deleted.

Hence, this PR adds a change to handle the Statistics file cleanup during expire_snapshot.

Note that this is just for API level clean up (table#expireSnapshots)

Clean-up from expired snapshots spark action/procedure will be built on top of it in a follow-up PR.

@github-actions github-actions bot added the core label Nov 1, 2022
@ajantha-bhat
Copy link
Member Author

cc: @findepi, @rdblue, @szehon-ho

@ajantha-bhat
Copy link
Member Author

@findepi: Thinking more about this, As the TableMetadata has just the list of StatisticsFile. And you have mentioned, statisticsFile.snapshotId() is "ID of the Iceberg table's snapshot the statistics were computed from"
So, how will the query knows which statistics file to use for the current snapshot (Incase of rewrite data files, the current snapshot id may not be present in that list of statistics file?)

@rdblue, @findepi: Please help in clearing my above doubt.

@findepi
Copy link
Member

findepi commented Nov 7, 2022

I think we should change the label of the snapshot-id entry in https://iceberg.apache.org/spec/#table-statistics (to level, not blob level)

@ajantha-bhat
Copy link
Member Author

I think we should change the label of the snapshot-id entry in https://iceberg.apache.org/spec/#table-statistics (to level, not blob level)

Sorry, I still didn't get how the query engine will figure out the statistics file for the current snapshot (when the snapshot is reused).
Instead of the suggested change, can we change statisticsFile.snapshotId() to the snapshot id of the referring snapshot? This way TableMetadata will have entries for each snapshot id (even for the resue case). Snapshot file path can be reused.

@rdblue: What do you think about this?

@ajantha-bhat ajantha-bhat marked this pull request as draft December 12, 2022 23:44
@ajantha-bhat ajantha-bhat force-pushed the stats_expire branch 3 times, most recently from 911ed3d to 7d9dae0 Compare December 13, 2022 09:33
@ajantha-bhat ajantha-bhat marked this pull request as ready for review December 13, 2022 09:35
@ajantha-bhat
Copy link
Member Author

@rdblue, @findepi, @amogh-jahagirdar: Handled the comments. Please take a look at it again.
Also, #6267 is ready.

return (RemoveSnapshots) removeSnapshots.withIncrementalCleanup(incrementalCleanup);
}

private StatisticsFile writeStatsFileForCurrentSnapshot(Table table, File statsLocation)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think there's a reason to pass table to this method. I think this should accept a String location, a FileIO, and a snapshot ID.

This should also not use File for writing.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

table.newAppend().appendFile(FILE_B).commit();
// Note: RewriteDataFiles can reuse statistics files across operations.
// This test reuses stats for append just to mimic this scenario without having to run
// RewriteDataFiles.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this actually happen in RewriteDataFiles? I don't think that the same stats file should be added more than once. It's a good idea to make sure it doesn't, but that should not be the behavior of built-in operations.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@findepi has mentioned about reusing the stats file. I think we should not allow it because concurrent operations can add extra stats during rewrite operation.

We don't have any engine integration with stats in this repo. So, I mentioned "can reuse" instead of "will resue"

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is inaccurate then. It should be enough to state that in the even that a snapshot file is for some reason reused, we want to detect that it is still referenced and not delete it from the file system.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok. rephrased it.

@rdblue
Copy link
Contributor

rdblue commented Jan 31, 2023

Thanks, @ajantha-bhat! I made some comments in tests to fix.

@ajantha-bhat
Copy link
Member Author

Thanks, @ajantha-bhat! I made some comments in tests to fix.

Thanks for the review. I have addressed the comments.

@ajantha-bhat
Copy link
Member Author

If the changes are ok, please merge this PR. So that I can rebase #6091 and make it ready for review.

}
}

private StatisticsFile reuseStatsForCurrentSnapshot(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not for the "current" snapshot because the snapshot ID is being passed in.

When there are problems that need to be fixed in multiple places, I might just mention it once to avoid unnecessary repetition. So to keep PRs moving faster, you should always look for similar cases that also need to be fixed.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ACK.

Apologies for the back and forth. This was induced during refactoring.

Copy link
Contributor

@rdblue rdblue left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ajantha-bhat, looks like there are just two more things to fix. Thanks!

@ajantha-bhat
Copy link
Member Author

@ajantha-bhat, looks like there are just two more things to fix. Thanks!

Done. Thanks for the review.

@ajantha-bhat
Copy link
Member Author

@jackye1995: Can you please consider this for the 1.2.0 release?

@jackye1995 jackye1995 added this to the Iceberg 1.2.0 milestone Feb 22, 2023
@jackye1995 jackye1995 self-requested a review February 22, 2023 17:49
}

private void commitStats(Table table, StatisticsFile statisticsFile) {
table.updateStatistics().setStatistics(statisticsFile.snapshotId(), statisticsFile).commit();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find it odd that a single line like this is in a separate method. Seems like this could be inlined and would make the tests more readable.

@rdblue rdblue merged commit 8592dee into apache:master Feb 22, 2023
@rdblue
Copy link
Contributor

rdblue commented Feb 22, 2023

Thanks, @ajantha-bhat. Good to have this in.

@ajantha-bhat
Copy link
Member Author

@rdblue: Thanks for the review and merge.

Now, I have rebased and reworked #6091 based on the learnings of this PR.
So, it is ready for review.

krvikash pushed a commit to krvikash/iceberg that referenced this pull request Mar 16, 2023
zhongyujiang pushed a commit to zhongyujiang/iceberg that referenced this pull request Apr 16, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants