-
Notifications
You must be signed in to change notification settings - Fork 3k
Core, API: Support cleanupMode in snapshot expiration #14287
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Update cleanup strategies to support retaining data files during snapshot expiration
| * @param retain true to retain orphaned data files only reachable by expired snapshots | ||
| * @return this for method chaining | ||
| */ | ||
| default ExpireSnapshots retainOrphanedDataFiles(boolean retain) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel like this isn't necessary to achieve the behavior of "clean up the metadata files but keep the data/delete files". There's a deleteWith API which allows for custom functions to delete based on paths. A client could just pass in a function which based on if a path is a path in the metadata location clean it up, otherwise don't do anything.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or the passed in logic can be inverted, to skip cleanup if it's in the data location or some other heuristic but I think the point still stands
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks @amogh-jahagirdar ! We actually explored that option and here's what we find
- use retainOrphanedDataFiles option actually speed up the clean up process by avoiding open and read the manifest files. If only metadata (like manifest-list and manifest) are considered for clean up, then we don't need to read the manifest entries, which is usually the bottleneck and requires work distribution. That's why most snapshot expiration is taken cared of in Spark action and procedures
- use DeleteWith consumer currently only provides a file path represented in String, we can use its file suffix to differentiate metadata and data files, but with introduction of Parquet Manifest - POC #13769, we can no longer rely on
.parquetalone to tell. We can still probably rely on checking$tableLocation/data/as part of file path but this is mostly conventional
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @dramaticlly , I mostly meant just using the table data location not a suffix as that wouldn't be sufficient but you're right that even the table data location is based on convention.
I agree though that specifying a custom cleanup function isn't the most ideal way to solve this use case.
I think the best argument for this kind of option is reducing costs (reducing unnecessary requests to read manifests, and the compute running when doing so) for cases where we know we only want to cleanup metadata and retain the data file.
The way I look at expressing this kind of logic is a bit different; rather than expressing which files we intend to retain, I look at it as which files should we cleanup. so something like a cleanExpiredFiles(CleanupMode mode)
and the options for CleanupMode like ALL, METADATA_ONLY, NONE. I think on this path we'd deprecate the cleanExpiredFiles(boolean) option as well, no point keeping both APIs?
If we were to go with the current approach we'd have to define precedence if someone uses the existing cleanExpiredFiles and the retainDataFiles, which is why it seems cleaner to me to define a cleanExpiredFiles API with some modes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @amogh-jahagirdar , I think what you proposed makes sense, as today retainDataFiles need to have prerequisites of cleanExpiredFiles=true.
But since those are public API on interface, deprecation and removal will need to go through cycles. I can update to use CleanupMode behind the scene.
| * @param retain true to retain orphaned data files only reachable by expired snapshots | ||
| * @return this for method chaining | ||
| */ | ||
| default ExpireSnapshots retainOrphanedDataFiles(boolean retain) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @dramaticlly , I mostly meant just using the table data location not a suffix as that wouldn't be sufficient but you're right that even the table data location is based on convention.
I agree though that specifying a custom cleanup function isn't the most ideal way to solve this use case.
I think the best argument for this kind of option is reducing costs (reducing unnecessary requests to read manifests, and the compute running when doing so) for cases where we know we only want to cleanup metadata and retain the data file.
The way I look at expressing this kind of logic is a bit different; rather than expressing which files we intend to retain, I look at it as which files should we cleanup. so something like a cleanExpiredFiles(CleanupMode mode)
and the options for CleanupMode like ALL, METADATA_ONLY, NONE. I think on this path we'd deprecate the cleanExpiredFiles(boolean) option as well, no point keeping both APIs?
If we were to go with the current approach we'd have to define precedence if someone uses the existing cleanExpiredFiles and the retainDataFiles, which is why it seems cleaner to me to define a cleanExpiredFiles API with some modes?
deprecate cleanExpiredFiles method and update tests
amogh-jahagirdar
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @dramaticlly! the core logic looks right to me, just had some comments about if we could just undo any unrelated test refactorings in favor of another PR, and some other minor comments.
| public ExpireSnapshots cleanExpiredFiles(boolean clean) { | ||
| this.cleanExpiredFiles = clean; | ||
| LOG.warn("cleanExpiredFiles(boolean) is deprecated. Use cleanMode(CleanupMode) instead."); | ||
| Preconditions.checkArgument( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure about throwing here, why not just override whatever the cleanup mode was set to?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Eduard, I added such validation mainly want to prevent double intention where one set both the cleanExpired files as well as cleanupMode, so intention is not clear and override might be risky, this could lead to some unexpected results so throw early maybe helpful.
This can be found more in my unit test named testCannotSetCleanExpiredFilesAndCleanModeTogether in https://github.com/apache/iceberg/pull/14287/files?new_files_changed=true#diff-35ed4072da58b6d638da909476780a4da8d97390bd56e41f8555b15b91499b64R2071-R2091.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it seems ok to throw an exception here because we don't want to users to call both setters. only one should be used.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but the condition may not be precise/perfect. e.g., if the client called cleanupLevel(ALL) (default value), it would still allow this method to go through.
the other way is to default cleanupLevel to null. but we would need to do a bit if-else check during read to apply the value.
maybe the complexity is not worth it. I am wondering if it is simpler to just go with what @nastra suggested. just rely on API deprecation to move users away from the old API.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there's some nuance here and value to keep:
Take what I included in the unit test for an example here, use preconditions check here prevent the case where both cleanupLevel(ExpireSnapshots.CleanupLevel.METADATA_ONLY) and cleanExpiredFiles(false) is configured on snapshot expiration, as the intention is unclear at the moment, override to either NONE or METADATA_ONLY could potentially result in undesired results.
Although unlikely, for the corner case discussed here when client called cleanupLevel(ALL) and also set the cleanExpiredFiles
- if cleanExpiredFiles = true, then cleanupLevel ends up resolve to ALL and logic is equivalent
- if cleanExpiredFiles = false, we allow such override to happen and end up with cleanupLevel=None and retain all files, I think it's acceptable as we are moving from most restrictive and least restrictive, and those files can be later cleaned with orphan removal.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the current impleemntation is reasonable so long as there's no breakages when someone upgrades to 1.11 and using the deprecated API, which it looks like there's not since the condition is based on whether someone additionally set the new API. I also prefer the approach of failing if a non-default cleanup level is set and the old one is also set because it's pretty unlikely a user intended to do that and it forces them to resolve that ambiguity that @dramaticlly mentioned.
| @Override | ||
| public ExpireSnapshots cleanMode(CleanupMode mode) { | ||
| Preconditions.checkNotNull(mode, "CleanupMode cannot be null"); | ||
| Preconditions.checkArgument( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same as mentioned earlier about throwing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unresolving this one, I do think this is a case where we should just be consistent with what's done in the other options in this API, which is just override what was previously set. e.g. we can set cleanExpiredFiles multiple times, and it'll just take the last. Any reason why this particular one should be different and throw @dramaticlly ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added this mostly want to avoid ambiguous intention when user set both options together regardless of which option is chained first.
i.e both shall fail with exception to ask user for further clarification
table.expireSnapshots()
.cleanupLevel(ExpireSnapshots.CleanupLevel.METADATA_ONLY)
.cleanExpiredFiles(false)and
table
.expireSnapshots()
.cleanExpiredFiles(false)
.cleanupLevel(ExpireSnapshots.CleanupLevel.METADATA_ONLY)Updated preconditions to better assertion condition and message
| public ExpireSnapshots cleanExpiredFiles(boolean clean) { | ||
| this.cleanExpiredFiles = clean; | ||
| LOG.warn("cleanExpiredFiles(boolean) is deprecated. Use cleanMode(CleanupMode) instead."); | ||
| Preconditions.checkArgument( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it seems ok to throw an exception here because we don't want to users to call both setters. only one should be used.
Address feedback from Amogh, Eduard and Steven
|
@nastra @amogh-jahagirdar @stevenzwu ready for another look |
| public ExpireSnapshots cleanExpiredFiles(boolean clean) { | ||
| this.cleanExpiredFiles = clean; | ||
| LOG.warn("cleanExpiredFiles(boolean) is deprecated. Use cleanMode(CleanupMode) instead."); | ||
| Preconditions.checkArgument( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the current impleemntation is reasonable so long as there's no breakages when someone upgrades to 1.11 and using the deprecated API, which it looks like there's not since the condition is based on whether someone additionally set the new API. I also prefer the approach of failing if a non-default cleanup level is set and the old one is also set because it's pretty unlikely a user intended to do that and it forces them to resolve that ambiguity that @dramaticlly mentioned.
| @Override | ||
| public ExpireSnapshots cleanMode(CleanupMode mode) { | ||
| Preconditions.checkNotNull(mode, "CleanupMode cannot be null"); | ||
| Preconditions.checkArgument( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unresolving this one, I do think this is a case where we should just be consistent with what's done in the other options in this API, which is just override what was previously set. e.g. we can set cleanExpiredFiles multiple times, and it'll just take the last. Any reason why this particular one should be different and throw @dramaticlly ?
|
@amogh-jahagirdar @nastra can you take another look with the latest changes? |
amogh-jahagirdar
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From my side, this looks good just minor stuff, thank you @dramaticlly for this improvement!
| super(fileIO, deleteExecutorService, planExecutorService, deleteFunc); | ||
| } | ||
|
|
||
| /** {@inheritDoc} */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: Is this required? My understanding is that this is for subclasses so it inherits the parent docs and then adding additional context for the child class afterwards. I thought by default, if we just want the parent javadocs, we'll automatically inherit those? i.e. I thought only having inheritDoc by itself isn't required. It's useful when it's followed by some additional child class docs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|
thanks @dramaticlly for the improvement, and @amogh-jahagirdar @nastra for the review. |

This patch add
retainOrphanedDataFilesto the expire snapshots interface to allow retain the orphaned data files and also update both cleanup strategies. Most of time, we want to remove such data files as part of file clean up followed by snapshot expiration as those files are no longer referenced by any active snapshots.However, when the underlying data files are shared/shallow copied to other table, or when parquet files being added by the
add-filesprocedure https://iceberg.apache.org/docs/latest/spark-procedures/#add_files, we might want to keep the data files.This option is disabled by default
Can you help take a look? @stevenzwu @amogh-jahagirdar