-
Notifications
You must be signed in to change notification settings - Fork 3k
Core: Support removeUnusedSortOrders in ExpireSnapshots #13975
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
* read manifest-list file once * read both dataManifest and deleteManifest
gaborkaszab
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for submitting a PR for this, @munendrasn ! In general I think cleaning up any kind of unused metadata makes sense.
After taking a quick look at the approach, I'm a bit hesitant with this one, though. The reason is that in order to get the used sort order IDs we have to read into all the manifest entries within all the retained snapshots. If I'm not mistaken these reads are from storage. At first glance this seems very heavyweight for a metadata cleanup and I'm not sure if the extra computation costs is worth the gain here.
Do you have any measurements how much extra runtime this puts to the metadata cleanup process?
amogh-jahagirdar
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 to @gaborkaszab point. Beyond this being a spec change, in the proposed implementation it looks like we're always additionally reading all the manifests just to obtain any sort orders that are referenced. I'm doubtful this is really worth it just for cleaning up sort orders (at the very very least this traversal should only happen if the flag for cleaning up additional metadata is set).
In the past I think we've discussed as a community that really catalogs should just take the responsibility of cleaning this additional metadata up. The benefit is we avoid implementing more complex logic on the client.
|
@gaborkaszab @amogh-jahagirdar
Not at the moment, but it would be function of Manifest files + dataFiles
The sortOrder used for write is only available at contentFile level as per spec. Except in case of Equality delete, both spark and flink seems to be not adding the sortOrderId to DataFile.
having a flag would be helpful to consumer - can make it configurable to run the expiry sort-order only when required. Should we have this behavior for all other metadata - schema, spec too?
I assume you are referring to REST Open API spec. Please correct me if that's the not the case. |
|
As a follow-up: To sum-up, @amogh-jahagirdar and I both feel that the cost of cleaning up sort orders is too high compared to the gain we have. Do you have a specific use-case where the unused sort orders take up so much space it's disturbing? Answering some questions that came up and adding some remarks:
But all in all, before making any further steps here, I think we should judge the cost vs gain here. I'm not convinced that all these extra storage reads worth doing to clean up sort orders, unless there is a specific, practical use-case that we are solving. Maybe bringing this up on the community sync (or writing to dev@) could give more clarity on whether there is community support for this. |
|
This pull request has been marked as stale due to 30 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the [email protected] list. Thank you for your contributions. |
|
This pull request has been closed due to lack of activity. This is not a judgement on the merit of the PR in any way. It is just a way of keeping the PR queue manageable. If you think that is incorrect, or the pull request requires review, you can revive the PR at any time. |
Based on the previous PRs for
Traverses and gets the sortOrder from DataFiles.
sortOrderis marked as optional on ContentFile, seems not set for except in case of EqualityDeletes