Skip to content

Conversation

@munendrasn
Copy link
Contributor

Based on the previous PRs for

Traverses and gets the sortOrder from DataFiles. sortOrder is marked as optional on ContentFile, seems not set for except in case of EqualityDeletes

* read manifest-list file once
* read both dataManifest and deleteManifest
@github-actions github-actions bot added the core label Sep 2, 2025
Copy link
Collaborator

@gaborkaszab gaborkaszab left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for submitting a PR for this, @munendrasn ! In general I think cleaning up any kind of unused metadata makes sense.

After taking a quick look at the approach, I'm a bit hesitant with this one, though. The reason is that in order to get the used sort order IDs we have to read into all the manifest entries within all the retained snapshots. If I'm not mistaken these reads are from storage. At first glance this seems very heavyweight for a metadata cleanup and I'm not sure if the extra computation costs is worth the gain here.
Do you have any measurements how much extra runtime this puts to the metadata cleanup process?

Copy link
Contributor

@amogh-jahagirdar amogh-jahagirdar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 to @gaborkaszab point. Beyond this being a spec change, in the proposed implementation it looks like we're always additionally reading all the manifests just to obtain any sort orders that are referenced. I'm doubtful this is really worth it just for cleaning up sort orders (at the very very least this traversal should only happen if the flag for cleaning up additional metadata is set).

In the past I think we've discussed as a community that really catalogs should just take the responsibility of cleaning this additional metadata up. The benefit is we avoid implementing more complex logic on the client.

@munendrasn
Copy link
Contributor Author

@gaborkaszab @amogh-jahagirdar
Thanks for the review.

Do you have any measurements how much extra runtime this puts to the metadata cleanup process?

Not at the moment, but it would be function of Manifest files + dataFiles

in the proposed implementation it looks like we're always additionally reading all the manifests just to obtain any sort orders that are referenced

The sortOrder used for write is only available at contentFile level as per spec. Except in case of Equality delete, both spark and flink seems to be not adding the sortOrderId to DataFile.
Once the DataFile is written, sortOrder if not latest, might not be useful - can that expired? Please let me know if I'm missing any other references or usage.
Also, any pointers on improving the current impl is appreciated.

at the very very least this traversal should only happen if the flag for cleaning up additional metadata is set

having a flag would be helpful to consumer - can make it configurable to run the expiry sort-order only when required. Should we have this behavior for all other metadata - schema, spec too?

Beyond this being a spec change

I assume you are referring to REST Open API spec. Please correct me if that's the not the case.

@gaborkaszab
Copy link
Collaborator

As a follow-up: To sum-up, @amogh-jahagirdar and I both feel that the cost of cleaning up sort orders is too high compared to the gain we have. Do you have a specific use-case where the unused sort orders take up so much space it's disturbing?

Answering some questions that came up and adding some remarks:

  • Checking the implementation, this is already behind the cleanExpiredMetadata flag. I wouldn't introduce yet another flag for sort orders separately, because eventually we'd have to wire it to the expireSnapshots Spark and Flink procedures.
  • catalogs should just take the responsibility of cleaning this additional metadata up I believe eventually there was a consensus that maintaining tables through Spark procedure even makes sense, because many vendors do that still. This is why I added the cleanExpiredMetadata param to the expireSnapshots proc. This might initiate a whole different conversation, though. :)
  • I assume you are referring to REST Open API spec Yes, the new MetadataUpdate type you added, RemoveSortOrders should also be added to the REST spec, similarly to RemovePartitionSpecs and all.

But all in all, before making any further steps here, I think we should judge the cost vs gain here. I'm not convinced that all these extra storage reads worth doing to clean up sort orders, unless there is a specific, practical use-case that we are solving. Maybe bringing this up on the community sync (or writing to dev@) could give more clarity on whether there is community support for this.

@github-actions
Copy link

This pull request has been marked as stale due to 30 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the [email protected] list. Thank you for your contributions.

@github-actions github-actions bot added the stale label Oct 24, 2025
@github-actions
Copy link

This pull request has been closed due to lack of activity. This is not a judgement on the merit of the PR in any way. It is just a way of keeping the PR queue manageable. If you think that is incorrect, or the pull request requires review, you can revive the PR at any time.

@github-actions github-actions bot closed this Oct 31, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants