Skip to content

[Iceberg] Support procedure remove_orphan_files#23267

Merged
hantangwangd merged 1 commit intoprestodb:masterfrom
hantangwangd:remove_orphan_files
Jul 31, 2024
Merged

[Iceberg] Support procedure remove_orphan_files#23267
hantangwangd merged 1 commit intoprestodb:masterfrom
hantangwangd:remove_orphan_files

Conversation

@hantangwangd
Copy link
Member

@hantangwangd hantangwangd commented Jul 21, 2024

Description

This PR support the procedure remove_orphan_files for iceberg. It can be used to remove files which are not referenced in any metadata files of an Iceberg table and can thus be considered "orphaned".

See examples as follow:

  • Remove any files which are not known to the table db.sample and older than specified timestamp::
CALL iceberg.system.remove_orphan_files('db', 'sample', TIMESTAMP '2023-08-31 00:00:00.000');
  • Remove any files which are not known to the table db.sample and created 3 days ago (by default)::
CALL iceberg.system.remove_orphan_files(schema => 'db', table_name => 'sample');

Motivation and Context

Support removing orphan files that are not referenced in any metadata files for Iceberg

Test Plan

  • Newly added test cases in TestRemoveOrphanFilesProcedure

Contributor checklist

  • Please make sure your submission complies with our development, formatting, commit message, and attribution guidelines.
  • PR description addresses the issue accurately and concisely. If the change is non-trivial, a GitHub Issue is referenced.
  • Documented new properties (with its default value), SQL syntax, functions, or other functionality.
  • If release notes are required, they follow the release notes guidelines.
  • Adequate tests were added if applicable.
  • CI passed.

Release Notes

== RELEASE NOTES ==


Iceberg Connector Changes
* Add procedure `remove_orphan_files` to remove orphan files that are not referenced in any metadata files for Iceberg. :pr:`23267`

@hantangwangd hantangwangd requested a review from presto-oss July 21, 2024 06:19
@hantangwangd hantangwangd force-pushed the remove_orphan_files branch from e644017 to a9066c5 Compare July 21, 2024 07:24
@tdcmeehan tdcmeehan self-assigned this Jul 22, 2024
Copy link
Contributor

@kiersten-stokes kiersten-stokes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will be nice to have! Just NITs from me

@hantangwangd hantangwangd force-pushed the remove_orphan_files branch from a9066c5 to 4bdb47b Compare July 26, 2024 18:09
Copy link
Contributor

@steveburnett steveburnett left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the doc! Local doc build with the new table looks good. A minor suggested rephrase for active voice, and shortening for readability.

@hantangwangd hantangwangd force-pushed the remove_orphan_files branch from 4bdb47b to 0bf44a8 Compare July 29, 2024 17:30
steveburnett
steveburnett previously approved these changes Jul 29, 2024
Copy link
Contributor

@steveburnett steveburnett left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! (docs)

Pull updated branch, new local docs build, looks good. Thanks!

Copy link
Contributor

@ZacBlanco ZacBlanco left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few initial comments. Great to see us getting more parity with Iceberg's Spark procedures

Copy link
Contributor

@ZacBlanco ZacBlanco left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two final things. Also, I would like to see a test which exercises the default expiry limit. I don't have a good idea on how to do that in a reasonable time frame within the test, so I am happy to leave as is if you think it would be difficult

@hantangwangd
Copy link
Member Author

I would like to see a test which exercises the default expiry limit. I don't have a good idea on how to do that in a reasonable time frame within the test, so I am happy to leave as is if you think it would be difficult

Seems it's indeed a bit difficult to test this scenario unless we change the implementation of remove_orphan_files to support configuring the default value of older_than. I'm not so sure if it's worthing do this. Maybe we can supplement such test cases when we figure out a better time test frame?

@hantangwangd hantangwangd merged commit 86fc085 into prestodb:master Jul 31, 2024
@hantangwangd hantangwangd deleted the remove_orphan_files branch July 31, 2024 15:33
@tdcmeehan tdcmeehan mentioned this pull request Aug 23, 2024
34 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants