-
Notifications
You must be signed in to change notification settings - Fork 3.4k
Improve remove_orphan_files #26438
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve remove_orphan_files #26438
Conversation
plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/IcebergMetadata.java
Show resolved
Hide resolved
| validFileNames.add("version-hint.text"); | ||
|
|
||
| scanAndDeleteInvalidFiles(table, session, schemaTableName, expiration, validFileNames.build(), fileIoProperties); | ||
| try { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
small improvement, use util which wraps executor completion service: https://github.com/trinodb/trino/blob/master/core/trino-main/src/main/java/io/trino/util/Executors.java#L41-L62
Benefit is the executor completion service will fail the remaining futures immediately if a single future fails, while this code waits on the futures in order. Gives quicker failure feedback (and wastes less resources)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would have liked to use that, the problem is it forces serial execution of reading manifest lists upfront. In the current code we get to interleave reading from manifest lists on the main thread with reading the manifest files in the threadpool.
grantatspothero
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor suggestion but looks good.
Combine the scan and delete step for data and metadata into one operation to reduce the number of filesystem calls
4c4c6f9 to
a7c1c8c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR optimizes the remove_orphan_files operation by consolidating filesystem scanning and reducing separate operations for data and metadata folders. It improves performance by scanning manifest files in parallel using futures and combines the validation of all file types into a single concurrent set.
- Consolidates separate data and metadata file validation into a single unified approach
- Implements parallel scanning of manifest files using futures to improve performance
- Removes the subfolder-specific scanning approach in favor of scanning the entire table location
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
Description
Avoid separate scan and delete operations for "data" and "metadata" folders to reduce filesystem calls.
Scan manifest files for valid file paths in parallel.
Additional context and related issues
Release notes
( ) This is not user-visible or is docs only, and no release notes are required.
( ) Release notes are required. Please propose a release note for me.
(x) Release notes are required, with the following suggested text: