-
Notifications
You must be signed in to change notification settings - Fork 3.4k
Improve parallelism in remove_orphan_files procedure #26326
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR improves the performance of the remove_orphan_files procedure by introducing parallel file deletion operations and optimizing manifest reading.
- Adds a dedicated executor service for file delete operations to enable parallel processing
- Optimizes manifest reading by selecting only the required
file_pathcolumn - Modifies the file deletion logic to submit batches asynchronously rather than blocking on each batch
Reviewed Changes
Copilot reviewed 7 out of 7 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
| IcebergMetadataFactory.java | Adds injection and initialization of the new file delete executor service |
| IcebergMetadata.java | Implements parallel file deletion logic and optimizes manifest reading |
| Multiple test files | Updates test constructors to include the new executor service parameter |
plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/IcebergMetadata.java
Show resolved
Hide resolved
plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/IcebergMetadata.java
Show resolved
Hide resolved
tbaeg
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/IcebergMetadata.java
Show resolved
Hide resolved
| fileSystem.deleteFiles(filesToDelete); | ||
| filesToDelete.clear(); | ||
| List<Location> finalFilesToDelete = filesToDelete; | ||
| deleteFutures.add(icebergFileDeleteExecutor.submit(() -> deleteFiles(finalFilesToDelete, schemaTableName, fileSystem))); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is worth seeing whether in production use cases the object store provider will choose to return 429 Too Many Requests
Is the filesystem layer being able to cope with backpressure?
I'm looking at https://github.com/trinodb/trino/blob/master/lib/trino-filesystem-s3/src/main/java/io/trino/filesystem/s3/S3FileSystem.java and i don't see any FailSafe logic built-in.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall parallelism is bounded by iceberg.file-delete-threads.
In general, either the SDK or the filesystem layer is expected to handle backoff and retry on throttling errors.
Description
Improve parallelism in remove_orphan_files procedure
Additional context and related issues
Release notes
( ) This is not user-visible or is docs only, and no release notes are required.
( ) Release notes are required. Please propose a release note for me.
(x) Release notes are required, with the following suggested text: