Skip to content

Conversation

@raunaqmorarka
Copy link
Member

@raunaqmorarka raunaqmorarka commented Aug 18, 2025

Description

Avoid separate scan and delete operations for "data" and "metadata" folders to reduce filesystem calls.
Scan manifest files for valid file paths in parallel.

Additional context and related issues

Release notes

( ) This is not user-visible or is docs only, and no release notes are required.
( ) Release notes are required. Please propose a release note for me.
(x) Release notes are required, with the following suggested text:

## Iceberg
* Improve performance of remove_orphan_files. ({issue}`26438`)

This comment was marked as outdated.

validFileNames.add("version-hint.text");

scanAndDeleteInvalidFiles(table, session, schemaTableName, expiration, validFileNames.build(), fileIoProperties);
try {
Copy link
Contributor

@grantatspothero grantatspothero Aug 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

small improvement, use util which wraps executor completion service: https://github.com/trinodb/trino/blob/master/core/trino-main/src/main/java/io/trino/util/Executors.java#L41-L62

Benefit is the executor completion service will fail the remaining futures immediately if a single future fails, while this code waits on the futures in order. Gives quicker failure feedback (and wastes less resources)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would have liked to use that, the problem is it forces serial execution of reading manifest lists upfront. In the current code we get to interleave reading from manifest lists on the main thread with reading the manifest files in the threadpool.

Copy link
Contributor

@grantatspothero grantatspothero left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor suggestion but looks good.

@raunaqmorarka raunaqmorarka force-pushed the raunaq/remove-orph-manifest branch from 4c4c6f9 to a7c1c8c Compare August 19, 2025 06:18
@raunaqmorarka raunaqmorarka requested a review from Copilot August 19, 2025 06:25
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR optimizes the remove_orphan_files operation by consolidating filesystem scanning and reducing separate operations for data and metadata folders. It improves performance by scanning manifest files in parallel using futures and combines the validation of all file types into a single concurrent set.

  • Consolidates separate data and metadata file validation into a single unified approach
  • Implements parallel scanning of manifest files using futures to improve performance
  • Removes the subfolder-specific scanning approach in favor of scanning the entire table location

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

@raunaqmorarka raunaqmorarka merged commit 73cef68 into master Aug 19, 2025
89 of 92 checks passed
@raunaqmorarka raunaqmorarka deleted the raunaq/remove-orph-manifest branch August 19, 2025 08:34
@github-actions github-actions bot added this to the 477 milestone Aug 19, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cla-signed iceberg Iceberg connector

Development

Successfully merging this pull request may close these issues.

6 participants