Skip to content

Conversation

@kbendick
Copy link
Contributor

@kbendick kbendick commented Jan 30, 2022

This closes #4007

As of Iceberg 0.13.0, the Spark stored procedures expire_snapshots and remove_orphan_files have an added parameter, max_concurrent_deletes, which indicates the size of the thread pool that should be instantiated to remove the relevant files in parallel.

Without this parameter, no separate thread pool is instantiated and the files are deleted sequentially in the current thread. For a high volume of deletes, this can be slow. Moreover, we should document every public parameter to procedures.

@github-actions github-actions bot added the docs label Jan 30, 2022
@kbendick kbendick force-pushed the add-max-concurrent-deletes-parameter-to-docs branch from a0e4878 to 86c1815 Compare January 30, 2022 05:46
@kbendick
Copy link
Contributor Author

Copy link
Member

@szehon-ho szehon-ho left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this! cc @dramaticlly

@kbendick kbendick force-pushed the add-max-concurrent-deletes-parameter-to-docs branch from 6b40154 to 86726b4 Compare January 30, 2022 09:48
@kbendick kbendick force-pushed the add-max-concurrent-deletes-parameter-to-docs branch from 86726b4 to 996e3c3 Compare January 30, 2022 09:56
| `table` | ✔️ | string | Name of the table to update |
| `older_than` || timestamp | Timestamp before which snapshots will be removed (Default: 5 days ago) |
| `retain_last` | | int | Number of ancestor snapshots to preserve regardless of `older_than` (defaults to 1) |
| `max_concurrent_deletes` | | int | Size of the thread pool used for delete file actions (defaults to null, which deletes files serially in the current thread without instantiating a dedicated thread pool) |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The description is a bit long. How about "by default, no threadpool is used"

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seconded, reads better to me personally

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be not accurate to say no thread pool at all is used, wonder if we still need to specify 'by default, no dedicated thread pool is used'. (Not sure if that's too academic).

Copy link
Contributor Author

@kbendick kbendick Jan 31, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed it's a bit long. Will update.

If we wanted to expand on the details, we could add an additional sentence below like older_than and retain_last are. But I think the shorter statement is sufficient.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is updated.

Copy link
Contributor

@dramaticlly dramaticlly left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you Kyle for updating the spark procedures doc!

@rdblue rdblue merged commit 66f898c into apache:master Jan 31, 2022
@rdblue
Copy link
Contributor

rdblue commented Jan 31, 2022

Thanks, @kbendick!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Document max_concurrent_deletes parameter for the Spark procedures remove_orphan_files and expire_snapshots

4 participants