-
Notifications
You must be signed in to change notification settings - Fork 3k
[Site] - Document max_concurrent_deletes parameter in spark stored procedures. #4008
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Site] - Document max_concurrent_deletes parameter in spark stored procedures. #4008
Conversation
a0e4878 to
86c1815
Compare
szehon-ho
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this! cc @dramaticlly
6b40154 to
86726b4
Compare
…napshots spark procedures
86726b4 to
996e3c3
Compare
site/docs/spark-procedures.md
Outdated
| | `table` | ✔️ | string | Name of the table to update | | ||
| | `older_than` | ️ | timestamp | Timestamp before which snapshots will be removed (Default: 5 days ago) | | ||
| | `retain_last` | | int | Number of ancestor snapshots to preserve regardless of `older_than` (defaults to 1) | | ||
| | `max_concurrent_deletes` | | int | Size of the thread pool used for delete file actions (defaults to null, which deletes files serially in the current thread without instantiating a dedicated thread pool) | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The description is a bit long. How about "by default, no threadpool is used"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
seconded, reads better to me personally
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might be not accurate to say no thread pool at all is used, wonder if we still need to specify 'by default, no dedicated thread pool is used'. (Not sure if that's too academic).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed it's a bit long. Will update.
If we wanted to expand on the details, we could add an additional sentence below like older_than and retain_last are. But I think the shorter statement is sufficient.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is updated.
dramaticlly
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you Kyle for updating the spark procedures doc!
|
Thanks, @kbendick! |
* apache/iceberg#3723 * apache/iceberg#3732 * apache/iceberg#3749 * apache/iceberg#3766 * apache/iceberg#3787 * apache/iceberg#3796 * apache/iceberg#3809 * apache/iceberg#3820 * apache/iceberg#3878 * apache/iceberg#3890 * apache/iceberg#3892 * apache/iceberg#3944 * apache/iceberg#3976 * apache/iceberg#3993 * apache/iceberg#3996 * apache/iceberg#4008 * apache/iceberg#3758 and 3856 * apache/iceberg#3761 * apache/iceberg#2062 * apache/iceberg#3422 * remove restriction related to legacy parquet file list
This closes #4007
As of Iceberg 0.13.0, the Spark stored procedures
expire_snapshotsandremove_orphan_fileshave an added parameter,max_concurrent_deletes, which indicates the size of the thread pool that should be instantiated to remove the relevant files in parallel.Without this parameter, no separate thread pool is instantiated and the files are deleted sequentially in the current thread. For a high volume of deletes, this can be slow. Moreover, we should document every public parameter to procedures.