Skip to content

Conversation

@dramaticlly
Copy link
Contributor

@dramaticlly dramaticlly commented Aug 1, 2022

In #4052, S3 fileIO now implement a new interfaces to support S3 batch deletion, this PR introduce it for Spark expire-snapshots procedure to conditionally support delete files in batch if underlying fileIO supports it (if implements SupportsBulkOperations interface and currently only S3FileIO support such)

It default to use S3 batch deletion if fileIO is supported in catalog, allow for customization with bulkDeleteWith method

  • cannot reuse the existing bulkDelete consumer function because it only take single file name at a time instead of a iterable
  • did not add interface override, want to keep this change as small as possible, once approved then we can retroactively apply to interface and previous spark version (2.4/3.0/3.1/3.2)
  • Considering the existing test fixture all use HadoopTables and it's very hard to test the integration of S3FileIO and Spark action together in unit tests, so I am looking for a way to do some integration tests and will share the results later

Similar to #5373 but for expire-snapshots procedure, relate to #4012

CC @rdblue , @danielcweeks, @amogh-jahagirdar, @szehon-ho

Preconditions.checkArgument(
ops.io() instanceof SupportsBulkOperations,
"FileIO %s does not support bulk deletion",
table.io().getClass().getName());
Copy link
Contributor Author

@dramaticlly dramaticlly Aug 2, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I shall probably move this table.io() to ops.io() for consistency, but given check take 1+hr to complete, I will handle it together with other review feedback

@szehon-ho
Copy link
Member

Probably we need to synchronize with @amogh-jahagirdar on whether we need a bulkDeleteFunc, and whether retry happens at S3FileIO layer or not.

@dramaticlly
Copy link
Contributor Author

closed in favor of #6682, which cover more comprehensive spark actions and also implement bulk deletion for HadoopFileIO

@dramaticlly dramaticlly closed this Feb 2, 2023
@dramaticlly dramaticlly deleted the bulk branch June 28, 2023 21:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants