HADOOP-18679. Add API for bulk/paged object deletion #5993

steveloughran · 2023-08-26T17:26:11Z

Initial pass at writing an API for bulk deletes,
targeting S3 and any store with paged delete support.

Minimal design of a RemoteIterator to provide the list of paths to delete; a progress report will be provided after pages are deleted so as to provide an update of files deleted, and a way for the application code to abort an ongoing delete -such as after a failure.

Aspects of implementation to make clear in markdown spec

no guarantee of page size being > 1; or constant through entire operation.
after progress callback requests abort, more operations may continue (should we include an "aborting" flag?)
no guarantee order of execution. implementations may shuffle paths before posting.
no expectation that parent directories will exist after the operation completes; if an object store needs to explicitly look for and create directory markers, that step will be omitted.
background option is a hint to prioritise over other write operations vs add an interval between pages/different page size
callback: guarantee of thread callback comes from or whether it blocks further work
any exception raised by the iterator is an unrecoverable failure. unsubmitted paths may/may not be submitted before reporting the failure
no timeouts on iterator next()/hasNext().

How was this patch tested?

No tests yet; working on API first.

For code changes:

Does the title or this PR starts with the corresponding JIRA issue id (e.g. 'HADOOP-17799. Your PR title ...')?
Object storage: have the integration tests been executed and the endpoint declared according to the connector-specific documentation?
If adding new dependencies to the code, are these dependencies licensed in a way that is compatible for inclusion under ASF 2.0?
If applicable, have you updated the LICENSE, LICENSE-binary, NOTICE-binary files?

hadoop-yetus · 2023-08-26T19:55:45Z

💔 -1 overall

Vote	Subsystem	Runtime	Logfile	Comment
+0 🆗	reexec	0m 29s		Docker mode activated.
			_ Prechecks _
+1 💚	dupname	0m 0s		No case conflicting files found.
+0 🆗	codespell	0m 0s		codespell was not available.
+0 🆗	detsecrets	0m 0s		detect-secrets was not available.
+1 💚	@author	0m 0s		The patch does not contain any @author tags.
-1 ❌	test4tests	0m 0s		The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch.
			_ trunk Compile Tests _
+1 💚	mvninstall	31m 32s		trunk passed
+1 💚	compile	10m 44s		trunk passed with JDK Ubuntu-11.0.20+8-post-Ubuntu-1ubuntu120.04
+1 💚	compile	9m 34s		trunk passed with JDK Private Build-1.8.0_382-8u382-ga-1~20.04.1-b05
+1 💚	checkstyle	0m 51s		trunk passed
+1 💚	mvnsite	1m 11s		trunk passed
+1 💚	javadoc	0m 58s		trunk passed with JDK Ubuntu-11.0.20+8-post-Ubuntu-1ubuntu120.04
+1 💚	javadoc	0m 42s		trunk passed with JDK Private Build-1.8.0_382-8u382-ga-1~20.04.1-b05
+1 💚	spotbugs	1m 41s		trunk passed
+1 💚	shadedclient	22m 25s		branch has no errors when building and testing our client artifacts.
			_ Patch Compile Tests _
+1 💚	mvninstall	0m 39s		the patch passed
+1 💚	compile	9m 49s		the patch passed with JDK Ubuntu-11.0.20+8-post-Ubuntu-1ubuntu120.04
+1 💚	javac	9m 49s		the patch passed
+1 💚	compile	9m 33s		the patch passed with JDK Private Build-1.8.0_382-8u382-ga-1~20.04.1-b05
+1 💚	javac	9m 33s		the patch passed
+1 💚	blanks	0m 0s		The patch has no blanks issues.
+1 💚	checkstyle	0m 49s		the patch passed
+1 💚	mvnsite	1m 8s		the patch passed
+1 💚	javadoc	0m 54s		the patch passed with JDK Ubuntu-11.0.20+8-post-Ubuntu-1ubuntu120.04
+1 💚	javadoc	0m 42s		the patch passed with JDK Private Build-1.8.0_382-8u382-ga-1~20.04.1-b05
+1 💚	spotbugs	1m 45s		the patch passed
+1 💚	shadedclient	22m 19s		patch has no errors when building and testing our client artifacts.
			_ Other Tests _
+1 💚	unit	16m 49s		hadoop-common in the patch passed.
+1 💚	asflicense	0m 50s		The patch does not generate ASF License warnings.
		148m 24s

Subsystem	Report/Notes
Docker	ClientAPI=1.43 ServerAPI=1.43 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5993/1/artifact/out/Dockerfile
GITHUB PR	#5993
Optional Tests	dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets
uname	Linux f297046e6241 4.15.0-213-generic #224-Ubuntu SMP Mon Jun 19 13:30:12 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
Build tool	maven
Personality	dev-support/bin/hadoop.sh
git revision	trunk / `f25a930`
Default Java	Private Build-1.8.0_382-8u382-ga-1~20.04.1-b05
Multi-JDK versions	/usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.20+8-post-Ubuntu-1ubuntu120.04 /usr/lib/jvm/java-8-openjdk-amd64:Private Build-1.8.0_382-8u382-ga-1~20.04.1-b05
Test Results	https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5993/1/testReport/
Max. process+thread count	1253 (vs. ulimit of 5500)
modules	C: hadoop-common-project/hadoop-common U: hadoop-common-project/hadoop-common
Console output	https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5993/1/console
versions	git=2.25.1 maven=3.6.3 spotbugs=4.2.2
Powered by	Apache Yetus 0.14.0 https://yetus.apache.org

This message was automatically generated.

steveloughran · 2023-08-29T10:35:31Z

writing up spec made me decide we should have a .opt to indicate when a bulk delete is a "background" operation, which may be executed at a rate to interfere less with live queries, e.g: smaller pages, rate limited buildup of pages, different throttle retry policy.

ahmarsuhail

I am struggling a bit to understand how this all comes together. This is what I think is happening, but don't really get it:

Implement an iterator, iterator to have a method which implements the actual bulk delete calls to the store.
Implement the BulkDelete and Builder interfaces. The build() method will iterate through the iterator, and call a bulkDelete method on the iterator ..
bulkDelete() just creates the builder and returns?
On each iteration, call DeleteProgress to update progress. If using the FAIL_FAST implementation, and there are any failures it returns false.
If false, call abort(). (where is the abort() to be implemented?)
Once complete, return the Outcome object

ahmarsuhail · 2023-09-01T09:02:59Z

hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/BulkDelete.java

+ * will be batched into pages and submitted to the remote filesystem/store
+ * for bulk deletion, possibly in parallel.
+ * <p>
+ * A remote iterator provides the list of paths to delete; all must be under


why is the base path a requirement? to ensure things are in the same bucket (for S3) or something else?

its for multiple mounted filesystems (viewfs) to direct to the final fs.

ahmarsuhail · 2023-09-01T10:11:52Z

hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/BulkDelete.java

+    private final boolean successful;
+
+    /**
+     * Wast the operation aborted?


nit: typo, Was

steveloughran · 2023-10-05T09:59:16Z

@ahmarsuhail

caller provides a remote iterator, such as the ones we do for listing or another source/transformation (see RemoteIterators)
build() call returns some result
implementation kicks off a worker thread to process the iterator, reading its values in until there's enough to kick off a DELETE request (page or maybe a parallel set in a thread pool)
after each page/set of deletes, invokes the supplied callback of results
then continues, unless told to stop
finish only on: iterator has nothing, iterator raises an exception
or maybe on reaching some limit on failures
including maybe those considered unrecoverable

hadoop-yetus · 2024-01-01T20:23:41Z

💔 -1 overall

Vote	Subsystem	Runtime	Logfile	Comment
+0 🆗	reexec	0m 21s		Docker mode activated.
			_ Prechecks _
+1 💚	dupname	0m 0s		No case conflicting files found.
+0 🆗	codespell	0m 0s		codespell was not available.
+0 🆗	detsecrets	0m 0s		detect-secrets was not available.
+1 💚	@author	0m 0s		The patch does not contain any @author tags.
-1 ❌	test4tests	0m 0s		The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch.
			_ trunk Compile Tests _
+0 🆗	mvndep	13m 47s		Maven dependency ordering for branch
+1 💚	mvninstall	19m 28s		trunk passed
+1 💚	compile	8m 19s		trunk passed with JDK Ubuntu-11.0.21+9-post-Ubuntu-0ubuntu120.04
+1 💚	compile	7m 27s		trunk passed with JDK Private Build-1.8.0_392-8u392-ga-1~20.04-b08
+1 💚	checkstyle	2m 4s		trunk passed
+1 💚	mvnsite	1m 24s		trunk passed
+1 💚	javadoc	1m 4s		trunk passed with JDK Ubuntu-11.0.21+9-post-Ubuntu-0ubuntu120.04
+1 💚	javadoc	1m 0s		trunk passed with JDK Private Build-1.8.0_392-8u392-ga-1~20.04-b08
+1 💚	spotbugs	2m 9s		trunk passed
+1 💚	shadedclient	19m 50s		branch has no errors when building and testing our client artifacts.
			_ Patch Compile Tests _
+0 🆗	mvndep	0m 20s		Maven dependency ordering for patch
+1 💚	mvninstall	0m 50s		the patch passed
+1 💚	compile	7m 52s		the patch passed with JDK Ubuntu-11.0.21+9-post-Ubuntu-0ubuntu120.04
+1 💚	javac	7m 52s		the patch passed
+1 💚	compile	7m 27s		the patch passed with JDK Private Build-1.8.0_392-8u392-ga-1~20.04-b08
+1 💚	javac	7m 27s		the patch passed
+1 💚	blanks	0m 0s		The patch has no blanks issues.
-0 ⚠️	checkstyle	1m 59s	/results-checkstyle-root.txt	root: The patch generated 5 new + 3 unchanged - 0 fixed = 8 total (was 3)
+1 💚	mvnsite	1m 18s		the patch passed
+1 💚	javadoc	1m 2s		the patch passed with JDK Ubuntu-11.0.21+9-post-Ubuntu-0ubuntu120.04
+1 💚	javadoc	0m 57s		the patch passed with JDK Private Build-1.8.0_392-8u392-ga-1~20.04-b08
+1 💚	spotbugs	2m 20s		the patch passed
+1 💚	shadedclient	19m 45s		patch has no errors when building and testing our client artifacts.
			_ Other Tests _
+1 💚	unit	16m 22s		hadoop-common in the patch passed.
+1 💚	unit	2m 10s		hadoop-aws in the patch passed.
+1 💚	asflicense	0m 36s		The patch does not generate ASF License warnings.
		143m 28s

Subsystem	Report/Notes
Docker	ClientAPI=1.43 ServerAPI=1.43 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5993/3/artifact/out/Dockerfile
GITHUB PR	#5993
Optional Tests	dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets
uname	Linux e75e83010c54 5.15.0-88-generic #98-Ubuntu SMP Mon Oct 2 15:18:56 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
Build tool	maven
Personality	dev-support/bin/hadoop.sh
git revision	trunk / `d69fac0`
Default Java	Private Build-1.8.0_392-8u392-ga-1~20.04-b08
Multi-JDK versions	/usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.21+9-post-Ubuntu-0ubuntu120.04 /usr/lib/jvm/java-8-openjdk-amd64:Private Build-1.8.0_392-8u392-ga-1~20.04-b08
Test Results	https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5993/3/testReport/
Max. process+thread count	2432 (vs. ulimit of 5500)
modules	C: hadoop-common-project/hadoop-common hadoop-tools/hadoop-aws U: .
Console output	https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5993/3/console
versions	git=2.25.1 maven=3.6.3 spotbugs=4.2.2
Powered by	Apache Yetus 0.14.0 https://yetus.apache.org

This message was automatically generated.

Initial pass at writing an API for bulk deletes, targeting S3 and any store with paged delete support. Minimal design of a RemoteIterator to provide the list of paths to delete; a progress report will be provided after pages are deleted so as to provide an update of files deleted, and a way for the application code to abort an ongoing delete -such as after a failure. Change-Id: I3dcbb144232d76b5d4ebf7ad080d187edd6e93e4

Including option "fs.option.bulkdelete.background" to indicate this is a background cleanup and so can be lower priority (somehow) Change-Id: Idb55ebf2a6664fb23e3dbacd3e0ade45cb4936e1

Change-Id: I1b053f3b6573dfb53ade78073d0cdf948a0c207d

hadoop-yetus · 2024-01-24T18:15:57Z

💔 -1 overall

Vote	Subsystem	Runtime	Logfile	Comment
+0 🆗	reexec	0m 20s		Docker mode activated.
			_ Prechecks _
+1 💚	dupname	0m 0s		No case conflicting files found.
+0 🆗	codespell	0m 1s		codespell was not available.
+0 🆗	detsecrets	0m 1s		detect-secrets was not available.
+1 💚	@author	0m 0s		The patch does not contain any @author tags.
-1 ❌	test4tests	0m 0s		The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch.
			_ trunk Compile Tests _
+0 🆗	mvndep	14m 17s		Maven dependency ordering for branch
-1 ❌	mvninstall	4m 17s	/branch-mvninstall-root.txt	root in trunk failed.
-1 ❌	compile	3m 53s	/branch-compile-root-jdkUbuntu-11.0.21+9-post-Ubuntu-0ubuntu120.04.txt	root in trunk failed with JDK Ubuntu-11.0.21+9-post-Ubuntu-0ubuntu120.04.
-1 ❌	compile	3m 36s	/branch-compile-root-jdkPrivateBuild-1.8.0_392-8u392-ga-1~20.04-b08.txt	root in trunk failed with JDK Private Build-1.8.0_392-8u392-ga-1~20.04-b08.
+1 💚	checkstyle	1m 54s		trunk passed
+1 💚	mvnsite	1m 12s		trunk passed
+1 💚	javadoc	0m 50s		trunk passed with JDK Ubuntu-11.0.21+9-post-Ubuntu-0ubuntu120.04
+1 💚	javadoc	0m 37s		trunk passed with JDK Private Build-1.8.0_392-8u392-ga-1~20.04-b08
+1 💚	spotbugs	1m 49s		trunk passed
-1 ❌	shadedclient	4m 52s		branch has errors when building and testing our client artifacts.
			_ Patch Compile Tests _
+0 🆗	mvndep	0m 20s		Maven dependency ordering for patch
+1 💚	mvninstall	0m 45s		the patch passed
-1 ❌	compile	3m 48s	/patch-compile-root-jdkUbuntu-11.0.21+9-post-Ubuntu-0ubuntu120.04.txt	root in the patch failed with JDK Ubuntu-11.0.21+9-post-Ubuntu-0ubuntu120.04.
-1 ❌	javac	3m 48s	/patch-compile-root-jdkUbuntu-11.0.21+9-post-Ubuntu-0ubuntu120.04.txt	root in the patch failed with JDK Ubuntu-11.0.21+9-post-Ubuntu-0ubuntu120.04.
-1 ❌	compile	3m 36s	/patch-compile-root-jdkPrivateBuild-1.8.0_392-8u392-ga-1~20.04-b08.txt	root in the patch failed with JDK Private Build-1.8.0_392-8u392-ga-1~20.04-b08.
-1 ❌	javac	3m 36s	/patch-compile-root-jdkPrivateBuild-1.8.0_392-8u392-ga-1~20.04-b08.txt	root in the patch failed with JDK Private Build-1.8.0_392-8u392-ga-1~20.04-b08.
+1 💚	blanks	0m 0s		The patch has no blanks issues.
-0 ⚠️	checkstyle	1m 48s	/results-checkstyle-root.txt	root: The patch generated 5 new + 3 unchanged - 0 fixed = 8 total (was 3)
+1 💚	mvnsite	1m 6s		the patch passed
+1 💚	javadoc	0m 39s		the patch passed with JDK Ubuntu-11.0.21+9-post-Ubuntu-0ubuntu120.04
+1 💚	javadoc	0m 41s		the patch passed with JDK Private Build-1.8.0_392-8u392-ga-1~20.04-b08
+1 💚	spotbugs	2m 1s		the patch passed
-1 ❌	shadedclient	4m 57s		patch has errors when building and testing our client artifacts.
			_ Other Tests _
+1 💚	unit	16m 13s		hadoop-common in the patch passed.
+1 💚	unit	2m 24s		hadoop-aws in the patch passed.
+1 💚	asflicense	0m 24s		The patch does not generate ASF License warnings.
		79m 47s

Subsystem	Report/Notes
Docker	ClientAPI=1.44 ServerAPI=1.44 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5993/4/artifact/out/Dockerfile
GITHUB PR	#5993
Optional Tests	dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets
uname	Linux de98fcc15d5f 5.15.0-88-generic #98-Ubuntu SMP Mon Oct 2 15:18:56 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
Build tool	maven
Personality	dev-support/bin/hadoop.sh
git revision	trunk / `c7b4e99`
Default Java	Private Build-1.8.0_392-8u392-ga-1~20.04-b08
Multi-JDK versions	/usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.21+9-post-Ubuntu-0ubuntu120.04 /usr/lib/jvm/java-8-openjdk-amd64:Private Build-1.8.0_392-8u392-ga-1~20.04-b08
Test Results	https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5993/4/testReport/
Max. process+thread count	3149 (vs. ulimit of 5500)
modules	C: hadoop-common-project/hadoop-common hadoop-tools/hadoop-aws U: .
Console output	https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5993/4/console
versions	git=2.25.1 maven=3.6.3 spotbugs=4.2.2
Powered by	Apache Yetus 0.14.0 https://yetus.apache.org

This message was automatically generated.

steveloughran marked this pull request as draft August 26, 2023 17:26

github-actions bot added trunk Common labels Aug 26, 2023

ahmarsuhail reviewed Sep 1, 2023

View reviewed changes

steveloughran force-pushed the s3/HADOOP-18679-bulk-delete-api branch from 9317477 to d69fac0 Compare January 1, 2024 17:59

github-actions bot added TOOLS AWS labels Jan 1, 2024

apache deleted a comment from hadoop-yetus Jan 3, 2024

steveloughran added 3 commits January 24, 2024 16:53

HADOOP-18679. Bulk delete: define semantics better

ad7ca4e

Including option "fs.option.bulkdelete.background" to indicate this is a background cleanup and so can be lower priority (somehow) Change-Id: Idb55ebf2a6664fb23e3dbacd3e0ade45cb4936e1

HADOOP-18679. bulk delete: s3a impl

c7b4e99

Change-Id: I1b053f3b6573dfb53ade78073d0cdf948a0c207d

steveloughran force-pushed the s3/HADOOP-18679-bulk-delete-api branch from d69fac0 to c7b4e99 Compare January 24, 2024 16:55

steveloughran mentioned this pull request Jan 24, 2024

HADOOP-18679. Add API for bulk/paged object deletion #6494

Closed

4 tasks

steveloughran closed this May 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

HADOOP-18679. Add API for bulk/paged object deletion #5993

HADOOP-18679. Add API for bulk/paged object deletion #5993

Uh oh!

steveloughran commented Aug 26, 2023 •

edited

Loading

Uh oh!

hadoop-yetus commented Aug 26, 2023

Uh oh!

steveloughran commented Aug 29, 2023

Uh oh!

ahmarsuhail left a comment •

edited

Loading

Uh oh!

ahmarsuhail Sep 1, 2023

Uh oh!

steveloughran Oct 6, 2023

Uh oh!

ahmarsuhail Sep 1, 2023

Uh oh!

steveloughran commented Oct 5, 2023

Uh oh!

hadoop-yetus commented Jan 1, 2024

Uh oh!

hadoop-yetus commented Jan 24, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

HADOOP-18679. Add API for bulk/paged object deletion #5993

HADOOP-18679. Add API for bulk/paged object deletion #5993

Uh oh!

Conversation

steveloughran commented Aug 26, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Aspects of implementation to make clear in markdown spec

How was this patch tested?

For code changes:

Uh oh!

hadoop-yetus commented Aug 26, 2023

Uh oh!

steveloughran commented Aug 29, 2023

Uh oh!

ahmarsuhail left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ahmarsuhail Sep 1, 2023

Choose a reason for hiding this comment

Uh oh!

steveloughran Oct 6, 2023

Choose a reason for hiding this comment

Uh oh!

ahmarsuhail Sep 1, 2023

Choose a reason for hiding this comment

Uh oh!

steveloughran commented Oct 5, 2023

Uh oh!

hadoop-yetus commented Jan 1, 2024

Uh oh!

hadoop-yetus commented Jan 24, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

steveloughran commented Aug 26, 2023 •

edited

Loading

ahmarsuhail left a comment •

edited

Loading