HADOOP-16570. S3A committers encounter scale issues #1442

steveloughran · 2019-09-13T17:56:50Z

This patch addresses scale issues

Thread pool leakage

explicitly shuts down the thread pool in job cleanup and after task commit, abort, job abort and job commit.

The alternative strategy would to be to always destroy the threads in the same method they were used, but as two operations are normally parallelized back-to-back: listing pending .files and then committing or aborting them, retaining the pool is useful. And there isn't any close() method or similar in the OutputCommitter interface to place it.

To test this, a probe for the committer having a thread pool was added, and the AbstractITCommitProtocol test extended to verify that there was no thread pool after the various commit and abort lifecycle operations.

To verify that the tests themselves were valid, the destroyThreadPool() initially did not actually destroy the pool; the fact that the modified tests then all failed providesd evidence that all paths followed in those tests successfully cleaned up. Once the method did close the thread pool, all these failing tests passed.

Note: I also switched to the HadoopExecutors thread pool factory; I considered moving to one of the caching thread pools but decided that I'd make this change simpler for ease of backport. For a trunk-only fix I'd consider asking the target S3A FS for its store context and creating a thread pool of it, which would just be a restricted fraction of the store's own pool.

OOM on job commit for jobs with many thousands of tasks, each generating tens of files.

Instead of loading all pending commits into memory as a single list, the list of files to load is the sole list which is passed around; .pendingset files are loaded and processed in isolation -and reloaded if necessary for any abort/rollback operation.

The parallel commit/abort/revert operations now work at the .pendingset level, rather than that of individual pending commit files. The existing parallelized Tasks API is still used to commit those files, but with a null thread pool, so as to serialize the operations.

This could slow down the commit operation in the following situations:-

There is a significant skew in the number of files different tasks have created.
The job will be blocked waiting for the largest tasks to complete.
There are very few tasks (less the size of the thread pool) but they each created many many files.

I am not going to worry about these.

steveloughran · 2019-09-13T17:57:04Z

tested s3 ireland, ddb, auth

steveloughran · 2019-09-18T17:42:00Z

Tested, s3a ireland.

FWIW, I'm planning to backport the thread cleanup and extra asserts to branches -3.1 and 3.2; I'm not so sure about the thread checks if we need assertj to be added, but well, its an extra test-time dependency so not that troublesome.

steveloughran · 2019-09-18T17:51:05Z

Note that the thread tests go beyond just asking the committers if they have a thread pool; they verify that there aren't any left around. If we like this, it could be expanded to other places, such as verifying the other FS clients are also being good citizens -especially for hive, which creates and destroys many instances

ehiggs · 2019-09-19T21:31:29Z

hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/commit/AbstractS3ACommitter.java

nit: why final?

No real reason

made non final

steveloughran · 2019-09-20T18:57:00Z

This patch is about to get bigger and more complicated, as when you get really big with the scale tests, you also discover that trying to keep a list of all pending commits in the job committer it's just a way to see OOM exception traces. I going to have to be more incremental about loading and committing files.

steveloughran · 2019-09-24T17:16:47Z

tested -s3a ireland w/ddb. not yet tested: all the way through spark

steveloughran · 2019-09-26T19:47:03Z

latest test run -s3 ireland. There's a new unit test which with the current values takes 1 min; plan to cut the numbers back, just leaving as is to be confident that there's no scale problems with these values. I think I'll declare many more blocks per file.

The slow parts of the test are actually

the non serialized creation of all the pendingset files. that can be massively speeded up
the actual listing of files to commit. That's a sequential operation at the start of the commit; I will look at it a bit to see if there are some easy opportunities for speedups, as that would mattter in production, maybe moving off fancy java 8 stuff to simple loops will help there.

As that list process is the one for the staging committers, it is only listing the consistent cluster FS (i.e HDFS) so s3 perf won't matter. In real jobs the time to POST commits will dominate -and with that patch every pendingset file is loaded and processed in parallel

hadoop-yetus · 2019-09-26T21:34:09Z

🎊 +1 overall

Vote	Subsystem	Runtime	Comment
0	reexec	43	Docker mode activated.
		_ Prechecks _
+1	dupname	1	No case conflicting files found.
+1	@author	0	The patch does not contain any @author tags.
+1	test4tests	0	The patch appears to include 10 new or modified test files.
		_ trunk Compile Tests _
+1	mvninstall	1416	trunk passed
+1	compile	38	trunk passed
+1	checkstyle	26	trunk passed
+1	mvnsite	40	trunk passed
+1	shadedclient	934	branch has no errors when building and testing our client artifacts.
+1	javadoc	28	trunk passed
0	spotbugs	72	Used deprecated FindBugs config; considering switching to SpotBugs.
+1	findbugs	69	trunk passed
		_ Patch Compile Tests _
+1	mvninstall	38	the patch passed
+1	compile	32	the patch passed
+1	javac	32	the patch passed
-0	checkstyle	22	hadoop-tools/hadoop-aws: The patch generated 8 new + 46 unchanged - 0 fixed = 54 total (was 46)
+1	mvnsite	37	the patch passed
+1	whitespace	0	The patch has no whitespace issues.
+1	shadedclient	966	patch has no errors when building and testing our client artifacts.
+1	javadoc	26	the patch passed
+1	findbugs	70	the patch passed
		_ Other Tests _
+1	unit	91	hadoop-aws in the patch passed.
+1	asflicense	31	The patch does not generate ASF License warnings.
		4021

Subsystem	Report/Notes
Docker	Client=19.03.1 Server=19.03.1 base: https://builds.apache.org/job/hadoop-multibranch/job/PR-1442/8/artifact/out/Dockerfile
GITHUB PR	#1442
Optional Tests	dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient findbugs checkstyle
uname	Linux d71ebd4c0744 4.15.0-60-generic #67-Ubuntu SMP Thu Aug 22 16:55:30 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
Build tool	maven
Personality	personality/hadoop.sh
git revision	trunk / `06998a1`
Default Java	1.8.0_222
checkstyle	https://builds.apache.org/job/hadoop-multibranch/job/PR-1442/8/artifact/out/diff-checkstyle-hadoop-tools_hadoop-aws.txt
Test Results	https://builds.apache.org/job/hadoop-multibranch/job/PR-1442/8/testReport/
Max. process+thread count	345 (vs. ulimit of 5500)
modules	C: hadoop-tools/hadoop-aws U: hadoop-tools/hadoop-aws
Console output	https://builds.apache.org/job/hadoop-multibranch/job/PR-1442/8/console
versions	git=2.7.4 maven=3.3.9 findbugs=3.1.0-RC1
Powered by	Apache Yetus 0.10.0 http://yetus.apache.org

This message was automatically generated.

steveloughran · 2019-09-30T12:12:19Z

checkstyle warnings

./hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/commit/files/SuccessData.java:58: * {@link org.apache.hadoop.fs.s3a.commit.CommitConstants#SUCCESS_MARKER_FILE_LIMIT}.: Line is longer than 80 characters (found 85). [LineLength]
./hadoop-tools/hadoop-aws/src/test/java/org/apache/hadoop/fs/s3a/S3ATestUtils.java:1364:        "org.apache.hadoop.fs.FileSystem$Statistics$StatisticsDataReferenceCleaner");: Line is longer than 80 characters (found 85). [LineLength]
./hadoop-tools/hadoop-aws/src/test/java/org/apache/hadoop/fs/s3a/commit/staging/StagingTestBase.java:501:    public void addUploads(Map<String, String> uploads) {:48: 'uploads' hides a field. [HiddenField]
./hadoop-tools/hadoop-aws/src/test/java/org/apache/hadoop/fs/s3a/commit/staging/StagingTestBase.java:605:  public static AmazonS3 newMockS3Client(final ClientResults results,:3: Method length is 157 lines (max allowed is 150). [MethodLength]
./hadoop-tools/hadoop-aws/src/test/java/org/apache/hadoop/fs/s3a/commit/staging/TestDirectoryCommitterScale.java:137:  public void test_010_createTaskFiles() throws Exception {:15: Name 'test_010_createTaskFiles' must match pattern '^[a-z][a-zA-Z0-9]*$'. [MethodName]
./hadoop-tools/hadoop-aws/src/test/java/org/apache/hadoop/fs/s3a/commit/staging/TestDirectoryCommitterScale.java:208:  public void test_020_loadFilesToAttempt() throws Exception {:15: Name 'test_020_loadFilesToAttempt' must match pattern '^[a-z][a-zA-Z0-9]*$'. [MethodName]
./hadoop-tools/hadoop-aws/src/test/java/org/apache/hadoop/fs/s3a/commit/staging/TestDirectoryCommitterScale.java:227:  public void test_030_commitFiles() throws Exception {:15: Name 'test_030_commitFiles' must match pattern '^[a-z][a-zA-Z0-9]*$'. [MethodName]
./hadoop-tools/hadoop-aws/src/test/java/org/apache/hadoop/fs/s3a/commit/staging/TestDirectoryCommitterScale.java:257:  public void test_040_abortFiles() throws Exception {:15: Name 'test_040_abortFiles' must match pattern '^[a-z][a-zA-Z0-9]*$'. [MethodName]

hadoop-yetus · 2019-09-30T15:22:15Z

🎊 +1 overall

Vote	Subsystem	Runtime	Comment
0	reexec	39	Docker mode activated.
		_ Prechecks _
+1	dupname	1	No case conflicting files found.
+1	@author	0	The patch does not contain any @author tags.
+1	test4tests	0	The patch appears to include 10 new or modified test files.
		_ trunk Compile Tests _
+1	mvninstall	1177	trunk passed
+1	compile	31	trunk passed
+1	checkstyle	23	trunk passed
+1	mvnsite	35	trunk passed
+1	shadedclient	851	branch has no errors when building and testing our client artifacts.
+1	javadoc	24	trunk passed
0	spotbugs	57	Used deprecated FindBugs config; considering switching to SpotBugs.
+1	findbugs	55	trunk passed
		_ Patch Compile Tests _
+1	mvninstall	32	the patch passed
+1	compile	26	the patch passed
+1	javac	26	the patch passed
-0	checkstyle	19	hadoop-tools/hadoop-aws: The patch generated 6 new + 46 unchanged - 0 fixed = 52 total (was 46)
+1	mvnsite	30	the patch passed
+1	whitespace	0	The patch has no whitespace issues.
+1	shadedclient	867	patch has no errors when building and testing our client artifacts.
+1	javadoc	22	the patch passed
+1	findbugs	61	the patch passed
		_ Other Tests _
+1	unit	84	hadoop-aws in the patch passed.
+1	asflicense	28	The patch does not generate ASF License warnings.
		3489

Subsystem	Report/Notes
Docker	Client=19.03.1 Server=19.03.1 base: https://builds.apache.org/job/hadoop-multibranch/job/PR-1442/9/artifact/out/Dockerfile
GITHUB PR	#1442
Optional Tests	dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient findbugs checkstyle
uname	Linux eaa2f1e2a04a 4.15.0-60-generic #67-Ubuntu SMP Thu Aug 22 16:55:30 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
Build tool	maven
Personality	personality/hadoop.sh
git revision	trunk / `b46d823`
Default Java	1.8.0_222
checkstyle	https://builds.apache.org/job/hadoop-multibranch/job/PR-1442/9/artifact/out/diff-checkstyle-hadoop-tools_hadoop-aws.txt
Test Results	https://builds.apache.org/job/hadoop-multibranch/job/PR-1442/9/testReport/
Max. process+thread count	340 (vs. ulimit of 5500)
modules	C: hadoop-tools/hadoop-aws U: hadoop-tools/hadoop-aws
Console output	https://builds.apache.org/job/hadoop-multibranch/job/PR-1442/9/console
versions	git=2.7.4 maven=3.3.9 findbugs=3.1.0-RC1
Powered by	Apache Yetus 0.10.0 http://yetus.apache.org

This message was automatically generated.

rdblue · 2019-10-01T16:16:48Z

hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/commit/AbstractS3ACommitter.java

If this is not going to create the _SUCCESS marker because the file list is too large, why get all committed file names here? I think this should be inside whatever check maybeCreateSuccessMarker has to avoid the memory consumption.

we stop adding files to that pending.committed objects list once we reach an arbitrary threshold (SUCCESS_MARKER_FILE_LIMIT == 100), so we add that subset of entries. I'll clarify that in the comments

rdblue · 2019-10-01T16:17:54Z

hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/commit/AbstractS3ACommitter.java

This seems unrelated. Why is it necessary to create the output path here?

aah, because sometimes we've had our terasort tests fail saying there's a tombstone at the far end, and I suspect it means that sometimes we somehow aren't getting that directory created. So I'm doing it preemptively

rdblue · 2019-10-01T16:19:40Z

hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/commit/AbstractS3ACommitter.java

Why does CommitContext no longer require close to be called? Because the Tasks call now handles all of the failure and abort cases?

no, it's because its the second of the two closeables; its still in the () section; it follows the duration one after the semicolon

rdblue · 2019-10-01T16:23:19Z

hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/commit/AbstractS3ACommitter.java

Should this explicitly call throwFailureWhenFinished()? I typically call either that one or suppressFailures() so that it is obvious to the reader how the error is handled.

This implementation doesn't have that throwFailureWhenFinished() call. I'll be explicit with suppressFailures here and elsewhere

rdblue · 2019-10-01T16:26:33Z

hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/commit/AbstractS3ACommitter.java

Failures for each failed pending set are handled in loadAndCommit so I don't think you need to handle failures here, just reverts and aborts.

rdblue · 2019-10-01T16:30:57Z

Overall, looks good. The only problem I saw was loading all of the committed files for the _SUCCESS marker. Otherwise, I think everything should work, even if you end up aborting some failed files twice.

steveloughran · 2019-10-01T17:19:25Z

The only problem I saw was loading all of the committed files for the _SUCCESS marker.

let me review that. thanks for checking this over.

steveloughran · 2019-10-01T20:23:50Z

just pushed up a rebased update; tries to address all of Ryans comments, plus I use Tasks to parallelise the partition deletion. If a job writes to multiple partitions this way -speed up.

tested, s3 ireland with -Dparallel-tests -DtestsThreadCount=12 -Ds3guard -Ddynamo

hadoop-yetus · 2019-10-01T21:22:06Z

🎊 +1 overall

Vote	Subsystem	Runtime	Comment
0	reexec	77	Docker mode activated.
		_ Prechecks _
+1	dupname	1	No case conflicting files found.
+1	@author	0	The patch does not contain any @author tags.
+1	test4tests	0	The patch appears to include 11 new or modified test files.
		_ trunk Compile Tests _
+1	mvninstall	1194	trunk passed
+1	compile	31	trunk passed
+1	checkstyle	23	trunk passed
+1	mvnsite	37	trunk passed
+1	shadedclient	846	branch has no errors when building and testing our client artifacts.
+1	javadoc	28	trunk passed
0	spotbugs	57	Used deprecated FindBugs config; considering switching to SpotBugs.
+1	findbugs	55	trunk passed
		_ Patch Compile Tests _
+1	mvninstall	33	the patch passed
+1	compile	27	the patch passed
+1	javac	27	the patch passed
-0	checkstyle	20	hadoop-tools/hadoop-aws: The patch generated 6 new + 46 unchanged - 0 fixed = 52 total (was 46)
+1	mvnsite	30	the patch passed
+1	whitespace	0	The patch has no whitespace issues.
+1	shadedclient	863	patch has no errors when building and testing our client artifacts.
+1	javadoc	23	the patch passed
+1	findbugs	62	the patch passed
		_ Other Tests _
+1	unit	80	hadoop-aws in the patch passed.
+1	asflicense	30	The patch does not generate ASF License warnings.
		3562

Subsystem	Report/Notes
Docker	Client=19.03.2 Server=19.03.2 base: https://builds.apache.org/job/hadoop-multibranch/job/PR-1442/10/artifact/out/Dockerfile
GITHUB PR	#1442
Optional Tests	dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient findbugs checkstyle
uname	Linux a6d3be8957d7 4.15.0-58-generic #64-Ubuntu SMP Tue Aug 6 11:12:41 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
Build tool	maven
Personality	personality/hadoop.sh
git revision	trunk / `d947ded`
Default Java	1.8.0_222
checkstyle	https://builds.apache.org/job/hadoop-multibranch/job/PR-1442/10/artifact/out/diff-checkstyle-hadoop-tools_hadoop-aws.txt
Test Results	https://builds.apache.org/job/hadoop-multibranch/job/PR-1442/10/testReport/
Max. process+thread count	424 (vs. ulimit of 5500)
modules	C: hadoop-tools/hadoop-aws U: hadoop-tools/hadoop-aws
Console output	https://builds.apache.org/job/hadoop-multibranch/job/PR-1442/10/console
versions	git=2.7.4 maven=3.3.9 findbugs=3.1.0-RC1
Powered by	Apache Yetus 0.10.0 http://yetus.apache.org

This message was automatically generated.

ehiggs

LGTM. Only nit was using a time unit on the config default.

+1

ehiggs · 2019-10-03T13:31:01Z

hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/Constants.java

Suggested change

public static final int THREAD_POOL_SHUTDOWN_DELAY = 30;

public static final int THREAD_POOL_SHUTDOWN_DELAY_SECS = 30;

will do. thanks

This patch explicitly shuts down the thread pool in job cleanup and after task commit, abort, job abort and job commit. The alternative strategy would to be to always destroy the threads in the same method they were used, but as two operations are normally parallelized back-to-back: listing pending .files and then committing or aborting them, retaining the pool is useful. And there isn't any close() method or similar in the OutputCommitter interface to place it. To test this, a probe for the committer having a thread pool was added, and the AbstractITCommitProtocol test extended to verify that there was no thread pool after the various commit and abort lifecycle operations. To verify that the tests themselves were valid, the destroyThreadPool() initially *did not* actually destroy the pool; the fact that the modified tests then all failed providesd evidence that all paths followed in those tests successfully cleaned up. Once the method did close the thread pool, all these failing tests passed. Change-Id: Ib2765d70aae2658535e07da268899d72824094f4 Note: I also switched to the HadoopExecutors thread pool factory; I considered moving to one of the caching thread pools but decided that I'd make this change simpler for ease of backport. For a trunk-only fix I'd consider asking the target S3A FS for its store context and creating a thread pool of it, which would just be a restricted fraction of the store's own pool.

Based on some hints from Kevin Risden, two test suites now verify that after the test run there are no unexpected threads -after we strip out a set of known-but-unstoppable threads we can expect to see. This is used in ITestS3AClosedFS to verify that an FS instance doesn't leak any threads -useful for future regression testing. For the S3A committer tests we only scan for outstanding committer pool tests; the rest are unimportant. Also -some comments/diags in the instrumentation class -s3a FS close() also explcitly ishuts down the threadpools. With a min size of 0 they'll eventually stop anyway, this just guarantees it happens faster. Change-Id: I2081e327ac8fb57eb38a3d119f02efce6232bad2

This avoids needing to store the entire list of files and thercommit information during job commit. Consequences * each task's .pendingset file is loaded in its own thread, but the files to commit listed inside the file are sequentially committed in that thread. I don't know what the performance consequences will be. * it's harder to abort things. Rather than abort all commits we know about by way of the files, I'm just going to abort all uploads under the destination path. We do that anyway as a fail safe. * the partitioned committer is going to have to load the files twice in "REPLACE" mode, the first time to identify the partitions being written to and then to delete their contents. I haven't written the change for the partitioned committer yet; i'll make sure the core commit process is working first. When I do implement it, I might spread the actual delete operations across the thread pool -as we know how long delete can take -don't we? Change-Id: Ife66eb020f6dc9c08ce9e1ea001e94ea91b28f86

-Incomplete reimplementation of revert semantics for staging committers -partitioned committer implements replace as a load-and-apply sequence Change-Id: If78520bfa1f7c12ed4c1a5be4d330bc923659224

…dingset files * there are pendingset-level commit/abort/revert operations to manage committing work and the (best effort) rolling back from failures. * The Tasks API is used within these operations to choreograph commit/abort/revert actions. However, no thread pool is currently created for that work. I didn't want to use the pool which schedules of the file loading, as deadlock would have been inevitable. A separate thread pool could be created. However, unless it was actually bigger than the current pool, there would be no extra parallelisation. One special case: there were only a few tasks but they generated many, many files. I'm not worrying about that. * The mock test of the committers have been reworked for this world, including explicitly creating and saving multiple .pendingset files to better stress the commit process. * I've also moved to AssertJ assertions while trying to debug mismatches between the expected and actual values. I'd split the blame for those failing tests equally between setting up the mock state and me getting revert and abort to work as the existing test cases expected. * some more detail in the Java docs to explain what is going on. Regarding the state of the patch, the tests all happy; now want to see what Yetus says. I also want to make PartitionedStagingCommitter.replacePartitions() do its read of all .pendingset files in parallel, so it can build up the list of partitions to replace a bit faster when there is the output of many tasks to process. I'm also actually wondering what it would take to use the MockS3AClient here across more tests. Currently it is good (and with this patch better) at Simulating incomplete multipart uploads -including with tracking of active uploads. We could probably expand this to model more of the final state of the store -for example actually simulating the persisted state of the store. Worth a thought -though it is probably a moving target. Change-Id: Idfa8198a920664f2fefe441d317b8e0fb681d368

This is a unit test, using mocking as a substitute for talking to S3; we are testing client side scale, not that of the store communications. Change-Id: Ide6dab0b5b08a845a88553b9085d0cf06426a7cb

Change-Id: I85a35d84f306e5f369eaacac0cff38febe1ccac0

…tify slow points Listing files is surprisingly slow. Theories * the listFiles() call is the wrong scan for local (and HDFS?) * over use of java 8 streams/maps, etc explore #2 and then worry about #1. We must stay with listFiles for the magic committers scans of s3, but for the staging committers, we just need to flat list the source dir with a filter Change-Id: I7e29b6004e71b146500a95c9822c5eed17390fb4

…t checks. the partitioned staging committer will do this while identifying parent directories if it needs to replace those partitions. Change-Id: I4f83eaafc244e92d5d937d3edb55c9dcc8b0e254

* explicitly call suppressExceptions() * remove onFailure handler in commitPendingUploads * explain why the active commit list of pending files doesn't overload the _SUCCESS file Also partitioned committer deletes partition paths in parallel for a bit more speed; minimumm one LIST/POST per directory, plus on s3guard some extra IO to DDB. Change-Id: I750f421e826f7df738149afeb04afd35a0d44d9b

Change-Id: I32b7475b16e1d5cec5bbd29932d4d70e3bf47d73

hadoop-yetus · 2019-10-04T14:44:49Z

🎊 +1 overall

Vote	Subsystem	Runtime	Comment
0	reexec	76	Docker mode activated.
		_ Prechecks _
+1	dupname	0	No case conflicting files found.
+1	@author	0	The patch does not contain any @author tags.
+1	test4tests	0	The patch appears to include 11 new or modified test files.
		_ trunk Compile Tests _
+1	mvninstall	1202	trunk passed
+1	compile	32	trunk passed
+1	checkstyle	24	trunk passed
+1	mvnsite	36	trunk passed
+1	shadedclient	851	branch has no errors when building and testing our client artifacts.
+1	javadoc	26	trunk passed
0	spotbugs	56	Used deprecated FindBugs config; considering switching to SpotBugs.
+1	findbugs	55	trunk passed
		_ Patch Compile Tests _
+1	mvninstall	32	the patch passed
+1	compile	28	the patch passed
+1	javac	28	the patch passed
-0	checkstyle	19	hadoop-tools/hadoop-aws: The patch generated 6 new + 43 unchanged - 0 fixed = 49 total (was 43)
+1	mvnsite	31	the patch passed
+1	whitespace	0	The patch has no whitespace issues.
+1	shadedclient	857	patch has no errors when building and testing our client artifacts.
+1	javadoc	23	the patch passed
+1	findbugs	62	the patch passed
		_ Other Tests _
+1	unit	79	hadoop-aws in the patch passed.
+1	asflicense	29	The patch does not generate ASF License warnings.
		3561

Subsystem	Report/Notes
Docker	Client=19.03.2 Server=19.03.2 base: https://builds.apache.org/job/hadoop-multibranch/job/PR-1442/12/artifact/out/Dockerfile
GITHUB PR	#1442
Optional Tests	dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient findbugs checkstyle
uname	Linux 75b04da25d99 4.15.0-58-generic #64-Ubuntu SMP Tue Aug 6 11:12:41 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
Build tool	maven
Personality	personality/hadoop.sh
git revision	trunk / `531cc93`
Default Java	1.8.0_222
checkstyle	https://builds.apache.org/job/hadoop-multibranch/job/PR-1442/12/artifact/out/diff-checkstyle-hadoop-tools_hadoop-aws.txt
Test Results	https://builds.apache.org/job/hadoop-multibranch/job/PR-1442/12/testReport/
Max. process+thread count	421 (vs. ulimit of 5500)
modules	C: hadoop-tools/hadoop-aws U: hadoop-tools/hadoop-aws
Console output	https://builds.apache.org/job/hadoop-multibranch/job/PR-1442/12/console
versions	git=2.7.4 maven=3.3.9 findbugs=3.1.0-RC1
Powered by	Apache Yetus 0.10.0 http://yetus.apache.org

This message was automatically generated.

steveloughran · 2019-10-04T14:51:30Z

rebasing against trunk HADOOP-16207 I'm getting some failures which I'm blaming on test setup code, such as the fact that temp dirs on forked runs are coming in wrong:

<property><name>fs.s3a.buffer.dir</name><value>target/build/test:fork-0001</value>

I'll address here.

steveloughran · 2019-10-04T16:06:55Z

Also filed: https://issues.apache.org/jira/browse/HADOOP-16632

The failed assertion was caused by a speculative task writing its .pending output file to its attempt directory after the job had completed. This is my first full trace what happens during a partition and I am pleased the actual output of the job was correct. We just can't prevent partitioned MR tasks from writing to the attempt directories after the job completes -and as there is a risk that pending uploads may be outstanding, document the need to have a life cycle rule to clean these up. Which people should have anyway.

Change-Id: Iffef642be1f24b08c5e6369f2200327e8ad256e4

hadoop-yetus · 2019-10-04T17:35:34Z

🎊 +1 overall

Vote	Subsystem	Runtime	Comment
0	reexec	34	Docker mode activated.
		_ Prechecks _
+1	dupname	1	No case conflicting files found.
+1	@author	0	The patch does not contain any @author tags.
+1	test4tests	0	The patch appears to include 12 new or modified test files.
		_ trunk Compile Tests _
+1	mvninstall	1211	trunk passed
+1	compile	32	trunk passed
+1	checkstyle	25	trunk passed
+1	mvnsite	36	trunk passed
+1	shadedclient	836	branch has no errors when building and testing our client artifacts.
+1	javadoc	25	trunk passed
0	spotbugs	60	Used deprecated FindBugs config; considering switching to SpotBugs.
+1	findbugs	59	trunk passed
		_ Patch Compile Tests _
+1	mvninstall	33	the patch passed
+1	compile	25	the patch passed
+1	javac	25	the patch passed
-0	checkstyle	19	hadoop-tools/hadoop-aws: The patch generated 7 new + 49 unchanged - 0 fixed = 56 total (was 49)
+1	mvnsite	30	the patch passed
+1	whitespace	0	The patch has no whitespace issues.
+1	shadedclient	907	patch has no errors when building and testing our client artifacts.
+1	javadoc	24	the patch passed
+1	findbugs	72	the patch passed
		_ Other Tests _
+1	unit	97	hadoop-aws in the patch passed.
+1	asflicense	31	The patch does not generate ASF License warnings.
		3594

Subsystem	Report/Notes
Docker	Client=19.03.1 Server=19.03.1 base: https://builds.apache.org/job/hadoop-multibranch/job/PR-1442/13/artifact/out/Dockerfile
GITHUB PR	#1442
Optional Tests	dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient findbugs checkstyle
uname	Linux f37178571813 4.15.0-60-generic #67-Ubuntu SMP Thu Aug 22 16:55:30 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
Build tool	maven
Personality	personality/hadoop.sh
git revision	trunk / `4510970`
Default Java	1.8.0_222
checkstyle	https://builds.apache.org/job/hadoop-multibranch/job/PR-1442/13/artifact/out/diff-checkstyle-hadoop-tools_hadoop-aws.txt
Test Results	https://builds.apache.org/job/hadoop-multibranch/job/PR-1442/13/testReport/
Max. process+thread count	342 (vs. ulimit of 5500)
modules	C: hadoop-tools/hadoop-aws U: hadoop-tools/hadoop-aws
Console output	https://builds.apache.org/job/hadoop-multibranch/job/PR-1442/13/console
versions	git=2.7.4 maven=3.3.9 findbugs=3.1.0-RC1
Powered by	Apache Yetus 0.10.0 http://yetus.apache.org

This message was automatically generated.

steveloughran · 2019-10-04T17:56:18Z

ok, merged in. thank you for the reviews.

steveloughran force-pushed the s3/HADOOP-16570-thread-leak branch from 5e8d7ce to 618e24c Compare September 18, 2019 17:39

apache deleted a comment from hadoop-yetus Sep 18, 2019

steveloughran added bug fs/s3 changes related to hadoop-aws; submitter must declare test endpoint labels Sep 18, 2019

steveloughran force-pushed the s3/HADOOP-16570-thread-leak branch from 618e24c to 1729122 Compare September 19, 2019 09:08

ehiggs reviewed Sep 19, 2019

View reviewed changes

apache deleted a comment from hadoop-yetus Sep 20, 2019

steveloughran force-pushed the s3/HADOOP-16570-thread-leak branch from 25dde8f to ddc3d8b Compare September 24, 2019 17:15

apache deleted a comment from hadoop-yetus Sep 24, 2019

steveloughran changed the title ~~HADOOP-16570. S3A committers leak threads on job/task commit.~~ HADOOP-16570. S3A committers encounter scale issues Sep 24, 2019

steveloughran mentioned this pull request Sep 26, 2019

[WIP][SPARK-29037][Core] Spark gives duplicate result when an application was killed apache/spark#25795

Closed

apache deleted a comment from hadoop-yetus Sep 26, 2019

steveloughran force-pushed the s3/HADOOP-16570-thread-leak branch from 05f2e8f to 2246490 Compare September 26, 2019 19:39

apache deleted a comment from hadoop-yetus Sep 30, 2019

steveloughran force-pushed the s3/HADOOP-16570-thread-leak branch from c671212 to 24ba90a Compare September 30, 2019 14:14

rdblue reviewed Oct 1, 2019

View reviewed changes

steveloughran force-pushed the s3/HADOOP-16570-thread-leak branch from 24ba90a to c34f3fe Compare October 1, 2019 20:21

steveloughran mentioned this pull request Oct 2, 2019

HADOOP-16207 testMR failures #1115

Closed

ehiggs reviewed Oct 3, 2019

View reviewed changes

steveloughran added 11 commits October 4, 2019 14:19

HADOOP-15670 ongoing work on revert logic

b2ab98e

-Incomplete reimplementation of revert semantics for staging committers -partitioned committer implements replace as a load-and-apply sequence Change-Id: If78520bfa1f7c12ed4c1a5be4d330bc923659224

HADOOP-16570 add a test to simulate scale commits.

7a23a81

This is a unit test, using mocking as a substitute for talking to S3; we are testing client side scale, not that of the store communications. Change-Id: Ide6dab0b5b08a845a88553b9085d0cf06426a7cb

HADOOP-16570 tuning the new test

2079a34

Change-Id: I85a35d84f306e5f369eaacac0cff38febe1ccac0

HADOOP-16570 Staging committers preload pendingset files for prefligh…

27fa7bc

…t checks. the partitioned staging committer will do this while identifying parent directories if it needs to replace those partitions. Change-Id: I4f83eaafc244e92d5d937d3edb55c9dcc8b0e254

HADOOP-16570: final review feedback from Ewan

ac41d9e

Change-Id: I32b7475b16e1d5cec5bbd29932d4d70e3bf47d73

steveloughran force-pushed the s3/HADOOP-16570-thread-leak branch from 60d0679 to ac41d9e Compare October 4, 2019 13:43

HADOOP-16570 final tunings from test run review

3734adc

Change-Id: Iffef642be1f24b08c5e6369f2200327e8ad256e4

steveloughran closed this Oct 4, 2019

steveloughran deleted the s3/HADOOP-16570-thread-leak branch October 15, 2021 19:47

	public static final int THREAD_POOL_SHUTDOWN_DELAY = 30;
	public static final int THREAD_POOL_SHUTDOWN_DELAY_SECS = 30;

HADOOP-16570. S3A committers encounter scale issues #1442

HADOOP-16570. S3A committers encounter scale issues #1442

Uh oh!

Conversation

steveloughran commented Sep 13, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Thread pool leakage

OOM on job commit for jobs with many thousands of tasks, each generating tens of files.

Uh oh!

steveloughran commented Sep 13, 2019

Uh oh!

steveloughran commented Sep 18, 2019

Uh oh!

steveloughran commented Sep 18, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

steveloughran commented Sep 20, 2019

Uh oh!

steveloughran commented Sep 24, 2019

Uh oh!

steveloughran commented Sep 26, 2019

Uh oh!

hadoop-yetus commented Sep 26, 2019

Uh oh!

steveloughran commented Sep 30, 2019

Uh oh!

hadoop-yetus commented Sep 30, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rdblue commented Oct 1, 2019

Uh oh!

steveloughran commented Oct 1, 2019

Uh oh!

steveloughran commented Oct 1, 2019

Uh oh!

hadoop-yetus commented Oct 1, 2019

Uh oh!

ehiggs left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hadoop-yetus commented Oct 4, 2019

Uh oh!

steveloughran commented Oct 4, 2019

Uh oh!

steveloughran commented Oct 4, 2019

Uh oh!

hadoop-yetus commented Oct 4, 2019

Uh oh!

steveloughran commented Oct 4, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

steveloughran commented Sep 13, 2019 •

edited

Loading