Skip to content

Conversation

@steveloughran
Copy link
Contributor

@steveloughran steveloughran commented Mar 4, 2021

S3A Rename to

  1. switch from verifying dest parent path is present and a directory to simply verifying that it isn't a file.
  2. avoids race condition where thread 1 has deleted the last subdir/file but not yet recreated a parent marker
  3. don't downgrade FileNotFoundException to 'false' if source doesn't exist
  4. raise FileAlreadyExistsException if dest path exists or parent is a file.

If thread/process 1 deleted the subdir dest/subdir1 one and there
are no sibling subdirectories, then then the dir dest would not
exist until maybeCreateFakeParentDirectory() had performed a
LIST and, if needed, a PUT of a marker.
This creates a window where thread/process 2, trying to rename staging/subdir2
into dest could fail "parent does not exist".

And guess what:

  1. Hive still thinks renaming directories from staging to the production dataset is a good idea.
  2. It can spawn many threads to do this parallel (maybe even to compensate for the slow rename())
    performance of S3.
  3. On a sufficiently overloaded system (lots of threads already doing the parallel filenames)
    the window of parent-dir-not-found can last long enough for things to fail.

Prior to S3 being consistent this wouldn't have been an issue

  • rename()'s need to list everything underneath the source dir wasn't safe to use with delayed list consistency
  • so S3Guard was effectively mandatory.
  • and delayed LIST consistency may even have reduced the window size.

The fix: go from verifying parent dir exists to simply making sure that it isn't a file
is a weakening of the requirement "parent dir must exist" -but file:// already doesn't
require that.

Consequences

  1. You can rename under a path which doesn't exist. file:// lets you do that already; contract tests have a switch for it.
    (FS contract test has switch for this)
  2. Exceptions get raised when false would be returned and hive lose details of what went wrong. This
    isn't consistent with HDFS, but is with other stores (FS contract test has switch for this)
  3. You can rename a few layers under a file, but then create() lets you do that already,
    and nobody has noticed in production.

If directory marker retention was enabled there's more likelihood that an empty dir marker existed -if it did then the race condition wouldn't exist. But as there are no guarantees that the marker will be there, that's not safe to rely on.

@steveloughran
Copy link
Contributor Author

Testing with -Dparallel-tests -DtestsThreadCount=8 -Dmarkers=keep -Dscale

If the store is set to raise exceptions, a lot of tests which expect rename(bad options) to return false now get exceptions. One of the contract tests would downgrade if the raised exception was FileAlreadyExistsException and the contract xml said that was ok. I'm reluctant to go with a bigger patch.

this PR is so that Hive and friends can get better reporting on errors, rather than have them lost. It will be optional

@steveloughran steveloughran force-pushed the s3/HADOOP-16721-rename-resilience branch from f2622c7 to 6b3a984 Compare March 8, 2021 14:14
@steveloughran
Copy link
Contributor Author

Testing: s3 london with markers==keep and delete, s3guard on and off.

  • change in error reporting of s3a FS matched with relevant changes in s3a.xml for contract tests.
  • skip tests verifying that you can't rename 2+ levels under a file.
  • Failures related to endpoints of common-crawl and ITestAssumeRole.testAssumeRoleBadInnerAuth: known and fixed in HADOOP-17511. Add audit/telemetry logging to S3A connector #2675.

@iwasakims
Copy link
Member

LGTM overall. Tests against Tokyo region worked. I just left comments for nits.

$ mvn verify -Dtest=x -Dit.test=ITestS3AContractRename,ITestlRenameDeleteRace -Ds3guard
...
[INFO] Running org.apache.hadoop.fs.s3a.impl.ITestlRenameDeleteRace
[INFO] Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 4.044 s - in org.apache.hadoop.fs.s3a.impl.ITestlRenameDeleteRace
[INFO] Running org.apache.hadoop.fs.s3a.ITestS3GuardListConsistency
[INFO] Tests run: 11, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 35.603 s - in org.apache.hadoop.fs.s3a.ITestS3GuardListConsistency
[INFO] Running org.apache.hadoop.fs.contract.s3a.ITestS3AContractRename
[WARNING] Tests run: 22, Failures: 0, Errors: 0, Skipped: 2, Time elapsed: 12.523 s - in org.apache.hadoop.fs.contract.s3a.ITestS3AContractRename

@iwasakims
Copy link
Member

I got error on ITestS3AContractDistCp but this seems not to be related. I can not reproduce the error by running the ITestS3AContractDistCp alone by -Dit.test=ITestS3AContractDistCp.

$  mvn verify -Dtest=x -Dit.test=ITestS3AContract*
...
[INFO] Running org.apache.hadoop.fs.contract.s3a.ITestS3AContractDistCp
[ERROR] Tests run: 8, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 47.658 s <<< FAILURE! - in org.apache.hadoop.fs.contract.s3a.ITestS3AContractDistCp
[ERROR] testNonDirectWrite(org.apache.hadoop.fs.contract.s3a.ITestS3AContractDistCp)  Time elapsed: 4.055 s  <<< FAILURE!
java.lang.AssertionError: Expected 2 renames for a non-direct write distcp expected:<2> but was:<0>
        at org.apache.hadoop.fs.contract.s3a.ITestS3AContractDistCp.testNonDirectWrite(ITestS3AContractDistCp.java:98)

@iwasakims
Copy link
Member

  1. don't downgrade FileNotFoundException to 'false' if source doesn't exist
  2. raise FileAlreadyExistsException if dest path exists or parent is a file.

Should this JIRA be marked as incompatible for applicatinos assuming existing behavior?

@steveloughran steveloughran force-pushed the s3/HADOOP-16721-rename-resilience branch from 715b635 to d37997a Compare March 10, 2021 13:40
fs.s3a.rename.raises.exceptions: raise exceptions on rename failures
fs.s3a.rename.reduced.probes: don't look for parent dir (LIST), just verify
 it isn't a file.

The reduced probe not only saves money, it avoids race conditions
where one thread deleting a subdir can cause LIST <parent> to fail before
a dir marker is recreated.

Note:

* file:// rename() creates parent dirs, so this isn't too dangerous.
* tests will switch modes.

We could always just do the HEAD; topic for discussion. This patch: optional

Change-Id: Ic0f8a410b45fef14ff522cb5aa1ae2bc19c8eeee
Check for parent dir is now always !file, not isDirectory; avoids
race conditions with parallel deletes of subdirectories under destination

Also switches connector to raise exceptions on conditions
which don't trigger exceptions on HDFS (just return false

* source file missing
* destination existing

Some other connectors already report these as failures. With this
change S3 moves away from HDFS behaviour, but towards one with better
error reporting, and which we know is handled in existing applications.

Documentated the changes in filesystem.md and troubleshooting s3a

Change-Id: I5daf54636e63c19f273c86519d5eeb3cbeffeb49
…turning false.

about to revert.

Change-Id: I0d91330e2a59e8c6863ebeedce17573e69e2bef5
revert option to raise exceptions instead of returning false; explicitly
raising FileNotFoundException/FileAlreadyExistsException is better.

Change-Id: Ica78e0abf5a1444227e4843288679a2d3441dd70
Change-Id: Ic8787cb6f87d4b5cda23c6ced1e73922b665e797
modified:   src/site/markdown/tools/hadoop-aws/troubleshooting_s3a.md
modified:   src/test/java/org/apache/hadoop/fs/s3a/S3ATestUtils.java
renamed:    src/test/java/org/apache/hadoop/fs/s3a/impl/ITestlRenameDeleteRace.java -> src/test/java/org/apache/hadoop/fs/s3a/impl/ITestRenameDeleteRace.java

moved the "disable s3guard" test setup into S3ATestUtils; new suites are going
to be adopting it until we pull out S3Guard completely.

Change-Id: I0303aec2f6e4c98685da62e9f31c7e44a1e67d41
Change-Id: I0dc69dd913d7f9ca36930ab09da1060bdfee8970
@steveloughran steveloughran force-pushed the s3/HADOOP-16721-rename-resilience branch from d37997a to 3eff875 Compare March 10, 2021 13:42
Change-Id: I288e65899302adcebab5221508a005784cfe1d89
@steveloughran steveloughran force-pushed the s3/HADOOP-16721-rename-resilience branch from 3eff875 to b10cc28 Compare March 10, 2021 15:10
@steveloughran
Copy link
Contributor Author

testing: s3 london, most recently with -Dparallel-tests -DtestsThreadCount=6 -Dmarkers=keep -Ds3guard -Ddynamo -Dscale

Copy link
Member

@iwasakims iwasakims left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1. The ITestRenameDeleteRace failed without the fix of S3AFileSystem as expected. The test passed with the fix regardless of -Ds3guard. Thanks, @steveloughran.

@steveloughran
Copy link
Contributor Author

ah, thanks -lovely. just tweaked the imports slightly for better backporting; I'll merge once the next compile is good, then do a cp and retest for branch-3.3.

@steveloughran steveloughran merged commit bcd9c67 into apache:trunk Mar 11, 2021
asfgit pushed a commit that referenced this pull request Mar 11, 2021
The S3A connector's rename() operation now raises FileNotFoundException if
the source doesn't exist; a FileAlreadyExistsException if the destination
exists and is unsuitable for the source file/directory.

When renaming to a path which does not exist, the connector no longer checks
for the destination parent directory existing -instead it simply verifies
that there is no file immediately above the destination path.
This is needed to avoid race conditions with delete() and rename()
calls working on adjacent subdirectories.

Contributed by Steve Loughran.
@steveloughran steveloughran deleted the s3/HADOOP-16721-rename-resilience branch October 15, 2021 19:49
kiran-maturi pushed a commit to kiran-maturi/hadoop that referenced this pull request Nov 24, 2021
The S3A connector's rename() operation now raises FileNotFoundException if
the source doesn't exist; a FileAlreadyExistsException if the destination
exists and is unsuitable for the source file/directory.

When renaming to a path which does not exist, the connector no longer checks
for the destination parent directory existing -instead it simply verifies
that there is no file immediately above the destination path.
This is needed to avoid race conditions with delete() and rename()
calls working on adjacent subdirectories.

Contributed by Steve Loughran.
jojochuang pushed a commit to jojochuang/hadoop that referenced this pull request May 23, 2023
The S3A connector's rename() operation now raises FileNotFoundException if
the source doesn't exist; a FileAlreadyExistsException if the destination
exists and is unsuitable for the source file/directory.

When renaming to a path which does not exist, the connector no longer checks
for the destination parent directory existing -instead it simply verifies
that there is no file immediately above the destination path.
This is needed to avoid race conditions with delete() and rename()
calls working on adjacent subdirectories.

Contributed by Steve Loughran.

Conflicts:
	hadoop-tools/hadoop-aws/src/test/resources/contract/s3a.xml
        hadoop-tools/hadoop-aws/src/site/markdown/tools/hadoop-aws/troubleshooting_s3a.md

Change-Id: I1996625d750f62c3e51686ff5317bd47ca0233bf
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants