Skip to content

Comments

HADOOP-18242. ABFS Rename Failure when tracking metadata is in an incomplete state#4331

Merged
steveloughran merged 7 commits intoapache:trunkfrom
mehakmeet:HADOOP-18242
Jun 27, 2022
Merged

HADOOP-18242. ABFS Rename Failure when tracking metadata is in an incomplete state#4331
steveloughran merged 7 commits intoapache:trunkfrom
mehakmeet:HADOOP-18242

Conversation

@mehakmeet
Copy link
Contributor

@mehakmeet mehakmeet commented May 19, 2022

ABFS rename fails intermittently when the Storage-blob tracking
metadata is in an incomplete state. This surfaces as the error code
404 and an error message of "RenameDestinationParentPathNotFound"

To mitigate this issue, when a request fails with this response.
the ABFS client issues a HEAD call on the source file
and then retry the rename op.

ABFS filesystem statistics track when this occurs with new counters
rename_recovery
metadata_incomplete_rename_failures
rename_path_attempts

This is very rare occurrence and appears to be triggered under certain
heavy load conditions, just as HADOOP-18163 is.

Contributed by Mehakmeet Singh.

@mehakmeet
Copy link
Contributor Author

CC: @mukund-thakur @steveloughran
Reproducing this exact issue seems tricky. Also, need to check if this is the only error code that corresponds to this issue, if the tracking metadata is in an incomplete state, can we not see other 404s too?

@mehakmeet
Copy link
Contributor Author

Also, need to check if this is the only error code that corresponds to this issue, if the tracking metadata is in an incomplete state, can we not see other 404s too?

@snvijaya please provide some info on this if you could. Thanks.

Copy link
Contributor

@steveloughran steveloughran left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

made some suggestions. to really test that resilience, you will need to use mockito to inject a failure

}

@Test
public void testRenameWithNoDestinationParentDir() throws Exception {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add similar test case for resilient rename api, here or in ITestAbfsManifestStoreOperations

Copy link
Contributor

@snvijaya snvijaya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please have a look at the code comment given in AbfsClient.
As mentioned by Steve, driver test code uses mockito to recreate negative test cases.

@mehakmeet
Copy link
Contributor Author

Thanks for the review @snvijaya, had some pending changes, going to address your comments in the next commit alongside the change required to the condition of httpStatus code in the metadata incomplete scenario.

Copy link
Contributor

@steveloughran steveloughran left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reviewed. the less we use mockito, the less to break. so use new() over mocking where possible

Copy link
Contributor

@steveloughran steveloughran left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Happy with is. Mehakmeet, is this ready to go in?

Copy link
Contributor

@mukund-thakur mukund-thakur left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

few minor feedbacks.

@mehakmeet
Copy link
Contributor Author

mehakmeet commented Jun 18, 2022

Mehakmeet, is this ready to go in?

@steveloughran
Waiting on MSFT to provide a key detail on what kind of HttpStatus codes we can expect apart from ParentNotFound. Yet to hear from them. Apart from that this is ready to go in.

Copy link
Contributor

@steveloughran steveloughran left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. +1

how about we merge this and see if it is enough to eliminate the problem -if we see different failures with the same fix, we can extend it.

Copy link
Contributor

@steveloughran steveloughran left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

was about to merge, did a final review, and realised that we should pick up the source etag and use it in the inner rename call


// Doing a HEAD call resolves the incomplete metadata state and
// then we can retry the rename operation.
getPathStatus(source, false, tracingContext);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've had one more thought here. the path status contains the etag, doesn't it? so if sourceEtag was null, now we can set it. That way if the rename failure is followed
immediately buy the rename-failure-but-it-really-happened event of HADOOP-18163, we are lined up for recovery

@apache apache deleted a comment from hadoop-yetus Jun 24, 2022
@apache apache deleted a comment from hadoop-yetus Jun 24, 2022
@apache apache deleted a comment from hadoop-yetus Jun 24, 2022
@apache apache deleted a comment from hadoop-yetus Jun 24, 2022
@steveloughran
Copy link
Contributor

really close to merge, just had one final thought before we hit the merge button. already updated the PR text in preparation for the merge (i was writing it up while looking at the code)

@hadoop-yetus
Copy link

🎊 +1 overall

Vote Subsystem Runtime Logfile Comment
+0 🆗 reexec 0m 49s Docker mode activated.
_ Prechecks _
+1 💚 dupname 0m 1s No case conflicting files found.
+0 🆗 codespell 0m 0s codespell was not available.
+0 🆗 detsecrets 0m 0s detect-secrets was not available.
+1 💚 @author 0m 0s The patch does not contain any @author tags.
+1 💚 test4tests 0m 0s The patch appears to include 4 new or modified test files.
_ trunk Compile Tests _
+1 💚 mvninstall 40m 18s trunk passed
+1 💚 compile 0m 53s trunk passed with JDK Private Build-11.0.15+10-Ubuntu-0ubuntu0.20.04.1
+1 💚 compile 0m 46s trunk passed with JDK Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07
+1 💚 checkstyle 0m 41s trunk passed
+1 💚 mvnsite 0m 50s trunk passed
+1 💚 javadoc 0m 48s trunk passed with JDK Private Build-11.0.15+10-Ubuntu-0ubuntu0.20.04.1
+1 💚 javadoc 0m 40s trunk passed with JDK Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07
+1 💚 spotbugs 1m 26s trunk passed
+1 💚 shadedclient 23m 53s branch has no errors when building and testing our client artifacts.
-0 ⚠️ patch 24m 16s Used diff version of patch file. Binary files and potentially other changes not applied. Please rebase and squash commits if necessary.
_ Patch Compile Tests _
+1 💚 mvninstall 0m 35s the patch passed
+1 💚 compile 0m 38s the patch passed with JDK Private Build-11.0.15+10-Ubuntu-0ubuntu0.20.04.1
+1 💚 javac 0m 38s the patch passed
+1 💚 compile 0m 33s the patch passed with JDK Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07
+1 💚 javac 0m 33s the patch passed
+1 💚 blanks 0m 0s The patch has no blanks issues.
+1 💚 checkstyle 0m 23s the patch passed
+1 💚 mvnsite 0m 37s the patch passed
+1 💚 javadoc 0m 28s the patch passed with JDK Private Build-11.0.15+10-Ubuntu-0ubuntu0.20.04.1
+1 💚 javadoc 0m 27s the patch passed with JDK Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07
+1 💚 spotbugs 1m 11s the patch passed
+1 💚 shadedclient 23m 1s patch has no errors when building and testing our client artifacts.
_ Other Tests _
+1 💚 unit 2m 2s hadoop-azure in the patch passed.
+1 💚 asflicense 0m 42s The patch does not generate ASF License warnings.
102m 34s
Subsystem Report/Notes
Docker ClientAPI=1.41 ServerAPI=1.41 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4331/7/artifact/out/Dockerfile
GITHUB PR #4331
Optional Tests dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets
uname Linux c50fd83515c3 4.15.0-175-generic #184-Ubuntu SMP Thu Mar 24 17:48:36 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
Build tool maven
Personality dev-support/bin/hadoop.sh
git revision trunk / fc45b7a
Default Java Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07
Multi-JDK versions /usr/lib/jvm/java-11-openjdk-amd64:Private Build-11.0.15+10-Ubuntu-0ubuntu0.20.04.1 /usr/lib/jvm/java-8-openjdk-amd64:Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07
Test Results https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4331/7/testReport/
Max. process+thread count 533 (vs. ulimit of 5500)
modules C: hadoop-tools/hadoop-azure U: hadoop-tools/hadoop-azure
Console output https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4331/7/console
versions git=2.25.1 maven=3.6.3 spotbugs=4.2.2
Powered by Apache Yetus 0.14.0 https://yetus.apache.org

This message was automatically generated.

@steveloughran
Copy link
Contributor

due diligence question: what did you test against?

@steveloughran
Copy link
Contributor

+1, merging.

tested myself against azure cardiff; transient failure of TestAbfsClientThrottlingAnalyzer in the parallel run; went away on a standalone one. that test is too brittle.

@steveloughran steveloughran merged commit 823f5ee into apache:trunk Jun 27, 2022
mehakmeet added a commit to mehakmeet/hadoop that referenced this pull request Jun 30, 2022
…omplete state (apache#4331)

ABFS rename fails intermittently when the Storage-blob tracking
metadata is in an incomplete state. This surfaces as the error code
404 and an error message of "RenameDestinationParentPathNotFound"

To mitigate this issue, when a request fails with this response.
the ABFS client issues a HEAD call on the source file
and then retries the rename operation again

ABFS filesystem statistics track when this occurs with new counters
  rename_recovery
  metadata_incomplete_rename_failures
  rename_path_attempts

This is very rare occurrence and appears to be triggered under certain
heavy load conditions, just as with HADOOP-18163.

Contributed by Mehakmeet Singh.
HarshitGupta11 pushed a commit to HarshitGupta11/hadoop that referenced this pull request Nov 28, 2022
…omplete state (apache#4331)


ABFS rename fails intermittently when the Storage-blob tracking
metadata is in an incomplete state. This surfaces as the error code
404 and an error message of "RenameDestinationParentPathNotFound"

To mitigate this issue, when a request fails with this response.
the ABFS client issues a HEAD call on the source file
and then retries the rename operation again

ABFS filesystem statistics track when this occurs with new counters
  rename_recovery
  metadata_incomplete_rename_failures
  rename_path_attempts

This is very rare occurrence and appears to be triggered under certain
heavy load conditions, just as with HADOOP-18163.

Contributed by Mehakmeet Singh.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants