-
Notifications
You must be signed in to change notification settings - Fork 9.2k
HADOOP-18425. ABFS rename recovery #5494
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HADOOP-18425. ABFS rename recovery #5494
Conversation
If "fs.azure.enable.rename.resilience" is true, then do a HEAD of the source file before the rename, which can then be used to recover from the failure, as the manifest committer does (HADOOP-18163). Change-Id: Ia417f1501f7274662eb9ff919c6378fb913b476b HADOOP-18425. ABFS rename resilience through etags only get the etag on HNS stores Change-Id: I9faffa78294e1782f0b2db3d1c997ec3fe53637c
1. move config checks of rename resilience flag into AbfsClient 2. only getPathStatus on rename if enabled 3. resilience handling logs when unable to recover from a dir 4. and when it successfully recovers a file. Change-Id: I58b5f11e4c9b7c1a1d809d2db47a3cafe51f2060
|
tests good except for timeout in all the lease runs not sure what is up there. I had been playing with leases recently but my site settings haven't set any. the timeouts are only 30s, so maybe its just a slow test run. no vpn invoved though |
|
and yes, a test run on its own works |
|
reviews encouraged from people, especially @sreeb-msft @snvijaya @saxenapranav @hiteshs @mukund-thakur @mehakmeet @taklwu |
|
💔 -1 overall
This message was automatically generated. |
mukund-thakur
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added some minor comments.
| this.statistics = fs.statistics; | ||
| } | ||
|
|
||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cut extra lines?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
accidental edit; will revert
| final boolean recovered = result.getStatusCode() == HttpURLConnection.HTTP_OK | ||
| && sourceEtag.equals(extractEtagHeader(result)); | ||
| } catch (AzureBlobFileSystemException ignored) { | ||
| LOG.info("File rename has taken place: recovery completed"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this log statement is incorrect. It will depend on the value of the recovered flag.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
| // Server has returned HTTP 404, which means rename source no longer | ||
| // exists. Check on destination status and if its etag matches | ||
| // that of the source, consider it to be a success. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
these comments not valid here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will cut back
| && (op.getResult().getStatusCode() == HttpURLConnection.HTTP_NOT_FOUND) | ||
| && isNotEmpty(sourceEtag)) { | ||
|
|
||
| if (!(op.isARetriedRequest()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add a comment here.
| // etag passed in, so source is a file | ||
| final boolean hasEtag = isEmpty(sourceEtag); | ||
| boolean isDir = !hasEtag; | ||
| if (!hasEtag && renameResilience) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sourceEtag should be passed down only for HNS account as eTag does not remain the same when its FNS (as rename = copy to destination and delete source).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, but AFAIK there's no way to check in the client whether it is available.
it can only get passed in through the manifest committer, and as ABFS.createResilientCommitSupport() requires etag preservation, it won't be doing it on a non-HNS store.
| /** | ||
| * Enable resilient rename. | ||
| */ | ||
| private boolean renameResilience; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: lets have it final.
| import org.apache.hadoop.fs.EtagSource; | ||
| import org.apache.hadoop.fs.azurebfs.services.AbfsHttpOperation; | ||
| import org.apache.hadoop.fs.azurebfs.services.AbfsRestOperation; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
imports not required.
| // not an error | ||
| return false; | ||
| } | ||
| LOG.debug("Source not found on retry of rename({}, {}) isDir {} etag {}", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this can come even if op was retried and statusCode is not equal to 404
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the existing code only handles 404. Are you saying there are other errors we should look for, such as some 500+ value? if so: which ones
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this function gets invoked for any AbfsBlobFileSystemException whether it be because of 404 or any other: https://github.com/steveloughran/hadoop/blob/70a8d0fb3956b2a0e2c343373475ef7a0e75ab08/hadoop-tools/hadoop-azure/src/main/java/org/apache/hadoop/fs/azurebfs/services/AbfsClient.java#L607.
now because at line 668, it is because :
op.isARetriedRequest() || op.getResult().getStatusCode() != HttpUrlConnection.HTTP_NOT_FOUND
any retried operation will come on this line.
| LOG.debug("Source not found on retry of rename({}, {}) isDir {} etag {}", | ||
| source, destination, isDir, sourceEtag); | ||
| if (isDir) { | ||
| // directory recovery is not supported. | ||
| // log and fail. | ||
| LOG.info("rename directory {} to {} failed; unable to recover", | ||
| source, destination); | ||
| return false; | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this becomes equavalent to:
op.isARetriedRequest() || op.getResult().getStatusCode != HttpURLConnection.HTTP_NOT_FOUND
The older condition:
if ((op.isARetriedRequest())
&& (op.getResult().getStatusCode() == HttpURLConnection.HTTP_NOT_FOUND)
is changed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, let me copy and paste again. i think i was trying to do the inversion logic but may have stopped partway through.
| source, destination); | ||
| return false; | ||
| } | ||
| if (isNotEmpty(sourceEtag)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As per code in renamePath(),
isDir = !isNotEmpty(sourceEtag)
Should we use the same relation in this method to reduce confusion, instead of having a new argument isDir.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, but I want to log slightly differently. though actually it is confusion as isDir is true if resilience is off. will cut
| // specifying AbfsHttpOperation mock behavior | ||
|
|
||
| // mock object representing the 404 path not found result | ||
| AbfsHttpOperation mockHttp404Op = Mockito.mock(AbfsHttpOperation.class); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be awesome if we can use server integration in picture. Have added a comment on @sreeb-msft 's pr #5488 (comment). Suggested changes in saxenapranav@5247e12
| final List<AbfsHttpHeader> requestHeaders = createDefaultHeaders(); | ||
|
|
||
| // etag passed in, so source is a file | ||
| final boolean hasEtag = isEmpty(sourceEtag); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should it be
hasEtag = !isEmpty(sourceEtag)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ooh, nice catch. scarily nice. what happens when i code on a weekend
+ new itest, ITestAbfsContractRenameWithoutResilience this disables resilience and so verifies the normal codepath is still good. Change-Id: Ib2663c70afb112c9430043e94d75e9ddf7b3c313
Change-Id: I1db3878ee12ea082e00438781e1ae86af9850ff7
Integration testing all happy; had to do some work to get my auth mechanism work through the tests. Added test for dir handling, and commit renaming working through the failure. First time it's had this test, fwiw Change-Id: I89f7763d03d1a24a1a43361b001bfa5830804bc1
|
running the itests with the new pr and it is interesting because: when resilience is enabled, ITestAbfsFileSystemContractRename fail because of different outcomes from what is defined in the contract XML file. that is actually what is expected according to https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-common/filesystem/filesystem.html#boolean_rename.28Path_src.2C_Path_d.29, so I'm not sure why it doesn't happen today. will investigate |
|
so hdfs returns false if source not found, or dest exists <property>
<name>fs.contract.rename-returns-false-if-dest-exists</name>
<value>true</value>
</property>
<property>
<name>fs.contract.rename-returns-false-if-source-missing</name>
<value>true</value>
</property>abfs does the same. s3a client blows up with meaningful errors <property>
<name>fs.contract.rename-returns-false-if-source-missing</name>
<value>false</value>
</property>
<property>
<name>fs.contract.rename-returns-false-if-dest-exists</name>
<value>false</value>
</property>I'm more in favour of the "blow up" strategy; if you look at almost all uses of rename it is code like: if (!rename(src, dest)) throw new Exception("rename failed we don't know why")pand nobody ever seems to complain about the s3a failure...but then of course rename is broken there for other reasons, so maybe code just avoids it (e.g. committers). |
Change-Id: Iceb0042b2d97725d0864d138d3a522f29fb5c867
|
💔 -1 overall
This message was automatically generated. |
|
💔 -1 overall
This message was automatically generated. |
mehakmeet
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, looks good.
| import java.util.concurrent.TimeUnit; | ||
|
|
||
| import org.apache.hadoop.classification.VisibleForTesting; | ||
| import org.apache.hadoop.fs.azurebfs.contracts.exceptions.AbfsRestOperationException; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
import seem to be in the wrong place here.
| HadoopExecutors.newScheduledThreadPool(this.abfsConfiguration.getNumLeaseThreads(), tf)); | ||
| // rename resilience | ||
| renameResilience = abfsConfiguration.getRenameResilience(); | ||
| LOG.debug("Rename resilience is {}",renameResilience); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: space after ",".
| } | ||
|
|
||
| /** | ||
| * Even with failures, having |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
incomplete javadocs?
Description of PR
This is @sreeb-msft's PR of #5488 with
The handling is now exclusively in the AbfsClient class.
How was this patch tested?
abfs test in progress; azure cardiff.
For code changes:
LICENSE,LICENSE-binary,NOTICE-binaryfiles?