-
Notifications
You must be signed in to change notification settings - Fork 588
HDDS-10256. Retry block allocation when SCM is in safe mode. #6189
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
sadanand48
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @ashishkumar50 for the patch, Left few minor comments. otherwise looks good.
...p-ozone/ozone-manager/src/main/java/org/apache/hadoop/ozone/om/request/key/OMKeyRequest.java
Outdated
Show resolved
Hide resolved
...p-ozone/ozone-manager/src/main/java/org/apache/hadoop/ozone/om/request/key/OMKeyRequest.java
Outdated
Show resolved
Hide resolved
...p-ozone/ozone-manager/src/main/java/org/apache/hadoop/ozone/om/request/key/OMKeyRequest.java
Outdated
Show resolved
Hide resolved
hadoop-ozone/integration-test/src/test/java/org/apache/hadoop/ozone/om/TestScmSafeMode.java
Show resolved
Hide resolved
jojochuang
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This code is executed in the OM's request handler when OM receives the request from the client. Client --> OM --> SCM.
IMO It's probably not a good idea to sleep inside a request handler, because if there are many clients, their request will all get blocked because the thread pool for the request handlers is exhausted, which may cause other requests to fail.
Would it make more sense to retry at client side?
Yes makes sense. Moved retry logic at client side. |
hadoop-ozone/integration-test/src/test/java/org/apache/hadoop/ozone/om/TestScmSafeMode.java
Outdated
Show resolved
Hide resolved
smengcl
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @ashishkumar50 . Overall lgtm
...n/java/org/apache/hadoop/ozone/om/protocolPB/OzoneManagerProtocolClientSideTranslatorPB.java
Show resolved
Hide resolved
...n/java/org/apache/hadoop/ozone/om/protocolPB/OzoneManagerProtocolClientSideTranslatorPB.java
Show resolved
Hide resolved
|
Is this change specific to the new feature, i.e. does it need to go into the feature branch instead of |
...n/java/org/apache/hadoop/ozone/om/protocolPB/OzoneManagerProtocolClientSideTranslatorPB.java
Outdated
Show resolved
Hide resolved
It's not related to new feature but issue is seen during hbase testing and impacting hbase. |
|
IMO it is specific to HBase but I don't think we've experienced this problem with other applications. I'm fine to forward port it to master branch (assuming no code conflicts) |
jojochuang
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good. This should be enough to deal with transient issues. Applications like HBase should have its own retry mechanism if they want to deal with longer down time.
For future reference: HDFS client would retry 10 times in a roll, non-stop, before abort. So I think Ozone is more resilient than HDFS in this regard.
Co-authored-by: ashishk <[email protected]> (cherry picked from commit 370b9d7)
…6189) Co-authored-by: ashishk <[email protected]>
What changes were proposed in this pull request?
When SCM is in safe mode during restart(or rolling restart), block allocation fails. Instead of failing operation immediately, OM can retry few times(5 times every 3 seconds) to get block info from SCM before failing the operation.
What is the link to the Apache JIRA
https://issues.apache.org/jira/browse/HDDS-10256
How was this patch tested?
Integration test