HDDS-10256. Retry block allocation when SCM is in safe mode. #6189

ashishkumar50 · 2024-02-07T13:05:01Z

What changes were proposed in this pull request?

When SCM is in safe mode during restart(or rolling restart), block allocation fails. Instead of failing operation immediately, OM can retry few times(5 times every 3 seconds) to get block info from SCM before failing the operation.

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-10256

How was this patch tested?

Integration test

Rebase

sadanand48

Thanks @ashishkumar50 for the patch, Left few minor comments. otherwise looks good.

...p-ozone/ozone-manager/src/main/java/org/apache/hadoop/ozone/om/request/key/OMKeyRequest.java

hadoop-ozone/integration-test/src/test/java/org/apache/hadoop/ozone/om/TestScmSafeMode.java

jojochuang

This code is executed in the OM's request handler when OM receives the request from the client. Client --> OM --> SCM.

IMO It's probably not a good idea to sleep inside a request handler, because if there are many clients, their request will all get blocked because the thread pool for the request handlers is exhausted, which may cause other requests to fail.

Would it make more sense to retry at client side?

ashishkumar50 · 2024-02-09T07:11:34Z

This code is executed in the OM's request handler when OM receives the request from the client. Client --> OM --> SCM.

IMO It's probably not a good idea to sleep inside a request handler, because if there are many clients, their request will all get blocked because the thread pool for the request handlers is exhausted, which may cause other requests to fail.

Would it make more sense to retry at client side?

Yes makes sense. Moved retry logic at client side.

hadoop-ozone/integration-test/src/test/java/org/apache/hadoop/ozone/om/TestScmSafeMode.java

smengcl

Thanks @ashishkumar50 . Overall lgtm

...n/java/org/apache/hadoop/ozone/om/protocolPB/OzoneManagerProtocolClientSideTranslatorPB.java

adoroszlai · 2024-02-09T09:24:22Z

Is this change specific to the new feature, i.e. does it need to go into the feature branch instead of master?

...n/java/org/apache/hadoop/ozone/om/protocolPB/OzoneManagerProtocolClientSideTranslatorPB.java

ashishkumar50 · 2024-02-09T10:43:28Z

Is this change specific to the new feature, i.e. does it need to go into the feature branch instead of master?

It's not related to new feature but issue is seen during hbase testing and impacting hbase.
@jojochuang, Can you please help to confirm whether it can be merged into master?

jojochuang · 2024-02-09T18:57:50Z

IMO it is specific to HBase but I don't think we've experienced this problem with other applications. I'm fine to forward port it to master branch (assuming no code conflicts)

jojochuang

Looks good. This should be enough to deal with transient issues. Applications like HBase should have its own retry mechanism if they want to deal with longer down time.

For future reference: HDFS client would retry 10 times in a roll, non-stop, before abort. So I think Ozone is more resilient than HDFS in this regard.

Co-authored-by: ashishk <[email protected]> (cherry picked from commit 370b9d7)

…6189) Co-authored-by: ashishk <[email protected]>

…6189)

HDDS-10256. Block allocation retry when SCM is in safe mode.

33b4661

jojochuang force-pushed the HDDS-7593 branch from 76a573a to 1c20d84 Compare February 8, 2024 06:22

Merge pull request #10 from ashishkumar50/HDDS-7593

5c900a1

Rebase

sadanand48 reviewed Feb 8, 2024

View reviewed changes

Fix review comments

1925293

jojochuang requested a review from smengcl February 8, 2024 18:01

jojochuang added the hbase HBase on Ozone support label Feb 8, 2024

jojochuang requested a review from ChenSammi February 8, 2024 18:01

jojochuang reviewed Feb 8, 2024

View reviewed changes

Move retry logic to client

c1ba765

smengcl reviewed Feb 9, 2024

View reviewed changes

hadoop-ozone/integration-test/src/test/java/org/apache/hadoop/ozone/om/TestScmSafeMode.java Outdated Show resolved Hide resolved

Optimize test case

5f18f97

smengcl reviewed Feb 9, 2024

View reviewed changes

...n/java/org/apache/hadoop/ozone/om/protocolPB/OzoneManagerProtocolClientSideTranslatorPB.java Show resolved Hide resolved

...n/java/org/apache/hadoop/ozone/om/protocolPB/OzoneManagerProtocolClientSideTranslatorPB.java Show resolved Hide resolved

Add CLI message to show wait reason

9336828

smengcl reviewed Feb 9, 2024

View reviewed changes

...n/java/org/apache/hadoop/ozone/om/protocolPB/OzoneManagerProtocolClientSideTranslatorPB.java Outdated Show resolved Hide resolved

Update CLI message

643789d

smengcl approved these changes Feb 9, 2024

View reviewed changes

jojochuang approved these changes Feb 10, 2024

View reviewed changes

jojochuang merged commit 370b9d7 into apache:HDDS-7593 Feb 10, 2024

jojochuang pushed a commit that referenced this pull request Feb 10, 2024

HDDS-10256. Retry block allocation when SCM is in safe mode. (#6189)

7c79246

Co-authored-by: ashishk <[email protected]> (cherry picked from commit 370b9d7)

smengcl pushed a commit to smengcl/hadoop-ozone that referenced this pull request Mar 6, 2024

HDDS-10256. Retry block allocation when SCM is in safe mode. (apache#…

ef445a4

…6189) Co-authored-by: ashishk <[email protected]>

ivandika3 pushed a commit to ivandika3/ozone that referenced this pull request Oct 12, 2025

HDDS-10256. Retry block allocation when SCM is in safe mode. (apache#…

f415534

…6189)

HDDS-10256. Retry block allocation when SCM is in safe mode. #6189

HDDS-10256. Retry block allocation when SCM is in safe mode. #6189

Uh oh!

Conversation

ashishkumar50 commented Feb 7, 2024

What changes were proposed in this pull request?

What is the link to the Apache JIRA

How was this patch tested?

Uh oh!

sadanand48 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jojochuang left a comment

Choose a reason for hiding this comment

Uh oh!

ashishkumar50 commented Feb 9, 2024

Uh oh!

Uh oh!

smengcl left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

adoroszlai commented Feb 9, 2024

Uh oh!

Uh oh!

ashishkumar50 commented Feb 9, 2024

Uh oh!

jojochuang commented Feb 9, 2024

Uh oh!

jojochuang left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants