Skip to content

Conversation

@ashishkumar50
Copy link
Contributor

What changes were proposed in this pull request?

When SCM is in safe mode during restart(or rolling restart), block allocation fails. Instead of failing operation immediately, OM can retry few times(5 times every 3 seconds) to get block info from SCM before failing the operation.

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-10256

How was this patch tested?

Integration test

Copy link
Contributor

@sadanand48 sadanand48 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @ashishkumar50 for the patch, Left few minor comments. otherwise looks good.

@jojochuang jojochuang requested a review from smengcl February 8, 2024 18:01
@jojochuang jojochuang added the hbase HBase on Ozone support label Feb 8, 2024
@jojochuang jojochuang requested a review from ChenSammi February 8, 2024 18:01
Copy link
Contributor

@jojochuang jojochuang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code is executed in the OM's request handler when OM receives the request from the client. Client --> OM --> SCM.

IMO It's probably not a good idea to sleep inside a request handler, because if there are many clients, their request will all get blocked because the thread pool for the request handlers is exhausted, which may cause other requests to fail.

Would it make more sense to retry at client side?

@ashishkumar50
Copy link
Contributor Author

This code is executed in the OM's request handler when OM receives the request from the client. Client --> OM --> SCM.

IMO It's probably not a good idea to sleep inside a request handler, because if there are many clients, their request will all get blocked because the thread pool for the request handlers is exhausted, which may cause other requests to fail.

Would it make more sense to retry at client side?

Yes makes sense. Moved retry logic at client side.

Copy link
Contributor

@smengcl smengcl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @ashishkumar50 . Overall lgtm

@adoroszlai
Copy link
Contributor

Is this change specific to the new feature, i.e. does it need to go into the feature branch instead of master?

@ashishkumar50
Copy link
Contributor Author

Is this change specific to the new feature, i.e. does it need to go into the feature branch instead of master?

It's not related to new feature but issue is seen during hbase testing and impacting hbase.
@jojochuang, Can you please help to confirm whether it can be merged into master?

@jojochuang
Copy link
Contributor

IMO it is specific to HBase but I don't think we've experienced this problem with other applications. I'm fine to forward port it to master branch (assuming no code conflicts)

Copy link
Contributor

@jojochuang jojochuang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. This should be enough to deal with transient issues. Applications like HBase should have its own retry mechanism if they want to deal with longer down time.

For future reference: HDFS client would retry 10 times in a roll, non-stop, before abort. So I think Ozone is more resilient than HDFS in this regard.

@jojochuang jojochuang merged commit 370b9d7 into apache:HDDS-7593 Feb 10, 2024
jojochuang pushed a commit that referenced this pull request Feb 10, 2024
smengcl pushed a commit to smengcl/hadoop-ozone that referenced this pull request Mar 6, 2024
ivandika3 pushed a commit to ivandika3/ozone that referenced this pull request Oct 12, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

hbase HBase on Ozone support

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants