-
Notifications
You must be signed in to change notification settings - Fork 592
HDDS-11231. Make Recon start more resilient #6987
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…e restart or start failures.
…e restart or start failures.
|
@dombizita @ArafatKhan2198 Kindly review. |
| return true; | ||
| } | ||
|
|
||
| private void printFileAndKeyTableCount() throws IOException { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This method has different handling for fileTable and keyTable, the reason for this is not obvious. We return if keyTable is null even if fileTable might not be. This is making business logic assumptions for OM? Why not write this in a manner that does not assume which table can or cannot be null.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, thanks for pointing out @kerneltime , I think I was debugging something and missed to handle null for both. I have corrected now. Thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kerneltime pls re-review.
ArafatKhan2198
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the patch @devmadhuu Can you please find the comments given by me.
...on/src/main/java/org/apache/hadoop/ozone/recon/spi/impl/OzoneManagerServiceProviderImpl.java
Show resolved
Hide resolved
...on/src/main/java/org/apache/hadoop/ozone/recon/spi/impl/OzoneManagerServiceProviderImpl.java
Outdated
Show resolved
Hide resolved
...on/src/main/java/org/apache/hadoop/ozone/recon/spi/impl/OzoneManagerServiceProviderImpl.java
Outdated
Show resolved
Hide resolved
ArafatKhan2198
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the patch @devmadhuu
LGTM !
...on/src/main/java/org/apache/hadoop/ozone/recon/spi/impl/OzoneManagerServiceProviderImpl.java
Outdated
Show resolved
Hide resolved
...on/src/main/java/org/apache/hadoop/ozone/recon/spi/impl/OzoneManagerServiceProviderImpl.java
Outdated
Show resolved
Hide resolved
dombizita
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for working on this and addressing the review comments @devmadhuu, looks good to me!
| LOG.error("Failed fetching a full snapshot from Ozone Manager"); | ||
| } | ||
| } catch (IOException e) { | ||
| throw new RuntimeException(e); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Question: Why do we want to mask IOException and send a RuntimeException?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Question: Why do we want to mask
IOExceptionand send aRuntimeException?
Here this is a special corner case we would like to handle, so we know that when Recon OM DB snapshot was getting corrupted for unknown failures (RuntimeException) and OMMetaManager service was failing to start with RuntimeException thrown, then we would like to handle by falling back to full snapshot and if full snapshot fetch also failed due to any exception including IOException, we know that we are still not able to bring the OMMetaManager service up, so want to throw RuntimeException as was thrown earlier in its parent catch block.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have logged IOException as error now. Pls re-review.
|
Thanks for working on this @devmadhuu & thanks @kerneltime @dombizita for the review. |
What changes were proposed in this pull request?
This PR change is to improve Recon starts and restarts more gracefully in case of failures during start or restart. Ozone Recon component has few paths where exception handling needs improvement in order to avoid abrupt crash and shutdown of Recon.
What is the link to the Apache JIRA
https://issues.apache.org/jira/browse/HDDS-11231
How was this patch tested?
Patch is tested manually and with existing test cases.