Skip to content

Conversation

@devmadhuu
Copy link
Contributor

What changes were proposed in this pull request?

This PR change is to improve Recon starts and restarts more gracefully in case of failures during start or restart. Ozone Recon component has few paths where exception handling needs improvement in order to avoid abrupt crash and shutdown of Recon.

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-11231

How was this patch tested?

Patch is tested manually and with existing test cases.

@devmadhuu devmadhuu marked this pull request as ready for review July 25, 2024 10:34
@devmadhuu
Copy link
Contributor Author

@dombizita @ArafatKhan2198 Kindly review.

return true;
}

private void printFileAndKeyTableCount() throws IOException {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method has different handling for fileTable and keyTable, the reason for this is not obvious. We return if keyTable is null even if fileTable might not be. This is making business logic assumptions for OM? Why not write this in a manner that does not assume which table can or cannot be null.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, thanks for pointing out @kerneltime , I think I was debugging something and missed to handle null for both. I have corrected now. Thanks.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kerneltime pls re-review.

@adoroszlai adoroszlai changed the title HDDS-11231. Ozone Recon - Make Recon restart more resilient and handle restart or start failures. HDDS-11231. Make Recon start more resilient Jul 26, 2024
Copy link
Contributor

@ArafatKhan2198 ArafatKhan2198 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the patch @devmadhuu Can you please find the comments given by me.

Copy link
Contributor

@ArafatKhan2198 ArafatKhan2198 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the patch @devmadhuu
LGTM !

Copy link
Contributor

@dombizita dombizita left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this and addressing the review comments @devmadhuu, looks good to me!

LOG.error("Failed fetching a full snapshot from Ozone Manager");
}
} catch (IOException e) {
throw new RuntimeException(e);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: Why do we want to mask IOException and send a RuntimeException?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: Why do we want to mask IOException and send a RuntimeException?

Here this is a special corner case we would like to handle, so we know that when Recon OM DB snapshot was getting corrupted for unknown failures (RuntimeException) and OMMetaManager service was failing to start with RuntimeException thrown, then we would like to handle by falling back to full snapshot and if full snapshot fetch also failed due to any exception including IOException, we know that we are still not able to bring the OMMetaManager service up, so want to throw RuntimeException as was thrown earlier in its parent catch block.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have logged IOException as error now. Pls re-review.

@ArafatKhan2198
Copy link
Contributor

Thanks for working on this @devmadhuu & thanks @kerneltime @dombizita for the review.

@ArafatKhan2198 ArafatKhan2198 merged commit 9533066 into apache:master Jul 31, 2024
devabhishekpal pushed a commit to devabhishekpal/ozone that referenced this pull request Aug 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants