-
Notifications
You must be signed in to change notification settings - Fork 590
HDDS-7721. Make OM Ratis roles available in /prom endpoint #4140
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Do you want to include |
|
@kerneltime Thanks for the suggestion, I included the config for OM and SCM. |
|
Question: Why the need to format the metric as Would it make sense to break it up into individual values so it can be charted? It could be |
|
@kerneltime Thanks for looking into this.
I wanted to avoid duplicating code, so I got the same string we are using in We can either split it up in new tags or have a Map and then use that for the tag.
We want to track the leader and if we are presenting info only for the current node, then we would have to go over all of them just to find the leader. If we want to simplify it, we could have a tag with only the info for the leader and skip all the followers. Do you have any suggestions about the format? This fix is based on this discussion and #3791. |
|
@kerneltime I updated the patch with the changes you requested. Now |
...ne/integration-test/src/test/java/org/apache/hadoop/ozone/om/TestOzoneManagerHAWithData.java
Outdated
Show resolved
Hide resolved
|
@neils-dev Thanks for the review! I have addressed your comments. |
|
Thanks @xBis7. I took a look at the prom It would be much cleaner to keep the tags to a min and just use the gauge to reflect the leader and changes to the leader. This can be seen when failover is rendered on prometheus with simplified tags, two traces this time, see - https://github.com/neils-dev/play/blob/main/images/failovers_just_gauge.png. Failover is the criss-cross when rendered. |
|
@neils-dev I've tested it locally and I can see what you are referring to. I will remove |
|
Hey @xBis7 ! Seemed like some test was failing, please help take a look! Thanks~ |
|
Hey @DaveTeng0, the failure seems unrelated. Check here, I had a green build on the workflow of my fork. A lot of tests are flaky but I don't have write privileges to rerun the failed ones on the PR workflow. |
adoroszlai
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @xBis7 for working on this.
...ne/integration-test/src/test/java/org/apache/hadoop/ozone/om/TestOzoneManagerHAWithData.java
Outdated
Show resolved
Hide resolved
...ne/integration-test/src/test/java/org/apache/hadoop/ozone/om/TestOzoneManagerHAWithData.java
Outdated
Show resolved
Hide resolved
hadoop-ozone/ozone-manager/src/main/java/org/apache/hadoop/ozone/om/OzoneManager.java
Outdated
Show resolved
Hide resolved
hadoop-ozone/ozone-manager/src/main/java/org/apache/hadoop/ozone/om/ha/OMHAMetrics.java
Outdated
Show resolved
Hide resolved
|
@adoroszlai I've updated the patch, with the changes you requested. Can you please take another look? |
adoroszlai
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @xBis7 for updating the patch, LGTM.
If I understand correctly, @neils-dev's comment was also addressed previously in e836659.
@adoroszlai Yes, it has been addressed. |
|
Thanks @xBis7 , @adoroszlai, I'm going to take another look. |
hadoop-ozone/ozone-manager/src/main/java/org/apache/hadoop/ozone/om/ha/OMHAMetrics.java
Outdated
Show resolved
Hide resolved
hadoop-ozone/ozone-manager/src/main/java/org/apache/hadoop/ozone/om/ha/OMHAMetrics.java
Outdated
Show resolved
Hide resolved
hadoop-ozone/ozone-manager/src/main/java/org/apache/hadoop/ozone/om/ha/OMHAMetrics.java
Outdated
Show resolved
Hide resolved
|
@neils-dev I've addressed your comments. Can you please take another look? |
|
Once @neils-dev takes a final look, this PR would be ready to be merged! |
...ne/integration-test/src/test/java/org/apache/hadoop/ozone/om/TestOzoneManagerHAWithData.java
Outdated
Show resolved
Hide resolved
hadoop-ozone/ozone-manager/src/main/java/org/apache/hadoop/ozone/om/OzoneManager.java
Outdated
Show resolved
Hide resolved
hadoop-ozone/ozone-manager/src/main/java/org/apache/hadoop/ozone/om/ha/OMHAMetrics.java
Show resolved
Hide resolved
hadoop-ozone/ozone-manager/src/main/java/org/apache/hadoop/ozone/om/ha/OMHAMetrics.java
Show resolved
Hide resolved
|
@neils-dev Double checking the I hadn't thought about the case where ratis is disabled but I looked into it and tested it. Since, in that scenario there are no leaders or followers, |
|
@neils-dev I've added some unit tests, so that we can check the case where we provide an empty |
neils-dev
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left a few comments. Thanks. Should remove the unintended addition file TestOzoneFsSnapshot.java that came in your latest commit.
hadoop-ozone/integration-test/src/test/java/org/apache/hadoop/fs/ozone/TestOzoneFsSnapshot.java
Outdated
Show resolved
Hide resolved
Thanks for updating that. |
|
@neils-dev Thanks for the reviews, I've updated the patch. Let me know how it looks. |
hadoop-ozone/ozone-manager/src/main/java/org/apache/hadoop/ozone/om/OzoneManager.java
Show resolved
Hide resolved
|
@neils-dev I've updated the patch to address the latest comment. |
neils-dev
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @xBis7 . Looks good. Look to merge pending a green CI.
|
Thanks @xBis7 for the patch, @kerneltime, @neils-dev for the review. |
| RaftPeer leader = null; | ||
| try { | ||
| leader = omRatisServer.getLeader(); | ||
| } catch (IOException ex) { | ||
| LOG.error("IOException while getting the " + | ||
| "Ratis server leader.", ex); | ||
| } | ||
| if (Objects.nonNull(leader)) { | ||
| String leaderId = leader.getId().toString(); | ||
|
|
||
| // If leaderId is empty, then leader is undefined | ||
| // and current OM is neither leader nor follower. | ||
| // OMHAMetrics shouldn't be registered in that case. | ||
| if (!Strings.isNullOrEmpty(leaderId)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had to update this part of the code after merging the PR due to a conflicting change that had been just merged (HDDS-6743, getLeader() no longer throws IOException).
Upon closer inspection, I think these if-else blocks have a problem. Please correct me if I'm wrong, but if leader is undefined then leader will be null, so the unregistration will not happen. (This seems to have been the case even before HDDS-6743.) If we have any leader information, its id cannot be null.
I think it should be:
if (Objects.nonNull(leader)) {
// init
} else {
// unregister
}There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@adoroszlai You are right. This was missed because the code was
String leaderId = "";
try{
leader =
} catch() {
}
if (Objects.nonNull(leader)) {
}
// leaderId could be deliberately left empty down here due to failure to get the leader
// after refactoring `String leaderId = "";` was removed.If we have any leader information, its id cannot be null.
I didn't know that.
How can we handle this now since the code has been merged?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @xBis7 for checking.
If we have any leader information, its id cannot be null.
I didn't know that.
How can we handle this now since the code has been merged?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@adoroszlai Thanks! I'll create a patch shortly for HDDS-8009.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @adoroszlai for merging and uncovering this change. With getLeader() no longer throwing an exception, we can cleanup the null check and handling of that condition.
What changes were proposed in this pull request?
Ratis roles are created as part of the OMMXBean class which makes them available only for
/jmx. This information should be available for both/jmxand/promendpoints of the OM.In order to make the info available for the
/promendpoint, this patch adds a new metrics class that holds info about the component registering it, showing to the user if the current OM is a leader or a follower. The info is presented with a gauge(1 for leader, 0 for follower) so that it can be charted.There will be a follow up PR, with a similar approach for the SCM.
What is the link to the Apache JIRA
https://issues.apache.org/jira/browse/HDDS-7721
How was this patch tested?
A new test was added under
hadoop-ozone/integration-test/src/test/java/org/apache/hadoop/ozone/om/TestOzoneManagerHAWithData.java.This patch was also tested manually in docker clusters for both HA and non-HA like so
in
/hadoop-ozone/dist/target/ozone-1.4.0-SNAPSHOT/compose/ozonein
/hadoop-ozone/dist/target/ozone-1.4.0-SNAPSHOT/compose/ozone-ha