-
Notifications
You must be signed in to change notification settings - Fork 593
HDDS-5401. Add more metrics to ReplicationManager to help monitor replication progress. #2382
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…lication progress.
|
@ChenSammi please help review this, thanks~ |
|
Hi @bshashikant , could you please help review this as Sammi is on vacation this week, thanks~ |
| () -> metrics.incrNumDeleteCmdsCompleted()); | ||
|
|
||
| metrics.setInflightReplication(inflightReplication.size()); | ||
| metrics.setInflightDeletion(inflightDeletion.size()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we keep the existing way of getting the value of these two metrices? There is other places where the map size is changed. So to avoid setting the value every time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmmm, I'll investigate whether there are some ways, the solution seems not so direct because we have separated metrics into a separate class.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can think of two options:
-
Make the new Metrics class an inner class of ReplicationManager, so it can access RM's instance varaibles directly.
-
Pass the replicationManager instance to the new Metrics class and add a getter for inflight Replication / Deletion. Then call those getters when the metrics are requested.
Option 2 is probably better, but there may be other ways to do this too.
I think there may be a small bug here, due to where you are setting the inFlightRep / Deletion. You have sent it after removing any completed pending items, but before the container is processed for over / under replication. On the last container to be processed, it may be under replicated and hence it would be missed from the metrics until the next run of the RM.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @sodonnel , the mentioned bug is true, so I'll try to keep the 2 original metrics behave as before, and maybe take option 2 to implement.
| Assert.assertTrue(datanodeCommandHandler.received( | ||
| SCMCommandProto.Type.deleteContainerCommand, | ||
| replicaOne.getDatanodeDetails())); | ||
| Assert.assertEquals(currentDeleteCommandCount + 1, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we add more test to check the other newly added metrices?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, I'll try my best to cover more metrics, maybe it is not so direct to simulate timeouts.
| LOG.info("Sending replicate container command for container {}" + | ||
| " to datanode {} from datanodes {}", | ||
| container.containerID(), datanode, sources); | ||
| metrics.incrNumReplicateCmdsSent(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggest to move the metrics operation after the command is sent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it should be updated after sent.
|
|
||
| LOG.info("Sending delete container command for container {}" + | ||
| " to datanode {}", container.containerID(), datanode); | ||
| metrics.incrNumDeleteCmdsSent(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as above.
|
Thanks @guihecheng for working on this. I just left a few comments. Hi @sodonnel, would you help to have a look at this patch? There are two changes to the ReplicationManager metrics. One is the metrics name changing from INFLIGHT_REPLICATION to inflightReplication, another is the metrics source name changing from ReplicationManager to ReplicationManagerMetrics. Though I think the new names are more consistent with current other module metrics style, we would like to have your thoughts here. |
| private MutableGaugeLong inflightDeletion; | ||
|
|
||
| @Metric("Number of replicate commands sent.") | ||
| private MutableCounterLong numReplicateCmdsSent; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally suggest change Replicate -> Replication, Delete -> Deletion in all related metric name and descriptions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, then I shall do changes to some existing codes to stay consistent with the naming.
|
@ChenSammi thanks for mentioning me. I think this change looks generally good aside from the points you already raised. I checked with a few others on the team and changing the metrics names should be ok. I don't think anything is integrated with Ozone that depends on these names so far. |
Oh, this is a good news, thanks again for confirmation. |
Incr after commands sent. Rename: replicate -> replication, delete -> deletion.
| assertDeleteScheduled(1); | ||
|
|
||
| // Make a timeout | ||
| Thread.sleep(2000); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd like to avoid these sleeps in the tests. I have created #2425 to add a Clock into Replication Manager. If we can get that reviewed and committed then we can rebase this patch and remove the timeouts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, that's great, I see that it is merged now and I'll do a rebase soon, thanks~
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, I noticed that there is another improvement pr #2429, I shall wait for it merged and rebase on that.
- Touched some stuff about 'move', mostly the one metric for it. - Merged a conflict test case for cmd timeout. - Some cleanups for the test. Conflicts: hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/container/ReplicationManager.java hadoop-hdds/server-scm/src/test/java/org/apache/hadoop/hdds/scm/container/TestReplicationManager.java
|
@sodonnel @ChenSammi updated, thanks~ |
JacksonYao287
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks @guihecheng for this work, please fix the CI first
Ah, let's retrigger the CI since these metrics should not fail the mr jobs. |
| return DefaultMetricsSystem.instance().register(METRICS_SOURCE_NAME, | ||
| "SCM Replication manager (closed container replication) related " | ||
| + "metrics", | ||
| new ReplicationManagerMetrics(manager)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for the work! NIT: if RM start->stop->start, ReplicationManagerMetrics#create will be call twice ,and thus new ReplicationManagerMetrics(manager) will be called twice, it seems a little confuse to create the ReplicationManagerMetrics twice. i think it is better off using register instead of create for ReplicationManagerMetrics, and making ReplicationManagerMetrics singleton
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, thanks for the comments, it seems to be reasonable to adopt singleton pattern at first glance.
But actually there are things to be made clear:
- When does start->stop->start happen? and the purpose ?
- If we adopt singleton, then there's a behavior change:
- With singleton, we have only one metrics object, and after restart RM(without restarting the daemon of course) will start counting from an old metrics base and are these metrics consistent with the inflightReplication/inflightDeletion maps in RM?
- Without singleton, RM restart with a freshly created object all init to 0.
So what is the desired behavior?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general I prefer to avoid singletons, as they can make testing difficult, especially around Mini-Cluster tests.
I think its reasonable for a RM restart to reset the counts to zero, as it is a new instance. A RM restart should only happen in tests, as to restart it in a real cluster, you must restart all of SCM.
Fix CI k8s suite failure by merging master.
|
The last patch LGTM, +1. |
|
Thanks @guihecheng for the contribution and @sodonnel @JacksonYao287 for th code review. |
What changes were proposed in this pull request?
Add more metrics to ReplicationManager to help monitor replication progress.
Note that metrics are placed in a separated new class.
More detailed description could be found in the JIRA.
What is the link to the Apache JIRA
https://issues.apache.org/jira/browse/HDDS-5401
How was this patch tested?
extended ut.
manual test.