HDDS-5401. Add more metrics to ReplicationManager to help monitor replication progress. #2382

guihecheng · 2021-07-01T07:26:15Z

What changes were proposed in this pull request?

Add more metrics to ReplicationManager to help monitor replication progress.
Note that metrics are placed in a separated new class.
More detailed description could be found in the JIRA.

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-5401

How was this patch tested?

extended ut.
manual test.

…lication progress.

guihecheng · 2021-07-01T08:05:05Z

@ChenSammi please help review this, thanks~

guihecheng · 2021-07-06T07:47:18Z

Hi @bshashikant , could you please help review this as Sammi is on vacation this week, thanks~

ChenSammi · 2021-07-12T08:11:17Z

...p-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/container/ReplicationManager.java

+            () -> metrics.incrNumDeleteCmdsCompleted());
+
+        metrics.setInflightReplication(inflightReplication.size());
+        metrics.setInflightDeletion(inflightDeletion.size());


Can we keep the existing way of getting the value of these two metrices? There is other places where the map size is changed. So to avoid setting the value every time.

Hmmm, I'll investigate whether there are some ways, the solution seems not so direct because we have separated metrics into a separate class.

I can think of two options:

Make the new Metrics class an inner class of ReplicationManager, so it can access RM's instance varaibles directly.

Pass the replicationManager instance to the new Metrics class and add a getter for inflight Replication / Deletion. Then call those getters when the metrics are requested.

Option 2 is probably better, but there may be other ways to do this too.

I think there may be a small bug here, due to where you are setting the inFlightRep / Deletion. You have sent it after removing any completed pending items, but before the container is processed for over / under replication. On the last container to be processed, it may be under replicated and hence it would be missed from the metrics until the next run of the RM.

Thanks @sodonnel , the mentioned bug is true, so I'll try to keep the 2 original metrics behave as before, and maybe take option 2 to implement.

ChenSammi · 2021-07-12T08:37:59Z

...ds/server-scm/src/test/java/org/apache/hadoop/hdds/scm/container/TestReplicationManager.java

    Assert.assertTrue(datanodeCommandHandler.received(
        SCMCommandProto.Type.deleteContainerCommand,
        replicaOne.getDatanodeDetails()));
+    Assert.assertEquals(currentDeleteCommandCount + 1,


Can we add more test to check the other newly added metrices？

Sure, I'll try my best to cover more metrics, maybe it is not so direct to simulate timeouts.

ChenSammi · 2021-07-12T08:39:10Z

...p-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/container/ReplicationManager.java

    LOG.info("Sending replicate container command for container {}" +
            " to datanode {} from datanodes {}",
        container.containerID(), datanode, sources);
+    metrics.incrNumReplicateCmdsSent();


Suggest to move the metrics operation after the command is sent.

Yes, it should be updated after sent.

ChenSammi · 2021-07-12T08:39:18Z

...p-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/container/ReplicationManager.java


    LOG.info("Sending delete container command for container {}" +
            " to datanode {}", container.containerID(), datanode);
+    metrics.incrNumDeleteCmdsSent();


Same as above.

ChenSammi · 2021-07-12T08:51:23Z

Thanks @guihecheng for working on this. I just left a few comments.

Hi @sodonnel, would you help to have a look at this patch? There are two changes to the ReplicationManager metrics. One is the metrics name changing from INFLIGHT_REPLICATION to inflightReplication, another is the metrics source name changing from ReplicationManager to ReplicationManagerMetrics. Though I think the new names are more consistent with current other module metrics style, we would like to have your thoughts here.

ChenSammi · 2021-07-12T08:55:59Z

...rc/main/java/org/apache/hadoop/hdds/scm/container/replication/ReplicationManagerMetrics.java

+  private MutableGaugeLong inflightDeletion;
+
+  @Metric("Number of replicate commands sent.")
+  private MutableCounterLong numReplicateCmdsSent;


Generally suggest change Replicate -> Replication, Delete -> Deletion in all related metric name and descriptions.

OK, then I shall do changes to some existing codes to stay consistent with the naming.

sodonnel · 2021-07-15T20:58:59Z

@ChenSammi thanks for mentioning me. I think this change looks generally good aside from the points you already raised. I checked with a few others on the team and changing the metrics names should be ok. I don't think anything is integrated with Ozone that depends on these names so far.

guihecheng · 2021-07-16T02:31:01Z

@ChenSammi thanks for mentioning me. I think this change looks generally good aside from the points you already raised. I checked with a few others on the team and changing the metrics names should be ok. I don't think anything is integrated with Ozone that depends on these names so far.

Oh, this is a good news, thanks again for confirmation.

Incr after commands sent. Rename: replicate -> replication, delete -> deletion.

sodonnel · 2021-07-16T14:45:30Z

...ds/server-scm/src/test/java/org/apache/hadoop/hdds/scm/container/TestReplicationManager.java

+    assertDeleteScheduled(1);
+
+    // Make a timeout
+    Thread.sleep(2000);


I'd like to avoid these sleeps in the tests. I have created #2425 to add a Clock into Replication Manager. If we can get that reviewed and committed then we can rebase this patch and remove the timeouts.

Ah, that's great, I see that it is merged now and I'll do a rebase soon, thanks~

Oh, I noticed that there is another improvement pr #2429, I shall wait for it merged and rebase on that.

- Touched some stuff about 'move', mostly the one metric for it. - Merged a conflict test case for cmd timeout. - Some cleanups for the test. Conflicts: hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/container/ReplicationManager.java hadoop-hdds/server-scm/src/test/java/org/apache/hadoop/hdds/scm/container/TestReplicationManager.java

guihecheng · 2021-07-22T04:56:19Z

@sodonnel @ChenSammi updated, thanks~
Also invite @JacksonYao287 for review since some stuff of 'move' is changed.

JacksonYao287

thanks @guihecheng for this work, please fix the CI first

guihecheng · 2021-07-23T01:51:56Z

thanks @guihecheng for this work, please fix the CI first

Ah, let's retrigger the CI since these metrics should not fail the mr jobs.

JacksonYao287 · 2021-07-26T08:18:24Z

...rc/main/java/org/apache/hadoop/hdds/scm/container/replication/ReplicationManagerMetrics.java

+    return DefaultMetricsSystem.instance().register(METRICS_SOURCE_NAME,
+        "SCM Replication manager (closed container replication) related "
+            + "metrics",
+        new ReplicationManagerMetrics(manager));


thanks for the work! NIT: if RM start->stop->start, ReplicationManagerMetrics#create will be call twice ,and thus new ReplicationManagerMetrics(manager) will be called twice, it seems a little confuse to create the ReplicationManagerMetrics twice. i think it is better off using register instead of create for ReplicationManagerMetrics, and making ReplicationManagerMetrics singleton

Oh, thanks for the comments, it seems to be reasonable to adopt singleton pattern at first glance.
But actually there are things to be made clear:

When does start->stop->start happen? and the purpose ?

If we adopt singleton, then there's a behavior change:

With singleton, we have only one metrics object, and after restart RM(without restarting the daemon of course) will start counting from an old metrics base and are these metrics consistent with the inflightReplication/inflightDeletion maps in RM?

Without singleton, RM restart with a freshly created object all init to 0.
So what is the desired behavior?

In general I prefer to avoid singletons, as they can make testing difficult, especially around Mini-Cluster tests.

I think its reasonable for a RM restart to reset the counts to zero, as it is a new instance. A RM restart should only happen in tests, as to restart it in a real cluster, you must restart all of SCM.

Fix CI k8s suite failure by merging master.

guihecheng · 2021-07-29T08:18:42Z

ping @JacksonYao287 @ChenSammi @sodonnel

ChenSammi · 2021-08-02T03:38:36Z

The last patch LGTM, +1.

ChenSammi · 2021-08-02T03:49:53Z

Thanks @guihecheng for the contribution and @sodonnel @JacksonYao287 for th code review.

HDDS-5401. Add more metrics to ReplicationManager to help monitor rep…

82a9da2

…lication progress.

ChenSammi reviewed Jul 12, 2021

View reviewed changes

markgui added 2 commits July 16, 2021 15:26

Use inflightReplication/Deletion.size() for the 2 metrics.

7fda29d

Incr after commands sent. Rename: replicate -> replication, delete -> deletion.

Add more tests to cover all metrics

72fa012

sodonnel reviewed Jul 16, 2021

View reviewed changes

JacksonYao287 reviewed Jul 22, 2021

View reviewed changes

markgui added 2 commits July 23, 2021 09:52

retrigger ci

5b743ce

retrigger ci

a265938

JacksonYao287 reviewed Jul 26, 2021

View reviewed changes

markgui added 3 commits July 28, 2021 09:46

Merge remote-tracking branch 'origin/master' into HDDS-5401

00adc5d

Fix CI k8s suite failure by merging master.

retrigger ci

7eaf2e9

retrigger ci

07aecdb

ChenSammi merged commit b447ffc into apache:master Aug 2, 2021

HDDS-5401. Add more metrics to ReplicationManager to help monitor replication progress. #2382

HDDS-5401. Add more metrics to ReplicationManager to help monitor replication progress. #2382

Uh oh!

Conversation

guihecheng commented Jul 1, 2021

What changes were proposed in this pull request?

What is the link to the Apache JIRA

How was this patch tested?

Uh oh!

guihecheng commented Jul 1, 2021

Uh oh!

guihecheng commented Jul 6, 2021

Uh oh!

ChenSammi Jul 12, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

guihecheng Jul 16, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ChenSammi commented Jul 12, 2021

Uh oh!

ChenSammi Jul 12, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sodonnel commented Jul 15, 2021

Uh oh!

guihecheng commented Jul 16, 2021

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

guihecheng commented Jul 22, 2021

Uh oh!

JacksonYao287 left a comment

Choose a reason for hiding this comment

Uh oh!

guihecheng commented Jul 23, 2021

Uh oh!

Choose a reason for hiding this comment

Uh oh!

guihecheng Jul 26, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

guihecheng commented Jul 29, 2021

Uh oh!

ChenSammi commented Aug 2, 2021

Uh oh!

ChenSammi commented Aug 2, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

ChenSammi Jul 12, 2021 •

edited

Loading

guihecheng Jul 16, 2021 •

edited

Loading

ChenSammi Jul 12, 2021 •

edited

Loading

guihecheng Jul 26, 2021 •

edited

Loading