Skip to content

Conversation

@guihecheng
Copy link
Contributor

What changes were proposed in this pull request?

Add more metrics to ReplicationManager to help monitor replication progress.
Note that metrics are placed in a separated new class.
More detailed description could be found in the JIRA.

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-5401

How was this patch tested?

extended ut.
manual test.

@guihecheng
Copy link
Contributor Author

@ChenSammi please help review this, thanks~

@guihecheng
Copy link
Contributor Author

Hi @bshashikant , could you please help review this as Sammi is on vacation this week, thanks~

() -> metrics.incrNumDeleteCmdsCompleted());

metrics.setInflightReplication(inflightReplication.size());
metrics.setInflightDeletion(inflightDeletion.size());
Copy link
Contributor

@ChenSammi ChenSammi Jul 12, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we keep the existing way of getting the value of these two metrices? There is other places where the map size is changed. So to avoid setting the value every time.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm, I'll investigate whether there are some ways, the solution seems not so direct because we have separated metrics into a separate class.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can think of two options:

  1. Make the new Metrics class an inner class of ReplicationManager, so it can access RM's instance varaibles directly.

  2. Pass the replicationManager instance to the new Metrics class and add a getter for inflight Replication / Deletion. Then call those getters when the metrics are requested.

Option 2 is probably better, but there may be other ways to do this too.

I think there may be a small bug here, due to where you are setting the inFlightRep / Deletion. You have sent it after removing any completed pending items, but before the container is processed for over / under replication. On the last container to be processed, it may be under replicated and hence it would be missed from the metrics until the next run of the RM.

Copy link
Contributor Author

@guihecheng guihecheng Jul 16, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @sodonnel , the mentioned bug is true, so I'll try to keep the 2 original metrics behave as before, and maybe take option 2 to implement.

Assert.assertTrue(datanodeCommandHandler.received(
SCMCommandProto.Type.deleteContainerCommand,
replicaOne.getDatanodeDetails()));
Assert.assertEquals(currentDeleteCommandCount + 1,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add more test to check the other newly added metrices?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I'll try my best to cover more metrics, maybe it is not so direct to simulate timeouts.

LOG.info("Sending replicate container command for container {}" +
" to datanode {} from datanodes {}",
container.containerID(), datanode, sources);
metrics.incrNumReplicateCmdsSent();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggest to move the metrics operation after the command is sent.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it should be updated after sent.


LOG.info("Sending delete container command for container {}" +
" to datanode {}", container.containerID(), datanode);
metrics.incrNumDeleteCmdsSent();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above.

@ChenSammi
Copy link
Contributor

Thanks @guihecheng for working on this. I just left a few comments.

Hi @sodonnel, would you help to have a look at this patch? There are two changes to the ReplicationManager metrics. One is the metrics name changing from INFLIGHT_REPLICATION to inflightReplication, another is the metrics source name changing from ReplicationManager to ReplicationManagerMetrics. Though I think the new names are more consistent with current other module metrics style, we would like to have your thoughts here.

private MutableGaugeLong inflightDeletion;

@Metric("Number of replicate commands sent.")
private MutableCounterLong numReplicateCmdsSent;
Copy link
Contributor

@ChenSammi ChenSammi Jul 12, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally suggest change Replicate -> Replication, Delete -> Deletion in all related metric name and descriptions.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, then I shall do changes to some existing codes to stay consistent with the naming.

@sodonnel
Copy link
Contributor

@ChenSammi thanks for mentioning me. I think this change looks generally good aside from the points you already raised. I checked with a few others on the team and changing the metrics names should be ok. I don't think anything is integrated with Ozone that depends on these names so far.

@guihecheng
Copy link
Contributor Author

@ChenSammi thanks for mentioning me. I think this change looks generally good aside from the points you already raised. I checked with a few others on the team and changing the metrics names should be ok. I don't think anything is integrated with Ozone that depends on these names so far.

Oh, this is a good news, thanks again for confirmation.

markgui added 2 commits July 16, 2021 15:26
Incr after commands sent.
Rename: replicate -> replication, delete -> deletion.
assertDeleteScheduled(1);

// Make a timeout
Thread.sleep(2000);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to avoid these sleeps in the tests. I have created #2425 to add a Clock into Replication Manager. If we can get that reviewed and committed then we can rebase this patch and remove the timeouts.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, that's great, I see that it is merged now and I'll do a rebase soon, thanks~

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I noticed that there is another improvement pr #2429, I shall wait for it merged and rebase on that.

- Touched some stuff about 'move', mostly the one metric for it.
- Merged a conflict test case for cmd timeout.
- Some cleanups for the test.

Conflicts:
	hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/container/ReplicationManager.java
	hadoop-hdds/server-scm/src/test/java/org/apache/hadoop/hdds/scm/container/TestReplicationManager.java
@guihecheng
Copy link
Contributor Author

@sodonnel @ChenSammi updated, thanks~
Also invite @JacksonYao287 for review since some stuff of 'move' is changed.

Copy link
Contributor

@JacksonYao287 JacksonYao287 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks @guihecheng for this work, please fix the CI first

@guihecheng
Copy link
Contributor Author

thanks @guihecheng for this work, please fix the CI first

Ah, let's retrigger the CI since these metrics should not fail the mr jobs.

return DefaultMetricsSystem.instance().register(METRICS_SOURCE_NAME,
"SCM Replication manager (closed container replication) related "
+ "metrics",
new ReplicationManagerMetrics(manager));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for the work! NIT: if RM start->stop->start, ReplicationManagerMetrics#create will be call twice ,and thus new ReplicationManagerMetrics(manager) will be called twice, it seems a little confuse to create the ReplicationManagerMetrics twice. i think it is better off using register instead of create for ReplicationManagerMetrics, and making ReplicationManagerMetrics singleton

Copy link
Contributor Author

@guihecheng guihecheng Jul 26, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, thanks for the comments, it seems to be reasonable to adopt singleton pattern at first glance.
But actually there are things to be made clear:

  1. When does start->stop->start happen? and the purpose ?
  2. If we adopt singleton, then there's a behavior change:
    • With singleton, we have only one metrics object, and after restart RM(without restarting the daemon of course) will start counting from an old metrics base and are these metrics consistent with the inflightReplication/inflightDeletion maps in RM?
    • Without singleton, RM restart with a freshly created object all init to 0.
      So what is the desired behavior?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general I prefer to avoid singletons, as they can make testing difficult, especially around Mini-Cluster tests.

I think its reasonable for a RM restart to reset the counts to zero, as it is a new instance. A RM restart should only happen in tests, as to restart it in a real cluster, you must restart all of SCM.

@guihecheng
Copy link
Contributor Author

ping @JacksonYao287 @ChenSammi @sodonnel

@ChenSammi
Copy link
Contributor

The last patch LGTM, +1.

@ChenSammi ChenSammi merged commit b447ffc into apache:master Aug 2, 2021
@ChenSammi
Copy link
Contributor

Thanks @guihecheng for the contribution and @sodonnel @JacksonYao287 for th code review.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants