HDDS-2922. Balance ratis leader distribution in datanodes #1371

runzhiwang · 2020-09-01T02:43:14Z

What changes were proposed in this pull request?

What's the problem ?

When enable multi-raft, the leader distribution in datanodes is not balance. In my test, there are 72 datanodes, each datanode
engage in 6 pipelines, so there are 144 pipelines. As the image shows, the leader number of the 4 datanodes is 0, 0, 4, 2, it's not balance. Because ratis leader not only accept client request, but also replicate log to 2 followers, and follower only replicate log from leader, so the leader's load is at least 3 times of follower. So we need to balance leader.

How to improve ?

With the guidance of @szetszwo , I implement RATIS-967, which not only support priority in leader election, but also support lower priority leader try to yield leadership to higher priority peer when higher priority peer's log catch up, to address the higher priority leader lose the leadership.

So in ozone

assign the suggested leader with higher priority, and 2 followers with lower priority, then we can achieve leader distribution's balance.
record the suggested leader in Pipeline, when create new pipeline, choose 3 datanodes, find pipelines on each datanode, calculate the suggested leader count on each datanode, then choose the datanode which has the minum suggested leader count as the leader.
to avoid we lose the suggested leader of pipeline in SCM when SCM restart, we store the suggested leader in pipeline table.

As the following image shows, there are 72 datanodes, each datanode engage in 6 pipelines, so there are 144 pipelines.
The leader count of each datanode is 2, there is no exception, we achieve the leader distribution's balance.

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-2922

How was this patch tested?

add new ut.

runzhiwang · 2020-09-01T02:45:58Z

@bshashikant @lokeshj1703 @mukul1987 @elek @xiaoyuyao Could you help review this patch ? Thank you very much.

amaliujia · 2020-09-01T23:13:43Z

Thanks @runzhiwang. This is an awesome work! I will also try to help review this PR.

GlenGeng-awx

Thanks for this fantastic feature! Just have some inline comments.

As discussed, we can handle the suggestLeaderCount in an easier and more accurate way by leveraging pipeline report.

hadoop-hdds/common/src/main/java/org/apache/hadoop/hdds/protocol/DatanodeDetails.java

hadoop-hdds/common/src/main/java/org/apache/hadoop/hdds/ratis/RatisHelper.java

hadoop-hdds/common/src/main/java/org/apache/hadoop/hdds/scm/pipeline/Pipeline.java

.../hadoop/ozone/container/common/statemachine/commandhandler/CreatePipelineCommandHandler.java

...java/org/apache/hadoop/ozone/container/common/transport/server/ratis/XceiverServerRatis.java

...r-service/src/main/java/org/apache/hadoop/ozone/protocol/commands/CreatePipelineCommand.java

...hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/pipeline/RatisPipelineProvider.java

runzhiwang · 2020-09-11T06:50:28Z

@GlenGeng Thanks for review, I have updated the patch.

...java/org/apache/hadoop/ozone/container/common/transport/server/ratis/XceiverServerRatis.java

xiaoyuyao · 2020-09-16T21:58:46Z

...hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/pipeline/RatisPipelineProvider.java

+        try {
+          Pipeline pipeline = getPipelineStateManager().getPipeline(pipelineID);
+          if (!pipeline.isClosed()
+              && dn.getUuid().equals(pipeline.getSuggestedLeaderId())) {


should we use getLeaderId() instead of getSuggestedLeaderId() here to reflect the actual leader count?

And the method name can be changed to getLeaderCount() so that the suggest leader is determined by the actual leader count.

RATIS-967 guarantee the highpriority node act as leader when create pipeline. In a long running cluster, the leader maybe crash, then some follower will take the leadership, but when the old leader restart and catch up with current leader's log, the old leader can grab the leadership again by RATIS-967.

So let me suppose the following case, there are 3 servers: s1, s2, s3, there are 2 pipelines now, the first pipeline's leader is s1, the second pipeline's leader is s2, both the 2 leaders are suggested leader with high priority. Then s1 crash, suppose s3 will take the first pipline's leader. Then s1 restart, but has not grab leadership of the first pipeline. If we use getLeaderId() instead of getSuggestedLeaderId() to reflect the actual leader count, when we create the 3 third pipeline, we find the leader number on s1, s2, s3 is 0, 1, 1, so we will select s1 as the suggest leader, then s1 grab the leadership of the first pipeline by RATIS-967, so the leader number on s1, s2, s3 will be 2, 1, 0, it's not balance.

bq. then s1 grab the leadership of the first pipeline by RATIS-967,

Does RATIS-967 always gives up its leader when s1 is back online even the current leader works fine? I think this is more specific to RATIS-967 wrt. how the priority is enforced. Any performance impact on the pipeline of forcing leader to be the original one?

I'm thinking of instead of forcing leader of pipeline P1, P2, P3 like
S1 S2 S3
P1 P2
P3

In the case of S1 temporarily down, why don't we keep P1 leader on S3 and create P3 with leader on S1, this gives more flexibility for higher level to choose leader?
S1 S2 S3
P2
P1
P3

Another situation I'm thinking of is writers on pipeline with slow leader(e.g., hardware slowness) may not be able to recover by leader change.

@xiaoyuyao Good point, I also have thought this.

Any performance impact on the pipeline of forcing leader to be the original one.

If there is performance problem, I can improve forcing leader change within 1 second. I already know how to improve it, but has not implemented it.

Another situation I'm thinking of is writers on pipeline with slow leader(e.g., hardware slowness) may not be able to recover by leader change.

We can find slow leader by some metric, decrease the priority of the slow leader, select one faster datanode and increase it's priority, so the faster datanode will grab the leadership from the slow leader.

In the case of S1 temporarily down, why don't we keep P1 leader on S3 and create P3 with leader on S1, this gives more flexibility for higher level to choose leader?

I want the cluster leader distribution as we planned, if the plan is not appropriate, we can adjust the plan by change priority.

If the leader distribution totally depends on hardware rather than plan, we maybe lost control of the leader distribution. Because the leaderId in scm was reported by datanode, it maybe a delayed leaderId. For example, datanode report:

S1 .. S2 .. S3
P1 .. P2

then P1's leader transfer to S3, but SCM has not received this report, SCM allocate P3's leader to S3, then

S1 .. S2 .. S3
........P2 .. P1
...............P3

It's not balance now.

Good point, planed leader distribution should work with RATIS-967. How do we scale with this. Plan weight for each of node as a leader when the cluster has thousands of nodes can be difficult.

Plan weight for each of node as a leader when the cluster has thousands of nodes can be difficult.

If each node has similar hardware, i.e. CPU, memory, we just plan weight as now, assign each node with same leader number, it is cheap and reasonable.

I think the only case we need to consider is that some nodes' hardware is weaker than other nodes' obviously. I think the weeker datanodes should engage in less pipeline than the stronger datanodes, but ozone does not support this now. If we can support this, the maxum leader number of each datanode should be less or equal to ((1/3) * the pipeline number it engaged in), and we select the datanode as the leader which has lowest value of (leader number / pipeline number it engaged in) in 3 datanodes, this is also cheap. We can change this if there is requirement in the future, but now it is enough to allocate the same leader number in each datanode.

xiaoyuyao

Thanks @runzhiwang for adding this useful feature. It looks good to me overall. A few comments added inline.

xiaoyuyao · 2020-09-16T22:02:01Z

...hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/pipeline/RatisPipelineProvider.java

+        try {
+          Pipeline pipeline = getPipelineStateManager().getPipeline(pipelineID);
+          if (!pipeline.isClosed()
+              && dn.getUuid().equals(pipeline.getSuggestedLeaderId())) {


And the method name can be changed to getLeaderCount() so that the suggest leader is determined by the actual leader count.

...hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/pipeline/RatisPipelineProvider.java

hadoop-ozone/integration-test/src/test/java/org/apache/hadoop/ozone/TestMiniOzoneCluster.java

runzhiwang · 2020-09-17T00:56:23Z

@xiaoyuyao Thanks for review. I have updated the patch.

...est/src/test/java/org/apache/hadoop/hdds/scm/pipeline/TestRatisPipelineCreateAndDestroy.java

...hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/pipeline/RatisPipelineProvider.java

runzhiwang · 2020-09-29T06:05:48Z

@xiaoyuyao @bshashikant I have updated the patch. Could you help review it again ? Thank you very much.

hadoop-hdds/common/src/main/java/org/apache/hadoop/hdds/scm/pipeline/Pipeline.java

...java/org/apache/hadoop/ozone/container/common/transport/server/ratis/XceiverServerRatis.java

...r-service/src/main/java/org/apache/hadoop/ozone/protocol/commands/CreatePipelineCommand.java

...-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/pipeline/PipelineStateManager.java

...hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/pipeline/RatisPipelineProvider.java

...op-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/pipeline/SCMPipelineManager.java

...r-service/src/main/java/org/apache/hadoop/ozone/protocol/commands/CreatePipelineCommand.java

bshashikant · 2020-10-05T11:15:44Z

Can we also make the policy configurable? Also, one policy should also be defined for no priority at all incase, this turns out to be a performance killer.

runzhiwang · 2020-10-06T08:59:31Z

@GlenGeng @bshashikant Thanks for review. I have updated the patch.

Can we also make the policy configurable? Also, one policy should also be defined for no priority at all incase, this turns out to be a performance killer.

policy can be configured by ozone.scm.pipeline.leader-choose.policy. And define a policy for no priority named RandomLeaderChoosePolicy

bshashikant · 2020-10-06T11:06:57Z

@GlenGeng @bshashikant Thanks for review. I have updated the patch.

Can we also make the policy configurable? Also, one policy should also be defined for no priority at all incase, this turns out to be a performance killer.

policy can be configured by ozone.scm.pipeline.leader-choose.policy. And define a policy for no priority named RandomLeaderChoosePolicy

@runzhiwang , please correct me if i am wrong. RandomLeaderChoosePolicy still chooses a datanode randomly and this is suggested to Ratis while creating the pipeline.
With NO_PRIORITY, i meant that, we should not have any recommendation for a leader at all (as what is currently). Usually in such cases, whoever starts the ratis leader election first initially, becomes the leader.

runzhiwang · 2020-10-06T12:43:22Z

@bshashikant Thanks for suggestions. Actually, RandomLeaderChoosePolicy does not choose datanode, it return null in chooseLeader, then all the datanodes are assigned the same priority, as what is currently. The name of RandomLeaderChoosePolicy seems confused, sorry for the misleading, do you have better name?

bshashikant · 2020-10-07T05:30:39Z

@bshashikant Thanks for suggestions. Actually, RandomLeaderChoosePolicy does not choose datanode, it return null in chooseLeader, then all the datanodes are assigned the same priority, as what is currently. The name of RandomLeaderChoosePolicy seems confused, sorry for the misleading, do you have better name?

i guess , this can be named as "DefaultLeaderChoosePolicy" and this should be made the default , until and unless we measure the performance with the minimumLeader election count policy and see the results. What do you think?

runzhiwang · 2020-10-07T14:45:36Z

@bshashikant Thanks for suggestions. Actually, RandomLeaderChoosePolicy does not choose datanode, it return null in chooseLeader, then all the datanodes are assigned the same priority, as what is currently. The name of RandomLeaderChoosePolicy seems confused, sorry for the misleading, do you have better name?

i guess , this can be named as "DefaultLeaderChoosePolicy" and this should be made the default , until and unless we measure the performance with the minimumLeader election count policy and see the results. What do you think?

@bshashikant I agree. I have updated the patch.

GlenGeng-awx · 2020-10-12T12:38:00Z

+1
Thanks for the work! LGTM

runzhiwang · 2020-10-14T10:57:30Z

@xiaoyuyao Could you help merge this patch ? Thanks a lot.

ChenSammi · 2020-10-19T03:45:21Z

LGTM + 1.

Thanks @runzhiwang for the contribution and @bshashikant @xiaoyuyao @GlenGeng for the review.

runzhiwang · 2020-10-19T03:46:22Z

@ChenSammi Thanks for merging it, @bshashikant @xiaoyuyao @GlenGeng Thanks for review.

* master: HDDS-4301. SCM CA certificate does not encode KeyUsage extension properly (apache#1468) HDDS-4158. Provide a class type for Java based configuration (apache#1407) HDDS-4297. Allow multiple transactions per container to be sent for deletion by SCM. HDDS-2922. Balance ratis leader distribution in datanodes (apache#1371) HDDS-4269. Ozone DataNode thinks a volume is failed if an unexpected file is in the HDDS root directory. (apache#1490) HDDS-4327. Potential resource leakage using BatchOperation. (apache#1493) HDDS-3995. Fix s3g met NPE exception while write file by multiPartUpload (apache#1499) HDDS-4343. ReplicationManager.handleOverReplicatedContainer() does not handle unhealthyReplicas properly. (apache#1495)

runzhiwang force-pushed the balance-leader branch 2 times, most recently from d5b2d64 to 73b8622 Compare September 2, 2020 06:29

runzhiwang closed this Sep 2, 2020

runzhiwang reopened this Sep 2, 2020

runzhiwang force-pushed the balance-leader branch 9 times, most recently from ae54831 to b58df38 Compare September 7, 2020 23:44

GlenGeng-awx reviewed Sep 10, 2020

View reviewed changes

runzhiwang force-pushed the balance-leader branch 3 times, most recently from 289a380 to da89a36 Compare September 11, 2020 06:49

runzhiwang force-pushed the balance-leader branch from 3bce470 to a1a7f4e Compare September 11, 2020 10:59

xiaoyuyao reviewed Sep 16, 2020

View reviewed changes

...java/org/apache/hadoop/ozone/container/common/transport/server/ratis/XceiverServerRatis.java Outdated Show resolved Hide resolved

xiaoyuyao reviewed Sep 16, 2020

View reviewed changes

bshashikant reviewed Sep 22, 2020

View reviewed changes

...est/src/test/java/org/apache/hadoop/hdds/scm/pipeline/TestRatisPipelineCreateAndDestroy.java Outdated Show resolved Hide resolved

...hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/pipeline/RatisPipelineProvider.java Outdated Show resolved Hide resolved

GlenGeng-awx reviewed Sep 29, 2020

View reviewed changes

HDDS-2922. Balance ratis leader distribution in datanodes

758a736

runzhiwang added 3 commits October 6, 2020 10:49

fix code review

e3e1182

change priorityList to static DEFAULT_PRIORITY_LIST

4fb1cb0

add choose leader policy

b4c84ff

runzhiwang force-pushed the balance-leader branch 6 times, most recently from 8eee20e to eefb819 Compare October 6, 2020 08:51

fix code review

f573b4a

runzhiwang force-pushed the balance-leader branch from eefb819 to f573b4a Compare October 6, 2020 08:58

runzhiwang force-pushed the balance-leader branch from 01627b5 to e405f4f Compare October 7, 2020 09:44

rename RandomLeaderChoosePolicy

d3a7ad0

runzhiwang force-pushed the balance-leader branch from e405f4f to d3a7ad0 Compare October 7, 2020 10:56

triger ci

c264e37

bshashikant approved these changes Oct 13, 2020

View reviewed changes

ChenSammi merged commit 8fab5f2 into apache:master Oct 19, 2020

HDDS-2922. Balance ratis leader distribution in datanodes #1371

HDDS-2922. Balance ratis leader distribution in datanodes #1371

Uh oh!

Conversation

runzhiwang commented Sep 1, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

What is the link to the Apache JIRA

How was this patch tested?

Uh oh!

runzhiwang commented Sep 1, 2020

Uh oh!

amaliujia commented Sep 1, 2020

Uh oh!

GlenGeng-awx left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

runzhiwang commented Sep 11, 2020

Uh oh!

Uh oh!

xiaoyuyao Sep 16, 2020

Choose a reason for hiding this comment

Uh oh!

xiaoyuyao Sep 16, 2020

Choose a reason for hiding this comment

Uh oh!

runzhiwang Sep 17, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

xiaoyuyao Sep 18, 2020

Choose a reason for hiding this comment

Uh oh!

runzhiwang Sep 18, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

xiaoyuyao Sep 21, 2020

Choose a reason for hiding this comment

Uh oh!

runzhiwang Sep 22, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

xiaoyuyao left a comment

Choose a reason for hiding this comment

Uh oh!

xiaoyuyao Sep 16, 2020

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

runzhiwang commented Sep 17, 2020

Uh oh!

Uh oh!

Uh oh!

runzhiwang commented Sep 29, 2020

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

bshashikant commented Oct 5, 2020

Uh oh!

runzhiwang commented Oct 6, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

runzhiwang commented Sep 1, 2020 •

edited

Loading

GlenGeng-awx left a comment •

edited

Loading

runzhiwang Sep 17, 2020 •

edited

Loading

runzhiwang Sep 18, 2020 •

edited

Loading

runzhiwang Sep 22, 2020 •

edited

Loading

runzhiwang commented Oct 6, 2020 •

edited

Loading

runzhiwang commented Oct 6, 2020 •

edited

Loading

bshashikant commented Oct 7, 2020 •

edited

Loading