Skip to content

Conversation

@runzhiwang
Copy link
Contributor

@runzhiwang runzhiwang commented Sep 1, 2020

What changes were proposed in this pull request?

What's the problem ?

When enable multi-raft, the leader distribution in datanodes is not balance. In my test, there are 72 datanodes, each datanode
engage in 6 pipelines, so there are 144 pipelines. As the image shows, the leader number of the 4 datanodes is 0, 0, 4, 2, it's not balance. Because ratis leader not only accept client request, but also replicate log to 2 followers, and follower only replicate log from leader, so the leader's load is at least 3 times of follower. So we need to balance leader.

image

How to improve ?

With the guidance of @szetszwo , I implement RATIS-967, which not only support priority in leader election, but also support lower priority leader try to yield leadership to higher priority peer when higher priority peer's log catch up, to address the higher priority leader lose the leadership.

So in ozone

  1. assign the suggested leader with higher priority, and 2 followers with lower priority, then we can achieve leader distribution's balance.
  2. record the suggested leader in Pipeline, when create new pipeline, choose 3 datanodes, find pipelines on each datanode, calculate the suggested leader count on each datanode, then choose the datanode which has the minum suggested leader count as the leader.
  3. to avoid we lose the suggested leader of pipeline in SCM when SCM restart, we store the suggested leader in pipeline table.

As the following image shows, there are 72 datanodes, each datanode engage in 6 pipelines, so there are 144 pipelines.
The leader count of each datanode is 2, there is no exception, we achieve the leader distribution's balance.

image

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-2922

How was this patch tested?

add new ut.

@runzhiwang
Copy link
Contributor Author

@bshashikant @lokeshj1703 @mukul1987 @elek @xiaoyuyao Could you help review this patch ? Thank you very much.

@amaliujia
Copy link
Contributor

Thanks @runzhiwang. This is an awesome work! I will also try to help review this PR.

@runzhiwang runzhiwang force-pushed the balance-leader branch 2 times, most recently from d5b2d64 to 73b8622 Compare September 2, 2020 06:29
@runzhiwang runzhiwang closed this Sep 2, 2020
@runzhiwang runzhiwang reopened this Sep 2, 2020
@runzhiwang runzhiwang force-pushed the balance-leader branch 9 times, most recently from ae54831 to b58df38 Compare September 7, 2020 23:44
Copy link
Contributor

@GlenGeng-awx GlenGeng-awx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this fantastic feature! Just have some inline comments.

As discussed, we can handle the suggestLeaderCount in an easier and more accurate way by leveraging pipeline report.

@runzhiwang runzhiwang force-pushed the balance-leader branch 3 times, most recently from 289a380 to da89a36 Compare September 11, 2020 06:49
@runzhiwang
Copy link
Contributor Author

@GlenGeng Thanks for review, I have updated the patch.

try {
Pipeline pipeline = getPipelineStateManager().getPipeline(pipelineID);
if (!pipeline.isClosed()
&& dn.getUuid().equals(pipeline.getSuggestedLeaderId())) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we use getLeaderId() instead of getSuggestedLeaderId() here to reflect the actual leader count?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And the method name can be changed to getLeaderCount() so that the suggest leader is determined by the actual leader count.

Copy link
Contributor Author

@runzhiwang runzhiwang Sep 17, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RATIS-967 guarantee the highpriority node act as leader when create pipeline. In a long running cluster, the leader maybe crash, then some follower will take the leadership, but when the old leader restart and catch up with current leader's log, the old leader can grab the leadership again by RATIS-967.

So let me suppose the following case, there are 3 servers: s1, s2, s3, there are 2 pipelines now, the first pipeline's leader is s1, the second pipeline's leader is s2, both the 2 leaders are suggested leader with high priority. Then s1 crash, suppose s3 will take the first pipline's leader. Then s1 restart, but has not grab leadership of the first pipeline. If we use getLeaderId() instead of getSuggestedLeaderId() to reflect the actual leader count, when we create the 3 third pipeline, we find the leader number on s1, s2, s3 is 0, 1, 1, so we will select s1 as the suggest leader, then s1 grab the leadership of the first pipeline by RATIS-967, so the leader number on s1, s2, s3 will be 2, 1, 0, it's not balance.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bq. then s1 grab the leadership of the first pipeline by RATIS-967,

Does RATIS-967 always gives up its leader when s1 is back online even the current leader works fine? I think this is more specific to RATIS-967 wrt. how the priority is enforced. Any performance impact on the pipeline of forcing leader to be the original one?

I'm thinking of instead of forcing leader of pipeline P1, P2, P3 like
S1 S2 S3
P1 P2
P3

In the case of S1 temporarily down, why don't we keep P1 leader on S3 and create P3 with leader on S1, this gives more flexibility for higher level to choose leader?
S1 S2 S3
P2
P1
P3

Another situation I'm thinking of is writers on pipeline with slow leader(e.g., hardware slowness) may not be able to recover by leader change.

Copy link
Contributor Author

@runzhiwang runzhiwang Sep 18, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@xiaoyuyao Good point, I also have thought this.

Any performance impact on the pipeline of forcing leader to be the original one.

If there is performance problem, I can improve forcing leader change within 1 second. I already know how to improve it, but has not implemented it.

Another situation I'm thinking of is writers on pipeline with slow leader(e.g., hardware slowness) may not be able to recover by leader change.

We can find slow leader by some metric, decrease the priority of the slow leader, select one faster datanode and increase it's priority, so the faster datanode will grab the leadership from the slow leader.

In the case of S1 temporarily down, why don't we keep P1 leader on S3 and create P3 with leader on S1, this gives more flexibility for higher level to choose leader?

I want the cluster leader distribution as we planned, if the plan is not appropriate, we can adjust the plan by change priority.

If the leader distribution totally depends on hardware rather than plan, we maybe lost control of the leader distribution. Because the leaderId in scm was reported by datanode, it maybe a delayed leaderId. For example, datanode report:

S1 .. S2 .. S3
P1 .. P2

then P1's leader transfer to S3, but SCM has not received this report, SCM allocate P3's leader to S3, then

S1 .. S2 .. S3
........P2 .. P1
...............P3

It's not balance now.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, planed leader distribution should work with RATIS-967. How do we scale with this. Plan weight for each of node as a leader when the cluster has thousands of nodes can be difficult.

Copy link
Contributor Author

@runzhiwang runzhiwang Sep 22, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Plan weight for each of node as a leader when the cluster has thousands of nodes can be difficult.

If each node has similar hardware, i.e. CPU, memory, we just plan weight as now, assign each node with same leader number, it is cheap and reasonable.

I think the only case we need to consider is that some nodes' hardware is weaker than other nodes' obviously. I think the weeker datanodes should engage in less pipeline than the stronger datanodes, but ozone does not support this now. If we can support this, the maxum leader number of each datanode should be less or equal to ((1/3) * the pipeline number it engaged in), and we select the datanode as the leader which has lowest value of (leader number / pipeline number it engaged in) in 3 datanodes, this is also cheap. We can change this if there is requirement in the future, but now it is enough to allocate the same leader number in each datanode.

Copy link
Contributor

@xiaoyuyao xiaoyuyao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @runzhiwang for adding this useful feature. It looks good to me overall. A few comments added inline.

try {
Pipeline pipeline = getPipelineStateManager().getPipeline(pipelineID);
if (!pipeline.isClosed()
&& dn.getUuid().equals(pipeline.getSuggestedLeaderId())) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And the method name can be changed to getLeaderCount() so that the suggest leader is determined by the actual leader count.

@runzhiwang
Copy link
Contributor Author

@xiaoyuyao Thanks for review. I have updated the patch.

@runzhiwang
Copy link
Contributor Author

@xiaoyuyao @bshashikant I have updated the patch. Could you help review it again ? Thank you very much.

@bshashikant
Copy link
Contributor

Can we also make the policy configurable? Also, one policy should also be defined for no priority at all incase, this turns out to be a performance killer.

@runzhiwang runzhiwang force-pushed the balance-leader branch 6 times, most recently from 8eee20e to eefb819 Compare October 6, 2020 08:51
@runzhiwang
Copy link
Contributor Author

runzhiwang commented Oct 6, 2020

@GlenGeng @bshashikant Thanks for review. I have updated the patch.

Can we also make the policy configurable? Also, one policy should also be defined for no priority at all incase, this turns out to be a performance killer.

policy can be configured by ozone.scm.pipeline.leader-choose.policy. And define a policy for no priority named RandomLeaderChoosePolicy

@bshashikant
Copy link
Contributor

@GlenGeng @bshashikant Thanks for review. I have updated the patch.

Can we also make the policy configurable? Also, one policy should also be defined for no priority at all incase, this turns out to be a performance killer.

policy can be configured by ozone.scm.pipeline.leader-choose.policy. And define a policy for no priority named RandomLeaderChoosePolicy

@runzhiwang , please correct me if i am wrong. RandomLeaderChoosePolicy still chooses a datanode randomly and this is suggested to Ratis while creating the pipeline.
With NO_PRIORITY, i meant that, we should not have any recommendation for a leader at all (as what is currently). Usually in such cases, whoever starts the ratis leader election first initially, becomes the leader.

@runzhiwang
Copy link
Contributor Author

runzhiwang commented Oct 6, 2020

@bshashikant Thanks for suggestions. Actually, RandomLeaderChoosePolicy does not choose datanode, it return null in chooseLeader, then all the datanodes are assigned the same priority, as what is currently. The name of RandomLeaderChoosePolicy seems confused, sorry for the misleading, do you have better name?

@bshashikant
Copy link
Contributor

bshashikant commented Oct 7, 2020

@bshashikant Thanks for suggestions. Actually, RandomLeaderChoosePolicy does not choose datanode, it return null in chooseLeader, then all the datanodes are assigned the same priority, as what is currently. The name of RandomLeaderChoosePolicy seems confused, sorry for the misleading, do you have better name?

i guess , this can be named as "DefaultLeaderChoosePolicy" and this should be made the default , until and unless we measure the performance with the minimumLeader election count policy and see the results. What do you think?

@runzhiwang
Copy link
Contributor Author

@bshashikant Thanks for suggestions. Actually, RandomLeaderChoosePolicy does not choose datanode, it return null in chooseLeader, then all the datanodes are assigned the same priority, as what is currently. The name of RandomLeaderChoosePolicy seems confused, sorry for the misleading, do you have better name?

i guess , this can be named as "DefaultLeaderChoosePolicy" and this should be made the default , until and unless we measure the performance with the minimumLeader election count policy and see the results. What do you think?

@bshashikant I agree. I have updated the patch.

@GlenGeng-awx
Copy link
Contributor

+1
Thanks for the work! LGTM

@runzhiwang
Copy link
Contributor Author

@xiaoyuyao Could you help merge this patch ? Thanks a lot.

@ChenSammi ChenSammi merged commit 8fab5f2 into apache:master Oct 19, 2020
@ChenSammi
Copy link
Contributor

LGTM + 1.

Thanks @runzhiwang for the contribution and @bshashikant @xiaoyuyao @GlenGeng for the review.

@runzhiwang
Copy link
Contributor Author

@ChenSammi Thanks for merging it, @bshashikant @xiaoyuyao @GlenGeng Thanks for review.

errose28 added a commit to errose28/ozone that referenced this pull request Oct 19, 2020
* master:
  HDDS-4301. SCM CA certificate does not encode KeyUsage extension properly (apache#1468)
  HDDS-4158. Provide a class type for Java based configuration (apache#1407)
  HDDS-4297. Allow multiple transactions per container to be sent for deletion by SCM.
  HDDS-2922. Balance ratis leader distribution in datanodes (apache#1371)
  HDDS-4269. Ozone DataNode thinks a volume is failed if an unexpected file is in the HDDS root directory. (apache#1490)
  HDDS-4327. Potential resource leakage using BatchOperation. (apache#1493)
  HDDS-3995. Fix s3g met NPE exception while write file by multiPartUpload (apache#1499)
  HDDS-4343. ReplicationManager.handleOverReplicatedContainer() does not handle unhealthyReplicas properly. (apache#1495)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants