-
Notifications
You must be signed in to change notification settings - Fork 588
HDDS-2922. Balance ratis leader distribution in datanodes #1371
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
@bshashikant @lokeshj1703 @mukul1987 @elek @xiaoyuyao Could you help review this patch ? Thank you very much. |
|
Thanks @runzhiwang. This is an awesome work! I will also try to help review this PR. |
d5b2d64 to
73b8622
Compare
ae54831 to
b58df38
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this fantastic feature! Just have some inline comments.
As discussed, we can handle the suggestLeaderCount in an easier and more accurate way by leveraging pipeline report.
hadoop-hdds/common/src/main/java/org/apache/hadoop/hdds/protocol/DatanodeDetails.java
Outdated
Show resolved
Hide resolved
hadoop-hdds/common/src/main/java/org/apache/hadoop/hdds/ratis/RatisHelper.java
Show resolved
Hide resolved
hadoop-hdds/common/src/main/java/org/apache/hadoop/hdds/scm/pipeline/Pipeline.java
Outdated
Show resolved
Hide resolved
hadoop-hdds/common/src/main/java/org/apache/hadoop/hdds/scm/pipeline/Pipeline.java
Outdated
Show resolved
Hide resolved
.../hadoop/ozone/container/common/statemachine/commandhandler/CreatePipelineCommandHandler.java
Outdated
Show resolved
Hide resolved
...java/org/apache/hadoop/ozone/container/common/transport/server/ratis/XceiverServerRatis.java
Outdated
Show resolved
Hide resolved
...r-service/src/main/java/org/apache/hadoop/ozone/protocol/commands/CreatePipelineCommand.java
Outdated
Show resolved
Hide resolved
...hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/pipeline/RatisPipelineProvider.java
Outdated
Show resolved
Hide resolved
289a380 to
da89a36
Compare
|
@GlenGeng Thanks for review, I have updated the patch. |
3bce470 to
a1a7f4e
Compare
...java/org/apache/hadoop/ozone/container/common/transport/server/ratis/XceiverServerRatis.java
Outdated
Show resolved
Hide resolved
| try { | ||
| Pipeline pipeline = getPipelineStateManager().getPipeline(pipelineID); | ||
| if (!pipeline.isClosed() | ||
| && dn.getUuid().equals(pipeline.getSuggestedLeaderId())) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we use getLeaderId() instead of getSuggestedLeaderId() here to reflect the actual leader count?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And the method name can be changed to getLeaderCount() so that the suggest leader is determined by the actual leader count.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
RATIS-967 guarantee the highpriority node act as leader when create pipeline. In a long running cluster, the leader maybe crash, then some follower will take the leadership, but when the old leader restart and catch up with current leader's log, the old leader can grab the leadership again by RATIS-967.
So let me suppose the following case, there are 3 servers: s1, s2, s3, there are 2 pipelines now, the first pipeline's leader is s1, the second pipeline's leader is s2, both the 2 leaders are suggested leader with high priority. Then s1 crash, suppose s3 will take the first pipline's leader. Then s1 restart, but has not grab leadership of the first pipeline. If we use getLeaderId() instead of getSuggestedLeaderId() to reflect the actual leader count, when we create the 3 third pipeline, we find the leader number on s1, s2, s3 is 0, 1, 1, so we will select s1 as the suggest leader, then s1 grab the leadership of the first pipeline by RATIS-967, so the leader number on s1, s2, s3 will be 2, 1, 0, it's not balance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
bq. then s1 grab the leadership of the first pipeline by RATIS-967,
Does RATIS-967 always gives up its leader when s1 is back online even the current leader works fine? I think this is more specific to RATIS-967 wrt. how the priority is enforced. Any performance impact on the pipeline of forcing leader to be the original one?
I'm thinking of instead of forcing leader of pipeline P1, P2, P3 like
S1 S2 S3
P1 P2
P3
In the case of S1 temporarily down, why don't we keep P1 leader on S3 and create P3 with leader on S1, this gives more flexibility for higher level to choose leader?
S1 S2 S3
P2
P1
P3
Another situation I'm thinking of is writers on pipeline with slow leader(e.g., hardware slowness) may not be able to recover by leader change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@xiaoyuyao Good point, I also have thought this.
Any performance impact on the pipeline of forcing leader to be the original one.
If there is performance problem, I can improve forcing leader change within 1 second. I already know how to improve it, but has not implemented it.
Another situation I'm thinking of is writers on pipeline with slow leader(e.g., hardware slowness) may not be able to recover by leader change.
We can find slow leader by some metric, decrease the priority of the slow leader, select one faster datanode and increase it's priority, so the faster datanode will grab the leadership from the slow leader.
In the case of S1 temporarily down, why don't we keep P1 leader on S3 and create P3 with leader on S1, this gives more flexibility for higher level to choose leader?
I want the cluster leader distribution as we planned, if the plan is not appropriate, we can adjust the plan by change priority.
If the leader distribution totally depends on hardware rather than plan, we maybe lost control of the leader distribution. Because the leaderId in scm was reported by datanode, it maybe a delayed leaderId. For example, datanode report:
S1 .. S2 .. S3
P1 .. P2
then P1's leader transfer to S3, but SCM has not received this report, SCM allocate P3's leader to S3, then
S1 .. S2 .. S3
........P2 .. P1
...............P3
It's not balance now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, planed leader distribution should work with RATIS-967. How do we scale with this. Plan weight for each of node as a leader when the cluster has thousands of nodes can be difficult.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Plan weight for each of node as a leader when the cluster has thousands of nodes can be difficult.
If each node has similar hardware, i.e. CPU, memory, we just plan weight as now, assign each node with same leader number, it is cheap and reasonable.
I think the only case we need to consider is that some nodes' hardware is weaker than other nodes' obviously. I think the weeker datanodes should engage in less pipeline than the stronger datanodes, but ozone does not support this now. If we can support this, the maxum leader number of each datanode should be less or equal to ((1/3) * the pipeline number it engaged in), and we select the datanode as the leader which has lowest value of (leader number / pipeline number it engaged in) in 3 datanodes, this is also cheap. We can change this if there is requirement in the future, but now it is enough to allocate the same leader number in each datanode.
xiaoyuyao
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @runzhiwang for adding this useful feature. It looks good to me overall. A few comments added inline.
| try { | ||
| Pipeline pipeline = getPipelineStateManager().getPipeline(pipelineID); | ||
| if (!pipeline.isClosed() | ||
| && dn.getUuid().equals(pipeline.getSuggestedLeaderId())) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And the method name can be changed to getLeaderCount() so that the suggest leader is determined by the actual leader count.
...hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/pipeline/RatisPipelineProvider.java
Outdated
Show resolved
Hide resolved
hadoop-ozone/integration-test/src/test/java/org/apache/hadoop/ozone/TestMiniOzoneCluster.java
Show resolved
Hide resolved
|
@xiaoyuyao Thanks for review. I have updated the patch. |
...est/src/test/java/org/apache/hadoop/hdds/scm/pipeline/TestRatisPipelineCreateAndDestroy.java
Outdated
Show resolved
Hide resolved
...hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/pipeline/RatisPipelineProvider.java
Outdated
Show resolved
Hide resolved
|
@xiaoyuyao @bshashikant I have updated the patch. Could you help review it again ? Thank you very much. |
hadoop-hdds/common/src/main/java/org/apache/hadoop/hdds/scm/pipeline/Pipeline.java
Outdated
Show resolved
Hide resolved
...java/org/apache/hadoop/ozone/container/common/transport/server/ratis/XceiverServerRatis.java
Outdated
Show resolved
Hide resolved
...r-service/src/main/java/org/apache/hadoop/ozone/protocol/commands/CreatePipelineCommand.java
Outdated
Show resolved
Hide resolved
...-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/pipeline/PipelineStateManager.java
Outdated
Show resolved
Hide resolved
...hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/pipeline/RatisPipelineProvider.java
Outdated
Show resolved
Hide resolved
...hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/pipeline/RatisPipelineProvider.java
Outdated
Show resolved
Hide resolved
...op-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/pipeline/SCMPipelineManager.java
Outdated
Show resolved
Hide resolved
...r-service/src/main/java/org/apache/hadoop/ozone/protocol/commands/CreatePipelineCommand.java
Outdated
Show resolved
Hide resolved
|
Can we also make the policy configurable? Also, one policy should also be defined for no priority at all incase, this turns out to be a performance killer. |
8eee20e to
eefb819
Compare
eefb819 to
f573b4a
Compare
|
@GlenGeng @bshashikant Thanks for review. I have updated the patch.
policy can be configured by |
@runzhiwang , please correct me if i am wrong. RandomLeaderChoosePolicy still chooses a datanode randomly and this is suggested to Ratis while creating the pipeline. |
|
@bshashikant Thanks for suggestions. Actually, RandomLeaderChoosePolicy does not choose datanode, it return null in chooseLeader, then all the datanodes are assigned the same priority, as what is currently. The name of RandomLeaderChoosePolicy seems confused, sorry for the misleading, do you have better name? |
i guess , this can be named as "DefaultLeaderChoosePolicy" and this should be made the default , until and unless we measure the performance with the minimumLeader election count policy and see the results. What do you think? |
01627b5 to
e405f4f
Compare
e405f4f to
d3a7ad0
Compare
@bshashikant I agree. I have updated the patch. |
|
+1 |
|
@xiaoyuyao Could you help merge this patch ? Thanks a lot. |
|
LGTM + 1. Thanks @runzhiwang for the contribution and @bshashikant @xiaoyuyao @GlenGeng for the review. |
|
@ChenSammi Thanks for merging it, @bshashikant @xiaoyuyao @GlenGeng Thanks for review. |
* master: HDDS-4301. SCM CA certificate does not encode KeyUsage extension properly (apache#1468) HDDS-4158. Provide a class type for Java based configuration (apache#1407) HDDS-4297. Allow multiple transactions per container to be sent for deletion by SCM. HDDS-2922. Balance ratis leader distribution in datanodes (apache#1371) HDDS-4269. Ozone DataNode thinks a volume is failed if an unexpected file is in the HDDS root directory. (apache#1490) HDDS-4327. Potential resource leakage using BatchOperation. (apache#1493) HDDS-3995. Fix s3g met NPE exception while write file by multiPartUpload (apache#1499) HDDS-4343. ReplicationManager.handleOverReplicatedContainer() does not handle unhealthyReplicas properly. (apache#1495)
What changes were proposed in this pull request?
What's the problem ?
When enable multi-raft, the leader distribution in datanodes is not balance. In my test, there are 72 datanodes, each datanode
engage in 6 pipelines, so there are 144 pipelines. As the image shows, the leader number of the 4 datanodes is 0, 0, 4, 2, it's not balance. Because ratis leader not only accept client request, but also replicate log to 2 followers, and follower only replicate log from leader, so the leader's load is at least 3 times of follower. So we need to balance leader.
How to improve ?
With the guidance of @szetszwo , I implement RATIS-967, which not only support priority in leader election, but also support lower priority leader try to yield leadership to higher priority peer when higher priority peer's log catch up, to address the higher priority leader lose the leadership.
So in ozone
As the following image shows, there are 72 datanodes, each datanode engage in 6 pipelines, so there are 144 pipelines.
The leader count of each datanode is 2, there is no exception, we achieve the leader distribution's balance.
What is the link to the Apache JIRA
https://issues.apache.org/jira/browse/HDDS-2922
How was this patch tested?
add new ut.