-
Notifications
You must be signed in to change notification settings - Fork 593
HDDS-5517. Support multiple container moves from a source datanode in one balance iteration #2808
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
85d4e22 to
fa693b1
Compare
|
@lokeshj1703 @siddhantsangwan please take a look at this jira as a priority , i think it will speed up the move process very much. |
lokeshj1703
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@JacksonYao287 Thanks for working on this!
Currently we iterate over overutilized nodes. Maybe it is better to have a selection criteria for source datanodes as well. Default implementation would be to just return the DN with highest utilisation (also including size being moved).
...erver-scm/src/main/java/org/apache/hadoop/hdds/scm/container/balancer/ContainerBalancer.java
Outdated
Show resolved
Hide resolved
thanks @lokeshj1703 for the review. the suggestion looks good , i will do this in a new commit |
bed454a to
7a4b044
Compare
|
@lokeshj1703 @siddhantsangwan @ChenSammi PTAL, thanks! |
848da06 to
1efb219
Compare
1efb219 to
33337b2
Compare
siddhantsangwan
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@JacksonYao287 The changes mostly look good. How about we make the source datanodes selection criteria methods a part of the existing ContainerBalancerSelectionCriteria class instead of creating a new one?
...erver-scm/src/main/java/org/apache/hadoop/hdds/scm/container/balancer/ContainerBalancer.java
Show resolved
Hide resolved
...erver-scm/src/main/java/org/apache/hadoop/hdds/scm/container/balancer/ContainerBalancer.java
Outdated
Show resolved
Hide resolved
...erver-scm/src/main/java/org/apache/hadoop/hdds/scm/container/balancer/ContainerBalancer.java
Outdated
Show resolved
Hide resolved
| //TODO:use a more quick data structure, which will hava a | ||
| // better performance when changing or deleting one element at once | ||
| overUtilizedNodes.sort((a, b) -> { | ||
| double currentUsageOfA = a.calculateUtilization( | ||
| sizeLeavingNode.get(a.getDatanodeDetails())); | ||
| double currentUsageOfB = b.calculateUtilization( | ||
| sizeLeavingNode.get(b.getDatanodeDetails())); | ||
| //in descending order | ||
| return Double.compare(currentUsageOfB, currentUsageOfA); | ||
| }); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My understanding of this method is unclear, so I might be wrong. I think we want to sort by reducing the used space (subtracting sizeLeavingNode) and then calculating utilization.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we want to sort by reducing the used space (subtracting sizeLeavingNode) and then calculating utilization.
yes , that is correct. getNextCandidateSourceDataNode always try to return a source data node with the highest usage. thanks very much for pointing out this mistake!
|
Also, please note the following concern that was raised here for an earlier version of balancer.
By removing |
|
thanks very much @siddhantsangwan for the review, i have updated the patch , please take a look!
i do get your point . i think the main difference between our ideas is what should balancer exactly do. |
looks good, so that we have only one criteria for all the selections. what do you think @lokeshj1703 ? |
lokeshj1703
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@JacksonYao287 I think there are three different changes being made in this PR. It is better to separate these into different PRs.
- I was thinking we should have sth similar to FindTargetStrategy for source like FindSourceStrategy.
- Let's have a separate PR for withinThreshold nodes removal. I think there was a use case where accounting them was important where there are 5 over utilised and 5 within threshold nodes. cc @siddhantsangwan I think user shouldn't have to adjust threshold in this case.
- Also by default we shouldn't allow all size of data to be moved from source/target. It is important to limit otherwise it is possible for all balancing to happen in 2-4 nodes.
|
I think we should keep within threshold nodes. Removing them implies the user has to do a lot of work in first analyzing the cluster and then calculating a suitable threshold. The user can understandably expect to introduce one new node to the cluster and have balancer balance it with the default threshold. I don't see any down side in the current logic related to within threshold nodes. There's a bug that was pointed out by
An easy fix for this is to remove |
|
thanks @lokeshj1703 and @siddhantsangwan for the review!
we can add an interface of FindSourceStrategy and add a default implementation for it. this can be done by refactoring
sure, let us create another PR to solve this, and fix related bug. in this patch , we focus supporting multiple container moves from a source datanode. i will add withinThreshold nodes back in a new commit
in this patch , before matching a target with a source , we will sort all the target datanodes In ascending order and all the source datanodes in descending order by usage rate considering |
|
@lokeshj1703 @siddhantsangwan i have refactored the patch according to the comments, please take a look. if this looks good to you , i will improve the Code comments in a new commit.
i will create a separate patch to add withinThreshold nodes back into candidate target and source datanodes, and fix the potential bug of withinThreshold nodes. i think after refactoring, we can do this more gracefully. |
1e3a8c5 to
f0e0ff0
Compare
e0679f6 to
d7f4b65
Compare
lokeshj1703
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@JacksonYao287 Thanks for updating the PR! I have few minor comments inline.
...server-scm/src/main/java/org/apache/hadoop/hdds/scm/container/balancer/FindTargetGreedy.java
Outdated
Show resolved
Hide resolved
.../main/java/org/apache/hadoop/hdds/scm/container/balancer/ContainerBalancerConfiguration.java
Outdated
Show resolved
Hide resolved
|
Regarding the max limits, I don't think we can support very large limits. 500 G means 500GB data can move from/to datanode in one iteration. I am not sure how much replication dn can support per minute. Maybe it should be determined first. |
|
thanks @lokeshj1703 for the review, i have updated this patch , please take a look.
we can discuss this in @siddhantsangwan `s jira, which will make the default configurations smarter |
| double currentUsageOfA = a.calculateUtilization( | ||
| sizeEnteringNode.get(a.getDatanodeDetails())); | ||
| double currentUsageOfB = b.calculateUtilization( | ||
| sizeEnteringNode.get(b.getDatanodeDetails())); | ||
| return Double.compare(currentUsageOfA, currentUsageOfB); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for not bringing this up earlier! But we will need to handle the case in the comparator where utilisation is same for two nodes. Otherwise two nodes with same utilisation can not exist. It would be better to make similar change for source as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 o.w.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good point , thanks for pointing out this
|
@lokeshj1703 thanks for pointing out the mistake , i have updated this patch , please take a look |
742dab3 to
8a320ee
Compare
| if (ret != 0) { | ||
| return ret; | ||
| } | ||
| return a.hashCode() - b.hashCode(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we do datanode details UUID comparison instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure, will do this.
e65f02e to
a8d24b5
Compare
lokeshj1703
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pending clean CI.
|
@lokeshj1703 @siddhantsangwan thank you for the review! CI failure seems not caused by this patch. |
|
@JacksonYao287 We will have to get a clean CI before merge. That is the process we follow for Ozone. Maybe you can try rebasing the PR on current master. |
|
@lokeshj1703 sure, thanks , i have merged current master branch into this patch. let`s wait for a clean CI |
|
@JacksonYao287 Thanks for the contribution! @siddhantsangwan Thanks for review! I have committed the PR to master branch. |
|
thanks @lokeshj1703 and @siddhantsangwan for the review! |
| if(currentSize != null) { | ||
| sizeLeavingNode.put(dui, currentSize + size); | ||
| //reorder according to the latest sizeLeavingNode | ||
| potentialSources.add(nodeManager.getUsageInfo(dui)); | ||
| return; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @JacksonYao287 Can you help me understand what's happening in this method? I don't think the usage info for a node will get updated during an iteration, since DU/DF don't run during an iteration.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let me explain this.
when selecting a source datanode, we always want to the select the one which has a largest storage usage. here, i use PriorityQueue, which is fast to get the top one. when calling getNextCandidateSourceDataNode, PriorityQueue#poll is called , which will get and remove the top source data node for this PriorityQueue.
actually, now we supporting move multiple containers from one datanode , so a data node can be selected as source for multiple times in one iteration.
here are two reasons to call potentialSources.add(nodeManager.getUsageInfo(dui));
1 add the data node back to the PriorityQueue again, so it can be selected as a source again.
2 when we update sizeLeavingNode, the usage of this data node will be considered changed (the reported usage - sizeLeaving), so we need to sort all the candidate source datanodes according to the latest usage and get the top one. when adding the data node back to the PriorityQueue, PriorityQueue will sort all the datanode again(it will use heap sort , so very fast), so we can get the next top one.
i am not sure is it clear to you now?
What changes were proposed in this pull request?
Support multiple container moves from a source datanode
What is the link to the Apache JIRA
https://issues.apache.org/jira/browse/HDDS-5517
How was this patch tested?
UT