-
Notifications
You must be signed in to change notification settings - Fork 593
HDDS-4927. Determine over and under utilized datanodes in Container Balancer. #2230
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
linyiqun
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@siddhantsangwan , some minor comments from me. Please have a look.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Before initialize the iteration, can we do the cleanup operation for related node list additionally, like overUtilizedNodes/underUtilizedNodes.. It will look more understandable that we make node list clear logic as part of balance method rather than we clear list out side of this method.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, thanks for pointing this out.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not fully get this. Why underUtilizedNodes are added into source list, not target nodes like original logic did?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the case where there are no overUtilizedNodes in the cluster. Only underUtilizedNodes and other nodes having utilization within the limits are present. Then underUtilizedNodes need to be balanced and become the source nodes to which data will be moved.
So here the term 'source nodes' has been used for nodes that need to be balanced. Target nodes will be chosen from the list of over utilized, above average, under utilized or below average nodes as necessary. Do you have another approach in mind?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So here the term 'source nodes' has been used for nodes that need to be balanced.
Okay, so the meaning of source node is a little different with before.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. However, the meaning of source nodes might change according to the algorithm for moving containers. The term will be definite once the exact algorithm is final.
|
@siddhantsangwan The PR shows commits from HDDS-4925 as well. Can you please take a look? |
GlenGeng-awx
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @siddhantsangwan for the work. We are looking forward to this feature.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NIT. merge the two info. In multi-thread context, there might be intervening logs between 114 and 115.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NIT. nodeUsageInfos. nodes is misleading here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, changing that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why catching NPE here ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ArithmeticException means nodes is empty, which leading to divide 0. How about skip this iteration if nodes is empty ? say
nodes = nodeManager.getMostOrLeastUsedDatanodes(true);
if (nodes.empty()) {
return true.
}
The balancer should not work if SCM haven't heard any datanodes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense. But shouldn't we return false then?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we will also need to handle safe mode. Balancer should not operate when SCM is in safe mode.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NIT. merge 166 and 167 together.
what if clusterAvgUtilisation is less than threshold, e.g., for an empty cluster. Does a negative lowerLimit make sense here ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A negative lower limit and upper limit greater than 1 will not lead to errors(as checked by the unit test). But in that case we could return false early since the cluster is balanced.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Question:
say we have a 10 DN cluster, the usage of all of them is 95%, then one empty DN is added to rebalance the cluster. Given the threshold is 10%, it seems the balancer will not work in this case, since that 10 DN will not achieve upperLimit.
Have we consider corner case like this ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In that case, the newly added DN will be under utilized. Balancer will recognize this and then try to move data into it from the other 10 DNs that are now above average utilized. However, the exact algorithm for choosing particular source DN and containers to move is under progress.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NIT. Better use a lock free variable to avoid contention, and print a error instead of throwing RuntimeException.
private final AtomicBoolean balancerRunning = new AtomicBoolean(false);
public void start(ContainerBalancerConfiguration balancerConfiguration) {
if (!balancerRunning.compareAndSet(false, true)) {
LOG.error("Container Balancer is already running.");
return;
}
///
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NIT. containsNode is called multi times, better change listToSearch to hash set, and do the existence check, which may be simpler and quicker.
For example, declare overUtilizedNodes and underUtilizedNodes to be hash set.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a great suggestion. The current implementation is assuming that the order of nodes in terms of utilization is important so balancer can focus on nodes that are most over or under utilized first. That's why a sorted list is being used.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Comparator can return 0 for two different datanodes with same utilisation. We will need to handle that case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can use CollectionUtils.intersection(), CollectionUtils.intersection(), CollectionUtils.union() to simplify the code.
…itialising iteration.
lokeshj1703
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@siddhantsangwan Thanks for working on the PR! I have added a few comments inline.
...erver-scm/src/main/java/org/apache/hadoop/hdds/scm/container/balancer/ContainerBalancer.java
Outdated
Show resolved
Hide resolved
...erver-scm/src/main/java/org/apache/hadoop/hdds/scm/container/balancer/ContainerBalancer.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we will also need to handle safe mode. Balancer should not operate when SCM is in safe mode.
...erver-scm/src/main/java/org/apache/hadoop/hdds/scm/container/balancer/ContainerBalancer.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Comparator can return 0 for two different datanodes with same utilisation. We will need to handle that case.
.../main/java/org/apache/hadoop/hdds/scm/container/balancer/ContainerBalancerConfiguration.java
Outdated
Show resolved
Hide resolved
.../main/java/org/apache/hadoop/hdds/scm/container/balancer/ContainerBalancerConfiguration.java
Outdated
Show resolved
Hide resolved
hadoop-hdds/server-scm/src/test/java/org/apache/hadoop/hdds/scm/container/MockNodeManager.java
Outdated
Show resolved
Hide resolved
...r-scm/src/test/java/org/apache/hadoop/hdds/scm/container/balancer/TestContainerBalancer.java
Outdated
Show resolved
Hide resolved
...r-scm/src/test/java/org/apache/hadoop/hdds/scm/container/balancer/TestContainerBalancer.java
Outdated
Show resolved
Hide resolved
d157859 to
6747866
Compare
|
Changed the term |
…verage. Other changes include preventing Container Balancer from operating while SCM is in safe mode. Code clean up.
| this.clusterRemaining = 0L; | ||
|
|
||
| this.overUtilizedNodes = new ArrayList<>(); | ||
| this.underUtilizedNodes = new ArrayList<>(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe it is better to put these initialize operations in constructor, and the start function will just only do the start work
lokeshj1703
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@siddhantsangwan Thanks for updating the PR! I have added a few more minor comments based on recent changes.
...erver-scm/src/main/java/org/apache/hadoop/hdds/scm/container/balancer/ContainerBalancer.java
Outdated
Show resolved
Hide resolved
...erver-scm/src/main/java/org/apache/hadoop/hdds/scm/container/balancer/ContainerBalancer.java
Outdated
Show resolved
Hide resolved
.../main/java/org/apache/hadoop/hdds/scm/container/balancer/ContainerBalancerConfiguration.java
Outdated
Show resolved
Hide resolved
# Conflicts: # hadoop-hdds/server-scm/src/test/java/org/apache/hadoop/hdds/scm/container/MockNodeManager.java
|
Thanks for the reviews. I have addressed the comments. |
lokeshj1703
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@siddhantsangwan Thanks for updating the PR! The changes look good to me. +1.
Can you create another jira for #2230 (comment) or handle it in next jira?
I have handled that case in the |
|
If hypothetically you have 5 nodes with same utilisation then containsNode might not return true with the current change. Because binary search can return index of any node with same utilisation.
This logic would return false since the returned datanode might have a different id. Comparator would need to handle the case of same utilisation. |
Yes, this case can be handled in another Jira. Thanks for pointing this out. |
|
@JacksonYao287 and @linyiqun can you please review the changes? Any comments are welcome. |
GlenGeng-awx
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @siddhantsangwan for the work. Just some inline comments.
|
|
||
| @Metric(about = "The amount of Giga Bytes that have been moved to achieve " + | ||
| "balance.") | ||
| private LongMetric gigaBytesMoved; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NIT: gigaBytesMoved to dataSizeBalancedGB
|
|
||
| @Metric(about = "Number of containers that Container Balancer has moved" + | ||
| " until now.") | ||
| private LongMetric numContainersMoved; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NIT: numContainersMoved to movedContainerNum
| private LongMetric numContainersMoved; | ||
|
|
||
| @Metric(about = "The total number of datanodes that need to be balanced.") | ||
| private LongMetric totalNumDatanodesToBalance; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NIT: totalNumDatanodesToBalance to datanodeNumToBalance
|
|
||
| @Metric(about = "Number of datanodes that Container Balancer has balanced " + | ||
| "until now.") | ||
| private LongMetric numDatanodesBalanced; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NIT: numDatanodesBalanced to datanodeNumBalanced
| private LongMetric numDatanodesBalanced; | ||
|
|
||
| @Metric(about = "Utilisation value of the current maximum utilised datanode.") | ||
| private double maxUtilizedDatanodeRatio; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NIT: maxUtilizedDatanodeRatio to maxDatanodeUtilizedRatio
|
|
||
| @Metric(about = "The total amount of used space in GigaBytes that needs to " + | ||
| "be balanced.") | ||
| private LongMetric totalSizeToBalanceGB; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NIT totalSizeToBalanceGB -> dataSizeToBalanceGB
| public void stop() { | ||
| LOG.info("Stopping Container Balancer..."); | ||
| balancerRunning = false; | ||
| balancerRunning.set(false); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NIT. remove line 319, one info line is sufficient here, since no actual work needs to be done here.
| * @param datanodeUsageInfo DatanodeUsageInfo to calculate utilization for | ||
| * @return Utilization value | ||
| */ | ||
| private double calculateUtilization(DatanodeUsageInfo datanodeUsageInfo) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NIT. make it to be a static helper function.
|
|
||
| try { | ||
| return clusterUsed / (double) clusterCapacity; | ||
| } catch (ArithmeticException e) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NIT. better not handle ArithmeticException, instead check nodes.size() != 0 at the entrance of function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated PR. Please take a look.
|
LGTM. We can merge this PR for now, and start our future development based on it. |
| */ | ||
| public void start(ContainerBalancerConfiguration balancerConfiguration) { | ||
| this.balancerRunning = true; | ||
| public boolean start(ContainerBalancerConfiguration balancerConfiguration) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe it is better to move these configuration initialization operations to constructor, start just do the start work without any parameter
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was initially decided in order to support configuration change using restart. Admin could restart balancer and balancer was supposed to load the new configuration values.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@JacksonYao287 if there are other suggestions for this function we can take them up in #2278. If it is ok I will commit this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if we put container balancer inside SCM , the configuration is just the one loaded by scm. so ,the configuration can not be reloaded unless restarting scm
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
According to our design decision, params will be based from command line, which will be implemented by #2278, thereby the conf related code will be removed in future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. We will not need the configuration param in that case. I see that change is made in #2278.
|
@siddhantsangwan Thanks for the contribution! @linyiqun @GlenGeng @JacksonYao287 Thanks for the reviews! I have committed the PR to master branch. |
…ing-upgrade-master-merge * upstream/master: (76 commits) HDDS-5280. Make XceiverClientManager creation when necessary in ContainerOperationClient (apache#2289) HDDS-5272. Make ozonefs.robot execution repeatable (apache#2280) HDDS-5123. Use the pre-created apache/ozone-testkrb5 image during secure acceptance tests (apache#2165) HDDS-4993. Add guardrail for reserved buffer size when DN reads a chunk (apache#2058) HDDS-4936. Change ozone groupId from org.apache.hadoop to org.apache.ozone (apache#2018) HDDS-4043. allow deletion from Trash directory without -skipTrash option (apache#2110) HDDS-4927. Determine over and under utilized datanodes in Container Balancer. (apache#2230) HDDS-5273. Handle unsecure cluster convert to secure cluster for SCM. (apache#2281) HDDS-5158. Add documentation for SCM HA Security. (apache#2205) HDDS-5275. Datanode Report Publisher publishes one extra report after DN shutdown (apache#2283) HDDS-5241. SCM UI should have leader/follower and Primordial SCM information (apache#2260) HDDS-5219. Limit number of bad volumes by dfs.datanode.failed.volumes.tolerated. (apache#2243) HDDS-5252. PipelinePlacementPolicy filter out datanodes with not enough space. (apache#2271) HDDS-5191. Increase default pvc storage size (apache#2219) HDDS-5073. Use ReplicationConfig on client side (apache#2136) HDDS-5250. Build integration tests with Maven cache (apache#2269) HDDS-5236. Require block token for more operations (apache#2254) HDDS-5266 Misspelt words in S3MultipartUploadCommitPartRequest.java line 202 (apache#2279) HDDS-5249. Race Condition between Full and Incremental Container Reports (apache#2268) HDDS-5142. Make generic streaming client/service for container re-replication, data read, scm/om snapshot download (apache#2256) ... Conflicts: hadoop-hdds/common/src/main/java/org/apache/hadoop/hdds/scm/protocol/StorageContainerLocationProtocol.java hadoop-hdds/framework/src/main/java/org/apache/hadoop/hdds/scm/protocolPB/StorageContainerLocationProtocolClientSideTranslatorPB.java hadoop-hdds/interface-admin/src/main/proto/ScmAdminProtocol.proto hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/server/StorageContainerManager.java hadoop-hdds/server-scm/src/test/java/org/apache/hadoop/hdds/scm/container/MockNodeManager.java hadoop-ozone/dist/src/main/compose/testlib.sh hadoop-ozone/integration-test/src/test/java/org/apache/hadoop/ozone/TestStorageContainerManager.java hadoop-ozone/interface-client/src/main/proto/OmClientProtocol.proto hadoop-ozone/ozone-manager/pom.xml hadoop-ozone/ozone-manager/src/main/java/org/apache/hadoop/ozone/om/OzoneManager.java hadoop-ozone/ozone-manager/src/main/java/org/apache/hadoop/ozone/om/ratis/utils/OzoneManagerRatisUtils.java hadoop-ozone/s3gateway/pom.xml
What changes were proposed in this pull request?
ContainerBalancerwill identify over, under, and within threshold utilized nodes at the start of each iteration. Based on this, it will determine whether balancing should continue. Unit tests to test this functionality.What is the link to the Apache JIRA
https://issues.apache.org/jira/browse/HDDS-4927
How was this patch tested?
Added unit test
TestContainerBalancer