Skip to content

Conversation

@siddhantsangwan
Copy link
Contributor

@siddhantsangwan siddhantsangwan commented May 10, 2021

What changes were proposed in this pull request?

ContainerBalancer will identify over, under, and within threshold utilized nodes at the start of each iteration. Based on this, it will determine whether balancing should continue. Unit tests to test this functionality.

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-4927

How was this patch tested?

Added unit test TestContainerBalancer

Copy link
Contributor

@linyiqun linyiqun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@siddhantsangwan , some minor comments from me. Please have a look.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Before initialize the iteration, can we do the cleanup operation for related node list additionally, like overUtilizedNodes/underUtilizedNodes.. It will look more understandable that we make node list clear logic as part of balance method rather than we clear list out side of this method.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, thanks for pointing this out.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not fully get this. Why underUtilizedNodes are added into source list, not target nodes like original logic did?

Copy link
Contributor Author

@siddhantsangwan siddhantsangwan May 11, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the case where there are no overUtilizedNodes in the cluster. Only underUtilizedNodes and other nodes having utilization within the limits are present. Then underUtilizedNodes need to be balanced and become the source nodes to which data will be moved.

So here the term 'source nodes' has been used for nodes that need to be balanced. Target nodes will be chosen from the list of over utilized, above average, under utilized or below average nodes as necessary. Do you have another approach in mind?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So here the term 'source nodes' has been used for nodes that need to be balanced.

Okay, so the meaning of source node is a little different with before.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. However, the meaning of source nodes might change according to the algorithm for moving containers. The term will be definite once the exact algorithm is final.

@lokeshj1703
Copy link
Contributor

@siddhantsangwan The PR shows commits from HDDS-4925 as well. Can you please take a look?

Copy link
Contributor

@GlenGeng-awx GlenGeng-awx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @siddhantsangwan for the work. We are looking forward to this feature.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NIT. merge the two info. In multi-thread context, there might be intervening logs between 114 and 115.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NIT. nodeUsageInfos. nodes is misleading here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, changing that.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why catching NPE here ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ArithmeticException means nodes is empty, which leading to divide 0. How about skip this iteration if nodes is empty ? say

nodes = nodeManager.getMostOrLeastUsedDatanodes(true);
if (nodes.empty()) {
  return true.
}

The balancer should not work if SCM haven't heard any datanodes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense. But shouldn't we return false then?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we will also need to handle safe mode. Balancer should not operate when SCM is in safe mode.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NIT. merge 166 and 167 together.
what if clusterAvgUtilisation is less than threshold, e.g., for an empty cluster. Does a negative lowerLimit make sense here ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A negative lower limit and upper limit greater than 1 will not lead to errors(as checked by the unit test). But in that case we could return false early since the cluster is balanced.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question:

say we have a 10 DN cluster, the usage of all of them is 95%, then one empty DN is added to rebalance the cluster. Given the threshold is 10%, it seems the balancer will not work in this case, since that 10 DN will not achieve upperLimit.

Have we consider corner case like this ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In that case, the newly added DN will be under utilized. Balancer will recognize this and then try to move data into it from the other 10 DNs that are now above average utilized. However, the exact algorithm for choosing particular source DN and containers to move is under progress.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NIT. Better use a lock free variable to avoid contention, and print a error instead of throwing RuntimeException.

private final AtomicBoolean balancerRunning = new AtomicBoolean(false);

public void start(ContainerBalancerConfiguration balancerConfiguration) {
  if (!balancerRunning.compareAndSet(false, true)) {
    LOG.error("Container Balancer is already running.");
    return;
  }

  /// 
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NIT. containsNode is called multi times, better change listToSearch to hash set, and do the existence check, which may be simpler and quicker.

For example, declare overUtilizedNodes and underUtilizedNodes to be hash set.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a great suggestion. The current implementation is assuming that the order of nodes in terms of utilization is important so balancer can focus on nodes that are most over or under utilized first. That's why a sorted list is being used.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comparator can return 0 for two different datanodes with same utilisation. We will need to handle that case.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can use CollectionUtils.intersection(), CollectionUtils.intersection(), CollectionUtils.union() to simplify the code.

Copy link
Contributor

@lokeshj1703 lokeshj1703 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@siddhantsangwan Thanks for working on the PR! I have added a few comments inline.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we will also need to handle safe mode. Balancer should not operate when SCM is in safe mode.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comparator can return 0 for two different datanodes with same utilisation. We will need to handle that case.

@siddhantsangwan
Copy link
Contributor Author

Changed the term source nodes to unBalanced nodes to avoid ambiguity. Source node can mean 'node from which data is leaving' as expected. @linyiqun

…verage.

Other changes include preventing Container Balancer from operating while SCM is in safe mode. Code clean up.
this.clusterRemaining = 0L;

this.overUtilizedNodes = new ArrayList<>();
this.underUtilizedNodes = new ArrayList<>();
Copy link
Contributor

@JacksonYao287 JacksonYao287 May 15, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe it is better to put these initialize operations in constructor, and the start function will just only do the start work

Copy link
Contributor

@lokeshj1703 lokeshj1703 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@siddhantsangwan Thanks for updating the PR! I have added a few more minor comments based on recent changes.

@siddhantsangwan
Copy link
Contributor Author

Thanks for the reviews. I have addressed the comments.

Copy link
Contributor

@lokeshj1703 lokeshj1703 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@siddhantsangwan Thanks for updating the PR! The changes look good to me. +1.

Can you create another jira for #2230 (comment) or handle it in next jira?

@siddhantsangwan
Copy link
Contributor Author

@siddhantsangwan Thanks for updating the PR! The changes look good to me. +1.

Can you create another jira for #2230 (comment) or handle it in next jira?

I have handled that case in the containsNode method. Do you mean changing the comparator itself?

@lokeshj1703
Copy link
Contributor

If hypothetically you have 5 nodes with same utilisation then containsNode might not return true with the current change. Because binary search can return index of any node with same utilisation.

return index >= 0 && listToSearch.get(index).equals(node);

This logic would return false since the returned datanode might have a different id.

Comparator would need to handle the case of same utilisation.

@siddhantsangwan
Copy link
Contributor Author

Because binary search can return index of any node with same utilisation.

Yes, this case can be handled in another Jira. Thanks for pointing this out.

@siddhantsangwan
Copy link
Contributor Author

@JacksonYao287 and @linyiqun can you please review the changes? Any comments are welcome.

Copy link
Contributor

@GlenGeng-awx GlenGeng-awx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @siddhantsangwan for the work. Just some inline comments.


@Metric(about = "The amount of Giga Bytes that have been moved to achieve " +
"balance.")
private LongMetric gigaBytesMoved;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NIT: gigaBytesMoved to dataSizeBalancedGB


@Metric(about = "Number of containers that Container Balancer has moved" +
" until now.")
private LongMetric numContainersMoved;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NIT: numContainersMoved to movedContainerNum

private LongMetric numContainersMoved;

@Metric(about = "The total number of datanodes that need to be balanced.")
private LongMetric totalNumDatanodesToBalance;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NIT: totalNumDatanodesToBalance to datanodeNumToBalance


@Metric(about = "Number of datanodes that Container Balancer has balanced " +
"until now.")
private LongMetric numDatanodesBalanced;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NIT: numDatanodesBalanced to datanodeNumBalanced

private LongMetric numDatanodesBalanced;

@Metric(about = "Utilisation value of the current maximum utilised datanode.")
private double maxUtilizedDatanodeRatio;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NIT: maxUtilizedDatanodeRatio to maxDatanodeUtilizedRatio


@Metric(about = "The total amount of used space in GigaBytes that needs to " +
"be balanced.")
private LongMetric totalSizeToBalanceGB;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NIT totalSizeToBalanceGB -> dataSizeToBalanceGB

public void stop() {
LOG.info("Stopping Container Balancer...");
balancerRunning = false;
balancerRunning.set(false);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NIT. remove line 319, one info line is sufficient here, since no actual work needs to be done here.

* @param datanodeUsageInfo DatanodeUsageInfo to calculate utilization for
* @return Utilization value
*/
private double calculateUtilization(DatanodeUsageInfo datanodeUsageInfo) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NIT. make it to be a static helper function.


try {
return clusterUsed / (double) clusterCapacity;
} catch (ArithmeticException e) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NIT. better not handle ArithmeticException, instead check nodes.size() != 0 at the entrance of function.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated PR. Please take a look.

@GlenGeng-awx
Copy link
Contributor

LGTM. We can merge this PR for now, and start our future development based on it.

*/
public void start(ContainerBalancerConfiguration balancerConfiguration) {
this.balancerRunning = true;
public boolean start(ContainerBalancerConfiguration balancerConfiguration) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe it is better to move these configuration initialization operations to constructor, start just do the start work without any parameter

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was initially decided in order to support configuration change using restart. Admin could restart balancer and balancer was supposed to load the new configuration values.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@JacksonYao287 if there are other suggestions for this function we can take them up in #2278. If it is ok I will commit this PR.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we put container balancer inside SCM , the configuration is just the one loaded by scm. so ,the configuration can not be reloaded unless restarting scm

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to our design decision, params will be based from command line, which will be implemented by #2278, thereby the conf related code will be removed in future.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. We will not need the configuration param in that case. I see that change is made in #2278.

@siddhantsangwan siddhantsangwan changed the title HDDS-4927. Add support for initializing an iteration in ContainerBalancer. Add unit tests. HDDS-4927. Determine over and under utilized datanodes in Container Balancer. May 26, 2021
@lokeshj1703
Copy link
Contributor

@siddhantsangwan Thanks for the contribution! @linyiqun @GlenGeng @JacksonYao287 Thanks for the reviews! I have committed the PR to master branch.

errose28 added a commit to errose28/ozone that referenced this pull request Jun 1, 2021
…ing-upgrade-master-merge

* upstream/master: (76 commits)
  HDDS-5280. Make XceiverClientManager creation when necessary in ContainerOperationClient (apache#2289)
  HDDS-5272. Make ozonefs.robot execution repeatable (apache#2280)
  HDDS-5123. Use the pre-created apache/ozone-testkrb5 image during secure acceptance tests (apache#2165)
  HDDS-4993. Add guardrail for reserved buffer size when DN reads a chunk (apache#2058)
  HDDS-4936. Change ozone groupId from org.apache.hadoop to org.apache.ozone (apache#2018)
  HDDS-4043. allow deletion from Trash directory without -skipTrash option (apache#2110)
  HDDS-4927. Determine over and under utilized datanodes in Container Balancer. (apache#2230)
  HDDS-5273. Handle unsecure cluster convert to secure cluster for SCM. (apache#2281)
  HDDS-5158. Add documentation for SCM HA Security. (apache#2205)
  HDDS-5275. Datanode Report Publisher publishes one extra report after DN shutdown (apache#2283)
  HDDS-5241. SCM UI should have leader/follower and Primordial SCM information (apache#2260)
  HDDS-5219. Limit number of bad volumes by dfs.datanode.failed.volumes.tolerated. (apache#2243)
  HDDS-5252. PipelinePlacementPolicy filter out datanodes with not enough space. (apache#2271)
  HDDS-5191. Increase default pvc storage size (apache#2219)
  HDDS-5073. Use ReplicationConfig on client side  (apache#2136)
  HDDS-5250. Build integration tests with Maven cache (apache#2269)
  HDDS-5236. Require block token for more operations (apache#2254)
  HDDS-5266 Misspelt words in S3MultipartUploadCommitPartRequest.java line 202 (apache#2279)
  HDDS-5249. Race Condition between Full and Incremental Container Reports (apache#2268)
  HDDS-5142. Make generic streaming client/service for container re-replication, data read, scm/om snapshot download (apache#2256)
  ...

Conflicts:
	hadoop-hdds/common/src/main/java/org/apache/hadoop/hdds/scm/protocol/StorageContainerLocationProtocol.java
	hadoop-hdds/framework/src/main/java/org/apache/hadoop/hdds/scm/protocolPB/StorageContainerLocationProtocolClientSideTranslatorPB.java
	hadoop-hdds/interface-admin/src/main/proto/ScmAdminProtocol.proto
	hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/server/StorageContainerManager.java
	hadoop-hdds/server-scm/src/test/java/org/apache/hadoop/hdds/scm/container/MockNodeManager.java
	hadoop-ozone/dist/src/main/compose/testlib.sh
	hadoop-ozone/integration-test/src/test/java/org/apache/hadoop/ozone/TestStorageContainerManager.java
	hadoop-ozone/interface-client/src/main/proto/OmClientProtocol.proto
	hadoop-ozone/ozone-manager/pom.xml
	hadoop-ozone/ozone-manager/src/main/java/org/apache/hadoop/ozone/om/OzoneManager.java
	hadoop-ozone/ozone-manager/src/main/java/org/apache/hadoop/ozone/om/ratis/utils/OzoneManagerRatisUtils.java
	hadoop-ozone/s3gateway/pom.xml
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants