HDDS-5517. Support multiple container moves from a source datanode in one balance iteration #2808

JacksonYao287 · 2021-11-05T09:17:33Z

What changes were proposed in this pull request?

Support multiple container moves from a source datanode

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-5517

How was this patch tested?

UT

JacksonYao287 · 2021-11-05T09:55:44Z

@lokeshj1703 @siddhantsangwan please take a look at this jira as a priority , i think it will speed up the move process very much.
by the way , i think we should not limit maxSizeEnteringTarget and maxSizeLeavingSource by default, we can specify the value in the command line if needed

lokeshj1703

@JacksonYao287 Thanks for working on this!
Currently we iterate over overutilized nodes. Maybe it is better to have a selection criteria for source datanodes as well. Default implementation would be to just return the DN with highest utilisation (also including size being moved).

...erver-scm/src/main/java/org/apache/hadoop/hdds/scm/container/balancer/ContainerBalancer.java

JacksonYao287 · 2021-11-08T13:19:02Z

Maybe it is better to have a selection criteria for source datanodes as well. Default implementation would be to just return the DN with highest utilisation (also including size being moved).

thanks @lokeshj1703 for the review. the suggestion looks good , i will do this in a new commit

JacksonYao287 · 2021-11-10T08:43:25Z

@lokeshj1703 @siddhantsangwan @ChenSammi PTAL, thanks!

siddhantsangwan

@JacksonYao287 The changes mostly look good. How about we make the source datanodes selection criteria methods a part of the existing ContainerBalancerSelectionCriteria class instead of creating a new one?

...erver-scm/src/main/java/org/apache/hadoop/hdds/scm/container/balancer/ContainerBalancer.java

siddhantsangwan · 2021-11-12T07:35:04Z

...main/java/org/apache/hadoop/hdds/scm/container/balancer/SourceDataNodeSelectionCriteria.java

+    //TODO：use a more quick data structure, which will hava a
+    // better performance when changing or deleting one element at once
+    overUtilizedNodes.sort((a, b) -> {
+      double currentUsageOfA = a.calculateUtilization(
+          sizeLeavingNode.get(a.getDatanodeDetails()));
+      double currentUsageOfB = b.calculateUtilization(
+          sizeLeavingNode.get(b.getDatanodeDetails()));
+      //in descending order
+      return Double.compare(currentUsageOfB, currentUsageOfA);
+    });


My understanding of this method is unclear, so I might be wrong. I think we want to sort by reducing the used space (subtracting sizeLeavingNode) and then calculating utilization.

I think we want to sort by reducing the used space (subtracting sizeLeavingNode) and then calculating utilization.

yes , that is correct. getNextCandidateSourceDataNode always try to return a source data node with the highest usage. thanks very much for pointing out this mistake!

siddhantsangwan · 2021-11-12T07:56:39Z

Also, please note the following concern that was raised here for an earlier version of balancer.

Question:

say we have a 10 DN cluster, the usage of all of them is 95%, then one empty DN is added to rebalance the cluster. Given the threshold is 10%, it seems the balancer will not work in this case, since that 10 DN will not achieve upperLimit.

Have we consider corner case like this ?

By removing withinThresholdUtilizedNodes, we rely on the user setting a suitable threshold to make balancing work in such cases.

JacksonYao287 · 2021-11-12T09:14:09Z

thanks very much @siddhantsangwan for the review, i have updated the patch , please take a look!

By removing withinThresholdUtilizedNodes, we rely on the user setting a suitable threshold to make balancing work in such cases.

i do get your point . i think the main difference between our ideas is what should balancer exactly do.
in your opinion, balancer should always try its best to make the cluster more balanced，no matter what the threshold is.
but in mine, balancer just try to make the cluster balanced to what the threshold specified, if we want the cluster to be more balanced, we should specified a smaller threshold.
on one hand, in practice, i think it easy to specified a smaller one if we find current threshold does not take any effect.
on the other hand, if we want the cluster to be more balanced , we can not rely on balancer`s effort with a big threshold, because how balanced balancer will make the cluster to be is uncertain. specifying a smaller threshold will definitely work as expected.

JacksonYao287 · 2021-11-12T11:32:15Z

How about we make the source datanodes selection criteria methods a part of the existing ContainerBalancerSelectionCriteria class instead of creating a new one?

looks good, so that we have only one criteria for all the selections. what do you think @lokeshj1703 ?

lokeshj1703

@JacksonYao287 I think there are three different changes being made in this PR. It is better to separate these into different PRs.

I was thinking we should have sth similar to FindTargetStrategy for source like FindSourceStrategy.
Let's have a separate PR for withinThreshold nodes removal. I think there was a use case where accounting them was important where there are 5 over utilised and 5 within threshold nodes. cc @siddhantsangwan I think user shouldn't have to adjust threshold in this case.
Also by default we shouldn't allow all size of data to be moved from source/target. It is important to limit otherwise it is possible for all balancing to happen in 2-4 nodes.

siddhantsangwan · 2021-11-15T09:07:45Z

I think we should keep within threshold nodes. Removing them implies the user has to do a lot of work in first analyzing the cluster and then calculating a suitable threshold. The user can understandably expect to introduce one new node to the cluster and have balancer balance it with the default threshold.

I don't see any down side in the current logic related to within threshold nodes. There's a bug that was pointed out by
@JacksonYao287:

in some case , one container may be moved from a withinThresholdUtilized node to another withinThresholdUtilized node

An easy fix for this is to remove withinThresholdNodes from potentialTargets when matching within threshold nodes with under utilized nodes.

JacksonYao287 · 2021-11-15T11:05:50Z

thanks @lokeshj1703 and @siddhantsangwan for the review!

I was thinking we should have sth similar to FindTargetStrategy for source like FindSourceStrategy.

we can add an interface of FindSourceStrategy and add a default implementation for it. this can be done by refactoring SourceDataNodeSelectionCriteria.

Let's have a separate PR for withinThreshold nodes removal. I think there was a use case where accounting them was important where there are 5 over utilized and 5 within threshold nodes.

sure, let us create another PR to solve this, and fix related bug. in this patch , we focus supporting multiple container moves from a source datanode. i will add withinThreshold nodes back in a new commit

Also by default we shouldn't allow all size of data to be moved from source/target. It is important to limit otherwise it is possible for all balancing to happen in 2-4 nodes.

in this patch , before matching a target with a source , we will sort all the target datanodes In ascending order and all the source datanodes in descending order by usage rate considering SizeEnteringnode and SizeLeavingNode, so that we always try to match a source datanode with the biggest usage to the target datanode with a smallest usage. it is Almost impossible for all balancing to happen in only 2-4 nodes， unless those 2-4 nodes has a much bigger or smaller usage than others. i think if that case happens, it makes sense to move only among these Very unbalanced datanodes

JacksonYao287 · 2021-11-16T03:59:24Z

@lokeshj1703 @siddhantsangwan i have refactored the patch according to the comments, please take a look. if this looks good to you , i will improve the Code comments in a new commit.

in some case , one container may be moved from a withinThresholdUtilized node to another withinThresholdUtilized node

i will create a separate patch to add withinThreshold nodes back into candidate target and source datanodes, and fix the potential bug of withinThreshold nodes. i think after refactoring, we can do this more gracefully.

lokeshj1703

@JacksonYao287 Thanks for updating the PR! I have few minor comments inline.

...server-scm/src/main/java/org/apache/hadoop/hdds/scm/container/balancer/FindTargetGreedy.java

.../main/java/org/apache/hadoop/hdds/scm/container/balancer/ContainerBalancerConfiguration.java

lokeshj1703 · 2021-11-18T07:32:00Z

Regarding the max limits, I don't think we can support very large limits. 500 G means 500GB data can move from/to datanode in one iteration. I am not sure how much replication dn can support per minute. Maybe it should be determined first.
Further we also have move timeouts, I think move would definitely time out with this much payload.

JacksonYao287 · 2021-11-18T08:53:24Z

thanks @lokeshj1703 for the review, i have updated this patch , please take a look.

Regarding the max limits, I don't think we can support very large limits. 500 G means 500GB data can move from/to datanode in one iteration. I am not sure how much replication dn can support per minute. Maybe it should be determined first. Further we also have move timeouts, I think move would definitely time out with this much payload.

we can discuss this in @siddhantsangwan `s jira, which will make the default configurations smarter

lokeshj1703 · 2021-11-18T10:44:29Z

...server-scm/src/main/java/org/apache/hadoop/hdds/scm/container/balancer/FindTargetGreedy.java

+      double currentUsageOfA = a.calculateUtilization(
+          sizeEnteringNode.get(a.getDatanodeDetails()));
+      double currentUsageOfB = b.calculateUtilization(
+          sizeEnteringNode.get(b.getDatanodeDetails()));
+      return Double.compare(currentUsageOfA, currentUsageOfB);


Sorry for not bringing this up earlier! But we will need to handle the case in the comparator where utilisation is same for two nodes. Otherwise two nodes with same utilisation can not exist. It would be better to make similar change for source as well.

good point , thanks for pointing out this

JacksonYao287 · 2021-11-18T11:38:25Z

@lokeshj1703 thanks for pointing out the mistake , i have updated this patch , please take a look

lokeshj1703 · 2021-11-19T05:59:08Z

...server-scm/src/main/java/org/apache/hadoop/hdds/scm/container/balancer/FindSourceGreedy.java

+      if (ret != 0) {
+        return ret;
+      }
+      return a.hashCode() - b.hashCode();


Can we do datanode details UUID comparison instead?

sure, will do this.

lokeshj1703

Pending clean CI.

JacksonYao287 · 2021-11-19T09:04:18Z

@lokeshj1703 @siddhantsangwan thank you for the review! CI failure seems not caused by this patch.
i have tested this patch in my Kubernetes cluster, and it works as expected。the process of balancing is greatly accelerated。

lokeshj1703 · 2021-11-19T09:10:11Z

@JacksonYao287 We will have to get a clean CI before merge. That is the process we follow for Ozone. Maybe you can try rebasing the PR on current master.

JacksonYao287 · 2021-11-19T09:15:03Z

@lokeshj1703 sure, thanks , i have merged current master branch into this patch. let`s wait for a clean CI

lokeshj1703 · 2021-11-23T07:12:49Z

@JacksonYao287 Thanks for the contribution! @siddhantsangwan Thanks for review! I have committed the PR to master branch.

JacksonYao287 · 2021-11-23T07:29:54Z

thanks @lokeshj1703 and @siddhantsangwan for the review!

siddhantsangwan · 2021-12-22T08:03:55Z

...server-scm/src/main/java/org/apache/hadoop/hdds/scm/container/balancer/FindSourceGreedy.java

+    if(currentSize != null) {
+      sizeLeavingNode.put(dui, currentSize + size);
+      //reorder according to the latest sizeLeavingNode
+      potentialSources.add(nodeManager.getUsageInfo(dui));
+      return;


Hey @JacksonYao287 Can you help me understand what's happening in this method? I don't think the usage info for a node will get updated during an iteration, since DU/DF don't run during an iteration.

let me explain this.
when selecting a source datanode, we always want to the select the one which has a largest storage usage. here, i use PriorityQueue, which is fast to get the top one. when calling getNextCandidateSourceDataNode, PriorityQueue#poll is called , which will get and remove the top source data node for this PriorityQueue.

actually, now we supporting move multiple containers from one datanode , so a data node can be selected as source for multiple times in one iteration.

here are two reasons to call potentialSources.add(nodeManager.getUsageInfo(dui));
1 add the data node back to the PriorityQueue again, so it can be selected as a source again.
2 when we update sizeLeavingNode, the usage of this data node will be considered changed (the reported usage - sizeLeaving), so we need to sort all the candidate source datanodes according to the latest usage and get the top one. when adding the data node back to the PriorityQueue, PriorityQueue will sort all the datanode again(it will use heap sort , so very fast), so we can get the next top one.

i am not sure is it clear to you now?

JacksonYao287 force-pushed the HDDS-5517 branch from 85d4e22 to fa693b1 Compare November 5, 2021 09:45

lokeshj1703 reviewed Nov 8, 2021

View reviewed changes

...erver-scm/src/main/java/org/apache/hadoop/hdds/scm/container/balancer/ContainerBalancer.java Outdated Show resolved Hide resolved

JacksonYao287 force-pushed the HDDS-5517 branch 3 times, most recently from bed454a to 7a4b044 Compare November 10, 2021 06:25

JacksonYao287 changed the title ~~HDDS-5517. Support multiple container moves from a source datanode~~ HDDS-5517. Support multiple container moves from a source datanode in one balance iteration Nov 10, 2021

JacksonYao287 requested a review from lokeshj1703 November 10, 2021 08:30

JacksonYao287 force-pushed the HDDS-5517 branch from 848da06 to 1efb219 Compare November 10, 2021 09:05

HDDS-5517. Support multiple container moves from a source datanode

33337b2

JacksonYao287 force-pushed the HDDS-5517 branch from 1efb219 to 33337b2 Compare November 10, 2021 09:14

siddhantsangwan reviewed Nov 12, 2021

View reviewed changes

fix comments

ff27527

sort potentialTargets before selecting a target data node

77e1ba7

lokeshj1703 reviewed Nov 15, 2021

View reviewed changes

JacksonYao287 force-pushed the HDDS-5517 branch 4 times, most recently from 1e3a8c5 to f0e0ff0 Compare November 16, 2021 12:28

JacksonYao287 requested a review from lokeshj1703 November 16, 2021 13:29

refactor

d7f4b65

JacksonYao287 force-pushed the HDDS-5517 branch from e0679f6 to d7f4b65 Compare November 17, 2021 03:22

lokeshj1703 reviewed Nov 18, 2021

View reviewed changes

...server-scm/src/main/java/org/apache/hadoop/hdds/scm/container/balancer/FindTargetGreedy.java Outdated Show resolved Hide resolved

.../main/java/org/apache/hadoop/hdds/scm/container/balancer/ContainerBalancerConfiguration.java Outdated Show resolved Hide resolved

fix comments

911dfbe

JacksonYao287 requested a review from lokeshj1703 November 18, 2021 09:55

lokeshj1703 reviewed Nov 18, 2021

View reviewed changes

fix comments

7bca9dc

JacksonYao287 requested a review from lokeshj1703 November 18, 2021 11:38

display more container configuration parameters

8a320ee

JacksonYao287 force-pushed the HDDS-5517 branch from 742dab3 to 8a320ee Compare November 18, 2021 13:10

lokeshj1703 reviewed Nov 19, 2021

View reviewed changes

use uuid for comparition

a8d24b5

JacksonYao287 force-pushed the HDDS-5517 branch from e65f02e to a8d24b5 Compare November 19, 2021 06:12

lokeshj1703 approved these changes Nov 19, 2021

View reviewed changes

triger CI

9f4cfc1

triger CI

83c195c

Merge remote-tracking branch 'origin/master' into HDDS-5517

463a3f7

JacksonYao287 requested a review from lokeshj1703 November 22, 2021 07:27

lokeshj1703 merged commit 52e619c into apache:master Nov 23, 2021

JacksonYao287 deleted the HDDS-5517 branch November 23, 2021 07:29

siddhantsangwan reviewed Dec 22, 2021

View reviewed changes

siddhantsangwan mentioned this pull request Dec 22, 2021

HDDS-5602. make it configurable to choose the nearest one as the target in the candidates according to networkTopology #2756

Merged

HDDS-5517. Support multiple container moves from a source datanode in one balance iteration #2808

HDDS-5517. Support multiple container moves from a source datanode in one balance iteration #2808

Uh oh!

Conversation

JacksonYao287 commented Nov 5, 2021

What changes were proposed in this pull request?

What is the link to the Apache JIRA

How was this patch tested?

Uh oh!

JacksonYao287 commented Nov 5, 2021

Uh oh!

lokeshj1703 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

JacksonYao287 commented Nov 8, 2021

Uh oh!

JacksonYao287 commented Nov 10, 2021

Uh oh!

siddhantsangwan left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

siddhantsangwan commented Nov 12, 2021

Uh oh!

JacksonYao287 commented Nov 12, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

JacksonYao287 commented Nov 12, 2021

Uh oh!

lokeshj1703 left a comment

Choose a reason for hiding this comment

Uh oh!

siddhantsangwan commented Nov 15, 2021

Uh oh!

JacksonYao287 commented Nov 15, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

JacksonYao287 commented Nov 16, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lokeshj1703 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

lokeshj1703 commented Nov 18, 2021

Uh oh!

JacksonYao287 commented Nov 18, 2021

Uh oh!

lokeshj1703 Nov 18, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

JacksonYao287 commented Nov 18, 2021

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lokeshj1703 left a comment

Choose a reason for hiding this comment

Uh oh!

JacksonYao287 commented Nov 19, 2021

Uh oh!

lokeshj1703 commented Nov 19, 2021

Uh oh!

JacksonYao287 commented Nov 19, 2021

Uh oh!

lokeshj1703 commented Nov 23, 2021

JacksonYao287 commented Nov 12, 2021 •

edited

Loading

JacksonYao287 commented Nov 15, 2021 •

edited

Loading

JacksonYao287 commented Nov 16, 2021 •

edited

Loading

lokeshj1703 Nov 18, 2021 •

edited

Loading