Skip to content

Conversation

@JacksonYao287
Copy link
Contributor

What changes were proposed in this pull request?

Support multiple container moves from a source datanode

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-5517

How was this patch tested?

UT

@JacksonYao287
Copy link
Contributor Author

@lokeshj1703 @siddhantsangwan please take a look at this jira as a priority , i think it will speed up the move process very much.
by the way , i think we should not limit maxSizeEnteringTarget and maxSizeLeavingSource by default, we can specify the value in the command line if needed

Copy link
Contributor

@lokeshj1703 lokeshj1703 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@JacksonYao287 Thanks for working on this!
Currently we iterate over overutilized nodes. Maybe it is better to have a selection criteria for source datanodes as well. Default implementation would be to just return the DN with highest utilisation (also including size being moved).

@JacksonYao287
Copy link
Contributor Author

Maybe it is better to have a selection criteria for source datanodes as well. Default implementation would be to just return the DN with highest utilisation (also including size being moved).

thanks @lokeshj1703 for the review. the suggestion looks good , i will do this in a new commit

@JacksonYao287 JacksonYao287 force-pushed the HDDS-5517 branch 3 times, most recently from bed454a to 7a4b044 Compare November 10, 2021 06:25
@JacksonYao287 JacksonYao287 changed the title HDDS-5517. Support multiple container moves from a source datanode HDDS-5517. Support multiple container moves from a source datanode in one balance iteration Nov 10, 2021
@JacksonYao287
Copy link
Contributor Author

@lokeshj1703 @siddhantsangwan @ChenSammi PTAL, thanks!

Copy link
Contributor

@siddhantsangwan siddhantsangwan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@JacksonYao287 The changes mostly look good. How about we make the source datanodes selection criteria methods a part of the existing ContainerBalancerSelectionCriteria class instead of creating a new one?

Comment on lines 50 to 59
//TODO:use a more quick data structure, which will hava a
// better performance when changing or deleting one element at once
overUtilizedNodes.sort((a, b) -> {
double currentUsageOfA = a.calculateUtilization(
sizeLeavingNode.get(a.getDatanodeDetails()));
double currentUsageOfB = b.calculateUtilization(
sizeLeavingNode.get(b.getDatanodeDetails()));
//in descending order
return Double.compare(currentUsageOfB, currentUsageOfA);
});
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understanding of this method is unclear, so I might be wrong. I think we want to sort by reducing the used space (subtracting sizeLeavingNode) and then calculating utilization.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we want to sort by reducing the used space (subtracting sizeLeavingNode) and then calculating utilization.

yes , that is correct. getNextCandidateSourceDataNode always try to return a source data node with the highest usage. thanks very much for pointing out this mistake!

@siddhantsangwan
Copy link
Contributor

Also, please note the following concern that was raised here for an earlier version of balancer.

Question:

say we have a 10 DN cluster, the usage of all of them is 95%, then one empty DN is added to rebalance the cluster. Given the threshold is 10%, it seems the balancer will not work in this case, since that 10 DN will not achieve upperLimit.

Have we consider corner case like this ?

By removing withinThresholdUtilizedNodes, we rely on the user setting a suitable threshold to make balancing work in such cases.

@JacksonYao287
Copy link
Contributor Author

JacksonYao287 commented Nov 12, 2021

thanks very much @siddhantsangwan for the review, i have updated the patch , please take a look!

By removing withinThresholdUtilizedNodes, we rely on the user setting a suitable threshold to make balancing work in such cases.

i do get your point . i think the main difference between our ideas is what should balancer exactly do.
in your opinion, balancer should always try its best to make the cluster more balanced,no matter what the threshold is.
but in mine, balancer just try to make the cluster balanced to what the threshold specified, if we want the cluster to be more balanced, we should specified a smaller threshold.
on one hand, in practice, i think it easy to specified a smaller one if we find current threshold does not take any effect.
on the other hand, if we want the cluster to be more balanced , we can not rely on balancer`s effort with a big threshold, because how balanced balancer will make the cluster to be is uncertain. specifying a smaller threshold will definitely work as expected.

@JacksonYao287
Copy link
Contributor Author

How about we make the source datanodes selection criteria methods a part of the existing ContainerBalancerSelectionCriteria class instead of creating a new one?

looks good, so that we have only one criteria for all the selections. what do you think @lokeshj1703 ?

Copy link
Contributor

@lokeshj1703 lokeshj1703 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@JacksonYao287 I think there are three different changes being made in this PR. It is better to separate these into different PRs.

  1. I was thinking we should have sth similar to FindTargetStrategy for source like FindSourceStrategy.
  2. Let's have a separate PR for withinThreshold nodes removal. I think there was a use case where accounting them was important where there are 5 over utilised and 5 within threshold nodes. cc @siddhantsangwan I think user shouldn't have to adjust threshold in this case.
  3. Also by default we shouldn't allow all size of data to be moved from source/target. It is important to limit otherwise it is possible for all balancing to happen in 2-4 nodes.

@siddhantsangwan
Copy link
Contributor

I think we should keep within threshold nodes. Removing them implies the user has to do a lot of work in first analyzing the cluster and then calculating a suitable threshold. The user can understandably expect to introduce one new node to the cluster and have balancer balance it with the default threshold.

I don't see any down side in the current logic related to within threshold nodes. There's a bug that was pointed out by
@JacksonYao287:

in some case , one container may be moved from a withinThresholdUtilized node to another withinThresholdUtilized node

An easy fix for this is to remove withinThresholdNodes from potentialTargets when matching within threshold nodes with under utilized nodes.

@JacksonYao287
Copy link
Contributor Author

JacksonYao287 commented Nov 15, 2021

thanks @lokeshj1703 and @siddhantsangwan for the review!

  1. I was thinking we should have sth similar to FindTargetStrategy for source like FindSourceStrategy.

we can add an interface of FindSourceStrategy and add a default implementation for it. this can be done by refactoring SourceDataNodeSelectionCriteria.

  1. Let's have a separate PR for withinThreshold nodes removal. I think there was a use case where accounting them was important where there are 5 over utilized and 5 within threshold nodes.

sure, let us create another PR to solve this, and fix related bug. in this patch , we focus supporting multiple container moves from a source datanode. i will add withinThreshold nodes back in a new commit

  1. Also by default we shouldn't allow all size of data to be moved from source/target. It is important to limit otherwise it is possible for all balancing to happen in 2-4 nodes.

in this patch , before matching a target with a source , we will sort all the target datanodes In ascending order and all the source datanodes in descending order by usage rate considering SizeEnteringnode and SizeLeavingNode, so that we always try to match a source datanode with the biggest usage to the target datanode with a smallest usage. it is Almost impossible for all balancing to happen in only 2-4 nodes, unless those 2-4 nodes has a much bigger or smaller usage than others. i think if that case happens, it makes sense to move only among these Very unbalanced datanodes

@JacksonYao287
Copy link
Contributor Author

JacksonYao287 commented Nov 16, 2021

@lokeshj1703 @siddhantsangwan i have refactored the patch according to the comments, please take a look. if this looks good to you , i will improve the Code comments in a new commit.

in some case , one container may be moved from a withinThresholdUtilized node to another withinThresholdUtilized node

i will create a separate patch to add withinThreshold nodes back into candidate target and source datanodes, and fix the potential bug of withinThreshold nodes. i think after refactoring, we can do this more gracefully.

@JacksonYao287 JacksonYao287 force-pushed the HDDS-5517 branch 4 times, most recently from 1e3a8c5 to f0e0ff0 Compare November 16, 2021 12:28
Copy link
Contributor

@lokeshj1703 lokeshj1703 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@JacksonYao287 Thanks for updating the PR! I have few minor comments inline.

@lokeshj1703
Copy link
Contributor

Regarding the max limits, I don't think we can support very large limits. 500 G means 500GB data can move from/to datanode in one iteration. I am not sure how much replication dn can support per minute. Maybe it should be determined first.
Further we also have move timeouts, I think move would definitely time out with this much payload.

@JacksonYao287
Copy link
Contributor Author

thanks @lokeshj1703 for the review, i have updated this patch , please take a look.

Regarding the max limits, I don't think we can support very large limits. 500 G means 500GB data can move from/to datanode in one iteration. I am not sure how much replication dn can support per minute. Maybe it should be determined first. Further we also have move timeouts, I think move would definitely time out with this much payload.

we can discuss this in @siddhantsangwan `s jira, which will make the default configurations smarter

Comment on lines 66 to 70
double currentUsageOfA = a.calculateUtilization(
sizeEnteringNode.get(a.getDatanodeDetails()));
double currentUsageOfB = b.calculateUtilization(
sizeEnteringNode.get(b.getDatanodeDetails()));
return Double.compare(currentUsageOfA, currentUsageOfB);
Copy link
Contributor

@lokeshj1703 lokeshj1703 Nov 18, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for not bringing this up earlier! But we will need to handle the case in the comparator where utilisation is same for two nodes. Otherwise two nodes with same utilisation can not exist. It would be better to make similar change for source as well.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 o.w.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good point , thanks for pointing out this

@JacksonYao287
Copy link
Contributor Author

@lokeshj1703 thanks for pointing out the mistake , i have updated this patch , please take a look

if (ret != 0) {
return ret;
}
return a.hashCode() - b.hashCode();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we do datanode details UUID comparison instead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure, will do this.

Copy link
Contributor

@lokeshj1703 lokeshj1703 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pending clean CI.

@JacksonYao287
Copy link
Contributor Author

@lokeshj1703 @siddhantsangwan thank you for the review! CI failure seems not caused by this patch.
i have tested this patch in my Kubernetes cluster, and it works as expected。the process of balancing is greatly accelerated。

@lokeshj1703
Copy link
Contributor

@JacksonYao287 We will have to get a clean CI before merge. That is the process we follow for Ozone. Maybe you can try rebasing the PR on current master.

@JacksonYao287
Copy link
Contributor Author

@lokeshj1703 sure, thanks , i have merged current master branch into this patch. let`s wait for a clean CI

@lokeshj1703 lokeshj1703 merged commit 52e619c into apache:master Nov 23, 2021
@lokeshj1703
Copy link
Contributor

@JacksonYao287 Thanks for the contribution! @siddhantsangwan Thanks for review! I have committed the PR to master branch.

@JacksonYao287
Copy link
Contributor Author

thanks @lokeshj1703 and @siddhantsangwan for the review!

@JacksonYao287 JacksonYao287 deleted the HDDS-5517 branch November 23, 2021 07:29
Comment on lines +87 to +91
if(currentSize != null) {
sizeLeavingNode.put(dui, currentSize + size);
//reorder according to the latest sizeLeavingNode
potentialSources.add(nodeManager.getUsageInfo(dui));
return;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @JacksonYao287 Can you help me understand what's happening in this method? I don't think the usage info for a node will get updated during an iteration, since DU/DF don't run during an iteration.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let me explain this.
when selecting a source datanode, we always want to the select the one which has a largest storage usage. here, i use PriorityQueue, which is fast to get the top one. when calling getNextCandidateSourceDataNode, PriorityQueue#poll is called , which will get and remove the top source data node for this PriorityQueue.

actually, now we supporting move multiple containers from one datanode , so a data node can be selected as source for multiple times in one iteration.

here are two reasons to call potentialSources.add(nodeManager.getUsageInfo(dui));
1 add the data node back to the PriorityQueue again, so it can be selected as a source again.
2 when we update sizeLeavingNode, the usage of this data node will be considered changed (the reported usage - sizeLeaving), so we need to sort all the candidate source datanodes according to the latest usage and get the top one. when adding the data node back to the PriorityQueue, PriorityQueue will sort all the datanode again(it will use heap sort , so very fast), so we can get the next top one.

i am not sure is it clear to you now?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants