HDDS-10160. Cache sort results in ContainerBalancerSelectionCriteria #6050

symious · 2024-01-22T08:15:53Z

What changes were proposed in this pull request?

HDDS-10160. Cache sort results in ContainerBalancerSelectionCriteria

The sort of all the containers is very time consuming, this patch is to cache the sort result and improve the balancer speed.

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-10160

How was this patch tested?

Original balancer tests running successfully would be enough.

...n/java/org/apache/hadoop/hdds/scm/container/balancer/ContainerBalancerSelectionCriteria.java

sumitagrawl

@symious Thanks for working over this, can we see avoid need of sorting and iterating of all containers every time. Given few suggestions

...n/java/org/apache/hadoop/hdds/scm/container/balancer/ContainerBalancerSelectionCriteria.java

siddhantsangwan

Thanks @symious for working on this. Are you planning to make more such perf improvements for balancer? And did you happen to perform any benchmarks to identify the bottlenecks?

siddhantsangwan · 2024-01-22T12:19:14Z

...n/java/org/apache/hadoop/hdds/scm/container/balancer/ContainerBalancerSelectionCriteria.java

+        addNodeToSetMap(node);
+      }
+      // In case the node is removed
+      nodeManager.getContainers(node);


Are we making this call just to see if an exception gets thrown? In that case this is a bit awkward and confusing. Does node manager provide an API that we can use first to check if SCM knows the node, and then get its containers (or remove them from the cache is the node isn't there anymore)?

Yes, since currently there is no explicit method to check if a node exists, this is used for this check.

Maybe getNodeStatus would be a better one to check the node is still there? getcontainers creates a new HashSet of all the containers on the node, so it is somewhat expensive if we don't use those returned contaienrs.

I agree with @sodonnel. Would isNodeRegistered be even better?

ozone/hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/node/SCMNodeManager.java

Lines 627 to 634 in ac597ad

@Override

public Boolean isNodeRegistered(DatanodeDetails datanodeDetails) {

try {

nodeStateManager.getNode(datanodeDetails);

return true;

} catch (NodeNotFoundException e) {

return false;

}

symious · 2024-01-23T02:29:33Z

Thanks @symious for working on this. Are you planning to make more such perf improvements for balancer? And did you happen to perform any benchmarks to identify the bottlenecks?

@sumitagrawl We noticed the interval of SCM sending requests is quite long, about 20~30 seconds at worst, and most of the time is cost on NavigatbleSet.add(), that's why we did this improvement.

...r-scm/src/main/java/org/apache/hadoop/hdds/scm/container/balancer/ContainerBalancerTask.java

sumitagrawl

LGTM

guohao-rosicky · 2024-01-26T06:28:32Z

This was discussed at the last weekly meeting, and thanks @symious for providing the pr so quickly.

adoroszlai · 2024-01-26T06:28:52Z

@symious can you please take a look at #6050 (comment) (and #6050 (comment))?

guohao-rosicky · 2024-01-26T06:32:35Z

@sumitagrawl We noticed the interval of SCM sending requests is quite long, about 20~30 seconds at worst, and most of the time is cost on NavigatbleSet.add(), that's why we did this improvement.

Yes building NavigatbleSet is very time consuming, even longer than 30s in my environment!

symious · 2024-01-26T06:49:59Z

can you please take a look at #6050 (comment) (and #6050 (comment))?

@sodonnel @adoroszlai Update to use isNodeRegistered, PTAL.

adoroszlai · 2024-01-26T08:07:29Z

Update to use isNodeRegistered, PTAL.

Thanks @symious for updating the patch. It seems TestContainerBalancerTask needs to be tweaked, too.

https://github.com/apache/ozone/actions/runs/7664891568/job/20890035849?pr=6050#step:5:1806

symious · 2024-01-26T08:46:23Z

It seems TestContainerBalancerTask needs to be tweaked, too.

@adoroszlai Updated, PTAL.

adoroszlai · 2024-01-26T10:05:39Z

...n/java/org/apache/hadoop/hdds/scm/container/balancer/ContainerBalancerSelectionCriteria.java

+    try {
+      // Initialize containerSet for node
+      setMap.computeIfAbsent(node, n -> {
+        try {
+          addNodeToSetMap(n);
+          return setMap.get(n);
+        } catch (NodeNotFoundException e) {
+          LOG.warn("Could not find Datanode {} while selecting candidate " +
+              "containers for Container Balancer.", n.toString(), e);
+          return null;
+        }
+      });
+    } catch (Exception e) {
+      LOG.error("An unexpected error occurred while processing the node.", e);
+      setMap.remove(node);
+      return Collections.emptySet();
    }

-    containerIDSet.removeIf(
-        containerID -> shouldBeExcluded(containerID, node, sizeMovedAlready));
-    return containerIDSet;
+    return setMap.get(node);


Sorry, I just realized that this is a bit more complex than needed. addNodeToSetMap shouldn't update setMap while being called via computeIfAbsent.

This can be simplified to:

Set<ContainerID> containers = setMap.computeIfAbsent(node, this::getCandidateContainers); return containers != null ? containers : Collections.emptySet();

With a small change in addNodeToSetMap (renamed to getCandidateContainers, as it no longer changes setMap):

private NavigableSet<ContainerID> getCandidateContainers(DatanodeDetails node) { NavigableSet<ContainerID> newSet = new TreeSet<>(orderContainersByUsedBytes().reversed()); try { Set<ContainerID> idSet = nodeManager.getContainers(node); if (excludeContainers != null) { idSet.removeAll(excludeContainers); } if (selectedContainers != null) { idSet.removeAll(selectedContainers); } newSet.addAll(idSet); return newSet; } catch (NodeNotFoundException e) { LOG.warn("Could not find Datanode {} while selecting candidate " + "containers for Container Balancer.", node, e); return null; } }

The changes look good, updated, PTAL.

adoroszlai

Thanks @symious for updating the patch.

adoroszlai · 2024-01-26T17:44:39Z

@sodonnel would you like to take another look?

adoroszlai · 2024-01-29T09:59:24Z

@siddhantsangwan would you like to take another look?

siddhantsangwan · 2024-01-29T11:18:14Z

@siddhantsangwan would you like to take another look?

Yes, will review again.

siddhantsangwan

Mostly looks good, just two minor comments.

siddhantsangwan · 2024-01-30T05:55:37Z

...n/java/org/apache/hadoop/hdds/scm/container/balancer/ContainerBalancerSelectionCriteria.java

+   * 1. Container must not be in ExcludedContainers.
+   * 2. Container must not be in SelectedContainers.


Points 1 and 2 duplicate points 6 and 4, respectively.

@siddhantsangwan Thanks for the detailed review. Updated, PTAL.

siddhantsangwan · 2024-01-30T05:56:09Z

...n/java/org/apache/hadoop/hdds/scm/container/balancer/ContainerBalancerSelectionCriteria.java

+   * 7. Container should be closed.
+   * 8. If the {@link LegacyReplicationManager} is enabled, then the container should not be an EC container.
+   * @param node DatanodeDetails for which to find candidate containers.
+   * @return Set of candidate containers that satisfy the criteria.


@return should be updated.

siddhantsangwan

@symious thanks for the update.

adoroszlai · 2024-01-30T09:06:23Z

Thanks @symious for the patch, @guohao-rosicky, @siddhantsangwan, @sodonnel, @sumitagrawl for the review.

…DDS-10870 (apache#221) * CDPD-65217. HDDS-10085. Improve method name in ContainerBalancerSelectionCriteria (apache#5957) (cherry picked from commit b932e16) * CDPD-65645. HDDS-10160. Cache sort results in ContainerBalancerSelectionCriteria (apache#6050) (cherry picked from commit 95666eb) * CDPD-52140. HDDS-7252. Polled source Datanodes are wrongly not re-considered for balancing in Container Balancer (apache#6305) (cherry picked from commit fdc38b5) * CDPD-70112. HDDS-10869. SCMNodeManager#getUsageInfo memory occupancy optimization (apache#6737) (cherry picked from commit 39baf0f) * CDPD-70110. HDDS-10871. ContainerBalancerSelectionCriteria memory occupancy optimization (apache#6738) (cherry picked from commit 19d4419) * CDPD-70111. HDDS-10870. moveSelectionToFutureMap cleanup when future complete (apache#6746) (cherry picked from commit 4f02853) * CDPD-78919. HDDS-12231. Logging in Container Balancer is too verbose. (apache#7826) (cherry picked from commit 371792f) --------- Conflicts: 1. Forced to include CDPD-65217 (minor commit that only changes a method name) and CDPD-52140 to resolve conflicts. No major changes made. 2. Had to change a test method to use the full Assertions.assertEquals instead of just assertEquals. 3. CDPD-70111 - had to merge manually to exclude changes from other unrelated commits in ContainerBalancerTask. --------- Co-authored-by: Siddhant Sangwan <[email protected]> Co-authored-by: Symious <[email protected]> Co-authored-by: Tejaskriya <[email protected]> Co-authored-by: hao guo <[email protected]>

HDDS-10160. Cache sort results in ContainerBalancerSelectionCriteria

63b257a

symious mentioned this pull request Jan 22, 2024

HDDS-10160. Cache sort results in ContainerBalancerSelectionCriteria #6032

Closed

symious added 2 commits January 22, 2024 16:55

HDDS-10160. Sort out the code

e18496a

HDDS-10160. Filter selected and excluded containers

968cde7

symious commented Jan 22, 2024

View reviewed changes

sumitagrawl reviewed Jan 22, 2024

View reviewed changes

siddhantsangwan reviewed Jan 22, 2024

View reviewed changes

symious added 3 commits January 23, 2024 15:25

HDDS-10160. check on iteration

2f4ec85

Remove containers from candidationContainerSet

05c5a05

Remove containerId from candidateContainerSet

54d336f

sumitagrawl reviewed Jan 24, 2024

View reviewed changes

...r-scm/src/main/java/org/apache/hadoop/hdds/scm/container/balancer/ContainerBalancerTask.java Show resolved Hide resolved

HDDS-10160. Add comment

7f3b085

sumitagrawl approved these changes Jan 24, 2024

View reviewed changes

adoroszlai requested a review from siddhantsangwan January 24, 2024 08:24

guohao-rosicky approved these changes Jan 26, 2024

View reviewed changes

HDDS-10160. Use isNodeRegistered

92c7ba0

adoroszlai requested a review from sodonnel January 26, 2024 07:14

HDDS-10160. Implement MockNodeManager's isNodeRegistered()

7b590d2

adoroszlai reviewed Jan 26, 2024

View reviewed changes

HDDS-10160. Code cleaning

b2c3bcb

adoroszlai approved these changes Jan 26, 2024

View reviewed changes

siddhantsangwan reviewed Jan 30, 2024

View reviewed changes

HDDS-10160. Update comments

8c00ae7

siddhantsangwan approved these changes Jan 30, 2024

View reviewed changes

adoroszlai merged commit 95666eb into apache:master Jan 30, 2024

	@Override
	public Boolean isNodeRegistered(DatanodeDetails datanodeDetails) {
	try {
	nodeStateManager.getNode(datanodeDetails);
	return true;
	} catch (NodeNotFoundException e) {
	return false;
	}

		* 1. Container must not be in ExcludedContainers.
		* 2. Container must not be in SelectedContainers.

HDDS-10160. Cache sort results in ContainerBalancerSelectionCriteria #6050

HDDS-10160. Cache sort results in ContainerBalancerSelectionCriteria #6050

Uh oh!

Conversation

symious commented Jan 22, 2024

What changes were proposed in this pull request?

What is the link to the Apache JIRA

How was this patch tested?

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

sumitagrawl left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

siddhantsangwan left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

symious commented Jan 23, 2024

Uh oh!

Uh oh!

sumitagrawl left a comment

Choose a reason for hiding this comment

Uh oh!

guohao-rosicky commented Jan 26, 2024

Uh oh!

adoroszlai commented Jan 26, 2024

Uh oh!

guohao-rosicky commented Jan 26, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

symious commented Jan 26, 2024

Uh oh!

adoroszlai commented Jan 26, 2024

Uh oh!

symious commented Jan 26, 2024

Uh oh!

adoroszlai Jan 26, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

adoroszlai left a comment

Choose a reason for hiding this comment

Uh oh!

adoroszlai commented Jan 26, 2024

Uh oh!

adoroszlai commented Jan 29, 2024

Uh oh!

siddhantsangwan commented Jan 29, 2024

Uh oh!

siddhantsangwan left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

siddhantsangwan left a comment

Choose a reason for hiding this comment

Uh oh!

adoroszlai commented Jan 30, 2024

Uh oh!

Reviewers

Assignees

guohao-rosicky commented Jan 26, 2024 •

edited

Loading

adoroszlai Jan 26, 2024 •

edited

Loading