Skip to content

Conversation

@symious
Copy link
Contributor

@symious symious commented Jan 22, 2024

What changes were proposed in this pull request?

HDDS-10160. Cache sort results in ContainerBalancerSelectionCriteria

The sort of all the containers is very time consuming, this patch is to cache the sort result and improve the balancer speed.

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-10160

How was this patch tested?

Original balancer tests running successfully would be enough.

Copy link
Contributor

@sumitagrawl sumitagrawl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@symious Thanks for working over this, can we see avoid need of sorting and iterating of all containers every time. Given few suggestions

Copy link
Contributor

@siddhantsangwan siddhantsangwan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @symious for working on this. Are you planning to make more such perf improvements for balancer? And did you happen to perform any benchmarks to identify the bottlenecks?

addNodeToSetMap(node);
}
// In case the node is removed
nodeManager.getContainers(node);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we making this call just to see if an exception gets thrown? In that case this is a bit awkward and confusing. Does node manager provide an API that we can use first to check if SCM knows the node, and then get its containers (or remove them from the cache is the node isn't there anymore)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, since currently there is no explicit method to check if a node exists, this is used for this check.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe getNodeStatus would be a better one to check the node is still there? getcontainers creates a new HashSet of all the containers on the node, so it is somewhat expensive if we don't use those returned contaienrs.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with @sodonnel. Would isNodeRegistered be even better?

@Override
public Boolean isNodeRegistered(DatanodeDetails datanodeDetails) {
try {
nodeStateManager.getNode(datanodeDetails);
return true;
} catch (NodeNotFoundException e) {
return false;
}

@symious
Copy link
Contributor Author

symious commented Jan 23, 2024

Thanks @symious for working on this. Are you planning to make more such perf improvements for balancer? And did you happen to perform any benchmarks to identify the bottlenecks?

@sumitagrawl We noticed the interval of SCM sending requests is quite long, about 20~30 seconds at worst, and most of the time is cost on NavigatbleSet.add(), that's why we did this improvement.

Copy link
Contributor

@sumitagrawl sumitagrawl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@guohao-rosicky
Copy link
Contributor

This was discussed at the last weekly meeting, and thanks @symious for providing the pr so quickly.

@adoroszlai
Copy link
Contributor

@symious can you please take a look at #6050 (comment) (and #6050 (comment))?

@guohao-rosicky
Copy link
Contributor

guohao-rosicky commented Jan 26, 2024

@sumitagrawl We noticed the interval of SCM sending requests is quite long, about 20~30 seconds at worst, and most of the time is cost on NavigatbleSet.add(), that's why we did this improvement.

Yes building NavigatbleSet is very time consuming, even longer than 30s in my environment!

@symious
Copy link
Contributor Author

symious commented Jan 26, 2024

can you please take a look at #6050 (comment) (and #6050 (comment))?

@sodonnel @adoroszlai Update to use isNodeRegistered, PTAL.

@adoroszlai adoroszlai requested a review from sodonnel January 26, 2024 07:14
@adoroszlai
Copy link
Contributor

Update to use isNodeRegistered, PTAL.

Thanks @symious for updating the patch. It seems TestContainerBalancerTask needs to be tweaked, too.

https://github.com/apache/ozone/actions/runs/7664891568/job/20890035849?pr=6050#step:5:1806

@symious
Copy link
Contributor Author

symious commented Jan 26, 2024

It seems TestContainerBalancerTask needs to be tweaked, too.

@adoroszlai Updated, PTAL.

Comment on lines 99 to 117
try {
// Initialize containerSet for node
setMap.computeIfAbsent(node, n -> {
try {
addNodeToSetMap(n);
return setMap.get(n);
} catch (NodeNotFoundException e) {
LOG.warn("Could not find Datanode {} while selecting candidate " +
"containers for Container Balancer.", n.toString(), e);
return null;
}
});
} catch (Exception e) {
LOG.error("An unexpected error occurred while processing the node.", e);
setMap.remove(node);
return Collections.emptySet();
}

containerIDSet.removeIf(
containerID -> shouldBeExcluded(containerID, node, sizeMovedAlready));
return containerIDSet;
return setMap.get(node);
Copy link
Contributor

@adoroszlai adoroszlai Jan 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I just realized that this is a bit more complex than needed. addNodeToSetMap shouldn't update setMap while being called via computeIfAbsent.

This can be simplified to:

    Set<ContainerID> containers = setMap.computeIfAbsent(node, this::getCandidateContainers);
    return containers != null ? containers : Collections.emptySet();

With a small change in addNodeToSetMap (renamed to getCandidateContainers, as it no longer changes setMap):

  private NavigableSet<ContainerID> getCandidateContainers(DatanodeDetails node) {
    NavigableSet<ContainerID> newSet =
        new TreeSet<>(orderContainersByUsedBytes().reversed());
    try {
      Set<ContainerID> idSet = nodeManager.getContainers(node);
      if (excludeContainers != null) {
        idSet.removeAll(excludeContainers);
      }
      if (selectedContainers != null) {
        idSet.removeAll(selectedContainers);
      }
      newSet.addAll(idSet);
      return newSet;
    } catch (NodeNotFoundException e) {
      LOG.warn("Could not find Datanode {} while selecting candidate " +
          "containers for Container Balancer.", node, e);
      return null;
    }
  }

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changes look good, updated, PTAL.

Copy link
Contributor

@adoroszlai adoroszlai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @symious for updating the patch.

@adoroszlai
Copy link
Contributor

@sodonnel would you like to take another look?

@adoroszlai
Copy link
Contributor

@siddhantsangwan would you like to take another look?

@siddhantsangwan
Copy link
Contributor

@siddhantsangwan would you like to take another look?

Yes, will review again.

Copy link
Contributor

@siddhantsangwan siddhantsangwan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly looks good, just two minor comments.

Comment on lines 158 to 159
* 1. Container must not be in ExcludedContainers.
* 2. Container must not be in SelectedContainers.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Points 1 and 2 duplicate points 6 and 4, respectively.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@siddhantsangwan Thanks for the detailed review. Updated, PTAL.

* 7. Container should be closed.
* 8. If the {@link LegacyReplicationManager} is enabled, then the container should not be an EC container.
* @param node DatanodeDetails for which to find candidate containers.
* @return Set of candidate containers that satisfy the criteria.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@return should be updated.

Copy link
Contributor

@siddhantsangwan siddhantsangwan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@symious thanks for the update.

@adoroszlai adoroszlai merged commit 95666eb into apache:master Jan 30, 2024
@adoroszlai
Copy link
Contributor

Thanks @symious for the patch, @guohao-rosicky, @siddhantsangwan, @sodonnel, @sumitagrawl for the review.

ptlrs pushed a commit to ptlrs/ozone that referenced this pull request Mar 8, 2025
…DDS-10870 (apache#221)

* CDPD-65217. HDDS-10085. Improve method name in ContainerBalancerSelectionCriteria (apache#5957)

(cherry picked from commit b932e16)

* CDPD-65645. HDDS-10160. Cache sort results in ContainerBalancerSelectionCriteria (apache#6050)

(cherry picked from commit 95666eb)

* CDPD-52140. HDDS-7252. Polled source Datanodes are wrongly not re-considered for balancing in Container Balancer (apache#6305)

(cherry picked from commit fdc38b5)

* CDPD-70112. HDDS-10869. SCMNodeManager#getUsageInfo memory occupancy optimization (apache#6737)

(cherry picked from commit 39baf0f)

* CDPD-70110. HDDS-10871. ContainerBalancerSelectionCriteria memory occupancy optimization (apache#6738)

(cherry picked from commit 19d4419)

* CDPD-70111. HDDS-10870. moveSelectionToFutureMap cleanup when future complete (apache#6746)

(cherry picked from commit 4f02853)

* CDPD-78919. HDDS-12231. Logging in Container Balancer is too verbose. (apache#7826)

(cherry picked from commit 371792f)

---------

Conflicts:

1. Forced to include CDPD-65217 (minor commit that only changes a method name) and CDPD-52140 to resolve conflicts. No major changes made.
2. Had to change a test method to use the full Assertions.assertEquals instead of just assertEquals.
3. CDPD-70111 - had to merge manually to exclude changes from other unrelated commits in ContainerBalancerTask.


---------

Co-authored-by: Siddhant Sangwan <[email protected]>
Co-authored-by: Symious <[email protected]>
Co-authored-by: Tejaskriya <[email protected]>
Co-authored-by: hao guo <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants