HDDS-9889. Refactor tests related to dynamical adaptation for datanode limits in ContainerBalancer #5758

Montura · 2023-12-11T09:11:26Z

Dynamical adaptation (introduced in HDDS-5526) for datanodes.involved.max.percentage.per.iteration in container balancer doesn't work well in some cases.

Sometimes the number of under-utilized nodes may not be sufficient to satisfy the limit about the max percent of datanodes participating in the balance iteration (datanodes.involved.max.percentage.per.iteration). Thus, collections of source and target datanodes are reset and balancing is skipped (see comment).

The issue it can be easily detected when cluster has few nodes (< 10), for example 4 or 5. To fix this case we have to set datanodes.involved.max.percentage.per.iteration value to 100.

@siddhantsangwan wrote a small documentaion with some facing details about container balancer.

What changes were proposed in this pull request?

Introduced TestableCluster class to reuse it in tests for clusters with the different numbers of of datanodes.

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-9889

How was this patch tested?

hdds.scm.container.balancer.TestContainerBalancerTask is reworked:

Extracted two classes

hdds.scm.container.balancer.MockedSCM for setting up testable hdds.scm.server.StorageContainerManager
hdds.scm.container.balancer.TestableCluster for creating test cluster with a required number of datanodes

Add TestContainerBalancerDatanodeNodeLimit test with 3 tests extracted from TestContainerBalancerTask with cluster with different node count.

Montura · 2023-12-11T11:35:02Z

@JacksonYao287 , @sumitagrawl, please review the changes

adoroszlai · 2023-12-11T17:09:36Z

@Montura please merge latest master into your branch, the compile error (not caused by this PR) is fixed in 582a5ce

Montura · 2023-12-11T17:48:53Z

@adoroszlai, UPD: done!

Montura · 2023-12-15T08:48:26Z

UPD: Today I rebased this PR on the master branch (to get the latest changes)

adoroszlai · 2023-12-15T08:55:58Z

UPD: Today I rebased this PR on the master branch (to get the latest changes)

Thanks @Montura. Only need to update from master if there is a conflict, or failing checks need code from master. Also, please use merge, not rebase.

@siddhantsangwan @sumitagrawl can you please review?

Montura · 2023-12-20T11:37:58Z

@siddhantsangwan @sumitagrawl could you please review?

Montura · 2023-12-21T09:05:49Z

UPD: Today I merged master branch in this PR to resolve conflicts

Montura · 2023-12-26T13:09:24Z

@siddhantsangwan @sumitagrawl could you please review?

adoroszlai · 2023-12-26T13:48:09Z

@Montura please keep in mind that end of year is usually a holiday season in many places

adoroszlai · 2024-01-08T13:48:57Z

@Montura Sorry about the code conflicts. This PR does not allow edit from maintainers, is that intentional? If it did, I'd try to keep it updated after merging PRs touching the same files.

adoroszlai · 2024-01-08T13:51:23Z

@siddhantsangwan @sumitagrawl please take a look at the patch, to provide high-level feedback until conflicts are resolved

Montura · 2024-01-08T14:05:54Z

@Montura Sorry about the code conflicts. This PR does not allow edit from maintainers, is that intentional? If it did, I'd try to keep it updated after merging PRs touching the same files.

Tomorrow I'll resolve conflicts, it wasn't intentional to forbid PR editing for maintainers. It's my first PR here, I'll do better next time.

adoroszlai · 2024-01-08T15:30:39Z

it wasn't intentional to forbid PR editing for maintainers. It's my first PR here, I'll do better next time.

No worries.

siddhantsangwan · 2024-01-09T07:19:02Z

@Montura Thanks for working on this. I'm trying to understand the problem. If you wrote a test that fails without your fix, please point me to it.

Montura · 2024-01-09T07:47:02Z

@Montura Thanks for working on this. I'm trying to understand the problem. If you wrote a test that fails without your fix, please point me to it.

Sure, I'm merging current master now, when I finish, I'll point out to the test

siddhantsangwan · 2024-01-09T07:59:48Z

Sometimes the number of under-utilized nodes may not be sufficient to satisfy the limit about the max percent of datanodes participating in the balance iteration (datanodes.involved.max.percentage.per.iteration). Thus, collections of source and target datanodes are reset and balancing is skipped.

I didn't really get this. Can you please elaborate? It'd be helpful to have a small example where you describe this problem.

Montura · 2024-01-09T09:03:21Z

Sometimes the number of under-utilized nodes may not be sufficient to satisfy the limit about the max percent of datanodes participating in the balance iteration (datanodes.involved.max.percentage.per.iteration). Thus, collections of source and target datanodes are reset and balancing is skipped.

I didn't really get this. Can you please elaborate? It'd be helpful to have a small example where you describe this problem.

Let's imaging that you have a cluster with the total DNs number equals to any value of [4, 9] (4 or 5 or 6 or 7 or 8 or 9).

Then the maximum value of DN's that could be involved in balancing for that clusters will be 1, because default value for maxDatanodesRatioToInvolvePerIteration is 0.2 (20 %). So in the next two methods will skip balancing when DNs count is less then 10.

// ContainerBalancerTask#adaptWhenNearingIterationLimits
int maxDatanodesToInvolve =  config.getMaxDatanodesRatioToInvolvePerIteration() * totalNodesInCluster;
if (countDatanodesInvolvedPerIteration + 1 == maxDatanodesToInvolve) {
    // Restricts potential target datanodes to nodes that have already been selected
}


// ContainerBalancerTask#adaptOnReachingIterationLimits
int maxDatanodesToInvolve =  config.getMaxDatanodesRatioToInvolvePerIteration() * totalNodesInCluster;
if (countDatanodesInvolvedPerIteration  == maxDatanodesToInvolve) {
    // Restricts potential source and target datanodes to nodes that have already been selected
}

// 4  * 0.2  = (0.8 -> cast_to_int) 0;
// 5  * 0.2  = (1.0 -> cast_to_int) 1;
// 6  * 0.2  = (1.2 -> cast_to_int) 1;
// 7  * 0.2  = (1.4 -> cast_to_int) 1;
// 8  * 0.2  = (1.6 -> cast_to_int) 1;
// 9  * 0.2  = (1.8 -> cast_to_int) 1;
// 10 * 0.2  = (2.0 -> cast_to_int) 2;

By spec java primitive narrowing conversion, the floating-point value is rounded to an integer value V, rounding toward zero using IEEE 754 round-toward-zero mode (§4.2.3)

The Java programming language uses round toward zero when converting a floating value to an integer (§5.1.3), which acts, in this case, as though the number were truncated, discarding the mantissa bits. Rounding toward zero chooses as its result the format's value closest to and no greater in magnitude than the infinitely precise result.

So we got under-utilized nodes that will never take part in balancing at all. All clusters with DNs count > 3 and < 10 will start balancing and do nothing because of action in ContainerBalancerTask#adaptWhenNearingIterationLimits method

siddhantsangwan · 2024-01-09T10:34:46Z

Ah, I understand. Yes, I've seen this happen in some small clusters. The recommendation is to increase the value of datanodes.involved.max.percentage.per.iteration accordingly. For example, it can be set to 100 for clusters of 15 Datanodes or less so that all Datanodes may be involved in balancing. Do you have any reason to not do this and make a code change instead? It doesn't make sense to have a configuration datanodes.involved.max.percentage.per.iteration which imposes a limit, and then have another configuration adapt.balance.when.reach.the.limit which will effectively disable the former limit. Why not just change datanodes.involved.max.percentage.per.iteration?

Montura · 2024-01-09T10:53:09Z

Ok, make it sense.

Let me rewrite the tests to verify desired behavior by increase the value of datanodes.involved.max.percentage.per.iteration. And I revert the changes about properties in hdds.scm.container.balancer.ContainerBalancerConfiguration.

What do you think?

UPD: I've updated PR with using datanodes.involved.max.percentage.per.iteration property in containerBalancerShouldObeyMaxDatanodesToInvolveLimit test

…roperty explicitly in tests

…ancerConfiguration more explicitly

Montura · 2024-04-10T12:50:52Z

@siddhantsangwan, I applied all your suggestions. Please, look at the PR once again

siddhantsangwan

@Montura Thanks for the update. LGTM.

siddhantsangwan · 2024-04-30T13:11:14Z

@Montura can you push an empty commit with no changes, and the commit message saying "Trigger CI"?
Github is showing me:

Unable to re-run one or more workflows. Check if the workflows are already running, are more than 30 days old, or are disabled.

for 1 workflow. @adoroszlai any idea?

Montura · 2024-04-30T17:33:34Z

@Montura can you push an empty commit with no changes, and the commit message saying "Trigger CI"? Github is showing me:

Unable to re-run one or more workflows. Check if the workflows are already running, are more than 30 days old, or are disabled.

for 1 workflow. @adoroszlai any idea?

Empty commit is disabled, any changes required. Let's wait for @adoroszlai

adoroszlai · 2024-04-30T18:17:58Z

Empty commit is disabled

git commit --allow-empty

Montura · 2024-05-01T07:11:38Z

Empty commit is disabled

git commit --allow-empty

Done

Montura · 2024-05-01T11:26:47Z

Merge please

adoroszlai · 2024-05-01T12:07:46Z

Thanks @Montura for continued efforts on this. Thanks @siddhantsangwan for the review.

…e limits in ContainerBalancer (apache#5758) (cherry picked from commit 78a7e7a)

…e limits in ContainerBalancer (apache#5758) (cherry picked from 78a7e7a)

Montura changed the title ~~Working on parametrized tests to run on clusters with different datan…~~ HDDS-9889. Configure adaptation for datanode limits in ContainerBalancer Dec 11, 2023

Montura force-pushed the amikhalev/datanode_limits branch from c65d14f to a839f3e Compare December 11, 2023 13:00

adoroszlai requested review from siddhantsangwan and sumitagrawl December 11, 2023 16:59

Montura force-pushed the amikhalev/datanode_limits branch 4 times, most recently from eed7a9a to f8dc3fc Compare December 15, 2023 08:20

Montura force-pushed the amikhalev/datanode_limits branch from f8dc3fc to 5d175ae Compare December 15, 2023 15:45

Montura force-pushed the amikhalev/datanode_limits branch 2 times, most recently from 0332673 to 0239124 Compare January 9, 2024 12:12

Montura added 2 commits April 1, 2024 11:06

Set maxDatanodes percentage to involve per iteration in tests explicitly

a02120a

Merge branch 'apache-master' into amikhalev/datanode_limits

d5a8b35

Montura force-pushed the amikhalev/datanode_limits branch from 1c19524 to fdb7ed6 Compare April 9, 2024 13:21

Clean up MockedSCM. Set maxDatanodesPercentageToInvolvePerIteration p…

89bd705

…roperty explicitly in tests

Montura force-pushed the amikhalev/datanode_limits branch from fdb7ed6 to 89bd705 Compare April 9, 2024 13:24

Montura added 2 commits April 10, 2024 13:16

Add comments for assertions in tests. Set properties for ContainerBal…

9076be9

…ancerConfiguration more explicitly

Use the same balancer config settings in tests

1f126d1

Montura force-pushed the amikhalev/datanode_limits branch from 492863c to 1f126d1 Compare April 10, 2024 10:29

adoroszlai changed the title ~~HDDS-9889. Refatoring tests related to dynamical adaptation for datanode limits in ContainerBalancer~~ HDDS-9889. Refactor tests related to dynamical adaptation for datanode limits in ContainerBalancer Apr 10, 2024

This was referenced Apr 16, 2024

HDDS-10699. Refactor ContainerBalancerTask and tests in TestContainerBalancerTask #6536

Closed

HDDS-10699. Refactor ContainerBalancerTask and tests in TestContainerBalancerTask #6537

Merged

Montura requested a review from siddhantsangwan April 22, 2024 14:19

siddhantsangwan approved these changes Apr 30, 2024

View reviewed changes

Trigger CI

f8e65a3

adoroszlai merged commit 78a7e7a into apache:master May 1, 2024

Montura deleted the amikhalev/datanode_limits branch May 6, 2024 06:30

Montura mentioned this pull request May 28, 2024

HDDS-10917. Refactoring more tests from TestContainerBalancerTask #6734

Merged

jojochuang pushed a commit to jojochuang/ozone that referenced this pull request May 29, 2024

HDDS-9889. Refactor tests related to dynamical adaptation for datanod…

101260d

…e limits in ContainerBalancer (apache#5758) (cherry picked from commit 78a7e7a)

Montura mentioned this pull request Sep 4, 2024

HDDS-11410. Refactoring more tests from TestContainerBalancerTask #7156

Merged

xichen01 pushed a commit to xichen01/ozone that referenced this pull request Sep 16, 2024

HDDS-9889. Refactor tests related to dynamical adaptation for datanod…

d112888

…e limits in ContainerBalancer (apache#5758) (cherry picked from commit 78a7e7a)

xichen01 pushed a commit to xichen01/ozone that referenced this pull request Sep 18, 2024

HDDS-9889. Refactor tests related to dynamical adaptation for datanod…

2d30bba

…e limits in ContainerBalancer (apache#5758) (cherry picked from commit 78a7e7a)

xichen01 mentioned this pull request Sep 19, 2024

[DO NOT MERGE] Backport some fixes and compatibility commits from master to ozone-1.4 #7218

Merged

vtutrinov pushed a commit to vtutrinov/ozone that referenced this pull request Jul 15, 2025

HDDS-9889. Refactor tests related to dynamical adaptation for datanod…

d7b1e54

…e limits in ContainerBalancer (apache#5758) (cherry picked from 78a7e7a)

HDDS-9889. Refactor tests related to dynamical adaptation for datanode limits in ContainerBalancer #5758

HDDS-9889. Refactor tests related to dynamical adaptation for datanode limits in ContainerBalancer #5758

Uh oh!

Conversation

Montura commented Dec 11, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

What is the link to the Apache JIRA

How was this patch tested?

Uh oh!

Montura commented Dec 11, 2023

Uh oh!

adoroszlai commented Dec 11, 2023

Uh oh!

Montura commented Dec 11, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Montura commented Dec 15, 2023

Uh oh!

adoroszlai commented Dec 15, 2023

Uh oh!

Montura commented Dec 20, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Montura commented Dec 21, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Montura commented Dec 26, 2023

Uh oh!

adoroszlai commented Dec 26, 2023

Uh oh!

adoroszlai commented Jan 8, 2024

Uh oh!

adoroszlai commented Jan 8, 2024

Uh oh!

Montura commented Jan 8, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

adoroszlai commented Jan 8, 2024

Uh oh!

siddhantsangwan commented Jan 9, 2024

Uh oh!

Montura commented Jan 9, 2024

Uh oh!

siddhantsangwan commented Jan 9, 2024

Uh oh!

Montura commented Jan 9, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

siddhantsangwan commented Jan 9, 2024

Uh oh!

Montura commented Jan 9, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Montura commented Apr 10, 2024

Uh oh!

siddhantsangwan left a comment

Choose a reason for hiding this comment

Uh oh!

siddhantsangwan commented Apr 30, 2024

Uh oh!

Montura commented Apr 30, 2024

Uh oh!

adoroszlai commented Apr 30, 2024

Uh oh!

Montura commented May 1, 2024

Uh oh!

Montura commented May 1, 2024

Uh oh!

adoroszlai commented May 1, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Montura commented Dec 11, 2023 •

edited

Loading

Montura commented Dec 11, 2023 •

edited

Loading

Montura commented Dec 20, 2023 •

edited

Loading

Montura commented Dec 21, 2023 •

edited

Loading

Montura commented Jan 8, 2024 •

edited

Loading

Montura commented Jan 9, 2024 •

edited

Loading

Montura commented Jan 9, 2024 •

edited

Loading