Skip to content

Conversation

@siddhantsangwan
Copy link
Contributor

What changes were proposed in this pull request?

Currently ContainerBalancer uses ReplicationManager#move which internally uses LegacyReplicationManager. This jira proposes integrating MoveManager with ContainerBalancer such that it uses MoveManager#move if "hdds.scm.replication.enable.legacy" is false and ReplicationManager#move if it's true.

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-8153

How was this patch tested?

Added a UT and modified existing ones

Copy link
Contributor

@adoroszlai adoroszlai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM so far.

Comment on lines 789 to 791
ReplicationManager.ReplicationManagerConfiguration rmConf =
ozoneConfiguration.getObject(
ReplicationManager.ReplicationManagerConfiguration.class);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I think this should be stored in the constructor.

if (result == MoveManager.MoveResult.COMPLETED) {
sizeActuallyMovedInLatestIteration +=
containerInfo.getUsedBytes();
if (LOG.isDebugEnabled()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NIT: We usually don't need to wrap debug calls in if (LOG.isDebugEnabled) if the parameters to the log are simple getters, which I think they are here.

@siddhantsangwan siddhantsangwan marked this pull request as ready for review March 14, 2023 12:25
@siddhantsangwan
Copy link
Contributor Author

Thanks for the reviews. I've addressed them in the latest commit.

private String excludeContainers = "";

@Config(key = "move.timeout", type = ConfigType.TIME, defaultValue = "30m",
@Config(key = "move.timeout", type = ConfigType.TIME, defaultValue = "60m",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can do this in another PR, but there is a timeout hard coded into MoveManager right now. Probably we need to pass this value into MoveManager somehow so it passes a sensible timeout to RM when scheduling the command. That will then set the DN deadline and the pending Ops timeout.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I've reverted any changes in this PR related to timeouts. We can do that in the next PR.


@Config(key = "balancing.iteration.interval", type = ConfigType.TIME,
defaultValue = "70m", tags = {ConfigTag.BALANCER}, description =
defaultValue = "130m", tags = {ConfigTag.BALANCER}, description =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should the timeout and interval not be quite close in time? If commands timeout after 60 mins, and the interval is 130m, does that mean the balancer will go idle for some time in between?

@siddhantsangwan
Copy link
Contributor Author

siddhantsangwan commented Mar 15, 2023

I noticed a subtle bug in this PR. We're only constructing MoveManager once when ContainerBalancerTask is constructed. But ContainerBalancerTask works in iterations and resets and reuses any collections or members that hold state. Since MoveManager tracks moves using ContainerID as the key in a hash map, we need to reset its state between iterations too.

Copy link
Contributor

@sodonnel sodonnel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This changes are good to commit if we get green CI.

I had a discussion with @siddhantsangwan and we know we need to take MoveManager construction out of ContainerBalancerTask and inject the dependency, and also make the MoveManager instance register with ContainerReplicaPendingOps.

We are going to do that in a followup PR immediately after this one.

@siddhantsangwan
Copy link
Contributor Author

Thanks for the reviews. Merging now that CI is green. Follow up Jira: https://issues.apache.org/jira/browse/HDDS-8167

@siddhantsangwan siddhantsangwan merged commit 2fc0117 into apache:master Mar 15, 2023
errose28 added a commit to errose28/ozone that referenced this pull request Mar 16, 2023
* master: (262 commits)
  HDDS-8153. Integrate ContainerBalancer with MoveManager (apache#4391)
  HDDS-8090. When getBlock from a datanode fails, retry other datanodes. (apache#4357)
  HDDS-8163 Use try-with-resources to ensure close rockdb connection in SstFilteringService (apache#4402)
  HDDS-8065. Provide GNU long options (apache#4394)
  HDDS-7930. [addendum] input stream does not refresh expired block token.
  HDDS-7930. input stream does not refresh expired block token. (apache#4378)
  HDDS-7740. [Snapshot] Implement SnapshotDeletingService (apache#4244)
  HDDS-8076. Use container cache in Key listing API. (apache#4346)
  HDDS-8091. [addendum] Generate list of config tags from ConfigTag enum - Hadoop 3.1 compatibility fix (apache#4374)
  HDDS-8144. TestDefaultCertificateClient#testTimeBeforeExpiryGracePeriod fails as we approach DST. (apache#4382)
  HDDS-8151. Support fine grained lifetime for root CA certificate (apache#4386)
  HDDS-8150. RpcClientTest and ConfigurationSourceTest not run due to naming convention (apache#4388)
  HDDS-8131. Add Configuration for OM Ratis Log Purge Tuning Parameters. (apache#4371)
  HDDS-8133. Create ozone sh key checksum command (apache#4375)
  HDDS-8142. Check if no entries in Block DB for a container on container delete (apache#4379)
  HDDS-8118. Fail container delete on non empty chunks dir (apache#4367)
  HDDS-8028. JNI for RocksDB SST Dump tool (apache#4315)
  HDDS-8129. ContainerStateMachine allows two different tasks with the same container id running in parallel. (apache#4370)
  HDDS-8119. Remove loosely related AutoCloseable from SendContainerOutputStream (apache#4368)
  close db connection (apache#4366)
  ...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants