-
Notifications
You must be signed in to change notification settings - Fork 588
HDDS-8153. Integrate ContainerBalancer with MoveManager #4391
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
adoroszlai
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM so far.
| ReplicationManager.ReplicationManagerConfiguration rmConf = | ||
| ozoneConfiguration.getObject( | ||
| ReplicationManager.ReplicationManagerConfiguration.class); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: I think this should be stored in the constructor.
| if (result == MoveManager.MoveResult.COMPLETED) { | ||
| sizeActuallyMovedInLatestIteration += | ||
| containerInfo.getUsedBytes(); | ||
| if (LOG.isDebugEnabled()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NIT: We usually don't need to wrap debug calls in if (LOG.isDebugEnabled) if the parameters to the log are simple getters, which I think they are here.
|
Thanks for the reviews. I've addressed them in the latest commit. |
| private String excludeContainers = ""; | ||
|
|
||
| @Config(key = "move.timeout", type = ConfigType.TIME, defaultValue = "30m", | ||
| @Config(key = "move.timeout", type = ConfigType.TIME, defaultValue = "60m", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can do this in another PR, but there is a timeout hard coded into MoveManager right now. Probably we need to pass this value into MoveManager somehow so it passes a sensible timeout to RM when scheduling the command. That will then set the DN deadline and the pending Ops timeout.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I've reverted any changes in this PR related to timeouts. We can do that in the next PR.
|
|
||
| @Config(key = "balancing.iteration.interval", type = ConfigType.TIME, | ||
| defaultValue = "70m", tags = {ConfigTag.BALANCER}, description = | ||
| defaultValue = "130m", tags = {ConfigTag.BALANCER}, description = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should the timeout and interval not be quite close in time? If commands timeout after 60 mins, and the interval is 130m, does that mean the balancer will go idle for some time in between?
|
I noticed a subtle bug in this PR. We're only constructing |
…balancer's each iteration
sodonnel
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This changes are good to commit if we get green CI.
I had a discussion with @siddhantsangwan and we know we need to take MoveManager construction out of ContainerBalancerTask and inject the dependency, and also make the MoveManager instance register with ContainerReplicaPendingOps.
We are going to do that in a followup PR immediately after this one.
|
Thanks for the reviews. Merging now that CI is green. Follow up Jira: https://issues.apache.org/jira/browse/HDDS-8167 |
* master: (262 commits) HDDS-8153. Integrate ContainerBalancer with MoveManager (apache#4391) HDDS-8090. When getBlock from a datanode fails, retry other datanodes. (apache#4357) HDDS-8163 Use try-with-resources to ensure close rockdb connection in SstFilteringService (apache#4402) HDDS-8065. Provide GNU long options (apache#4394) HDDS-7930. [addendum] input stream does not refresh expired block token. HDDS-7930. input stream does not refresh expired block token. (apache#4378) HDDS-7740. [Snapshot] Implement SnapshotDeletingService (apache#4244) HDDS-8076. Use container cache in Key listing API. (apache#4346) HDDS-8091. [addendum] Generate list of config tags from ConfigTag enum - Hadoop 3.1 compatibility fix (apache#4374) HDDS-8144. TestDefaultCertificateClient#testTimeBeforeExpiryGracePeriod fails as we approach DST. (apache#4382) HDDS-8151. Support fine grained lifetime for root CA certificate (apache#4386) HDDS-8150. RpcClientTest and ConfigurationSourceTest not run due to naming convention (apache#4388) HDDS-8131. Add Configuration for OM Ratis Log Purge Tuning Parameters. (apache#4371) HDDS-8133. Create ozone sh key checksum command (apache#4375) HDDS-8142. Check if no entries in Block DB for a container on container delete (apache#4379) HDDS-8118. Fail container delete on non empty chunks dir (apache#4367) HDDS-8028. JNI for RocksDB SST Dump tool (apache#4315) HDDS-8129. ContainerStateMachine allows two different tasks with the same container id running in parallel. (apache#4370) HDDS-8119. Remove loosely related AutoCloseable from SendContainerOutputStream (apache#4368) close db connection (apache#4366) ...
What changes were proposed in this pull request?
Currently ContainerBalancer uses ReplicationManager#move which internally uses LegacyReplicationManager. This jira proposes integrating MoveManager with ContainerBalancer such that it uses MoveManager#move if "hdds.scm.replication.enable.legacy" is false and ReplicationManager#move if it's true.
What is the link to the Apache JIRA
https://issues.apache.org/jira/browse/HDDS-8153
How was this patch tested?
Added a UT and modified existing ones