Skip to content

Conversation

@adoroszlai
Copy link
Contributor

@adoroszlai adoroszlai commented May 14, 2023

What changes were proposed in this pull request?

TestDecommissionAndMaintenance#testContainerIsReplicatedWhenAllNodesGotoMaintenance fails with the new replication manager (i.e. if legacy is disabled). If all replicas are starting maintenance, underreplication is not fixed. RatisReplicationCheckHandler skips because there are no healthy replicas, and RatisUnhealthyReplicationCheckHandler skips because neither are there any unhealthy ones. Decommissioning and maintenance replicas are counted separately, and we lose the information regarding their health.

This change fixes the problem by counting healthy/unhealthy decom/maint replicas separately, and including them in total healthy/unhealthy counts (getHealthyReplicaCount() and getUnhealthyReplicaCount()).

It also includes some refactoring as separate commits, reducing code duplication and duplicate calculation of some values.

https://issues.apache.org/jira/browse/HDDS-8616

How was this patch tested?

New unit test is added to reproduce the problem.

Legacy replication manager in TestDecommissionAndMaintenance is disabled, since it now passes with the new one.

CI:
https://github.com/adoroszlai/hadoop-ozone/actions/runs/4972183233

@adoroszlai adoroszlai self-assigned this May 14, 2023
@adoroszlai adoroszlai added the scm label May 14, 2023
Copy link
Contributor

@sodonnel sodonnel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks for splitting the refactoring changes into separate commits - makes it a lot easier to review.

@adoroszlai adoroszlai merged commit 775d74f into apache:master May 16, 2023
@adoroszlai adoroszlai deleted the HDDS-8616 branch May 16, 2023 15:06
@adoroszlai
Copy link
Contributor Author

Thanks @sodonnel for the review.

errose28 added a commit to errose28/ozone that referenced this pull request May 17, 2023
* master: (78 commits)
  HDDS-8575. Intermittent failure in TestCloseContainerEventHandler.testCloseContainerWithDelayByLeaseManager (apache#4688)
  HDDS-7241. EC: Reconstruction could fail with orphan blocks. (apache#4718)
  HDDS-8577. [Snapshot] Disable compaction log when loading metadata for snapshot (apache#4697)
  HDDS-7080. EC: Offline reconstruction needs better logging (apache#4719)
  HDDS-8626. Config thread pool in ReplicationServer (apache#4715)
  HDDS-8616. Underreplication not fixed if all replicas start decommissioning (apache#4711)
  HDDS-8254. Close containers when volume reaches utilisation threshold (apache#4583)
  HDDS-8254. Close containers when volume reaches utilisation threshold (apache#4583)
  HDDS-8615. Explicitly show EC block type in 'ozone debug chunkinfo' command output (apache#4706)
  HDDS-8623. Delete duplicate getBucketInfo in OMKeyCommitRequest (apache#4712)
  HDDS-8339. Recon Show the number of keys marked for Deletion in Recon UI. (apache#4519)
  HDDS-8572. Support CodecBuffer for protobuf v3 codecs. (apache#4693)
  HDDS-8010. Improve DN warning message when getBlock does not find the block. (apache#4698)
  HDDS-8621. IOException is never thrown in SCMRatisServer.getRatisRoles(). (apache#4710)
  HDDS-8463. S3 key uniqueness in deletedTable (apache#4660)
  HDDS-8584. Hadoop client write slowly when stream enabled (apache#4703)
  HDDS-7732. EC: Verify block deletion from missing EC containers (apache#4705)
  HDDS-8581. Avoid random ports in integration tests (apache#4699)
  HDDS-8504. ReplicationManager: Pass used and excluded node separately for Under and Mis-Replication (apache#4694)
  HDDS-8576. Close RocksDB instance in RDBStore if RDBStore's initialization fails after RocksDB instance creation (apache#4692)
  ...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants