HDDS-8616. Underreplication not fixed if all replicas start decommissioning #4711

adoroszlai · 2023-05-14T13:20:25Z

What changes were proposed in this pull request?

TestDecommissionAndMaintenance#testContainerIsReplicatedWhenAllNodesGotoMaintenance fails with the new replication manager (i.e. if legacy is disabled). If all replicas are starting maintenance, underreplication is not fixed. RatisReplicationCheckHandler skips because there are no healthy replicas, and RatisUnhealthyReplicationCheckHandler skips because neither are there any unhealthy ones. Decommissioning and maintenance replicas are counted separately, and we lose the information regarding their health.

This change fixes the problem by counting healthy/unhealthy decom/maint replicas separately, and including them in total healthy/unhealthy counts (getHealthyReplicaCount() and getUnhealthyReplicaCount()).

It also includes some refactoring as separate commits, reducing code duplication and duplicate calculation of some values.

https://issues.apache.org/jira/browse/HDDS-8616

How was this patch tested?

New unit test is added to reproduce the problem.

Legacy replication manager in TestDecommissionAndMaintenance is disabled, since it now passes with the new one.

CI:
https://github.com/adoroszlai/hadoop-ozone/actions/runs/4972183233

…Count

…ioning

… getMaintenanceCount()

sodonnel

LGTM. Thanks for splitting the refactoring changes into separate commits - makes it a lot easier to review.

adoroszlai · 2023-05-16T15:06:43Z

Thanks @sodonnel for the review.

* master: (78 commits) HDDS-8575. Intermittent failure in TestCloseContainerEventHandler.testCloseContainerWithDelayByLeaseManager (apache#4688) HDDS-7241. EC: Reconstruction could fail with orphan blocks. (apache#4718) HDDS-8577. [Snapshot] Disable compaction log when loading metadata for snapshot (apache#4697) HDDS-7080. EC: Offline reconstruction needs better logging (apache#4719) HDDS-8626. Config thread pool in ReplicationServer (apache#4715) HDDS-8616. Underreplication not fixed if all replicas start decommissioning (apache#4711) HDDS-8254. Close containers when volume reaches utilisation threshold (apache#4583) HDDS-8254. Close containers when volume reaches utilisation threshold (apache#4583) HDDS-8615. Explicitly show EC block type in 'ozone debug chunkinfo' command output (apache#4706) HDDS-8623. Delete duplicate getBucketInfo in OMKeyCommitRequest (apache#4712) HDDS-8339. Recon Show the number of keys marked for Deletion in Recon UI. (apache#4519) HDDS-8572. Support CodecBuffer for protobuf v3 codecs. (apache#4693) HDDS-8010. Improve DN warning message when getBlock does not find the block. (apache#4698) HDDS-8621. IOException is never thrown in SCMRatisServer.getRatisRoles(). (apache#4710) HDDS-8463. S3 key uniqueness in deletedTable (apache#4660) HDDS-8584. Hadoop client write slowly when stream enabled (apache#4703) HDDS-7732. EC: Verify block deletion from missing EC containers (apache#4705) HDDS-8581. Avoid random ports in integration tests (apache#4699) HDDS-8504. ReplicationManager: Pass used and excluded node separately for Under and Mis-Replication (apache#4694) HDDS-8576. Close RocksDB instance in RDBStore if RDBStore's initialization fails after RocksDB instance creation (apache#4692) ...

adoroszlai added 10 commits May 14, 2023 12:03

Extract duplicated HealthResult construction to RatisContainerReplica…

e39c079

…Count

ReplicaCount already calculated in handle, reuse in checkReplication

99f4023

testUnderReplicatedDueToAllDecommissioning()

99e1e1c

HDDS-8616. Underreplication not fixed if all replicas start decommiss…

c519e5d

…ioning

Disable LegacyReplicationManager in TestDecommissionAndMaintenance

a5d4631

Improve RatisContainerReplicaCount toString

9fb7754

Consider (or not) unhealthy decom/maint in getDecommissionCount() and…

bca43dd

… getMaintenanceCount()

Skip duplicate calculation of redundancyDelta

d380d1e

inSufficientDueToDecommission can be private

321f481

includePendingAdd is always false in inSufficientDueToDecommission()

3c2a350

adoroszlai self-assigned this May 14, 2023

adoroszlai added the scm label May 14, 2023

adoroszlai requested review from siddhantsangwan and sodonnel May 14, 2023 14:38

sodonnel approved these changes May 16, 2023

View reviewed changes

adoroszlai merged commit 775d74f into apache:master May 16, 2023

adoroszlai deleted the HDDS-8616 branch May 16, 2023 15:06

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

HDDS-8616. Underreplication not fixed if all replicas start decommissioning #4711

HDDS-8616. Underreplication not fixed if all replicas start decommissioning #4711

Uh oh!

adoroszlai commented May 14, 2023 •

edited

Loading

Uh oh!

sodonnel left a comment

Uh oh!

adoroszlai commented May 16, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

HDDS-8616. Underreplication not fixed if all replicas start decommissioning #4711

HDDS-8616. Underreplication not fixed if all replicas start decommissioning #4711

Uh oh!

Conversation

adoroszlai commented May 14, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

sodonnel left a comment

Choose a reason for hiding this comment

Uh oh!

adoroszlai commented May 16, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

adoroszlai commented May 14, 2023 •

edited

Loading