HDDS-5153. Decommissioning a dead node should complete immediately #2190

sodonnel · 2021-04-27T16:36:10Z

What changes were proposed in this pull request?

If you run the decommission or "enter maintenance" command on a node which is already dead, then it should immediately go to the DECOMMISSIONED or IN_MAINTENANCE state. As the node is already dead, there is no way to replicate its containers in a controlled way, and hence the decommission process does not need to run.

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-5153

How was this patch tested?

New unit tests

GlenGeng-awx · 2021-04-28T03:32:54Z

Hey @sodonnel. I have one concern. Can the check checkContainersReplicatedOnNode be safely ignored for a dead node ? Will there be corner cases ?

        if (status.isDecommissioning() || status.isEnteringMaintenance()) {
          if (checkPipelinesClosedOnNode(dn)
              // Ensure the DN has received and persisted the current maint
              // state.
              && status.getOperationalState()
                  == dn.getPersistedOpState()
              && checkContainersReplicatedOnNode(dn)) {

Another choice is, we can ignore the check

              && status.getOperationalState()
                  == dn.getPersistedOpState()

for a dead node, since its persisted state can not be changed any more, which will make a dead node stuck in the queue. But we still reserve the check for replicas in that dead node. What do you think ?

sodonnel · 2021-04-28T07:51:43Z

@GlenGeng The change here is to handle a node which is already dead before decommission starts. This change will make it go to decommissioned immediately and not enter the normal workflow at all.

If a node goes dead while it is decommissioning (ie in the workflow), it will abort the workflow via the shouldContinueWorkflow(...) method, where it will be placed back into IN_SERVICE + DEAD, and then SCM will handle it as a dead node accordingly.

I don't think a node which goes dead will be stuck as decommissioning forever due to the above check.

fapifta · 2021-04-28T08:00:44Z

@GlenGeng I think, if a node is dead already, and we start decommission, if the node immediately goes to decommissioned, then the check is ignored due to this condition: if (status.isDecommissioning() || status.isEnteringMaintenance()) and as the node never enters the replication workflow we are good, as the container replication for an already dead node has to be handled when the node goes dead, and also we can not do anything with a dead node in the decomm flow.

@sodonnel I think the changes proposed are good, +1

GlenGeng-awx · 2021-04-28T08:21:17Z

Thanks @sodonnel and @fapifta for the explanation. The change is LGTM, and it has my +1.

Since we have shouldContinueWorkflow(), a dead node with START_MAINTENANCE and DECOMMISSIONING will abort workflow directly, I wonder which situation is this fixed applied for ?

GlenGeng-awx · 2021-04-28T08:23:49Z

I wonder which situation is this fixed applied for ?

Seems the problem is the dead node can not go through the workflow, thus they can not enter IN_MAINTENANCE and DECOMMISSIONED ?

sodonnel · 2021-04-28T10:55:41Z

@GlenGeng

I wonder which situation is this fixed applied for ?

The fix is for the case where someone decommissions / puts to maintenance an already dead node. In that case, we cannot do it gracefully, and the node will already be handled as dead in SCM. Therefore we treat it as a no-op and moved it directly to the end state. There is no point in adding it to the monitor and tracking it there. This also follows what HDFS does in the same scenario.

Glen also mentioned on slack:

What if there is an expiry time attached to the maintenance command. If we don't track the node in the monitor, then how can we expire maintenance?

This is somewhat of an edge case. However the solution (as already implemented in the code) is to set the maintenance end time as 0 (no end time). Then the only way to get the node back to IN_SERVICE is to recommission it, which is the same as for the decommissioned node.

…ing-upgrade-master-merge2 * upstream/master: (56 commits) HDDS-2212. Genconf tool should generate config files for secure clust… (apache#1788) HDDS-5166. Remove duplicate assignment of OZONE_OPTS for freon and sh (apache#2195) Revert "HDDS-5144. Create github check to alert when dependency tree is changed (apache#2177)" HDDS-4983. Display key offset for each block in command key info (apache#2051) HDDS-5144. Create github check to alert when dependency tree is changed (apache#2177) HDDS-4585. Support bucket acl operation in S3g (apache#1701) HDDS-5153. Decommissioning a dead node should complete immediately (apache#2190) HDDS-5147. Intermittent test failure in TestContainerDeletionChoosingPolicy#testRandomChoosingPolicy (apache#2188) HDDS-5152. Fix Suggested leader in Client. (apache#2189) HDDS-5148. Bump ratis version to 2.1.0-ff8aa66-SNAPSHOT (apache#2184) HDDS-4515. Datanodes should be able to persist and load CRL (apache#2181) HDDS-5060. [SCM HA Security] Make InterSCM grpc channel secure. (apache#2187) HDDS-5051. Ensure failover to suggested leader if any for NotLeaderException. (apache#2141) HDDS-5127. Fix getServiceList when SCM HA is enabled (apache#2173) HDDS-4889. Add simple CI check for docs (apache#2156) HDDS-5131. Use timeout in github actions (apache#2176) HDDS-5103. Fix Install Snapshot Mechanism in SCMStateMachine. (apache#2155) HDDS-5124. Use OzoneConsts.OZONE_TIME_ZONE instead of "GMT" (apache#2166) HDDS-5047. Refactor Pipeline to use ReplicationConfig instead of factor/type (apache#2096) HDDS-5083. Bump version of common-compress (apache#2139) ... Conflicts: hadoop-hdds/common/pom.xml hadoop-hdds/common/src/main/java/org/apache/hadoop/ozone/OzoneConfigKeys.java hadoop-hdds/common/src/main/java/org/apache/hadoop/ozone/OzoneConsts.java hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/HddsDatanodeService.java hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/node/SCMNodeManager.java hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/pipeline/PipelineManager.java hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/server/SCMStorageConfig.java hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/server/StorageContainerManager.java hadoop-hdds/server-scm/src/test/java/org/apache/hadoop/hdds/scm/node/TestSCMNodeManager.java hadoop-ozone/integration-test/src/test/java/org/apache/hadoop/ozone/MiniOzoneClusterImpl.java hadoop-ozone/ozone-manager/src/main/java/org/apache/hadoop/ozone/om/OMStorage.java hadoop-ozone/ozone-manager/src/main/java/org/apache/hadoop/ozone/om/OzoneManager.java hadoop-ozone/recon/src/main/java/org/apache/hadoop/ozone/recon/scm/ReconStorageContainerManagerFacade.java

…k-in-auth * HDDS-3698-nonrolling-upgrade: (57 commits) Fix compilation errors afte merge Update javassist in recon pom Fix changes introduced in merge that failed TestSCMNodeManager upgrade tests Fix checkstyle Fix intermittent test failure TestSCMNodeManager#testSetNodeOpStateAndCommandFired after merge Skip scm init default layout version in TestOzoneConfigurationFields HDDS-2212. Genconf tool should generate config files for secure clust… (apache#1788) HDDS-5166. Remove duplicate assignment of OZONE_OPTS for freon and sh (apache#2195) Revert "HDDS-5144. Create github check to alert when dependency tree is changed (apache#2177)" HDDS-4983. Display key offset for each block in command key info (apache#2051) HDDS-5144. Create github check to alert when dependency tree is changed (apache#2177) HDDS-4585. Support bucket acl operation in S3g (apache#1701) HDDS-5153. Decommissioning a dead node should complete immediately (apache#2190) HDDS-5147. Intermittent test failure in TestContainerDeletionChoosingPolicy#testRandomChoosingPolicy (apache#2188) HDDS-5152. Fix Suggested leader in Client. (apache#2189) HDDS-5148. Bump ratis version to 2.1.0-ff8aa66-SNAPSHOT (apache#2184) HDDS-4515. Datanodes should be able to persist and load CRL (apache#2181) HDDS-5060. [SCM HA Security] Make InterSCM grpc channel secure. (apache#2187) HDDS-5051. Ensure failover to suggested leader if any for NotLeaderException. (apache#2141) HDDS-5127. Fix getServiceList when SCM HA is enabled (apache#2173) HDDS-4889. Add simple CI check for docs (apache#2156) HDDS-5131. Use timeout in github actions (apache#2176) HDDS-5103. Fix Install Snapshot Mechanism in SCMStateMachine. (apache#2155) HDDS-5124. Use OzoneConsts.OZONE_TIME_ZONE instead of "GMT" (apache#2166) HDDS-5047. Refactor Pipeline to use ReplicationConfig instead of factor/type (apache#2096) ...

…lete immediately (apache#2190)" This reverts commit a920f25.

…lete immediately (#2190)" (#2282)

…pache#2190) (cherry picked from commit a920f25) Change-Id: Ia236d49bdc4ab66b2debf9cc29d0fd4030931817

…lete immediately (apache#2190)" This reverts commit 34e1a39. Change-Id: I81a28d3ac7ce63f9810c5fa285216671d7e6f440

HDDS-5153. Decommissioning a dead node should complete immediately

779c86c

fapifta approved these changes Apr 28, 2021

View reviewed changes

sodonnel merged commit a920f25 into apache:master Apr 28, 2021

sodonnel pushed a commit to sodonnel/hadoop-ozone that referenced this pull request May 25, 2021

HDDS-5274. Revert "HDDS-5153. Decommissioning a dead node should comp…

dd0f61a

…lete immediately (apache#2190)" This reverts commit a920f25.

sodonnel added a commit that referenced this pull request Jun 3, 2021

HDDS-5274. Revert "HDDS-5153. Decommissioning a dead node should comp…

8c1de61

…lete immediately (#2190)" (#2282)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

HDDS-5153. Decommissioning a dead node should complete immediately #2190

HDDS-5153. Decommissioning a dead node should complete immediately #2190

Uh oh!

sodonnel commented Apr 27, 2021

Uh oh!

GlenGeng-awx commented Apr 28, 2021

Uh oh!

sodonnel commented Apr 28, 2021

Uh oh!

fapifta commented Apr 28, 2021

Uh oh!

GlenGeng-awx commented Apr 28, 2021

Uh oh!

GlenGeng-awx commented Apr 28, 2021

Uh oh!

sodonnel commented Apr 28, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

HDDS-5153. Decommissioning a dead node should complete immediately #2190

HDDS-5153. Decommissioning a dead node should complete immediately #2190

Uh oh!

Conversation

sodonnel commented Apr 27, 2021

What changes were proposed in this pull request?

What is the link to the Apache JIRA

How was this patch tested?

Uh oh!

GlenGeng-awx commented Apr 28, 2021

Uh oh!

sodonnel commented Apr 28, 2021

Uh oh!

fapifta commented Apr 28, 2021

Uh oh!

GlenGeng-awx commented Apr 28, 2021

Uh oh!

GlenGeng-awx commented Apr 28, 2021

Uh oh!

sodonnel commented Apr 28, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants