HDDS-3066. SCM crash during loading containers to DB. #596

bharatviswa504 · 2020-02-24T20:04:13Z

What changes were proposed in this pull request?

This is happening because pipeline scrubber came and removed pipeline, and it closed pipeline and removed from DB and triggered close containers to set them to CLOSING. When SCM is restarted before close container command is handled and change the state to CLOSING, the below issue can happen.

This can happen in other scenarios like when safeModeHandler calls finalizeAndDestroyPipeline and do SCM restart.

The root cause for this is Pipeline removed from DB and the container is in open state in this scenario, and when trying to get pipeline we will crash SCM due to the PipelineNotFoundException error.

2020-02-21 13:57:34,888 [main] ERROR org.apache.hadoop.hdds.scm.server.StorageContainerManagerStarter: SCM start failed with exception org.apache.hadoop.hdds.scm.pipeline.PipelineNotFoundException: PipelineID=35dff62d-9bfa-449b-b6e8-6f00cc8c1b6e not found at org.apache.hadoop.hdds.scm.pipeline.PipelineStateMap.getPipeline(PipelineStateMap.java:133) at org.apache.hadoop.hdds.scm.pipeline.PipelineStateMap.addContainerToPipeline(PipelineStateMap.java:110) at org.apache.hadoop.hdds.scm.pipeline.PipelineStateManager.addContainerToPipeline(PipelineStateManager.java:59) at org.apache.hadoop.hdds.scm.pipeline.SCMPipelineManager.addContainerToPipeline(SCMPipelineManager.java:309) at org.apache.hadoop.hdds.scm.container.SCMContainerManager.loadExistingContainers(SCMContainerManager.java:121) at org.apache.hadoop.hdds.scm.container.SCMContainerManager.<init>(SCMContainerManager.java:107) at org.apache.hadoop.hdds.scm.server.StorageContainerManager.initializeSystemManagers(StorageContainerManager.java:412) at org.apache.hadoop.hdds.scm.server.StorageContainerManager.<init>(StorageContainerManager.java:283) at org.apache.hadoop.hdds.scm.server.StorageContainerManager.<init>(StorageContainerManager.java:215) at org.apache.hadoop.hdds.scm.server.StorageContainerManager.createSCM(StorageContainerManager.java:612) at org.apache.hadoop.hdds.scm.server.StorageContainerManagerStarter$SCMStarterHelper.start(StorageContainerManagerStarter.java:142) at org.apache.hadoop.hdds.scm.server.StorageContainerManagerStarter.startScm(StorageContainerManagerStarter.java:117) at org.apache.hadoop.hdds.scm.server.StorageContainerManagerStarter.call(StorageContainerManagerStarter.java:66) at org.apache.hadoop.hdds.scm.server.StorageContainerManagerStarter.call(StorageContainerManagerStarter.java:42) at picocli.CommandLine.execute(CommandLine.java:1173) at picocli.CommandLine.access$800(CommandLine.java:141) at picocli.CommandLine$RunLast.handle(CommandLine.java:1367) at picocli.CommandLine$RunLast.handle(CommandLine.java:1335) at picocli.CommandLine$AbstractParseResultHandler.handleParseResult(CommandLine.java:1243) at picocli.CommandLine.parseWithHandlers(CommandLine.java:1526) at picocli.CommandLine.parseWithHandler(CommandLine.java:1465) at org.apache.hadoop.hdds.cli.GenericCli.execute(GenericCli.java:65) at org.apache.hadoop.hdds.cli.GenericCli.run(GenericCli.java:56) at org.apache.hadoop.hdds.scm.server.StorageContainerManagerStarter.main(StorageContainerManagerStarter.java:55) 2020-02-21 13:57:34,892 [shutdown-hook-0] INFO org.apache.hadoop.hdds.scm.server.StorageContainerManagerStarter: SHUTDOWN_MSG: /************************************************************ SHUTDOWN_MSG: Shutting down StorageContainerManager at om-ha-1.vpc.cloudera.com/10.65.51.49 ************************************************************/

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-3066

How was this patch tested?

Thank You @nandakumar131 for the offline discussion.
Existing tests. Deployed the fix on the cluster, and SCM able to boot up.

2020-02-24 12:02:12,531 [main] WARN org.apache.hadoop.hdds.scm.container.SCMContainerManager: Found a Container ContainerInfo{id=3, state=OPEN, pipelineID=PipelineID=afb60e8a-0a69-410a-8699-d2a75e053225, stateEnterTime=1159646, owner=om2} which is in OPEN state with out a pipeline PipelineID=afb60e8a-0a69-410a-8699-d2a75e053225. Triggering Close Container.

xiaoyuyao · 2020-02-24T22:17:52Z

...-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/container/SCMContainerManager.java

Pipeline scrubber only destroy pipeline that has been in ALLOCATED for too long. And the container loader only call addContainerToPipeline if the container is in OPEN state.

If I understand correctly, we should not have OPEN container assigned on an ALLOCATED pipeline.

When SCM is restarted, all pipelines will be in allocated state, they will be moved to open state. And when pipeline reports are received from DN, they will be moved to Open State.
So, here scrubber removed those pipelines and container is in open State.

2020-02-20 12:42:18,947 [RatisPipelineUtilsThread] INFO org.apache.hadoop.hdds.scm.pipeline.SCMPipelineManager: Destroying pipeline:Pipeline[ Id: 35dff62d-9bfa-449b-b6e8-6f00cc8c1b6e, Nodes: 53fc2e1a-73da-4ae7-8725-9cc23ac6c393{ip: 10.65.54.245, host: om-ha-3.vpc.cloudera.com, networkLocation: /default-rack, certSerialId: null}2346b987-3126-48b8-b2d2-e8244cb2e0ae{ip: 10.65.51.168, host: om-ha-2.vpc.cloudera.com, networkLocation: /default-rack, certSerialId: null}45987d8b-4bfd-4ccc-bf2f-224bcf5b0dcd{ip: 10.65.51.49, host: om-ha-1.vpc.cloudera.com, networkLocation: /default-rack, certSerialId: null}, Type:RATIS, Factor:THREE, State:ALLOCATED, leaderId:null, CreationTimestamp2020-02-20T03:59:02.043Z]
2020-02-20 12:42:18,947 [RatisPipelineUtilsThread] INFO org.apache.hadoop.hdds.scm.pipeline.PipelineStateManager: Pipeline Pipeline[ Id: 35dff62d-9bfa-449b-b6e8-6f00cc8c1b6e, Nodes: 53fc2e1a-73da-4ae7-8725-9cc23ac6c393{ip: 10.65.54.245, host: om-ha-3.vpc.cloudera.com, networkLocation: /default-rack, certSerialId: null}2346b987-3126-48b8-b2d2-e8244cb2e0ae{ip: 10.65.51.168, host: om-ha-2.vpc.cloudera.com, networkLocation: /default-rack, certSerialId: null}45987d8b-4bfd-4ccc-bf2f-224bcf5b0dcd{ip: 10.65.51.49, host: om-ha-1.vpc.cloudera.com, networkLocation: /default-rack, certSerialId: null}, Type:RATIS, Factor:THREE, State:CLOSED, leaderId:null, CreationTimestamp2020-02-20T03:59:02.043Z] moved to CLOSED state

And SCM log in first restart.

Now after 2nd restart the pipeline is not in DB, but the container is in open state.

And also I am thinking scrubber should not come and delete pipelines until we are out of SafeMode, will try to do that in a new jira.

As mentioned in Jira, another scenario where this can happen

This can happen in other scenarios like when safeModeHandler calls finalizeAndDestroyPipeline and do SCM restart.

As we remove the pipeline from DB and container can be in an open state. (This can happen because close container command is triggered, but not yet processed.) When SCM restart, we can be in this scenario.

bq. And also I am thinking scrubber should not come and delete pipelines until we are out of SafeMode, will try to do that in a new jira.
Agree, we should have a special state to indicate that state of the pipeline. We can fix it in a separate JIRA.

Scrubbing to start after safe mode, I have handled in this PR #605
I see that we can do without introducing a new state.

I think there are values to keep the scurbber running in safemode. Without it, any pipeline created/restored during safemode will hold there forever if any issue hit during pipeline creation. This prevent new pipeline from being created to exit safemode.

E.g., when datanodes restart during safemode pipeline creation, before the pipeline report changed the SCM pipeline state from ALLOCATE to OPEN.

If old pipelines are reported, and they are only accounted for SafeMode rule calculation. (Because the pipeline count is got from pipeline DB during startup)
Once, we are out of safe mode, we are triggering pipeline creation.

So, to come out of safe mode pipelines which have already been created if we have them and they are reported according to percentage configured, we can come out of safe mode.

The main purpose not to run scrub in safe mode is if it is closing the pipelines where datanodes have still not reported, we shall never come out of safe mode. To avoid this kind of scenario, I think running scrubber in safe mode is not correct.

The fix for not to run scrubber is in HDDS-3072, we can discuss more on that over #605
If this is fine, we can get this in.

bq. The main purpose not to run scrub in safe mode is if it is closing the pipelines where datanodes have still not reported, we shall never come out of safe mode.

Multi-raft allows additional pipeline being created on top if existing one if they are not functional. To new OM client to write, there is no difference between pipeline loaded/reported or created/reported. If those loaded but not reported pipeline is not working, we should use scrubber to allow recreate/report. Agree, we can discuss this on #605.

...-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/container/SCMContainerManager.java

xiaoyuyao · 2020-02-27T18:59:11Z

+1, Let's discuss the scrubber issue in #605.

nandakumar131 · 2020-02-27T19:12:09Z

...-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/container/SCMContainerManager.java

+        // Not firing CLOSE_CONTAINER event because CloseContainer event
+        // handler is not registered by the time when we come
+        // here. So, we are calling update Container state to set
+        // container state to CLOSING, and later replication manager takes care
+        // of send close commands to datanode to close containers on the
+        // datanode.
+
+        // Skipping pipeline to container removal because, we got a
+        // pipelineNotFoundException when adding container to
+        // pipeline. So, we can only update container state.


We don't need this comment. As SCMContainerManager is the one which is taking this decision, it should not file an event to close the container. It is logical to directly update the container state.

It is very explicit that we are performing this in case of PipelineNotFoundException, so there is no need to repeat it again.

nandakumar131 · 2020-02-27T19:15:09Z

...-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/container/SCMContainerManager.java

+        updateContainerState(container.containerID(),
+            HddsProtos.LifeCycleEvent.FINALIZE, true);


We don't have to introduce a new method to updateContainerState. Since we are doing this inside a constructor, we don't need any lock.

We can directly call containerStateManager.updateContainerState(container.containerID(), HddsProtos.LifeCycleEvent.FINALIZE)

This call is being done so that it will be reflected in container DB also.

nandakumar131 · 2020-02-28T09:54:27Z

/retest

bharatviswa504 · 2020-02-28T21:38:46Z

Thank You @nandakumar131 and @xiaoyuyao for the review.

(cherry picked from commit b441954)

bharatviswa504 requested a review from nandakumar131 February 24, 2020 20:04

xiaoyuyao reviewed Feb 24, 2020

View reviewed changes

bharatviswa504 added 4 commits February 24, 2020 15:08

HDDS-3066. SCM crash during loading containers to DB.

4a7c8ff

more changes

85ce437

fix checkstyle.

b499ab4

fix test

7571b84

bharatviswa504 force-pushed the HDDS-3066 branch from 246a84d to 7571b84 Compare February 24, 2020 23:08

bharatviswa504 added 3 commits February 24, 2020 15:11

remove not needed code

c4895bf

check style

591a129

reword

cecc04c

xiaoyuyao reviewed Feb 25, 2020

View reviewed changes

...-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/container/SCMContainerManager.java Outdated Show resolved Hide resolved

xiaoyuyao reviewed Feb 25, 2020

View reviewed changes

...-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/container/SCMContainerManager.java Outdated Show resolved Hide resolved

review comment

34ee16f

nandakumar131 reviewed Feb 27, 2020

View reviewed changes

review comment

c5ee13a

nandakumar131 approved these changes Feb 28, 2020

View reviewed changes

empty commit to retest build

f8d64d2

bharatviswa504 merged commit b441954 into apache:master Feb 28, 2020

asfgit pushed a commit that referenced this pull request Mar 2, 2020

HDDS-3066. SCM crash during loading containers to DB. (#596)

695c855

(cherry picked from commit b441954)

		updateContainerState(container.containerID(),
		HddsProtos.LifeCycleEvent.FINALIZE, true);

HDDS-3066. SCM crash during loading containers to DB. #596

HDDS-3066. SCM crash during loading containers to DB. #596

Uh oh!

Conversation

bharatviswa504 commented Feb 24, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

What is the link to the Apache JIRA

How was this patch tested?

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bharatviswa504 Feb 26, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bharatviswa504 Feb 26, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

xiaoyuyao commented Feb 27, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bharatviswa504 Feb 27, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nandakumar131 commented Feb 28, 2020

Uh oh!

bharatviswa504 commented Feb 28, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

bharatviswa504 commented Feb 24, 2020 •

edited

Loading

bharatviswa504 Feb 26, 2020 •

edited

Loading

bharatviswa504 Feb 26, 2020 •

edited

Loading

bharatviswa504 Feb 27, 2020 •

edited

Loading