Skip to content

Conversation

@bharatviswa504
Copy link
Contributor

@bharatviswa504 bharatviswa504 commented Feb 24, 2020

What changes were proposed in this pull request?

This is happening because pipeline scrubber came and removed pipeline, and it closed pipeline and removed from DB and triggered close containers to set them to CLOSING. When SCM is restarted before close container command is handled and change the state to CLOSING, the below issue can happen.

This can happen in other scenarios like when safeModeHandler calls finalizeAndDestroyPipeline and do SCM restart.

The root cause for this is Pipeline removed from DB and the container is in open state in this scenario, and when trying to get pipeline we will crash SCM due to the PipelineNotFoundException error.

2020-02-21 13:57:34,888 [main] ERROR org.apache.hadoop.hdds.scm.server.StorageContainerManagerStarter: SCM start failed with exception org.apache.hadoop.hdds.scm.pipeline.PipelineNotFoundException: PipelineID=35dff62d-9bfa-449b-b6e8-6f00cc8c1b6e not found at org.apache.hadoop.hdds.scm.pipeline.PipelineStateMap.getPipeline(PipelineStateMap.java:133) at org.apache.hadoop.hdds.scm.pipeline.PipelineStateMap.addContainerToPipeline(PipelineStateMap.java:110) at org.apache.hadoop.hdds.scm.pipeline.PipelineStateManager.addContainerToPipeline(PipelineStateManager.java:59) at org.apache.hadoop.hdds.scm.pipeline.SCMPipelineManager.addContainerToPipeline(SCMPipelineManager.java:309) at org.apache.hadoop.hdds.scm.container.SCMContainerManager.loadExistingContainers(SCMContainerManager.java:121) at org.apache.hadoop.hdds.scm.container.SCMContainerManager.<init>(SCMContainerManager.java:107) at org.apache.hadoop.hdds.scm.server.StorageContainerManager.initializeSystemManagers(StorageContainerManager.java:412) at org.apache.hadoop.hdds.scm.server.StorageContainerManager.<init>(StorageContainerManager.java:283) at org.apache.hadoop.hdds.scm.server.StorageContainerManager.<init>(StorageContainerManager.java:215) at org.apache.hadoop.hdds.scm.server.StorageContainerManager.createSCM(StorageContainerManager.java:612) at org.apache.hadoop.hdds.scm.server.StorageContainerManagerStarter$SCMStarterHelper.start(StorageContainerManagerStarter.java:142) at org.apache.hadoop.hdds.scm.server.StorageContainerManagerStarter.startScm(StorageContainerManagerStarter.java:117) at org.apache.hadoop.hdds.scm.server.StorageContainerManagerStarter.call(StorageContainerManagerStarter.java:66) at org.apache.hadoop.hdds.scm.server.StorageContainerManagerStarter.call(StorageContainerManagerStarter.java:42) at picocli.CommandLine.execute(CommandLine.java:1173) at picocli.CommandLine.access$800(CommandLine.java:141) at picocli.CommandLine$RunLast.handle(CommandLine.java:1367) at picocli.CommandLine$RunLast.handle(CommandLine.java:1335) at picocli.CommandLine$AbstractParseResultHandler.handleParseResult(CommandLine.java:1243) at picocli.CommandLine.parseWithHandlers(CommandLine.java:1526) at picocli.CommandLine.parseWithHandler(CommandLine.java:1465) at org.apache.hadoop.hdds.cli.GenericCli.execute(GenericCli.java:65) at org.apache.hadoop.hdds.cli.GenericCli.run(GenericCli.java:56) at org.apache.hadoop.hdds.scm.server.StorageContainerManagerStarter.main(StorageContainerManagerStarter.java:55) 2020-02-21 13:57:34,892 [shutdown-hook-0] INFO org.apache.hadoop.hdds.scm.server.StorageContainerManagerStarter: SHUTDOWN_MSG: /************************************************************ SHUTDOWN_MSG: Shutting down StorageContainerManager at om-ha-1.vpc.cloudera.com/10.65.51.49 ************************************************************/

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-3066

How was this patch tested?

Thank You @nandakumar131 for the offline discussion.
Existing tests. Deployed the fix on the cluster, and SCM able to boot up.

2020-02-24 12:02:12,531 [main] WARN org.apache.hadoop.hdds.scm.container.SCMContainerManager: Found a Container ContainerInfo{id=3, state=OPEN, pipelineID=PipelineID=afb60e8a-0a69-410a-8699-d2a75e053225, stateEnterTime=1159646, owner=om2} which is in OPEN state with out a pipeline PipelineID=afb60e8a-0a69-410a-8699-d2a75e053225. Triggering Close Container.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pipeline scrubber only destroy pipeline that has been in ALLOCATED for too long. And the container loader only call addContainerToPipeline if the container is in OPEN state.

If I understand correctly, we should not have OPEN container assigned on an ALLOCATED pipeline.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When SCM is restarted, all pipelines will be in allocated state, they will be moved to open state. And when pipeline reports are received from DN, they will be moved to Open State.
So, here scrubber removed those pipelines and container is in open State.

2020-02-20 12:42:18,947 [RatisPipelineUtilsThread] INFO org.apache.hadoop.hdds.scm.pipeline.SCMPipelineManager: Destroying pipeline:Pipeline[ Id: 35dff62d-9bfa-449b-b6e8-6f00cc8c1b6e, Nodes: 53fc2e1a-73da-4ae7-8725-9cc23ac6c393{ip: 10.65.54.245, host: om-ha-3.vpc.cloudera.com, networkLocation: /default-rack, certSerialId: null}2346b987-3126-48b8-b2d2-e8244cb2e0ae{ip: 10.65.51.168, host: om-ha-2.vpc.cloudera.com, networkLocation: /default-rack, certSerialId: null}45987d8b-4bfd-4ccc-bf2f-224bcf5b0dcd{ip: 10.65.51.49, host: om-ha-1.vpc.cloudera.com, networkLocation: /default-rack, certSerialId: null}, Type:RATIS, Factor:THREE, State:ALLOCATED, leaderId:null, CreationTimestamp2020-02-20T03:59:02.043Z]
2020-02-20 12:42:18,947 [RatisPipelineUtilsThread] INFO org.apache.hadoop.hdds.scm.pipeline.PipelineStateManager: Pipeline Pipeline[ Id: 35dff62d-9bfa-449b-b6e8-6f00cc8c1b6e, Nodes: 53fc2e1a-73da-4ae7-8725-9cc23ac6c393{ip: 10.65.54.245, host: om-ha-3.vpc.cloudera.com, networkLocation: /default-rack, certSerialId: null}2346b987-3126-48b8-b2d2-e8244cb2e0ae{ip: 10.65.51.168, host: om-ha-2.vpc.cloudera.com, networkLocation: /default-rack, certSerialId: null}45987d8b-4bfd-4ccc-bf2f-224bcf5b0dcd{ip: 10.65.51.49, host: om-ha-1.vpc.cloudera.com, networkLocation: /default-rack, certSerialId: null}, Type:RATIS, Factor:THREE, State:CLOSED, leaderId:null, CreationTimestamp2020-02-20T03:59:02.043Z] moved to CLOSED state

And SCM log in first restart.

Now after 2nd restart the pipeline is not in DB, but the container is in open state.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And also I am thinking scrubber should not come and delete pipelines until we are out of SafeMode, will try to do that in a new jira.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As mentioned in Jira, another scenario where this can happen

This can happen in other scenarios like when safeModeHandler calls finalizeAndDestroyPipeline and do SCM restart.

As we remove the pipeline from DB and container can be in an open state. (This can happen because close container command is triggered, but not yet processed.) When SCM restart, we can be in this scenario.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bq. And also I am thinking scrubber should not come and delete pipelines until we are out of SafeMode, will try to do that in a new jira.
Agree, we should have a special state to indicate that state of the pipeline. We can fix it in a separate JIRA.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Scrubbing to start after safe mode, I have handled in this PR #605
I see that we can do without introducing a new state.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there are values to keep the scurbber running in safemode. Without it, any pipeline created/restored during safemode will hold there forever if any issue hit during pipeline creation. This prevent new pipeline from being created to exit safemode.

E.g., when datanodes restart during safemode pipeline creation, before the pipeline report changed the SCM pipeline state from ALLOCATE to OPEN.

Copy link
Contributor Author

@bharatviswa504 bharatviswa504 Feb 26, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If old pipelines are reported, and they are only accounted for SafeMode rule calculation. (Because the pipeline count is got from pipeline DB during startup)
Once, we are out of safe mode, we are triggering pipeline creation.

So, to come out of safe mode pipelines which have already been created if we have them and they are reported according to percentage configured, we can come out of safe mode.

The main purpose not to run scrub in safe mode is if it is closing the pipelines where datanodes have still not reported, we shall never come out of safe mode. To avoid this kind of scenario, I think running scrubber in safe mode is not correct.

Copy link
Contributor Author

@bharatviswa504 bharatviswa504 Feb 26, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fix for not to run scrubber is in HDDS-3072, we can discuss more on that over #605
If this is fine, we can get this in.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bq. The main purpose not to run scrub in safe mode is if it is closing the pipelines where datanodes have still not reported, we shall never come out of safe mode.

Multi-raft allows additional pipeline being created on top if existing one if they are not functional. To new OM client to write, there is no difference between pipeline loaded/reported or created/reported. If those loaded but not reported pipeline is not working, we should use scrubber to allow recreate/report. Agree, we can discuss this on #605.

@xiaoyuyao
Copy link
Contributor

+1, Let's discuss the scrubber issue in #605.

Comment on lines 132 to 141
// Not firing CLOSE_CONTAINER event because CloseContainer event
// handler is not registered by the time when we come
// here. So, we are calling update Container state to set
// container state to CLOSING, and later replication manager takes care
// of send close commands to datanode to close containers on the
// datanode.

// Skipping pipeline to container removal because, we got a
// pipelineNotFoundException when adding container to
// pipeline. So, we can only update container state.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need this comment. As SCMContainerManager is the one which is taking this decision, it should not file an event to close the container. It is logical to directly update the container state.

It is very explicit that we are performing this in case of PipelineNotFoundException, so there is no need to repeat it again.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Comment on lines +142 to +143
updateContainerState(container.containerID(),
HddsProtos.LifeCycleEvent.FINALIZE, true);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't have to introduce a new method to updateContainerState. Since we are doing this inside a constructor, we don't need any lock.

We can directly call containerStateManager.updateContainerState(container.containerID(), HddsProtos.LifeCycleEvent.FINALIZE)

Copy link
Contributor Author

@bharatviswa504 bharatviswa504 Feb 27, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This call is being done so that it will be reflected in container DB also.

@nandakumar131
Copy link
Contributor

/retest

@bharatviswa504
Copy link
Contributor Author

Thank You @nandakumar131 and @xiaoyuyao for the review.

@bharatviswa504 bharatviswa504 merged commit b441954 into apache:master Feb 28, 2020
asfgit pushed a commit that referenced this pull request Mar 2, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants