-
Notifications
You must be signed in to change notification settings - Fork 588
HDDS-3066. SCM crash during loading containers to DB. #596
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pipeline scrubber only destroy pipeline that has been in ALLOCATED for too long. And the container loader only call addContainerToPipeline if the container is in OPEN state.
If I understand correctly, we should not have OPEN container assigned on an ALLOCATED pipeline.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When SCM is restarted, all pipelines will be in allocated state, they will be moved to open state. And when pipeline reports are received from DN, they will be moved to Open State.
So, here scrubber removed those pipelines and container is in open State.
2020-02-20 12:42:18,947 [RatisPipelineUtilsThread] INFO org.apache.hadoop.hdds.scm.pipeline.SCMPipelineManager: Destroying pipeline:Pipeline[ Id: 35dff62d-9bfa-449b-b6e8-6f00cc8c1b6e, Nodes: 53fc2e1a-73da-4ae7-8725-9cc23ac6c393{ip: 10.65.54.245, host: om-ha-3.vpc.cloudera.com, networkLocation: /default-rack, certSerialId: null}2346b987-3126-48b8-b2d2-e8244cb2e0ae{ip: 10.65.51.168, host: om-ha-2.vpc.cloudera.com, networkLocation: /default-rack, certSerialId: null}45987d8b-4bfd-4ccc-bf2f-224bcf5b0dcd{ip: 10.65.51.49, host: om-ha-1.vpc.cloudera.com, networkLocation: /default-rack, certSerialId: null}, Type:RATIS, Factor:THREE, State:ALLOCATED, leaderId:null, CreationTimestamp2020-02-20T03:59:02.043Z]
2020-02-20 12:42:18,947 [RatisPipelineUtilsThread] INFO org.apache.hadoop.hdds.scm.pipeline.PipelineStateManager: Pipeline Pipeline[ Id: 35dff62d-9bfa-449b-b6e8-6f00cc8c1b6e, Nodes: 53fc2e1a-73da-4ae7-8725-9cc23ac6c393{ip: 10.65.54.245, host: om-ha-3.vpc.cloudera.com, networkLocation: /default-rack, certSerialId: null}2346b987-3126-48b8-b2d2-e8244cb2e0ae{ip: 10.65.51.168, host: om-ha-2.vpc.cloudera.com, networkLocation: /default-rack, certSerialId: null}45987d8b-4bfd-4ccc-bf2f-224bcf5b0dcd{ip: 10.65.51.49, host: om-ha-1.vpc.cloudera.com, networkLocation: /default-rack, certSerialId: null}, Type:RATIS, Factor:THREE, State:CLOSED, leaderId:null, CreationTimestamp2020-02-20T03:59:02.043Z] moved to CLOSED state
And SCM log in first restart.
Now after 2nd restart the pipeline is not in DB, but the container is in open state.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And also I am thinking scrubber should not come and delete pipelines until we are out of SafeMode, will try to do that in a new jira.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As mentioned in Jira, another scenario where this can happen
This can happen in other scenarios like when safeModeHandler calls finalizeAndDestroyPipeline and do SCM restart.
As we remove the pipeline from DB and container can be in an open state. (This can happen because close container command is triggered, but not yet processed.) When SCM restart, we can be in this scenario.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
bq. And also I am thinking scrubber should not come and delete pipelines until we are out of SafeMode, will try to do that in a new jira.
Agree, we should have a special state to indicate that state of the pipeline. We can fix it in a separate JIRA.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Scrubbing to start after safe mode, I have handled in this PR #605
I see that we can do without introducing a new state.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there are values to keep the scurbber running in safemode. Without it, any pipeline created/restored during safemode will hold there forever if any issue hit during pipeline creation. This prevent new pipeline from being created to exit safemode.
E.g., when datanodes restart during safemode pipeline creation, before the pipeline report changed the SCM pipeline state from ALLOCATE to OPEN.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If old pipelines are reported, and they are only accounted for SafeMode rule calculation. (Because the pipeline count is got from pipeline DB during startup)
Once, we are out of safe mode, we are triggering pipeline creation.
So, to come out of safe mode pipelines which have already been created if we have them and they are reported according to percentage configured, we can come out of safe mode.
The main purpose not to run scrub in safe mode is if it is closing the pipelines where datanodes have still not reported, we shall never come out of safe mode. To avoid this kind of scenario, I think running scrubber in safe mode is not correct.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
bq. The main purpose not to run scrub in safe mode is if it is closing the pipelines where datanodes have still not reported, we shall never come out of safe mode.
Multi-raft allows additional pipeline being created on top if existing one if they are not functional. To new OM client to write, there is no difference between pipeline loaded/reported or created/reported. If those loaded but not reported pipeline is not working, we should use scrubber to allow recreate/report. Agree, we can discuss this on #605.
246a84d to
7571b84
Compare
...-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/container/SCMContainerManager.java
Outdated
Show resolved
Hide resolved
...-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/container/SCMContainerManager.java
Outdated
Show resolved
Hide resolved
|
+1, Let's discuss the scrubber issue in #605. |
| // Not firing CLOSE_CONTAINER event because CloseContainer event | ||
| // handler is not registered by the time when we come | ||
| // here. So, we are calling update Container state to set | ||
| // container state to CLOSING, and later replication manager takes care | ||
| // of send close commands to datanode to close containers on the | ||
| // datanode. | ||
|
|
||
| // Skipping pipeline to container removal because, we got a | ||
| // pipelineNotFoundException when adding container to | ||
| // pipeline. So, we can only update container state. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't need this comment. As SCMContainerManager is the one which is taking this decision, it should not file an event to close the container. It is logical to directly update the container state.
It is very explicit that we are performing this in case of PipelineNotFoundException, so there is no need to repeat it again.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
| updateContainerState(container.containerID(), | ||
| HddsProtos.LifeCycleEvent.FINALIZE, true); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't have to introduce a new method to updateContainerState. Since we are doing this inside a constructor, we don't need any lock.
We can directly call containerStateManager.updateContainerState(container.containerID(), HddsProtos.LifeCycleEvent.FINALIZE)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This call is being done so that it will be reflected in container DB also.
|
/retest |
|
Thank You @nandakumar131 and @xiaoyuyao for the review. |
What changes were proposed in this pull request?
This is happening because pipeline scrubber came and removed pipeline, and it closed pipeline and removed from DB and triggered close containers to set them to CLOSING. When SCM is restarted before close container command is handled and change the state to CLOSING, the below issue can happen.
This can happen in other scenarios like when safeModeHandler calls finalizeAndDestroyPipeline and do SCM restart.
The root cause for this is Pipeline removed from DB and the container is in open state in this scenario, and when trying to get pipeline we will crash SCM due to the PipelineNotFoundException error.
2020-02-21 13:57:34,888 [main] ERROR org.apache.hadoop.hdds.scm.server.StorageContainerManagerStarter: SCM start failed with exception org.apache.hadoop.hdds.scm.pipeline.PipelineNotFoundException: PipelineID=35dff62d-9bfa-449b-b6e8-6f00cc8c1b6e not found at org.apache.hadoop.hdds.scm.pipeline.PipelineStateMap.getPipeline(PipelineStateMap.java:133) at org.apache.hadoop.hdds.scm.pipeline.PipelineStateMap.addContainerToPipeline(PipelineStateMap.java:110) at org.apache.hadoop.hdds.scm.pipeline.PipelineStateManager.addContainerToPipeline(PipelineStateManager.java:59) at org.apache.hadoop.hdds.scm.pipeline.SCMPipelineManager.addContainerToPipeline(SCMPipelineManager.java:309) at org.apache.hadoop.hdds.scm.container.SCMContainerManager.loadExistingContainers(SCMContainerManager.java:121) at org.apache.hadoop.hdds.scm.container.SCMContainerManager.<init>(SCMContainerManager.java:107) at org.apache.hadoop.hdds.scm.server.StorageContainerManager.initializeSystemManagers(StorageContainerManager.java:412) at org.apache.hadoop.hdds.scm.server.StorageContainerManager.<init>(StorageContainerManager.java:283) at org.apache.hadoop.hdds.scm.server.StorageContainerManager.<init>(StorageContainerManager.java:215) at org.apache.hadoop.hdds.scm.server.StorageContainerManager.createSCM(StorageContainerManager.java:612) at org.apache.hadoop.hdds.scm.server.StorageContainerManagerStarter$SCMStarterHelper.start(StorageContainerManagerStarter.java:142) at org.apache.hadoop.hdds.scm.server.StorageContainerManagerStarter.startScm(StorageContainerManagerStarter.java:117) at org.apache.hadoop.hdds.scm.server.StorageContainerManagerStarter.call(StorageContainerManagerStarter.java:66) at org.apache.hadoop.hdds.scm.server.StorageContainerManagerStarter.call(StorageContainerManagerStarter.java:42) at picocli.CommandLine.execute(CommandLine.java:1173) at picocli.CommandLine.access$800(CommandLine.java:141) at picocli.CommandLine$RunLast.handle(CommandLine.java:1367) at picocli.CommandLine$RunLast.handle(CommandLine.java:1335) at picocli.CommandLine$AbstractParseResultHandler.handleParseResult(CommandLine.java:1243) at picocli.CommandLine.parseWithHandlers(CommandLine.java:1526) at picocli.CommandLine.parseWithHandler(CommandLine.java:1465) at org.apache.hadoop.hdds.cli.GenericCli.execute(GenericCli.java:65) at org.apache.hadoop.hdds.cli.GenericCli.run(GenericCli.java:56) at org.apache.hadoop.hdds.scm.server.StorageContainerManagerStarter.main(StorageContainerManagerStarter.java:55) 2020-02-21 13:57:34,892 [shutdown-hook-0] INFO org.apache.hadoop.hdds.scm.server.StorageContainerManagerStarter: SHUTDOWN_MSG: /************************************************************ SHUTDOWN_MSG: Shutting down StorageContainerManager at om-ha-1.vpc.cloudera.com/10.65.51.49 ************************************************************/What is the link to the Apache JIRA
https://issues.apache.org/jira/browse/HDDS-3066
How was this patch tested?
Thank You @nandakumar131 for the offline discussion.
Existing tests. Deployed the fix on the cluster, and SCM able to boot up.
2020-02-24 12:02:12,531 [main] WARN org.apache.hadoop.hdds.scm.container.SCMContainerManager: Found a Container ContainerInfo{id=3, state=OPEN, pipelineID=PipelineID=afb60e8a-0a69-410a-8699-d2a75e053225, stateEnterTime=1159646, owner=om2} which is in OPEN state with out a pipeline PipelineID=afb60e8a-0a69-410a-8699-d2a75e053225. Triggering Close Container.