-
Notifications
You must be signed in to change notification settings - Fork 588
HDDS-3066. SCM crash during loading containers to DB. #596
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
4a7c8ff
85ce437
b499ab4
7571b84
c4895bf
591a129
cecc04c
34ee16f
c5ee13a
f8d64d2
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -30,6 +30,7 @@ | |
| import org.apache.hadoop.hdds.protocol.proto.HddsProtos; | ||
| import org.apache.hadoop.hdds.protocol.proto.HddsProtos.ReplicationFactor; | ||
| import org.apache.hadoop.hdds.protocol.proto.HddsProtos.ReplicationType; | ||
| import org.apache.hadoop.hdds.scm.pipeline.PipelineNotFoundException; | ||
| import org.apache.hadoop.hdds.server.ServerUtils; | ||
| import org.apache.hadoop.ozone.OzoneConsts; | ||
| import org.apache.hadoop.hdds.utils.BatchOperation; | ||
|
|
@@ -85,7 +86,8 @@ public class SCMContainerManager implements ContainerManager { | |
| * @throws IOException on Failure. | ||
| */ | ||
| public SCMContainerManager(final Configuration conf, | ||
| PipelineManager pipelineManager) throws IOException { | ||
| PipelineManager pipelineManager) | ||
| throws IOException { | ||
|
|
||
| final File containerDBPath = getContainerDBPath(conf); | ||
| final int cacheSize = conf.getInt(OZONE_SCM_DB_CACHE_SIZE_MB, | ||
|
|
@@ -117,9 +119,17 @@ private void loadExistingContainers() throws IOException { | |
| ContainerInfoProto.PARSER.parseFrom(entry.getValue())); | ||
| Preconditions.checkNotNull(container); | ||
| containerStateManager.loadContainer(container); | ||
| if (container.getState() == LifeCycleState.OPEN) { | ||
| pipelineManager.addContainerToPipeline(container.getPipelineID(), | ||
| ContainerID.valueof(container.getContainerID())); | ||
| try { | ||
| if (container.getState() == LifeCycleState.OPEN) { | ||
| pipelineManager.addContainerToPipeline(container.getPipelineID(), | ||
| ContainerID.valueof(container.getContainerID())); | ||
| } | ||
| } catch (PipelineNotFoundException ex) { | ||
| LOG.warn("Found a Container {} which is in {} state with pipeline {} " + | ||
| "that does not exist. Closing Container.", container, | ||
| container.getState(), container.getPipelineID()); | ||
| updateContainerState(container.containerID(), | ||
| HddsProtos.LifeCycleEvent.FINALIZE, true); | ||
|
Comment on lines
+131
to
+132
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We don't have to introduce a new method to We can directly call
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This call is being done so that it will be reflected in container DB also. |
||
| } | ||
| } | ||
| } | ||
|
|
@@ -323,6 +333,15 @@ public HddsProtos.LifeCycleState updateContainerState( | |
| ContainerID containerID, HddsProtos.LifeCycleEvent event) | ||
| throws IOException { | ||
| // Should we return the updated ContainerInfo instead of LifeCycleState? | ||
| return updateContainerState(containerID, event, false); | ||
| } | ||
|
|
||
|
|
||
| private HddsProtos.LifeCycleState updateContainerState( | ||
| ContainerID containerID, HddsProtos.LifeCycleEvent event, | ||
| boolean skipPipelineToContainerRemove) | ||
| throws IOException { | ||
| // Should we return the updated ContainerInfo instead of LifeCycleState? | ||
| lock.lock(); | ||
| try { | ||
| final ContainerInfo container = containerStateManager | ||
|
|
@@ -331,10 +350,13 @@ public HddsProtos.LifeCycleState updateContainerState( | |
| containerStateManager.updateContainerState(containerID, event); | ||
| final LifeCycleState newState = container.getState(); | ||
|
|
||
| if (oldState == LifeCycleState.OPEN && newState != LifeCycleState.OPEN) { | ||
| pipelineManager | ||
| .removeContainerFromPipeline(container.getPipelineID(), | ||
| containerID); | ||
| if (!skipPipelineToContainerRemove) { | ||
| if (oldState == LifeCycleState.OPEN && | ||
| newState != LifeCycleState.OPEN) { | ||
| pipelineManager | ||
| .removeContainerFromPipeline(container.getPipelineID(), | ||
| containerID); | ||
| } | ||
| } | ||
| final byte[] dbKey = Longs.toByteArray(containerID.getId()); | ||
| containerStore.put(dbKey, container.getProtobuf().toByteArray()); | ||
|
|
@@ -350,7 +372,6 @@ public HddsProtos.LifeCycleState updateContainerState( | |
| } | ||
| } | ||
|
|
||
|
|
||
| /** | ||
| * Update deleteTransactionId according to deleteTransactionMap. | ||
| * | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pipeline scrubber only destroy pipeline that has been in ALLOCATED for too long. And the container loader only call addContainerToPipeline if the container is in OPEN state.
If I understand correctly, we should not have OPEN container assigned on an ALLOCATED pipeline.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When SCM is restarted, all pipelines will be in allocated state, they will be moved to open state. And when pipeline reports are received from DN, they will be moved to Open State.
So, here scrubber removed those pipelines and container is in open State.
2020-02-20 12:42:18,947 [RatisPipelineUtilsThread] INFO org.apache.hadoop.hdds.scm.pipeline.SCMPipelineManager: Destroying pipeline:Pipeline[ Id: 35dff62d-9bfa-449b-b6e8-6f00cc8c1b6e, Nodes: 53fc2e1a-73da-4ae7-8725-9cc23ac6c393{ip: 10.65.54.245, host: om-ha-3.vpc.cloudera.com, networkLocation: /default-rack, certSerialId: null}2346b987-3126-48b8-b2d2-e8244cb2e0ae{ip: 10.65.51.168, host: om-ha-2.vpc.cloudera.com, networkLocation: /default-rack, certSerialId: null}45987d8b-4bfd-4ccc-bf2f-224bcf5b0dcd{ip: 10.65.51.49, host: om-ha-1.vpc.cloudera.com, networkLocation: /default-rack, certSerialId: null}, Type:RATIS, Factor:THREE, State:ALLOCATED, leaderId:null, CreationTimestamp2020-02-20T03:59:02.043Z]
2020-02-20 12:42:18,947 [RatisPipelineUtilsThread] INFO org.apache.hadoop.hdds.scm.pipeline.PipelineStateManager: Pipeline Pipeline[ Id: 35dff62d-9bfa-449b-b6e8-6f00cc8c1b6e, Nodes: 53fc2e1a-73da-4ae7-8725-9cc23ac6c393{ip: 10.65.54.245, host: om-ha-3.vpc.cloudera.com, networkLocation: /default-rack, certSerialId: null}2346b987-3126-48b8-b2d2-e8244cb2e0ae{ip: 10.65.51.168, host: om-ha-2.vpc.cloudera.com, networkLocation: /default-rack, certSerialId: null}45987d8b-4bfd-4ccc-bf2f-224bcf5b0dcd{ip: 10.65.51.49, host: om-ha-1.vpc.cloudera.com, networkLocation: /default-rack, certSerialId: null}, Type:RATIS, Factor:THREE, State:CLOSED, leaderId:null, CreationTimestamp2020-02-20T03:59:02.043Z] moved to CLOSED state
And SCM log in first restart.
Now after 2nd restart the pipeline is not in DB, but the container is in open state.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And also I am thinking scrubber should not come and delete pipelines until we are out of SafeMode, will try to do that in a new jira.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As mentioned in Jira, another scenario where this can happen
This can happen in other scenarios like when safeModeHandler calls finalizeAndDestroyPipeline and do SCM restart.
As we remove the pipeline from DB and container can be in an open state. (This can happen because close container command is triggered, but not yet processed.) When SCM restart, we can be in this scenario.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
bq. And also I am thinking scrubber should not come and delete pipelines until we are out of SafeMode, will try to do that in a new jira.
Agree, we should have a special state to indicate that state of the pipeline. We can fix it in a separate JIRA.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Scrubbing to start after safe mode, I have handled in this PR #605
I see that we can do without introducing a new state.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there are values to keep the scurbber running in safemode. Without it, any pipeline created/restored during safemode will hold there forever if any issue hit during pipeline creation. This prevent new pipeline from being created to exit safemode.
E.g., when datanodes restart during safemode pipeline creation, before the pipeline report changed the SCM pipeline state from ALLOCATE to OPEN.
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If old pipelines are reported, and they are only accounted for SafeMode rule calculation. (Because the pipeline count is got from pipeline DB during startup)
Once, we are out of safe mode, we are triggering pipeline creation.
So, to come out of safe mode pipelines which have already been created if we have them and they are reported according to percentage configured, we can come out of safe mode.
The main purpose not to run scrub in safe mode is if it is closing the pipelines where datanodes have still not reported, we shall never come out of safe mode. To avoid this kind of scenario, I think running scrubber in safe mode is not correct.
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The fix for not to run scrubber is in HDDS-3072, we can discuss more on that over #605
If this is fine, we can get this in.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
bq. The main purpose not to run scrub in safe mode is if it is closing the pipelines where datanodes have still not reported, we shall never come out of safe mode.
Multi-raft allows additional pipeline being created on top if existing one if they are not functional. To new OM client to write, there is no difference between pipeline loaded/reported or created/reported. If those loaded but not reported pipeline is not working, we should use scrubber to allow recreate/report. Agree, we can discuss this on #605.