Skip to content
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -30,6 +30,7 @@
import org.apache.hadoop.hdds.protocol.proto.HddsProtos;
import org.apache.hadoop.hdds.protocol.proto.HddsProtos.ReplicationFactor;
import org.apache.hadoop.hdds.protocol.proto.HddsProtos.ReplicationType;
import org.apache.hadoop.hdds.scm.pipeline.PipelineNotFoundException;
import org.apache.hadoop.hdds.server.ServerUtils;
import org.apache.hadoop.ozone.OzoneConsts;
import org.apache.hadoop.hdds.utils.BatchOperation;
Expand Down Expand Up @@ -85,7 +86,8 @@ public class SCMContainerManager implements ContainerManager {
* @throws IOException on Failure.
*/
public SCMContainerManager(final Configuration conf,
PipelineManager pipelineManager) throws IOException {
PipelineManager pipelineManager)
throws IOException {

final File containerDBPath = getContainerDBPath(conf);
final int cacheSize = conf.getInt(OZONE_SCM_DB_CACHE_SIZE_MB,
Expand Down Expand Up @@ -117,9 +119,17 @@ private void loadExistingContainers() throws IOException {
ContainerInfoProto.PARSER.parseFrom(entry.getValue()));
Preconditions.checkNotNull(container);
containerStateManager.loadContainer(container);
if (container.getState() == LifeCycleState.OPEN) {
pipelineManager.addContainerToPipeline(container.getPipelineID(),
ContainerID.valueof(container.getContainerID()));
try {
if (container.getState() == LifeCycleState.OPEN) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pipeline scrubber only destroy pipeline that has been in ALLOCATED for too long. And the container loader only call addContainerToPipeline if the container is in OPEN state.

If I understand correctly, we should not have OPEN container assigned on an ALLOCATED pipeline.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When SCM is restarted, all pipelines will be in allocated state, they will be moved to open state. And when pipeline reports are received from DN, they will be moved to Open State.
So, here scrubber removed those pipelines and container is in open State.

2020-02-20 12:42:18,947 [RatisPipelineUtilsThread] INFO org.apache.hadoop.hdds.scm.pipeline.SCMPipelineManager: Destroying pipeline:Pipeline[ Id: 35dff62d-9bfa-449b-b6e8-6f00cc8c1b6e, Nodes: 53fc2e1a-73da-4ae7-8725-9cc23ac6c393{ip: 10.65.54.245, host: om-ha-3.vpc.cloudera.com, networkLocation: /default-rack, certSerialId: null}2346b987-3126-48b8-b2d2-e8244cb2e0ae{ip: 10.65.51.168, host: om-ha-2.vpc.cloudera.com, networkLocation: /default-rack, certSerialId: null}45987d8b-4bfd-4ccc-bf2f-224bcf5b0dcd{ip: 10.65.51.49, host: om-ha-1.vpc.cloudera.com, networkLocation: /default-rack, certSerialId: null}, Type:RATIS, Factor:THREE, State:ALLOCATED, leaderId:null, CreationTimestamp2020-02-20T03:59:02.043Z]
2020-02-20 12:42:18,947 [RatisPipelineUtilsThread] INFO org.apache.hadoop.hdds.scm.pipeline.PipelineStateManager: Pipeline Pipeline[ Id: 35dff62d-9bfa-449b-b6e8-6f00cc8c1b6e, Nodes: 53fc2e1a-73da-4ae7-8725-9cc23ac6c393{ip: 10.65.54.245, host: om-ha-3.vpc.cloudera.com, networkLocation: /default-rack, certSerialId: null}2346b987-3126-48b8-b2d2-e8244cb2e0ae{ip: 10.65.51.168, host: om-ha-2.vpc.cloudera.com, networkLocation: /default-rack, certSerialId: null}45987d8b-4bfd-4ccc-bf2f-224bcf5b0dcd{ip: 10.65.51.49, host: om-ha-1.vpc.cloudera.com, networkLocation: /default-rack, certSerialId: null}, Type:RATIS, Factor:THREE, State:CLOSED, leaderId:null, CreationTimestamp2020-02-20T03:59:02.043Z] moved to CLOSED state

And SCM log in first restart.

Now after 2nd restart the pipeline is not in DB, but the container is in open state.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And also I am thinking scrubber should not come and delete pipelines until we are out of SafeMode, will try to do that in a new jira.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As mentioned in Jira, another scenario where this can happen

This can happen in other scenarios like when safeModeHandler calls finalizeAndDestroyPipeline and do SCM restart.

As we remove the pipeline from DB and container can be in an open state. (This can happen because close container command is triggered, but not yet processed.) When SCM restart, we can be in this scenario.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bq. And also I am thinking scrubber should not come and delete pipelines until we are out of SafeMode, will try to do that in a new jira.
Agree, we should have a special state to indicate that state of the pipeline. We can fix it in a separate JIRA.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Scrubbing to start after safe mode, I have handled in this PR #605
I see that we can do without introducing a new state.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there are values to keep the scurbber running in safemode. Without it, any pipeline created/restored during safemode will hold there forever if any issue hit during pipeline creation. This prevent new pipeline from being created to exit safemode.

E.g., when datanodes restart during safemode pipeline creation, before the pipeline report changed the SCM pipeline state from ALLOCATE to OPEN.

Copy link
Contributor Author

@bharatviswa504 bharatviswa504 Feb 26, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If old pipelines are reported, and they are only accounted for SafeMode rule calculation. (Because the pipeline count is got from pipeline DB during startup)
Once, we are out of safe mode, we are triggering pipeline creation.

So, to come out of safe mode pipelines which have already been created if we have them and they are reported according to percentage configured, we can come out of safe mode.

The main purpose not to run scrub in safe mode is if it is closing the pipelines where datanodes have still not reported, we shall never come out of safe mode. To avoid this kind of scenario, I think running scrubber in safe mode is not correct.

Copy link
Contributor Author

@bharatviswa504 bharatviswa504 Feb 26, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fix for not to run scrubber is in HDDS-3072, we can discuss more on that over #605
If this is fine, we can get this in.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bq. The main purpose not to run scrub in safe mode is if it is closing the pipelines where datanodes have still not reported, we shall never come out of safe mode.

Multi-raft allows additional pipeline being created on top if existing one if they are not functional. To new OM client to write, there is no difference between pipeline loaded/reported or created/reported. If those loaded but not reported pipeline is not working, we should use scrubber to allow recreate/report. Agree, we can discuss this on #605.

pipelineManager.addContainerToPipeline(container.getPipelineID(),
ContainerID.valueof(container.getContainerID()));
}
} catch (PipelineNotFoundException ex) {
LOG.warn("Found a Container {} which is in {} state with pipeline {} " +
"that does not exist. Closing Container.", container,
container.getState(), container.getPipelineID());
updateContainerState(container.containerID(),
HddsProtos.LifeCycleEvent.FINALIZE, true);
Comment on lines +131 to +132
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't have to introduce a new method to updateContainerState. Since we are doing this inside a constructor, we don't need any lock.

We can directly call containerStateManager.updateContainerState(container.containerID(), HddsProtos.LifeCycleEvent.FINALIZE)

Copy link
Contributor Author

@bharatviswa504 bharatviswa504 Feb 27, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This call is being done so that it will be reflected in container DB also.

}
}
}
Expand Down Expand Up @@ -323,6 +333,15 @@ public HddsProtos.LifeCycleState updateContainerState(
ContainerID containerID, HddsProtos.LifeCycleEvent event)
throws IOException {
// Should we return the updated ContainerInfo instead of LifeCycleState?
return updateContainerState(containerID, event, false);
}


private HddsProtos.LifeCycleState updateContainerState(
ContainerID containerID, HddsProtos.LifeCycleEvent event,
boolean skipPipelineToContainerRemove)
throws IOException {
// Should we return the updated ContainerInfo instead of LifeCycleState?
lock.lock();
try {
final ContainerInfo container = containerStateManager
Expand All @@ -331,10 +350,13 @@ public HddsProtos.LifeCycleState updateContainerState(
containerStateManager.updateContainerState(containerID, event);
final LifeCycleState newState = container.getState();

if (oldState == LifeCycleState.OPEN && newState != LifeCycleState.OPEN) {
pipelineManager
.removeContainerFromPipeline(container.getPipelineID(),
containerID);
if (!skipPipelineToContainerRemove) {
if (oldState == LifeCycleState.OPEN &&
newState != LifeCycleState.OPEN) {
pipelineManager
.removeContainerFromPipeline(container.getPipelineID(),
containerID);
}
}
final byte[] dbKey = Longs.toByteArray(containerID.getId());
containerStore.put(dbKey, container.getProtobuf().toByteArray());
Expand All @@ -350,7 +372,6 @@ public HddsProtos.LifeCycleState updateContainerState(
}
}


/**
* Update deleteTransactionId according to deleteTransactionMap.
*
Expand Down