Skip to content

Conversation

@duongkame
Copy link
Contributor

@duongkame duongkame commented Jan 6, 2023

What changes were proposed in this pull request?

SCM should allow adding a container to a CLOSED pipeline as a pipeline state can be changed while the container creating transaction is waiting to be processed by SCM.

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-7738

How was this patch tested?

Unit tests.
Standard CI: https://github.com/duongkame/ozone/actions/runs/3857497185/jobs/6574962174

@duongkame duongkame marked this pull request as ready for review January 6, 2023 18:49
@duongkame
Copy link
Contributor Author

@szetszwo @errose28 @aswinshakil please have a look. Some CI integration steps are failing but they're fine when running locally from my laptop, or by the branch CI, guess they're just flaky. Would be nice if one of you can help retry them individually.

@sodonnel
Copy link
Contributor

sodonnel commented Jan 6, 2023

Does the SCM terminate on the active SCM, or is this on the follower SCMs?

If we allow an open container on a closed pipeline, what will close the open container? The normal close flow is triggered when either the container fills up and the DN triggers a close, or the pipeline is closed and it triggers a close to all containers on the pipeline.

I am also wondering, what happens to a container which is allocated on SCM, but never gets anything written to it. It will never get replicas on a DN, and hence will never have any replicas reported. Will it get cleaned up or will it hang around forever?

@duongkame
Copy link
Contributor Author

Thanks for having a look @sodonnel .

Does the SCM terminate on the active SCM, or is this on the follower SCMs?

The same transactions get replayed in all SCMs and result the same errors preventing SCM to start up.

If we allow an open container on a closed pipeline, what will close the open container? The normal close flow is triggered when either the container fills up and the DN triggers a close, or the pipeline is closed and it triggers a close to all containers on the pipeline.

I think such containers will be closed by the pipeline scrubber, which periodically scans and closes containers associated with closed pipelines.

I am also wondering, what happens to a container which is allocated on SCM, but never gets anything written to it. It will never get replicas on a DN, and hence will never have any replicas reported. Will it get cleaned up or will it hang around forever?

I'm not sure about this. Basically, I can't find any process that cleans up empty containers and it looks like a container can only be removed via admin CLI.

Alternatively, SCM can also just reject the transaction (throwing a non-terminus) and move on. Yet, I'm not confident about the consequences.

Copy link
Contributor

@szetszwo szetszwo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@duongkame , thanks a lot for working on this!

  • We should print a WARN message when it happens; see below.
  • Also, let's keep the addContainerToPipelineSCMStart method for now so that it is easier to back port this change. We may do the code refactoring later.
+ b/hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/pipeline/PipelineStateMap.java
@@ -106,9 +106,10 @@ void addContainerToPipeline(PipelineID pipelineID, ContainerID containerID)
 
     Pipeline pipeline = getPipeline(pipelineID);
     if (pipeline.isClosed()) {
-      throw new IOException(String
-          .format("Cannot add container to pipeline=%s in closed state",
-              pipelineID));
+      LOG.warn("Adding container {} to pipeline={} in CLOSED state."
+          + "  This happens only for some exceptional cases."
+          + "  Check for the previous exceptions.",
+          containerID, pipelineID);
     }
     pipeline2container.get(pipelineID).add(containerID);
   }

@duongkame
Copy link
Contributor Author

  • We should print a WARN message when it happens; see below.
  • Also, let's keep the addContainerToPipelineSCMStart method for now so that it is easier to back port this change. We may do the code refactoring later.

Thanks for the suggestions, @szetszwo. I've made the updates.
Once this is merged I'll submit another PR to clean up the code.

Copy link
Contributor

@szetszwo szetszwo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 the change looks good.

@szetszwo szetszwo merged commit 2eb5805 into apache:master Jan 7, 2023
errose28 added a commit to errose28/ozone that referenced this pull request Jan 9, 2023
* master: (176 commits)
  HDDS-7726. EC: Enhance datanode reconstruction log message (apache#4155)
  HDDS-7739. EC: Increase the information in the RM sending command log message (apache#4153)
  HDDS-7652. Volume Quota not enforced during write when bucket quota is not set (apache#4124)
  HDDS-7628. Intermittent failure in TestOzoneContainerWithTLS (apache#4142)
  HDDS-7695. EC metrics related to replication commands don't add up (apache#4152)
  HDDS-7729. EC: ECContainerReplicaCount should handle pending delete of unhealthy replicas (apache#4146)
  HDDS-7738. SCM terminates when adding container to a closed pipeline (apache#4154)
  HDDS-7243. Remove RequestFeatureValidator from echoRPC method which supports only ValidationCondition.OLDER_CLIENT_REQUESTS (apache#4051)
  HDDS-7708. No check for certificate duration config scenarios. (apache#4149)
  HDDS-7727. EC: SCM unregistered event handler for DatanodeCommandCountUpdated (apache#4147)
  HDDS-7606. Add SCM HA support in intellij run (apache#4058)
  HDDS-7666. EC: Unrecoverable EC containers with some remaining replicas may block decommissioning (apache#4118)
  HDDS-7339. Implement Certificate renewal task for services (apache#3982)
  HDDS-7696. MisReplicationHandler does not consider QUASI_CLOSED replicas as sources (apache#4144)
  HDDS-7714. Docker cluster ozone-om-ha fails during docker-compose up (apache#4137)
  HDDS-7716. Log read requests rejected with permission denied in OM audit (apache#4136)
  HDDS-7588. Intermittent failure in TestObjectStoreWithLegacyFS#testFlatKeyStructureWithOBS (apache#4040)
  HDDS-7633. Compile error with Java 11: package com.sun.jmx.mbeanserver is not visible (apache#4077)
  HDDS-7648. Add a servername tag in UGI metrics. (apache#4094)
  HDDS-7564. Update Ozone version after 1.3.0 release (apache#4115)
  ...
@duongkame duongkame deleted the HDDS-7738 branch April 12, 2025 00:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants