-
Notifications
You must be signed in to change notification settings - Fork 588
HDDS-9479. Pipeline close doesn't wait for containers to be closed. #5604
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -87,10 +87,21 @@ public static Codec<Pipeline> getCodec() { | |
| // suggested leader id with high priority | ||
| private final UUID suggestedLeaderId; | ||
|
|
||
| private final Instant stateEnterTime; | ||
|
|
||
| /** | ||
| * The immutable properties of pipeline object is used in | ||
| * ContainerStateManager#getMatchingContainerByPipeline to take a lock on | ||
| * the container allocations for a particular pipeline. | ||
| * <br><br> | ||
| * Since the Pipeline class is immutable, if we want to change the state of | ||
| * the Pipeline we should create a new Pipeline object with the new state. | ||
| * Make sure that you set the value of <i>creationTimestamp</i> properly while | ||
| * creating the new Pipeline object. | ||
| * <br><br> | ||
| * There is no need to worry about the value of <i>stateEnterTime</i> as it's | ||
| * set to <i>Instant.now</i> when you crate the Pipeline object as part of | ||
| * state change. | ||
| */ | ||
| private Pipeline(PipelineID id, | ||
| ReplicationConfig replicationConfig, PipelineState state, | ||
|
|
@@ -102,6 +113,7 @@ private Pipeline(PipelineID id, | |
| this.creationTimestamp = Instant.now(); | ||
| this.suggestedLeaderId = suggestedLeaderId; | ||
| this.replicaIndexes = new HashMap<>(); | ||
| this.stateEnterTime = Instant.now(); | ||
| } | ||
|
|
||
| /** | ||
|
|
@@ -140,6 +152,10 @@ public Instant getCreationTimestamp() { | |
| return creationTimestamp; | ||
| } | ||
|
|
||
| public Instant getStateEnterTime() { | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. IMO, need set closedState time when close is triggered for pipeline close, else this time is same as creation time as initialized in constructor.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @sumitagrawl the Since
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @nandakumar131
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yeah, it is a little strange that creationTime and stateEnterTime are both basically hardcoded to "now". Feels like we should only need one of them, as creation time isn't really "correct". If the pipeline changes state from open to closed, the creation time is going to change too, due to its immutable nature. I can see how this could be confusing to someone in the future.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @sodonnel, I completely understand the confusion here. We should refactor the code for better understandability. With the current code, it's easy to introduce bugs in future. We should re-write the
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Ah, I missed that the builder sets creation time. I guess this is OK as the constructor is private, so the only way to create the object is via the builder. I noticed that the setter is public: We could make that private as the inner Builder class can call the private methods. This is not really in scope for this PR though. |
||
| return stateEnterTime; | ||
| } | ||
|
|
||
| /** | ||
| * Return the suggested leaderId which has a high priority among DNs of the | ||
| * pipeline. | ||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -115,9 +115,14 @@ NavigableSet<ContainerID> getContainersInPipeline(PipelineID pipelineID) | |
|
|
||
| void openPipeline(PipelineID pipelineId) throws IOException; | ||
|
|
||
| @Deprecated | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Why leave this deprecated on around? This is not an external facing API, so nothing should be using it except tests. It would probably be better to just remove it now.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. True, we can remove this method completely. I just didn't want to have large changes in a single PR if possible, it will become hard for the reviewers. We will have to change another 15+ classes to remove this method, I will create a follow-up Jira to make this change. |
||
| void closePipeline(Pipeline pipeline, boolean onTimeout) | ||
| throws IOException; | ||
|
|
||
| void closePipeline(PipelineID pipelineID) throws IOException; | ||
|
|
||
| void deletePipeline(PipelineID pipelineID) throws IOException; | ||
|
|
||
| void closeStalePipelines(DatanodeDetails datanodeDetails); | ||
|
|
||
| void scrubPipelines() throws IOException; | ||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -489,33 +489,46 @@ private void closeContainersForPipeline(final PipelineID pipelineId) | |
| * put pipeline in CLOSED state. | ||
| * @param pipeline - ID of the pipeline. | ||
| * @param onTimeout - whether to remove pipeline after some time. | ||
| * @throws IOException | ||
| * @throws IOException throws exception in case of failure | ||
| * @deprecated Do not use this method, onTimeout is not honored. | ||
| */ | ||
| @Override | ||
| @Deprecated | ||
| public void closePipeline(Pipeline pipeline, boolean onTimeout) | ||
| throws IOException { | ||
| PipelineID pipelineID = pipeline.getId(); | ||
| throws IOException { | ||
| closePipeline(pipeline.getId()); | ||
| } | ||
|
|
||
| /** | ||
| * Move the Pipeline to CLOSED state. | ||
| * @param pipelineID ID of the Pipeline to be closed | ||
| * @throws IOException In case of exception while closing the Pipeline | ||
| */ | ||
| public void closePipeline(PipelineID pipelineID) throws IOException { | ||
| HddsProtos.PipelineID pipelineIDProtobuf = pipelineID.getProtobuf(); | ||
| // close containers. | ||
| closeContainersForPipeline(pipelineID); | ||
|
|
||
| if (!pipeline.isClosed()) { | ||
| if (!getPipeline(pipelineID).isClosed()) { | ||
| acquireWriteLock(); | ||
| try { | ||
| stateManager.updatePipelineState(pipelineIDProtobuf, | ||
| HddsProtos.PipelineState.PIPELINE_CLOSED); | ||
| } finally { | ||
| releaseWriteLock(); | ||
| } | ||
| LOG.info("Pipeline {} moved to CLOSED state", pipeline); | ||
| LOG.info("Pipeline {} moved to CLOSED state", pipelineID); | ||
| } | ||
|
|
||
| metrics.removePipelineMetrics(pipelineID); | ||
|
|
||
| if (!onTimeout) { | ||
| // close pipeline right away. | ||
| removePipeline(pipeline); | ||
| } | ||
| } | ||
|
|
||
| /** | ||
| * Deletes the Pipeline for the given PipelineID. | ||
| * @param pipelineID ID of the Pipeline to be deleted | ||
| * @throws IOException In case of exception while deleting the Pipeline | ||
| */ | ||
| public void deletePipeline(PipelineID pipelineID) throws IOException { | ||
| removePipeline(getPipeline(pipelineID)); | ||
| } | ||
|
|
||
| /** close the pipelines whose nodes' IPs are stale. | ||
|
|
@@ -535,9 +548,10 @@ public void closeStalePipelines(DatanodeDetails datanodeDetails) { | |
| pipelinesWithStaleIpOrHostname.size()); | ||
| pipelinesWithStaleIpOrHostname.forEach(p -> { | ||
| try { | ||
| LOG.info("Closing the stale pipeline: {}", p.getId()); | ||
| closePipeline(p, false); | ||
| LOG.info("Closed the stale pipeline: {}", p.getId()); | ||
| final PipelineID id = p.getId(); | ||
| LOG.info("Closing the stale pipeline: {}", id); | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Nit: I am not sure we need this log message. We have a message below for "closed" and then exception handler has a message for "failed to close".
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Fixed. |
||
| closePipeline(id); | ||
| deletePipeline(id); | ||
sodonnel marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| } catch (IOException e) { | ||
| LOG.error("Closing the stale pipeline failed: {}", p, e); | ||
| } | ||
|
|
@@ -568,26 +582,34 @@ public void scrubPipelines() throws IOException { | |
| ScmConfigKeys.OZONE_SCM_PIPELINE_ALLOCATED_TIMEOUT, | ||
| ScmConfigKeys.OZONE_SCM_PIPELINE_ALLOCATED_TIMEOUT_DEFAULT, | ||
| TimeUnit.MILLISECONDS); | ||
| long pipelineDeleteTimoutInMills = conf.getTimeDuration( | ||
| ScmConfigKeys.OZONE_SCM_PIPELINE_DESTROY_TIMEOUT, | ||
| ScmConfigKeys.OZONE_SCM_PIPELINE_DESTROY_TIMEOUT_DEFAULT, | ||
| TimeUnit.MILLISECONDS); | ||
|
|
||
| List<Pipeline> candidates = stateManager.getPipelines(); | ||
|
|
||
| for (Pipeline p : candidates) { | ||
| final PipelineID id = p.getId(); | ||
| // scrub pipelines who stay ALLOCATED for too long. | ||
| if (p.getPipelineState() == Pipeline.PipelineState.ALLOCATED && | ||
| (currentTime.toEpochMilli() - p.getCreationTimestamp() | ||
| .toEpochMilli() >= pipelineScrubTimeoutInMills)) { | ||
|
|
||
| LOG.info("Scrubbing pipeline: id: {} since it stays at ALLOCATED " + | ||
| "stage for {} mins.", p.getId(), | ||
| "stage for {} mins.", id, | ||
| Duration.between(currentTime, p.getCreationTimestamp()) | ||
| .toMinutes()); | ||
| closePipeline(p, false); | ||
| closePipeline(id); | ||
| deletePipeline(id); | ||
| } | ||
| // scrub pipelines who stay CLOSED for too long. | ||
| if (p.getPipelineState() == Pipeline.PipelineState.CLOSED) { | ||
| if (p.getPipelineState() == Pipeline.PipelineState.CLOSED && | ||
| (currentTime.toEpochMilli() - p.getStateEnterTime().toEpochMilli()) | ||
| >= pipelineDeleteTimoutInMills) { | ||
| LOG.info("Scrubbing pipeline: id: {} since it stays at CLOSED stage.", | ||
| p.getId()); | ||
| closeContainersForPipeline(p.getId()); | ||
| removePipeline(p); | ||
| deletePipeline(id); | ||
| } | ||
| // If a datanode is stopped and then SCM is restarted, a pipeline can get | ||
| // stuck in an open state. For Ratis, provided some other DNs that were | ||
|
|
@@ -599,8 +621,7 @@ public void scrubPipelines() throws IOException { | |
| if (isOpenWithUnregisteredNodes(p)) { | ||
| LOG.info("Scrubbing pipeline: id: {} as it has unregistered nodes", | ||
| p.getId()); | ||
| closeContainersForPipeline(p.getId()); | ||
| closePipeline(p, true); | ||
| closePipeline(id); | ||
| } | ||
| } | ||
| } | ||
|
|
||
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is used to compare against the configured clock in
SCMContainerManagerWould it be possible to get the creation timestamp from the same clock here? Currently it should be fine but it would be more resistant to bugs due to refactoring in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense, let me create a separate Jira for handling this.