HDDS-3072. SCM scrub pipeline should be started after coming out of safe mode. #605

bharatviswa504 · 2020-02-25T20:38:51Z

What changes were proposed in this pull request?

Start scrubbing once we are out of safe mode.
Remove clean up pipeline logic, as scrubber is taking care of that.
Call trigger pipeline creation once after safe mode time out.

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-3072

How was this patch tested?

Existing tests should cover this. I will see if I can add integration test.

ChenSammi · 2020-02-26T08:18:05Z

@bharatviswa504 thanks for reporting this and working on it.
Currently by default when 10% pipeline is ready, SCM will exit the safe mode. The scrubber still has the chance to close unnecessary pipelines. To mitigate it, we can raise the SCM pipeline exit threshold, say 90% pipeline should be ready before exit the safe mode, and wait an extra several minutes before scrubber start to work if we don't want to introduce a new pipeline state.

bharatviswa504 · 2020-02-27T01:42:38Z

@bharatviswa504 thanks for reporting this and working on it.
Currently by default when 10% pipeline is ready, SCM will exit the safe mode. The scrubber still has the chance to close unnecessary pipelines. To mitigate it, we can raise the SCM pipeline exit threshold, say 90% pipeline should be ready before exit the safe mode, and wait an extra several minutes before scrubber start to work if we don't want to introduce a new pipeline state.

HealthySafeMode rule purpose is when we come out of safe mode, have at least few pipelines, so that writes will succeed. So, that is the reason for 10% of pipelines.

OneReplicaSafeMode rule purpose is when we come out of safe mode have at least one data node reported from each pipeline, so that reads will succeed. We have 90% as threshold.

Container Rule is all closed containers are reported, threshold is 0.99.

And once after coming out of safe mode, we wait for hdds.scm.wait.time.after.safemode.exit and then trigger pipeline creation. (Before we used to cleanup pipelines in the allocated state, as scrubber is doing the same removed that and called trigger piipeline creation)

So, with all these rules mostly pipelines if they are in a good state, they should have been reported. But if we have containers all in open state, then there is a chance still.
Also, we can change the value to larger value as suggested, but closing down pipelines and recreating pipelines is fine right, do you see any issue here.

I see we can do a thing here, have allocated time out change from 5 mts -> larger value.
In a working system, we close the pipeline when the node becomes stale, dead or DN sends pipeline action. So, we close down that pipeline and recreate the pipeline if we can. So, in this case we are fine even though I have changed the ozone.scm.pipeline.allocated.timeout value to larger value. I see the real usage of this is during startup.

Not sure if I am missing something here.

ChenSammi · 2020-03-04T10:17:41Z

...hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/server/StorageContainerManager.java

We can save this step as most likely it's still in safe mode at the time.

This is there here, so that in all Managers and protocolServer where we need safeModeStatus will get the value accordingly.

isInSafeMode.set(safeModeStatus.getSafeModeStatus()); scmClientProtocolServer.setSafeModeStatus(isInSafeMode.get()); scmBlockManager.setSafeModeStatus(isInSafeMode.get()); scmPipelineManager.setSafeModeStatus(isInSafeMode.get());

With the current code, we don't need it but in the future, if some manager has not read the HDDS_SCM_SAFEMODE_ENABLED and waiting for this to set the initial safemode status and then take specific actions it will help.

ChenSammi · 2020-03-04T10:20:36Z

hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/safemode/SafeModeHandler.java

Thanks @bharatviswa504 for the detail explanation.
When add triggerPipelineCreation() here, we'd better remove the pipelineManager.startPipelineCreator(); in SCMSafeModeManager#exitSafeMode, otherwise, the time wait actually doesn't has effect on pipeline scrubber in PipelineManager.
Also, I would suggest move the scmPipelineManager.setSafeModeStatus(isInSafeMode.get()); into the safeModeExitThread run.

pipelineManager.startPipelineCreator(); is there in exitSafeMode so that once after we are out of safeMode schedule a periodic task to create pipelines.

triggerPipelineCreation will schedule a task if no pipeline creator is running when the call happened, as previously here we used to destroy pipelines, now replaced it with a call to trigger pipeline creation.

And also we want to run scrubber after safe mode exit, but I think we should be fine to scrub pipelines after an additional safeModewait time. Addressed this.

…afe mode.

arp7 · 2020-03-05T15:50:18Z

A bunch of freon unit tests failed. Could they be related to this patch?

bharatviswa504 · 2020-03-05T17:14:43Z

A bunch of freon unit tests failed. Could they be related to this patch?

[WARNING] Tests run: 12, Failures: 0, Errors: 0, Skipped: 3
[INFO]
[INFO] ------------------------------------------------------------------------
[INFO] BUILD FAILURE
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 43:16 min
[INFO] Finished at: 2020-03-04T20:33:13Z
[INFO] ------------------------------------------------------------------------
[ERROR] Failed to execute goal org.apache.maven.plugins:maven-surefire-plugin:3.0.0-M1:test (default-test) on project hadoop-ozone-integration-test: There was a timeout or other error in the fork -> [Help 1]
[ERROR]
[ERROR] To see the full stack trace of the errors, re-run Maven with the -e switch.
[ERROR] Re-run Maven using the -X switch to enable full debug logging.
[ERROR]
[ERROR] For more information about the errors and possible solutions, please read the following articles:
[ERROR] [Help 1] http://cwiki.apache.org/confluence/display/MAVEN/MojoFailureException

I have observed the same in a couple of other PR's. Not related to this patch.

bharatviswa504 · 2020-03-05T17:20:17Z

@ChenSammi Addressed review comments. It is a blocker for 0.5 release, and this is the only blocker right now for 0.5. Pls help in review.

ChenSammi · 2020-03-06T15:52:40Z

+1. Thanks @bharatviswa504 for fix the issue.

…afe mode. (#605) (cherry picked from commit 252f56d)

bharatviswa504 · 2020-03-06T17:20:09Z

Thank You @ChenSammi for the review.

bharatviswa504 mentioned this pull request Feb 25, 2020

HDDS-3066. SCM crash during loading containers to DB. #596

Merged

bharatviswa504 force-pushed the HDDS-3072 branch 2 times, most recently from cdeec02 to 9fe0ae4 Compare February 25, 2020 20:42

ChenSammi reviewed Mar 4, 2020

View reviewed changes

bharatviswa504 added 4 commits March 4, 2020 10:47

HDDS-3072. SCM scrub pipeline should be started after coming out of s…

87ba67a

…afe mode.

remove test change

b8d5fee

check style.

486a3cb

fix sammi comment

a247b0e

bharatviswa504 force-pushed the HDDS-3072 branch from 96c35a6 to a247b0e Compare March 4, 2020 19:29

ChenSammi merged commit 252f56d into apache:master Mar 6, 2020

asfgit pushed a commit that referenced this pull request Mar 6, 2020

HDDS-3072. SCM scrub pipeline should be started after coming out of s…

9220252

…afe mode. (#605) (cherry picked from commit 252f56d)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

HDDS-3072. SCM scrub pipeline should be started after coming out of safe mode. #605

HDDS-3072. SCM scrub pipeline should be started after coming out of safe mode. #605

Uh oh!

bharatviswa504 commented Feb 25, 2020

Uh oh!

ChenSammi commented Feb 26, 2020

Uh oh!

bharatviswa504 commented Feb 27, 2020

Uh oh!

ChenSammi Mar 4, 2020

Uh oh!

bharatviswa504 Mar 4, 2020 •

edited

Loading

Uh oh!

ChenSammi Mar 4, 2020

Uh oh!

bharatviswa504 Mar 4, 2020

Uh oh!

arp7 commented Mar 5, 2020

Uh oh!

bharatviswa504 commented Mar 5, 2020 •

edited

Loading

Uh oh!

bharatviswa504 commented Mar 5, 2020

Uh oh!

ChenSammi commented Mar 6, 2020

Uh oh!

bharatviswa504 commented Mar 6, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

HDDS-3072. SCM scrub pipeline should be started after coming out of safe mode. #605

HDDS-3072. SCM scrub pipeline should be started after coming out of safe mode. #605

Uh oh!

Conversation

bharatviswa504 commented Feb 25, 2020

What changes were proposed in this pull request?

What is the link to the Apache JIRA

How was this patch tested?

Uh oh!

ChenSammi commented Feb 26, 2020

Uh oh!

bharatviswa504 commented Feb 27, 2020

Uh oh!

ChenSammi Mar 4, 2020

Choose a reason for hiding this comment

Uh oh!

bharatviswa504 Mar 4, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ChenSammi Mar 4, 2020

Choose a reason for hiding this comment

Uh oh!

bharatviswa504 Mar 4, 2020

Choose a reason for hiding this comment

Uh oh!

arp7 commented Mar 5, 2020

Uh oh!

bharatviswa504 commented Mar 5, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bharatviswa504 commented Mar 5, 2020

Uh oh!

ChenSammi commented Mar 6, 2020

Uh oh!

bharatviswa504 commented Mar 6, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

bharatviswa504 Mar 4, 2020 •

edited

Loading

bharatviswa504 commented Mar 5, 2020 •

edited

Loading