Skip to content

Conversation

@bharatviswa504
Copy link
Contributor

What changes were proposed in this pull request?

  1. Start scrubbing once we are out of safe mode.
  2. Remove clean up pipeline logic, as scrubber is taking care of that.
  3. Call trigger pipeline creation once after safe mode time out.

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-3072

How was this patch tested?

Existing tests should cover this. I will see if I can add integration test.

@ChenSammi
Copy link
Contributor

@bharatviswa504 thanks for reporting this and working on it.
Currently by default when 10% pipeline is ready, SCM will exit the safe mode. The scrubber still has the chance to close unnecessary pipelines. To mitigate it, we can raise the SCM pipeline exit threshold, say 90% pipeline should be ready before exit the safe mode, and wait an extra several minutes before scrubber start to work if we don't want to introduce a new pipeline state.

@bharatviswa504
Copy link
Contributor Author

@bharatviswa504 thanks for reporting this and working on it.
Currently by default when 10% pipeline is ready, SCM will exit the safe mode. The scrubber still has the chance to close unnecessary pipelines. To mitigate it, we can raise the SCM pipeline exit threshold, say 90% pipeline should be ready before exit the safe mode, and wait an extra several minutes before scrubber start to work if we don't want to introduce a new pipeline state.

HealthySafeMode rule purpose is when we come out of safe mode, have at least few pipelines, so that writes will succeed. So, that is the reason for 10% of pipelines.

OneReplicaSafeMode rule purpose is when we come out of safe mode have at least one data node reported from each pipeline, so that reads will succeed. We have 90% as threshold.

Container Rule is all closed containers are reported, threshold is 0.99.

And once after coming out of safe mode, we wait for hdds.scm.wait.time.after.safemode.exit and then trigger pipeline creation. (Before we used to cleanup pipelines in the allocated state, as scrubber is doing the same removed that and called trigger piipeline creation)

So, with all these rules mostly pipelines if they are in a good state, they should have been reported. But if we have containers all in open state, then there is a chance still.
Also, we can change the value to larger value as suggested, but closing down pipelines and recreating pipelines is fine right, do you see any issue here.

I see we can do a thing here, have allocated time out change from 5 mts -> larger value.
In a working system, we close the pipeline when the node becomes stale, dead or DN sends pipeline action. So, we close down that pipeline and recreate the pipeline if we can. So, in this case we are fine even though I have changed the ozone.scm.pipeline.allocated.timeout value to larger value. I see the real usage of this is during startup.

Not sure if I am missing something here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can save this step as most likely it's still in safe mode at the time.

Copy link
Contributor Author

@bharatviswa504 bharatviswa504 Mar 4, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is there here, so that in all Managers and protocolServer where we need safeModeStatus will get the value accordingly.

    isInSafeMode.set(safeModeStatus.getSafeModeStatus());
    scmClientProtocolServer.setSafeModeStatus(isInSafeMode.get());
    scmBlockManager.setSafeModeStatus(isInSafeMode.get());
    scmPipelineManager.setSafeModeStatus(isInSafeMode.get());

With the current code, we don't need it but in the future, if some manager has not read the HDDS_SCM_SAFEMODE_ENABLED and waiting for this to set the initial safemode status and then take specific actions it will help.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @bharatviswa504 for the detail explanation.
When add triggerPipelineCreation() here, we'd better remove the pipelineManager.startPipelineCreator(); in SCMSafeModeManager#exitSafeMode, otherwise, the time wait actually doesn't has effect on pipeline scrubber in PipelineManager.
Also, I would suggest move the scmPipelineManager.setSafeModeStatus(isInSafeMode.get()); into the safeModeExitThread run.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pipelineManager.startPipelineCreator(); is there in exitSafeMode so that once after we are out of safeMode schedule a periodic task to create pipelines.

triggerPipelineCreation will schedule a task if no pipeline creator is running when the call happened, as previously here we used to destroy pipelines, now replaced it with a call to trigger pipeline creation.

And also we want to run scrubber after safe mode exit, but I think we should be fine to scrub pipelines after an additional safeModewait time. Addressed this.

@arp7
Copy link
Contributor

arp7 commented Mar 5, 2020

A bunch of freon unit tests failed. Could they be related to this patch?

@bharatviswa504
Copy link
Contributor Author

bharatviswa504 commented Mar 5, 2020

A bunch of freon unit tests failed. Could they be related to this patch?

[WARNING] Tests run: 12, Failures: 0, Errors: 0, Skipped: 3
[INFO]
[INFO] ------------------------------------------------------------------------
[INFO] BUILD FAILURE
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 43:16 min
[INFO] Finished at: 2020-03-04T20:33:13Z
[INFO] ------------------------------------------------------------------------
[ERROR] Failed to execute goal org.apache.maven.plugins:maven-surefire-plugin:3.0.0-M1:test (default-test) on project hadoop-ozone-integration-test: There was a timeout or other error in the fork -> [Help 1]
[ERROR]
[ERROR] To see the full stack trace of the errors, re-run Maven with the -e switch.
[ERROR] Re-run Maven using the -X switch to enable full debug logging.
[ERROR]
[ERROR] For more information about the errors and possible solutions, please read the following articles:
[ERROR] [Help 1] http://cwiki.apache.org/confluence/display/MAVEN/MojoFailureException

I have observed the same in a couple of other PR's. Not related to this patch.

@bharatviswa504
Copy link
Contributor Author

@ChenSammi Addressed review comments. It is a blocker for 0.5 release, and this is the only blocker right now for 0.5. Pls help in review.

@ChenSammi
Copy link
Contributor

+1. Thanks @bharatviswa504 for fix the issue.

@ChenSammi ChenSammi merged commit 252f56d into apache:master Mar 6, 2020
asfgit pushed a commit that referenced this pull request Mar 6, 2020
@bharatviswa504
Copy link
Contributor Author

Thank You @ChenSammi for the review.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants