-
Notifications
You must be signed in to change notification settings - Fork 587
HDDS-3072. SCM scrub pipeline should be started after coming out of safe mode. #605
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
cdeec02 to
9fe0ae4
Compare
|
@bharatviswa504 thanks for reporting this and working on it. |
HealthySafeMode rule purpose is when we come out of safe mode, have at least few pipelines, so that writes will succeed. So, that is the reason for 10% of pipelines. OneReplicaSafeMode rule purpose is when we come out of safe mode have at least one data node reported from each pipeline, so that reads will succeed. We have 90% as threshold. Container Rule is all closed containers are reported, threshold is 0.99. And once after coming out of safe mode, we wait for hdds.scm.wait.time.after.safemode.exit and then trigger pipeline creation. (Before we used to cleanup pipelines in the allocated state, as scrubber is doing the same removed that and called trigger piipeline creation) So, with all these rules mostly pipelines if they are in a good state, they should have been reported. But if we have containers all in open state, then there is a chance still. I see we can do a thing here, have allocated time out change from 5 mts -> larger value. Not sure if I am missing something here. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can save this step as most likely it's still in safe mode at the time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is there here, so that in all Managers and protocolServer where we need safeModeStatus will get the value accordingly.
isInSafeMode.set(safeModeStatus.getSafeModeStatus());
scmClientProtocolServer.setSafeModeStatus(isInSafeMode.get());
scmBlockManager.setSafeModeStatus(isInSafeMode.get());
scmPipelineManager.setSafeModeStatus(isInSafeMode.get());
With the current code, we don't need it but in the future, if some manager has not read the HDDS_SCM_SAFEMODE_ENABLED and waiting for this to set the initial safemode status and then take specific actions it will help.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @bharatviswa504 for the detail explanation.
When add triggerPipelineCreation() here, we'd better remove the pipelineManager.startPipelineCreator(); in SCMSafeModeManager#exitSafeMode, otherwise, the time wait actually doesn't has effect on pipeline scrubber in PipelineManager.
Also, I would suggest move the scmPipelineManager.setSafeModeStatus(isInSafeMode.get()); into the safeModeExitThread run.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pipelineManager.startPipelineCreator(); is there in exitSafeMode so that once after we are out of safeMode schedule a periodic task to create pipelines.
triggerPipelineCreation will schedule a task if no pipeline creator is running when the call happened, as previously here we used to destroy pipelines, now replaced it with a call to trigger pipeline creation.
And also we want to run scrubber after safe mode exit, but I think we should be fine to scrub pipelines after an additional safeModewait time. Addressed this.
96c35a6 to
a247b0e
Compare
|
A bunch of freon unit tests failed. Could they be related to this patch? |
[WARNING] Tests run: 12, Failures: 0, Errors: 0, Skipped: 3 I have observed the same in a couple of other PR's. Not related to this patch. |
|
@ChenSammi Addressed review comments. It is a blocker for 0.5 release, and this is the only blocker right now for 0.5. Pls help in review. |
|
+1. Thanks @bharatviswa504 for fix the issue. |
|
Thank You @ChenSammi for the review. |
What changes were proposed in this pull request?
What is the link to the Apache JIRA
https://issues.apache.org/jira/browse/HDDS-3072
How was this patch tested?
Existing tests should cover this. I will see if I can add integration test.