-
Notifications
You must be signed in to change notification settings - Fork 588
HDDS-2034. Async RATIS pipeline creation and destroy through heartbeat commands #29
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
lokeshj1703
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ChenSammi Thanks for updating the PR! I have few minor comments.
...op-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/pipeline/SCMPipelineManager.java
Outdated
Show resolved
Hide resolved
...op-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/pipeline/SCMPipelineManager.java
Outdated
Show resolved
Hide resolved
...e/hadoop/ozone/container/common/statemachine/commandhandler/ClosePipelineCommandHandler.java
Outdated
Show resolved
Hide resolved
...er-scm/src/main/java/org/apache/hadoop/hdds/scm/safemode/OneReplicaPipelineSafeModeRule.java
Outdated
Show resolved
Hide resolved
...op-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/pipeline/SCMPipelineManager.java
Outdated
Show resolved
Hide resolved
...op-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/pipeline/SCMPipelineManager.java
Outdated
Show resolved
Hide resolved
...op-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/safemode/SCMSafeModeManager.java
Outdated
Show resolved
Hide resolved
lokeshj1703
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ChenSammi Thanks for updating the PR! The changes look good to me. I have few minor comments. Can you please verify if the test failures are related.
...er-scm/src/main/java/org/apache/hadoop/hdds/scm/safemode/OneReplicaPipelineSafeModeRule.java
Outdated
Show resolved
Hide resolved
hadoop-hdds/server-scm/src/test/java/org/apache/hadoop/hdds/scm/block/TestBlockManager.java
Outdated
Show resolved
Hide resolved
hadoop-hdds/server-scm/src/test/java/org/apache/hadoop/hdds/scm/block/TestBlockManager.java
Outdated
Show resolved
Hide resolved
...erver-scm/src/main/java/org/apache/hadoop/hdds/scm/safemode/HealthyPipelineSafeModeRule.java
Outdated
Show resolved
Hide resolved
|
Fix failed TestSCMPipelineManager.java. Other failed UT and integration test are either not relevant or passed in local. |
|
@ChenSammi Thanks for updating the PR! There is a checkstyle issue in the latest CI run and TestStorageContainerManager fails consistently with the patch. I also posted some comments here #29 (review). Can you please check those? |
|
New update per the review comments. Remove RATIS ONE factor pipeline from HealthyPipelineSafeModeRule is a very fundamental change, so a lot of UT is updated. |
|
/retest since the integration acceptance result is missing. |
|
I rerun the failed TestScmSafeMode multiple times locally. All are passed. |
|
/retest |
|
@ChenSammi Thanks for updating the PR! The changes looks good to me. I have just one comment on the changes in TestBlockManager. +1 o.w. |
...op-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/pipeline/SCMPipelineManager.java
Outdated
Show resolved
Hide resolved
|
@ChenSammi Thanks for updating the PR! Can you also please address the comment in #29 (comment)? |
|
Hi @lokeshj1703, do you know why to run acceptance test locally? It seems they are all running bash script. How can you know the failures are block allocation related? |
|
@ChenSammi In the Acceptance test results, for the failed tests you can click on the Log tab. |
lokeshj1703
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ChenSammi Thanks for updating the PR! The changes look good to me. The acceptance test failure passes in my local machine. I think there are a few conflicting changes as HDDS-1868 was committed. I have few minor comments on testlib.sh and there is a checkstyle failure. Other than that the PR looks good to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be 30 as per comments?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just follow the wait_for_datanodes() function in this testlib.sh.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to increment SECONDS variable here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will do it in a follow up JIRA. There are other places need to change.
… heartbeat commands.
|
@lokeshj1703 , thanks for review the code. I'm rebasing the code based on master. Since HDDS-1868 is committed, there are a lot conflictes. A lot previously succeed integration tests are failed due the master code change. if there is no big issues, I hope we can commit it to master as soon as possible. Minor issues can be handled in follow up JIRAs. Thanks again for your time. |
|
Checked 3 failed integration tests.
|
|
Hi @elek , the ci/acceptance failed with all test passed. Do you know what exactly the failure is? |
|
Thanks @ChenSammi for the update. The latest change LGTM, +1. I will merge it shortly to master. Opened HDDS-2491 and HDDS-2492 for the test issues that are not related to this change. |
|
Thanks @anuengineer @xiaoyuyao @lokeshj1703 for review the patch. |
|
This patch introduced a new unit test failure (TestDeadNodeHandler) (#202 shows that the problem disappears with reverting this commit) |
|
It also introduced acceptance test failure: This line has an extra Fix: - status=`docker-compose -f "${compose_file}" exec -T scm bash -c "kinit -k HTTP/[email protected] -t /etc/security/keytabs/HTTP.keytab && $command'"`
+ status=`docker-compose -f "${compose_file}" exec -T scm bash -c "kinit -k HTTP/[email protected] -t /etc/security/keytabs/HTTP.keytab && $command"` |
|
I will look into it. It's probably the same reason as HDDS-2491
<https://issues.apache.org/jira/browse/HDDS-2491>
…On Sun, Nov 17, 2019 at 12:38 AM Elek, Márton ***@***.***> wrote:
This patch introduced a new unit test failure (TestDeadNodeHandler) (#202
<#202> shows that the problem
disappears with reverting this commit)
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#29>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AEW7SPSBRTXIZB2BEMSDZZLQUAOZDANCNFSM4JAY6RSA>
.
|
|
Hi, The intention of this rule was when we come out of safeMode, we have at least few pipelines with type Ratis and 3. But with this patch, that logic is changed. HealthyPipelineSafeModeRule.java If we consider both 3, and 1 node pipelines, then also we shall be in problem. The scenario is in a cluster it has 10 3 node pipelines in the cluster, and now if we consider both 3 and 1 to increment currentHealthyPipelineCount, there can be a case, when from each pipeline one Datanode is reported, and we increment the currentHealthyPipelineCount, and we might reach the threshold. But after coming out of safeMode there will be no 3 pipeline nodes in the cluster and writes will fail. |
|
@bharatviswa504 the issue mentioned above has been reported and worked on with HDDS-2497. |
|
Just copy the comments from @elek and @swagle on HDDS-2034 in case you miss it @ChenSammi . Note, HDDS-2497 I opened a few days back are also observed by @swagle and @bharatviswa504 that could relate to the acceptance test failures. elekMarton Elek added a comment - 8 hours ago
Siddharth Wagle added a comment - 1 hour ago - edited
|
|
Hi @swagle Example Scenario: |
|
I think we only need to count the RATIS-3 pipeline in the HealthyPipelineSafeMode rule when ozone.replication is configured to be 3. RATIS-1 pipeline does not be counted in HealthyPipelineSafeMode. This way, we are consistent with existing behavior, assuming one single node pipeline per node reported implicitly. |
|
Yes @xiaoyuyao |
NOTICE
Previous Hadoop trunk PR link:
apache/hadoop#1650