HDDS-2034. Async RATIS pipeline creation and destroy through heartbeat commands #29

ChenSammi · 2019-10-15T08:18:51Z

NOTICE

Previous Hadoop trunk PR link:
apache/hadoop#1650

lokeshj1703

@ChenSammi Thanks for updating the PR! I have few minor comments.

...op-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/pipeline/SCMPipelineManager.java

...e/hadoop/ozone/container/common/statemachine/commandhandler/ClosePipelineCommandHandler.java

...er-scm/src/main/java/org/apache/hadoop/hdds/scm/safemode/OneReplicaPipelineSafeModeRule.java

hadoop-hdds/common/src/main/resources/ozone-default.xml

...op-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/pipeline/SCMPipelineManager.java

...op-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/safemode/SCMSafeModeManager.java

lokeshj1703

@ChenSammi Thanks for updating the PR! The changes look good to me. I have few minor comments. Can you please verify if the test failures are related.

...er-scm/src/main/java/org/apache/hadoop/hdds/scm/safemode/OneReplicaPipelineSafeModeRule.java

hadoop-hdds/server-scm/src/test/java/org/apache/hadoop/hdds/scm/block/TestBlockManager.java

...erver-scm/src/main/java/org/apache/hadoop/hdds/scm/safemode/HealthyPipelineSafeModeRule.java

ChenSammi · 2019-10-25T03:09:55Z

Fix failed TestSCMPipelineManager.java. Other failed UT and integration test are either not relevant or passed in local.

@xiaoyuyao @lokeshj1703

lokeshj1703 · 2019-10-26T09:16:23Z

@ChenSammi Thanks for updating the PR! There is a checkstyle issue in the latest CI run and TestStorageContainerManager fails consistently with the patch. I also posted some comments here #29 (review). Can you please check those?

ChenSammi · 2019-10-29T15:28:55Z

New update per the review comments. Remove RATIS ONE factor pipeline from HealthyPipelineSafeModeRule is a very fundamental change, so a lot of UT is updated.

xiaoyuyao · 2019-10-31T22:19:21Z

/retest since the integration acceptance result is missing.

ChenSammi · 2019-11-04T03:05:40Z

I rerun the failed TestScmSafeMode multiple times locally. All are passed.

@xiaoyuyao and @lokeshj1703

lokeshj1703 · 2019-11-04T14:44:59Z

/retest

lokeshj1703 · 2019-11-04T14:57:13Z

@ChenSammi Thanks for updating the PR! The changes looks good to me. I have just one comment on the changes in TestBlockManager. +1 o.w.
The acceptance tests results were not showing up. I have retriggered them.

...op-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/pipeline/SCMPipelineManager.java

ChenSammi · 2019-11-07T10:38:06Z

@lokeshj1703 and @xiaoyuyao

lokeshj1703 · 2019-11-07T10:43:34Z

@ChenSammi Thanks for updating the PR! Can you also please address the comment in #29 (comment)?
Also there are a few block allocation failures in acceptance tests. Can you please check if those are related?

ChenSammi · 2019-11-08T05:12:23Z

Hi @lokeshj1703, do you know why to run acceptance test locally? It seems they are all running bash script. How can you know the failures are block allocation related?

lokeshj1703 · 2019-11-08T16:09:49Z

@ChenSammi In the Acceptance test results, for the failed tests you can click on the Log tab.
After building ozone you can cd to directory - hadoop-ozone/dist/target/ozone-0.5.0-SNAPSHOT/compose/ . This contains all the test suites as subdirectories. You can execute the tests in the sub-directory using test.sh file.

lokeshj1703

@ChenSammi Thanks for updating the PR! The changes look good to me. The acceptance test failure passes in my local machine. I think there are a few conflicting changes as HDDS-1868 was committed. I have few minor comments on testlib.sh and there is a checkstyle failure. Other than that the PR looks good to me.

lokeshj1703 · 2019-11-13T10:53:21Z

hadoop-ozone/dist/src/main/compose/testlib.sh

Should this be 30 as per comments?

I just follow the wait_for_datanodes() function in this testlib.sh.

lokeshj1703 · 2019-11-13T10:53:55Z

hadoop-ozone/dist/src/main/compose/testlib.sh

We need to increment SECONDS variable here.

Will do it in a follow up JIRA. There are other places need to change.

… heartbeat commands.

ChenSammi · 2019-11-13T14:58:51Z

@lokeshj1703 , thanks for review the code. I'm rebasing the code based on master. Since HDDS-1868 is committed, there are a lot conflictes. A lot previously succeed integration tests are failed due the master code change. if there is no big issues, I hope we can commit it to master as soon as possible. Minor issues can be handled in follow up JIRAs. Thanks again for your time.

@xiaoyuyao

ChenSammi · 2019-11-14T03:18:16Z

Checked 3 failed integration tests.

TestSCMSafeModeWithPipelineRules
Not relevant.
The root cause is failing to exit the safemode. Current pipeline open condition(HDDS-1868) is got 3 datanode reports and one datanode marked itself as leader. In this failure case, the leader election succeeds while XceiverServerRatis#handleLeaderChangedNotification is not called in the next 3 minutes. So cluster.waitForClusterToBeReady() timeout.
The question is is this Leader change notification reliable? What's the expected latency between leader election succeed and notification send?
TestSCMPipelineManager
Not relevant.
Failure is caused by newly introduced function TestSCMPipelineManager#testPipelineOpenOnlyWhenLeaderReported which doesn't close pipelineManager at the end. It's better to fix it in a new JIRA.
TestOzoneManagerHA
Not relevant.

@xiaoyuyao @lokeshj1703

Change is done.

ChenSammi · 2019-11-14T08:15:22Z

Hi @elek , the ci/acceptance failed with all test passed. Do you know what exactly the failure is?

xiaoyuyao · 2019-11-15T01:06:13Z

Thanks @ChenSammi for the update. The latest change LGTM, +1. I will merge it shortly to master.

Opened HDDS-2491 and HDDS-2492 for the test issues that are not related to this change.

ChenSammi · 2019-11-15T02:34:11Z

Thanks @anuengineer @xiaoyuyao @lokeshj1703 for review the patch.

…heartbeat commands (apache#29)" This reverts commit 89d11ad.

elek · 2019-11-16T16:38:38Z

This patch introduced a new unit test failure (TestDeadNodeHandler) (#202 shows that the problem disappears with reverting this commit)

adoroszlai · 2019-11-16T17:58:33Z

It also introduced acceptance test failure:

https://github.com/elek/ozone-ci-03/blob/56c7473f77185592af9a0dc53a636ab1cdb47433/pr/pr-hdds-2291-fnk79/acceptance/output.log#L1286-L1288

This line has an extra ' near the end:

https://github.com/apache/hadoop-ozone/blob/1b72718dcab7f83ebdac67b6242c729f03a8f103/hadoop-ozone/dist/src/main/compose/testlib.sh#L97

Fix:

-         status=`docker-compose -f "${compose_file}" exec -T scm bash -c "kinit -k HTTP/[email protected] -t /etc/security/keytabs/HTTP.keytab && $command'"`
+         status=`docker-compose -f "${compose_file}" exec -T scm bash -c "kinit -k HTTP/[email protected] -t /etc/security/keytabs/HTTP.keytab && $command"`

…heartbeat commands (apache#29)" This reverts commit 89d11ad.

ChenSammi · 2019-11-19T03:58:56Z

I will look into it. It's probably the same reason as HDDS-2491 <https://issues.apache.org/jira/browse/HDDS-2491>

…

On Sun, Nov 17, 2019 at 12:38 AM Elek, Márton ***@***.***> wrote: This patch introduced a new unit test failure (TestDeadNodeHandler) (#202 <#202> shows that the problem disappears with reverting this commit) — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#29>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AEW7SPSBRTXIZB2BEMSDZZLQUAOZDANCNFSM4JAY6RSA> .

…heartbeat commands (#29)" This reverts commit 89d11ad.

bharatviswa504 · 2019-11-19T17:59:55Z

Hi,
I have one comment on the changes in HealthyPipelineSafeModeRule.
In HealthyPipelineSafeModeRule, now the thresholdCount on a freshlyInstalled with one datanode cluster will be 1. So, with single a node cluster, we will never come out of safemode, as during process, we have only checked 3 node pipeline.

The intention of this rule was when we come out of safeMode, we have at least few pipelines with type Ratis and 3. But with this patch, that logic is changed.

HealthyPipelineSafeModeRule.java

L79:    
int pipelineCount = pipelineManager.getPipelines(
        HddsProtos.ReplicationType.RATIS, HddsProtos.ReplicationFactor.THREE,
        Pipeline.PipelineState.OPEN).size() +
        pipelineManager.getPipelines(HddsProtos.ReplicationType.RATIS,
            HddsProtos.ReplicationFactor.THREE,
            Pipeline.PipelineState.ALLOCATED).size();

    // This value will be zero when pipeline count is 0.
    // On a fresh installed cluster, there will be zero pipelines in the SCM
    // pipeline DB.
    healthyPipelineThresholdCount = Math.max(minHealthyPipelines,
        (int) Math.ceil(healthyPipelinesPercent * pipelineCount));

L125:
    if (pipeline.getType() == HddsProtos.ReplicationType.RATIS &&
        pipeline.getFactor() == HddsProtos.ReplicationFactor.THREE) {
      getSafeModeMetrics().incCurrentHealthyPipelinesCount();
      currentHealthyPipelineCount++;
    }

If we consider both 3, and 1 node pipelines, then also we shall be in problem. The scenario is in a cluster it has 10 3 node pipelines in the cluster, and now if we consider both 3 and 1 to increment currentHealthyPipelineCount, there can be a case, when from each pipeline one Datanode is reported, and we increment the currentHealthyPipelineCount, and we might reach the threshold. But after coming out of safeMode there will be no 3 pipeline nodes in the cluster and writes will fail.

xiaoyuyao · 2019-11-19T18:54:42Z

@bharatviswa504 the issue mentioned above has been reported and worked on with HDDS-2497.

xiaoyuyao · 2019-11-19T19:16:03Z

Just copy the comments from @elek and @swagle on HDDS-2034 in case you miss it @ChenSammi . Note, HDDS-2497 I opened a few days back are also observed by @swagle and @bharatviswa504 that could relate to the acceptance test failures.

elekMarton Elek added a comment - 8 hours ago

Sorry, I need to revert this commit. We fixed two problems which are introduced by this patch with temporary fixes (TestDeadNodeHandler is failing + acceptance testlib.sh contains syntax error) but even after the fixes the acceptance tests are hanging indefinitely. (we tried it with or without the patch multiple times)

It might be caused by some integration problem not by the patch, but as it works well without the patch I will revert this patch to have a green baseline. Let's reopen the pull request (or upload a new one) to have a full build before the re-merge.

Siddharth Wagle added a comment - 1 hour ago - edited

Hi Sammi Chen, I believe HDDS-2497 is related to the change since we enabled pipeline check for safemode exit. For 1 node RATIS, the HealthyPipelineSafeModeRule.process does not check for 1 node pipeline. It should be an easy fix to go along with this patch, assigning that Jira to you. Thanks.

bharatviswa504 · 2019-11-19T19:20:44Z

Hi @swagle
Considering 1 node pipeline in HealthyPipelineSafeMode Rule will cause the issue.

Example Scenario:
If we consider both 3, and 1 node pipelines, then also we shall be in problem. The scenario is in a cluster it has 10 3 node pipelines in the cluster, and now if we consider both 3 and 1 to increment currentHealthyPipelineCount, there can be a case, when from each pipeline one Datanode is reported, and we increment the currentHealthyPipelineCount, and we might reach the threshold. But after coming out of safeMode there will be no 3 pipeline nodes in the cluster and writes will fail.

xiaoyuyao · 2019-11-19T21:44:54Z

I think we only need to count the RATIS-3 pipeline in the HealthyPipelineSafeMode rule when ozone.replication is configured to be 3. RATIS-1 pipeline does not be counted in HealthyPipelineSafeMode. This way, we are consistent with existing behavior, assuming one single node pipeline per node reported implicitly.

bharatviswa504 · 2019-11-19T22:44:26Z

Yes @xiaoyuyao
Thanks for the offline discussion, the patch does not consider 1 node ratis pipeline when computing pipelineCount. My bad in reading the code.

…through heartbeat commands (#29)"" This reverts commit dcfe5f3.

ChenSammi requested review from anuengineer, lokeshj1703 and xiaoyuyao October 15, 2019 08:18

nandakumar131 changed the title ~~Hdds 2034~~ HDDS-2034. Async RATIS pipeline creation and destroy through heartbeat commands Oct 15, 2019

lokeshj1703 previously requested changes Oct 18, 2019

View reviewed changes

ChenSammi force-pushed the HDDS-2034 branch from d13f960 to 64bcda9 Compare October 23, 2019 16:07

lokeshj1703 reviewed Oct 24, 2019

View reviewed changes

ChenSammi force-pushed the HDDS-2034 branch from 6f34525 to d0171ce Compare October 29, 2019 15:26

ChenSammi force-pushed the HDDS-2034 branch from ed12abd to cbac2da Compare October 31, 2019 09:06

lokeshj1703 reviewed Nov 5, 2019

View reviewed changes

...op-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/pipeline/SCMPipelineManager.java Outdated Show resolved Hide resolved

ChenSammi force-pushed the HDDS-2034 branch from 04e2ade to 2640463 Compare November 6, 2019 12:53

lokeshj1703 reviewed Nov 13, 2019

View reviewed changes

ChenSammi added 7 commits November 13, 2019 20:45

HDDS-2034. Async RATIS pipeline creation and destroy through datanode…

bd870a8

… heartbeat commands.

Fix checkstyle and code improvement

5b6b3ba

update per comments

a422bd3

move exit safe mode wait from safe mode manager to mini ozone cluster

34072a2

fix unit test TestSCMPipelineManager.java

0af319c

move pipeline create to scm safe mode manager

0daca86

remove RATIS ONE factor pipeline from HealthyPipelineSafeModeRule

4e51d9c

ChenSammi added 2 commits November 13, 2019 20:46

update per comment

4a46f9f

TestBlockManager

2a358b8

Fix issues caused by rebase

86deabf

ChenSammi force-pushed the HDDS-2034 branch from 32375e1 to 86deabf Compare November 13, 2019 16:51

remove unused import

29298cc

xiaoyuyao merged commit 89d11ad into apache:master Nov 15, 2019

timmylicheng mentioned this pull request Nov 15, 2019

HDDS-2492 Fix test clean up issue in TestSCMPipelineManager. #179

Merged

elek added a commit to elek/ozone that referenced this pull request Nov 16, 2019

Revert "HDDS-2034. Async RATIS pipeline creation and destroy through …

9f8c47e

…heartbeat commands (apache#29)" This reverts commit 89d11ad.

elek added a commit to elek/ozone that referenced this pull request Nov 17, 2019

Revert "HDDS-2034. Async RATIS pipeline creation and destroy through …

3a78613

…heartbeat commands (apache#29)" This reverts commit 89d11ad.

adoroszlai referenced this pull request in adoroszlai/ozone Nov 19, 2019

Revert "HDDS-2034. Async RATIS pipeline creation and destroy through …

fd1e699

…heartbeat commands (#29)" This reverts commit 89d11ad.

elek added a commit that referenced this pull request Nov 19, 2019

Revert "HDDS-2034. Async RATIS pipeline creation and destroy through …

dcfe5f3

…heartbeat commands (#29)" This reverts commit 89d11ad.

elek added a commit that referenced this pull request Nov 20, 2019

Revert "Revert "HDDS-2034. Async RATIS pipeline creation and destroy …

3e7e212

…through heartbeat commands (#29)"" This reverts commit dcfe5f3.

ChenSammi deleted the HDDS-2034 branch February 20, 2023 03:35

HDDS-2034. Async RATIS pipeline creation and destroy through heartbeat commands #29

HDDS-2034. Async RATIS pipeline creation and destroy through heartbeat commands #29

Uh oh!

Conversation

ChenSammi commented Oct 15, 2019

NOTICE

Uh oh!

lokeshj1703 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

lokeshj1703 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ChenSammi commented Oct 25, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lokeshj1703 commented Oct 26, 2019

Uh oh!

ChenSammi commented Oct 29, 2019

Uh oh!

xiaoyuyao commented Oct 31, 2019

Uh oh!

ChenSammi commented Nov 4, 2019

Uh oh!

lokeshj1703 commented Nov 4, 2019

Uh oh!

lokeshj1703 commented Nov 4, 2019

Uh oh!

Uh oh!

ChenSammi commented Nov 7, 2019

Uh oh!

lokeshj1703 commented Nov 7, 2019

Uh oh!

ChenSammi commented Nov 8, 2019

Uh oh!

lokeshj1703 commented Nov 8, 2019

Uh oh!

lokeshj1703 left a comment

Choose a reason for hiding this comment

Uh oh!

lokeshj1703 Nov 13, 2019

Choose a reason for hiding this comment

Uh oh!

ChenSammi Nov 13, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lokeshj1703 Nov 13, 2019

Choose a reason for hiding this comment

Uh oh!

ChenSammi Nov 14, 2019

Choose a reason for hiding this comment

Uh oh!

ChenSammi commented Nov 13, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ChenSammi commented Nov 14, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ChenSammi commented Nov 14, 2019

Uh oh!

xiaoyuyao commented Nov 15, 2019

Uh oh!

ChenSammi commented Nov 15, 2019

Uh oh!

elek commented Nov 16, 2019

Uh oh!

adoroszlai commented Nov 16, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

ChenSammi commented Oct 25, 2019 •

edited

Loading

ChenSammi Nov 13, 2019 •

edited

Loading

ChenSammi commented Nov 13, 2019 •

edited

Loading

ChenSammi commented Nov 14, 2019 •

edited

Loading

adoroszlai commented Nov 16, 2019 •

edited

Loading

bharatviswa504 commented Nov 19, 2019 •

edited

Loading

bharatviswa504 commented Nov 19, 2019 •

edited

Loading