Skip to content

Conversation

@ChenSammi
Copy link
Contributor

NOTICE

Previous Hadoop trunk PR link:
apache/hadoop#1650

@nandakumar131 nandakumar131 changed the title Hdds 2034 HDDS-2034. Async RATIS pipeline creation and destroy through heartbeat commands Oct 15, 2019
Copy link
Contributor

@lokeshj1703 lokeshj1703 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ChenSammi Thanks for updating the PR! I have few minor comments.

Copy link
Contributor

@lokeshj1703 lokeshj1703 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ChenSammi Thanks for updating the PR! The changes look good to me. I have few minor comments. Can you please verify if the test failures are related.

@ChenSammi
Copy link
Contributor Author

ChenSammi commented Oct 25, 2019

Fix failed TestSCMPipelineManager.java. Other failed UT and integration test are either not relevant or passed in local.

@xiaoyuyao @lokeshj1703

@lokeshj1703
Copy link
Contributor

@ChenSammi Thanks for updating the PR! There is a checkstyle issue in the latest CI run and TestStorageContainerManager fails consistently with the patch. I also posted some comments here #29 (review). Can you please check those?

@ChenSammi
Copy link
Contributor Author

New update per the review comments. Remove RATIS ONE factor pipeline from HealthyPipelineSafeModeRule is a very fundamental change, so a lot of UT is updated.

@xiaoyuyao
Copy link
Contributor

/retest since the integration acceptance result is missing.

@ChenSammi
Copy link
Contributor Author

I rerun the failed TestScmSafeMode multiple times locally. All are passed.

@xiaoyuyao and @lokeshj1703

@lokeshj1703
Copy link
Contributor

/retest

@lokeshj1703
Copy link
Contributor

@ChenSammi Thanks for updating the PR! The changes looks good to me. I have just one comment on the changes in TestBlockManager. +1 o.w.
The acceptance tests results were not showing up. I have retriggered them.

@ChenSammi
Copy link
Contributor Author

@lokeshj1703 and @xiaoyuyao

@lokeshj1703
Copy link
Contributor

@ChenSammi Thanks for updating the PR! Can you also please address the comment in #29 (comment)?
Also there are a few block allocation failures in acceptance tests. Can you please check if those are related?

@ChenSammi
Copy link
Contributor Author

Hi @lokeshj1703, do you know why to run acceptance test locally? It seems they are all running bash script. How can you know the failures are block allocation related?

@lokeshj1703
Copy link
Contributor

@ChenSammi In the Acceptance test results, for the failed tests you can click on the Log tab.
After building ozone you can cd to directory - hadoop-ozone/dist/target/ozone-0.5.0-SNAPSHOT/compose/ . This contains all the test suites as subdirectories. You can execute the tests in the sub-directory using test.sh file.

Copy link
Contributor

@lokeshj1703 lokeshj1703 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ChenSammi Thanks for updating the PR! The changes look good to me. The acceptance test failure passes in my local machine. I think there are a few conflicting changes as HDDS-1868 was committed. I have few minor comments on testlib.sh and there is a checkstyle failure. Other than that the PR looks good to me.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be 30 as per comments?

Copy link
Contributor Author

@ChenSammi ChenSammi Nov 13, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just follow the wait_for_datanodes() function in this testlib.sh.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to increment SECONDS variable here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do it in a follow up JIRA. There are other places need to change.

@ChenSammi
Copy link
Contributor Author

ChenSammi commented Nov 13, 2019

@lokeshj1703 , thanks for review the code. I'm rebasing the code based on master. Since HDDS-1868 is committed, there are a lot conflictes. A lot previously succeed integration tests are failed due the master code change. if there is no big issues, I hope we can commit it to master as soon as possible. Minor issues can be handled in follow up JIRAs. Thanks again for your time.

@xiaoyuyao

@ChenSammi
Copy link
Contributor Author

ChenSammi commented Nov 14, 2019

Checked 3 failed integration tests.

  1. TestSCMSafeModeWithPipelineRules
    Not relevant.
    The root cause is failing to exit the safemode. Current pipeline open condition(HDDS-1868) is got 3 datanode reports and one datanode marked itself as leader. In this failure case, the leader election succeeds while XceiverServerRatis#handleLeaderChangedNotification is not called in the next 3 minutes. So cluster.waitForClusterToBeReady() timeout.
    The question is is this Leader change notification reliable? What's the expected latency between leader election succeed and notification send?

  2. TestSCMPipelineManager
    Not relevant.
    Failure is caused by newly introduced function TestSCMPipelineManager#testPipelineOpenOnlyWhenLeaderReported which doesn't close pipelineManager at the end. It's better to fix it in a new JIRA.

  3. TestOzoneManagerHA
    Not relevant.

@xiaoyuyao @lokeshj1703

@ChenSammi
Copy link
Contributor Author

Hi @elek , the ci/acceptance failed with all test passed. Do you know what exactly the failure is?

@xiaoyuyao
Copy link
Contributor

Thanks @ChenSammi for the update. The latest change LGTM, +1. I will merge it shortly to master.

Opened HDDS-2491 and HDDS-2492 for the test issues that are not related to this change.

@xiaoyuyao xiaoyuyao merged commit 89d11ad into apache:master Nov 15, 2019
@ChenSammi
Copy link
Contributor Author

Thanks @anuengineer @xiaoyuyao @lokeshj1703 for review the patch.

elek added a commit to elek/ozone that referenced this pull request Nov 16, 2019
@elek
Copy link
Member

elek commented Nov 16, 2019

This patch introduced a new unit test failure (TestDeadNodeHandler) (#202 shows that the problem disappears with reverting this commit)

@adoroszlai
Copy link
Contributor

adoroszlai commented Nov 16, 2019

It also introduced acceptance test failure:

https://github.com/elek/ozone-ci-03/blob/56c7473f77185592af9a0dc53a636ab1cdb47433/pr/pr-hdds-2291-fnk79/acceptance/output.log#L1286-L1288

This line has an extra ' near the end:

https://github.com/apache/hadoop-ozone/blob/1b72718dcab7f83ebdac67b6242c729f03a8f103/hadoop-ozone/dist/src/main/compose/testlib.sh#L97

Fix:

-         status=`docker-compose -f "${compose_file}" exec -T scm bash -c "kinit -k HTTP/[email protected] -t /etc/security/keytabs/HTTP.keytab && $command'"`
+         status=`docker-compose -f "${compose_file}" exec -T scm bash -c "kinit -k HTTP/[email protected] -t /etc/security/keytabs/HTTP.keytab && $command"`

elek added a commit to elek/ozone that referenced this pull request Nov 17, 2019
@ChenSammi
Copy link
Contributor Author

ChenSammi commented Nov 19, 2019 via email

adoroszlai referenced this pull request in adoroszlai/ozone Nov 19, 2019
elek added a commit that referenced this pull request Nov 19, 2019
@bharatviswa504
Copy link
Contributor

bharatviswa504 commented Nov 19, 2019

Hi,
I have one comment on the changes in HealthyPipelineSafeModeRule.
In HealthyPipelineSafeModeRule, now the thresholdCount on a freshlyInstalled with one datanode cluster will be 1. So, with single a node cluster, we will never come out of safemode, as during process, we have only checked 3 node pipeline.

The intention of this rule was when we come out of safeMode, we have at least few pipelines with type Ratis and 3. But with this patch, that logic is changed.

HealthyPipelineSafeModeRule.java

L79:    
int pipelineCount = pipelineManager.getPipelines(
        HddsProtos.ReplicationType.RATIS, HddsProtos.ReplicationFactor.THREE,
        Pipeline.PipelineState.OPEN).size() +
        pipelineManager.getPipelines(HddsProtos.ReplicationType.RATIS,
            HddsProtos.ReplicationFactor.THREE,
            Pipeline.PipelineState.ALLOCATED).size();

    // This value will be zero when pipeline count is 0.
    // On a fresh installed cluster, there will be zero pipelines in the SCM
    // pipeline DB.
    healthyPipelineThresholdCount = Math.max(minHealthyPipelines,
        (int) Math.ceil(healthyPipelinesPercent * pipelineCount));

L125:
    if (pipeline.getType() == HddsProtos.ReplicationType.RATIS &&
        pipeline.getFactor() == HddsProtos.ReplicationFactor.THREE) {
      getSafeModeMetrics().incCurrentHealthyPipelinesCount();
      currentHealthyPipelineCount++;
    }

If we consider both 3, and 1 node pipelines, then also we shall be in problem. The scenario is in a cluster it has 10 3 node pipelines in the cluster, and now if we consider both 3 and 1 to increment currentHealthyPipelineCount, there can be a case, when from each pipeline one Datanode is reported, and we increment the currentHealthyPipelineCount, and we might reach the threshold. But after coming out of safeMode there will be no 3 pipeline nodes in the cluster and writes will fail.

@xiaoyuyao
Copy link
Contributor

@bharatviswa504 the issue mentioned above has been reported and worked on with HDDS-2497.

@xiaoyuyao
Copy link
Contributor

Just copy the comments from @elek and @swagle on HDDS-2034 in case you miss it @ChenSammi . Note, HDDS-2497 I opened a few days back are also observed by @swagle and @bharatviswa504 that could relate to the acceptance test failures.

elekMarton Elek added a comment - 8 hours ago

Sorry, I need to revert this commit. We fixed two problems which are introduced by this patch with temporary fixes (TestDeadNodeHandler is failing + acceptance testlib.sh contains syntax error) but even after the fixes the acceptance tests are hanging indefinitely. (we tried it with or without the patch multiple times)

It might be caused by some integration problem not by the patch, but as it works well without the patch I will revert this patch to have a green baseline. Let's reopen the pull request (or upload a new one) to have a full build before the re-merge.

Siddharth Wagle added a comment - 1 hour ago - edited

Hi Sammi Chen, I believe HDDS-2497 is related to the change since we enabled pipeline check for safemode exit. For 1 node RATIS, the HealthyPipelineSafeModeRule.process does not check for 1 node pipeline. It should be an easy fix to go along with this patch, assigning that Jira to you. Thanks.

@bharatviswa504
Copy link
Contributor

Hi @swagle
Considering 1 node pipeline in HealthyPipelineSafeMode Rule will cause the issue.

Example Scenario:
If we consider both 3, and 1 node pipelines, then also we shall be in problem. The scenario is in a cluster it has 10 3 node pipelines in the cluster, and now if we consider both 3 and 1 to increment currentHealthyPipelineCount, there can be a case, when from each pipeline one Datanode is reported, and we increment the currentHealthyPipelineCount, and we might reach the threshold. But after coming out of safeMode there will be no 3 pipeline nodes in the cluster and writes will fail.

@xiaoyuyao
Copy link
Contributor

I think we only need to count the RATIS-3 pipeline in the HealthyPipelineSafeMode rule when ozone.replication is configured to be 3. RATIS-1 pipeline does not be counted in HealthyPipelineSafeMode. This way, we are consistent with existing behavior, assuming one single node pipeline per node reported implicitly.

@bharatviswa504
Copy link
Contributor

bharatviswa504 commented Nov 19, 2019

Yes @xiaoyuyao
Thanks for the offline discussion, the patch does not consider 1 node ratis pipeline when computing pipelineCount. My bad in reading the code.

elek added a commit that referenced this pull request Nov 20, 2019
…through heartbeat commands (#29)""

This reverts commit dcfe5f3.
@ChenSammi ChenSammi deleted the HDDS-2034 branch February 20, 2023 03:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants