Skip to content

Conversation

@elek
Copy link
Member

@elek elek commented Nov 28, 2019

What changes were proposed in this pull request?

After HDDS-2034 (or even before?) pipeline creation (or the status transition from ALLOCATE to OPEN) requires at least one pipeline report from all of the datanodes. Which means that the cluster might not be usable even if it's out from the safe mode AND there are at least three datanodes.

It makes all the acceptance tests unstable.

For example in this run.

scm_1         | 2019-11-28 11:22:54,401 INFO pipeline.RatisPipelineProvider: Send pipeline:PipelineID=8dc4aeb6-5ae2-46a0-948d-287c97dd81fb create command to datanode 548f146f-2166-440a-b9f1-83086591ae26
scm_1         | 2019-11-28 11:22:54,402 INFO pipeline.RatisPipelineProvider: Send pipeline:PipelineID=8dc4aeb6-5ae2-46a0-948d-287c97dd81fb create command to datanode dccee7c4-19b3-41b8-a3f7-b47b0ed45f6c
scm_1         | 2019-11-28 11:22:54,404 INFO pipeline.RatisPipelineProvider: Send pipeline:PipelineID=8dc4aeb6-5ae2-46a0-948d-287c97dd81fb create command to datanode 47dbb8e4-bbde-4164-a798-e47e8c696fb5
scm_1         | 2019-11-28 11:22:54,405 INFO pipeline.PipelineStateManager: Created pipeline Pipeline[ Id: 8dc4aeb6-5ae2-46a0-948d-287c97dd81fb, Nodes: 548f146f-2166-440a-b9f1-83086591ae26{ip: 172.24.0.10, host: ozoneperf_datanode_3.ozoneperf_default, networkLocation: /default-rack, certSerialId: null}dccee7c4-19b3-41b8-a3f7-b47b0ed45f6c{ip: 172.24.0.5, host: ozoneperf_datanode_1.ozoneperf_default, networkLocation: /default-rack, certSerialId: null}47dbb8e4-bbde-4164-a798-e47e8c696fb5{ip: 172.24.0.2, host: ozoneperf_datanode_2.ozoneperf_default, networkLocation: /default-rack, certSerialId: null}, Type:RATIS, Factor:THREE, State:ALLOCATED]
scm_1         | 2019-11-28 11:22:56,975 INFO pipeline.PipelineReportHandler: Pipeline THREE PipelineID=8dc4aeb6-5ae2-46a0-948d-287c97dd81fb reported by 548f146f-2166-440a-b9f1-83086591ae26{ip: 172.24.0.10, host: ozoneperf_datanode_3.ozoneperf_default, networkLocation: /default-rack, certSerialId: null}
scm_1         | 2019-11-28 11:22:58,018 INFO pipeline.PipelineReportHandler: Pipeline THREE PipelineID=8dc4aeb6-5ae2-46a0-948d-287c97dd81fb reported by dccee7c4-19b3-41b8-a3f7-b47b0ed45f6c{ip: 172.24.0.5, host: ozoneperf_datanode_1.ozoneperf_default, networkLocation: /default-rack, certSerialId: null}
scm_1         | 2019-11-28 11:23:01,871 INFO pipeline.PipelineReportHandler: Pipeline THREE PipelineID=8dc4aeb6-5ae2-46a0-948d-287c97dd81fb reported by 548f146f-2166-440a-b9f1-83086591ae26{ip: 172.24.0.10, host: ozoneperf_datanode_3.ozoneperf_default, networkLocation: /default-rack, certSerialId: null}
scm_1         | 2019-11-28 11:23:02,817 INFO pipeline.PipelineReportHandler: Pipeline THREE PipelineID=8dc4aeb6-5ae2-46a0-948d-287c97dd81fb reported by 548f146f-2166-440a-b9f1-83086591ae26{ip: 172.24.0.10, host: ozoneperf_datanode_3.ozoneperf_default, networkLocation: /default-rack, certSerialId: null}
scm_1         | 2019-11-28 11:23:02,847 INFO pipeline.PipelineReportHandler: Pipeline THREE PipelineID=8dc4aeb6-5ae2-46a0-948d-287c97dd81fb reported by dccee7c4-19b3-41b8-a3f7-b47b0ed45f6c{ip: 172.24.0.5, host: ozoneperf_datanode_1.ozoneperf_default, networkLocation: /default-rack, certSerialId: null} 

As you can see the pipeline is created but the the cluster is not usable as it's not yet reporter back by datanode_2:

scm_1         | 2019-11-28 11:23:13,879 WARN block.BlockManagerImpl: Pipeline creation failed for type:RATIS factor:THREE. Retrying get pipelines c
all once.
scm_1         | org.apache.hadoop.hdds.scm.pipeline.InsufficientDatanodesException: Cannot create pipeline of factor 3 using 0 nodes.

The quick fix is to configure all the compose clusters to wait until (at least) one pipeline is available. This can be done by adjusting the number of the required datanodes:

// We only care about THREE replica pipeline
int minHealthyPipelines = minDatanodes /
    HddsProtos.ReplicationFactor.THREE_VALUE; 

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-2646

How was this patch tested?

If something is wrong, acceptance tests are failing. We need green run from the CI.

@elek elek changed the title Hdds 2646 HDDS-2646. Start acceptance tests only if at least one THREE pipeline is available Nov 28, 2019
@elek
Copy link
Member Author

elek commented Nov 28, 2019

@ChenSammi You are more experienced with this area. Can you please review this approach / patch?

OZONE-SITE.XML_ozone.metadata.dirs=/data/metadata
OZONE-SITE.XML_ozone.scm.client.address=scm
OZONE-SITE.XML_ozone.replication=3
OZONE-SITE.XML_hdds.scm.safemode.min.datanode=3
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not very faimilar with docker-compose. Where do we tell docker-compose to start three datanodes with all these configurations?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

test.sh calls start_docker_env from the testlib.sh which calls docker-compose scale datanode=3.

Unfortunately there is no easy way to define the expected number of the containers in the docker-compose.yaml. ( There is a deploy / replicas but it's available only for docker swarm and not for docker-compose)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. Thanks for the explanation.

Copy link
Contributor

@adoroszlai adoroszlai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @elek for working on fixing acceptance test flakiness.

OZONE-SITE.XML_ozone.replication=3
OZONE-SITE.XML_hdds.datanode.dir=/data/hdds
OZONE-SITE.XML_hdds.profiler.endpoint.enabled=true
OZONE-SITE.XML_hdds.scm.safemode.min.datanode=3
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will make it impossible to use these environments with a single datanode without modifying the config locally.

I would like to propose an alternative solution:

  1. define this config in the environment section in docker-compose.yaml using a variable that defaults to 1:
    - "OZONE-SITE.XML_hdds.scm.safemode.min.datanode=${SAFEMODE_MIN_DATANODES:-1}"
  2. set the variable to 3 in testlib.sh:
    export SAFEMODE_MIN_DATANODES=3

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this is the problem what I tried to describe in #238 I am fine with the suggested approach but it makes more complex the definition.

What I am thinking is to create a simple compose definition which can work with one datanode (and we can adjust there the replication factor and the s3 storage type as well).

Almost all the tested functionality requires datanode=3, it seems to be enough to have one cluster which can work with one datanode...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I'm fine with the hard-coded values in order to get acceptance tests in a good shape. We can refine it later. Until then, config can be edited locally if needed.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

During an offline conversation I understood your use case: at some cases it can be useful to make the compose folder usable just with one datanode (eg. when ui / recon or shell scripts is tested).

I pushed a new commit to experiment with your proposal.

@anuengineer anuengineer merged commit 0ff53ef into apache:master Dec 4, 2019
@anuengineer
Copy link
Contributor

@elek and @adoroszlai Thanks for explaining this patch and next patch in pipeline to me. Appreciate it. I have committed this patch to the master. @ChenSammi Thanks for the review.

elek pushed a commit that referenced this pull request Dec 5, 2019
@elek
Copy link
Member Author

elek commented Dec 5, 2019

@adoroszlai found a small typo. Fixed with an addendum commit (6c1a9ff)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants