HDDS-5219. Limit number of bad volumes by dfs.datanode.failed.volumes.tolerated #2243

guihecheng · 2021-05-12T14:34:39Z

What changes were proposed in this pull request?

Datanode has config 'dfs.datanode.failed.volumes.tolerated', but it does not work,
this makes it to work.

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-5219

How was this patch tested?

a new integration-test case
an extended ut

guihecheng · 2021-05-14T03:43:37Z

Hi @bshashikant @ChenSammi , the config support is added as you suggested, please help review this, thanks~

ChenSammi · 2021-05-19T07:02:43Z

...-service/src/main/java/org/apache/hadoop/ozone/container/common/volume/MutableVolumeSet.java

DFSConfigKeysLegacy is a deprecated Class. Since Ozone is spinned out from Hadoop, as a new project, we should avoid use HDFS configuration key anymore. We can define a similar key for Ozone.

@ChenSammi oh, then I shall add this key in some other place.

ChenSammi · 2021-05-19T07:05:21Z

...-service/src/main/java/org/apache/hadoop/ozone/container/common/volume/MutableVolumeSet.java

Indent is incorrect.

will fix it.

ChenSammi · 2021-05-19T07:09:44Z

...-service/src/main/java/org/apache/hadoop/ozone/container/common/volume/MutableVolumeSet.java

What's the difference of if else here, could you do a little explaination?

This logic is borrowed from hdfs, when we set this 'tolerated' to -1, we mean unlimited number of bad volumes but we still should have at least 1 good volume left.
Otherwise, we should have fewer or equal number of bad volumes.

Can you add some comments here?

@bshashikant sure, will cp the lines above into code comments.

ChenSammi · 2021-05-19T07:40:18Z

...rc/main/java/org/apache/hadoop/ozone/container/common/statemachine/DatanodeStateMachine.java

A more detail error message is preferred here since we know the right reason.

sure, I'll put a more clear message here.

ChenSammi · 2021-05-19T07:48:39Z

...-service/src/main/java/org/apache/hadoop/ozone/container/common/volume/MutableVolumeSet.java

Prefer more detail error message, such as how may volumes configured, and how many failed volumes.

bshashikant · 2021-05-19T09:54:54Z

@guihecheng , thanks for the work. The config needs to be defined in ozone-site.xml and should be well documented in ozone-default.xml.

bshashikant · 2021-05-24T09:34:03Z

...c/main/java/org/apache/hadoop/ozone/container/common/statemachine/DatanodeConfiguration.java

I feel, we should make the default as -1 rather than zero by default.

@bshashikant here I borrowed the default 0 from HDFS to stay aligned, but I also think that -1 is reasonable.
With -1 as the default, we introduce no behavioral change to existing clusters when we have a few(1~2) volumes failed.
Will update, thanks.

bshashikant

Looks good with minor suggestion.

….tolerated.

guihecheng · 2021-05-25T01:39:17Z

@bshashikant updated, thanks

…ing-upgrade-master-merge * upstream/master: (76 commits) HDDS-5280. Make XceiverClientManager creation when necessary in ContainerOperationClient (apache#2289) HDDS-5272. Make ozonefs.robot execution repeatable (apache#2280) HDDS-5123. Use the pre-created apache/ozone-testkrb5 image during secure acceptance tests (apache#2165) HDDS-4993. Add guardrail for reserved buffer size when DN reads a chunk (apache#2058) HDDS-4936. Change ozone groupId from org.apache.hadoop to org.apache.ozone (apache#2018) HDDS-4043. allow deletion from Trash directory without -skipTrash option (apache#2110) HDDS-4927. Determine over and under utilized datanodes in Container Balancer. (apache#2230) HDDS-5273. Handle unsecure cluster convert to secure cluster for SCM. (apache#2281) HDDS-5158. Add documentation for SCM HA Security. (apache#2205) HDDS-5275. Datanode Report Publisher publishes one extra report after DN shutdown (apache#2283) HDDS-5241. SCM UI should have leader/follower and Primordial SCM information (apache#2260) HDDS-5219. Limit number of bad volumes by dfs.datanode.failed.volumes.tolerated. (apache#2243) HDDS-5252. PipelinePlacementPolicy filter out datanodes with not enough space. (apache#2271) HDDS-5191. Increase default pvc storage size (apache#2219) HDDS-5073. Use ReplicationConfig on client side (apache#2136) HDDS-5250. Build integration tests with Maven cache (apache#2269) HDDS-5236. Require block token for more operations (apache#2254) HDDS-5266 Misspelt words in S3MultipartUploadCommitPartRequest.java line 202 (apache#2279) HDDS-5249. Race Condition between Full and Incremental Container Reports (apache#2268) HDDS-5142. Make generic streaming client/service for container re-replication, data read, scm/om snapshot download (apache#2256) ... Conflicts: hadoop-hdds/common/src/main/java/org/apache/hadoop/hdds/scm/protocol/StorageContainerLocationProtocol.java hadoop-hdds/framework/src/main/java/org/apache/hadoop/hdds/scm/protocolPB/StorageContainerLocationProtocolClientSideTranslatorPB.java hadoop-hdds/interface-admin/src/main/proto/ScmAdminProtocol.proto hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/server/StorageContainerManager.java hadoop-hdds/server-scm/src/test/java/org/apache/hadoop/hdds/scm/container/MockNodeManager.java hadoop-ozone/dist/src/main/compose/testlib.sh hadoop-ozone/integration-test/src/test/java/org/apache/hadoop/ozone/TestStorageContainerManager.java hadoop-ozone/interface-client/src/main/proto/OmClientProtocol.proto hadoop-ozone/ozone-manager/pom.xml hadoop-ozone/ozone-manager/src/main/java/org/apache/hadoop/ozone/om/OzoneManager.java hadoop-ozone/ozone-manager/src/main/java/org/apache/hadoop/ozone/om/ratis/utils/OzoneManagerRatisUtils.java hadoop-ozone/s3gateway/pom.xml

guihecheng force-pushed the HDDS-5219 branch 3 times, most recently from 005f3b0 to 34cf431 Compare May 14, 2021 02:03

ChenSammi reviewed May 19, 2021

View reviewed changes

guihecheng force-pushed the HDDS-5219 branch 2 times, most recently from b00c2a1 to fc40473 Compare May 24, 2021 06:10

bshashikant reviewed May 24, 2021

View reviewed changes

HDDS-5219. Limit number of bad volumes by dfs.datanode.failed.volumes…

52fed76

….tolerated.

guihecheng force-pushed the HDDS-5219 branch from fc40473 to 52fed76 Compare May 24, 2021 12:58

bshashikant approved these changes May 26, 2021

View reviewed changes

bshashikant merged commit 2671b48 into apache:master May 26, 2021

guihecheng mentioned this pull request Apr 13, 2022

HDDS-6541. [Merge rocksdb in datanode] Per-disk DB location management. #3292

Merged

HDDS-5219. Limit number of bad volumes by dfs.datanode.failed.volumes.tolerated #2243

HDDS-5219. Limit number of bad volumes by dfs.datanode.failed.volumes.tolerated #2243

Uh oh!

Conversation

guihecheng commented May 12, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

What is the link to the Apache JIRA

How was this patch tested?

Uh oh!

guihecheng commented May 14, 2021

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bshashikant commented May 19, 2021

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bshashikant left a comment

Choose a reason for hiding this comment

Uh oh!

guihecheng commented May 25, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

guihecheng commented May 12, 2021 •

edited

Loading