Skip to content

HDFS-16456. EC: Decommission a rack with only on dn will fail when the rack number is equal with replication#4126

Merged
tasanuma merged 4 commits intoapache:trunkfrom
lfxy:feature/HDFS-16456
Apr 14, 2022
Merged

HDFS-16456. EC: Decommission a rack with only on dn will fail when the rack number is equal with replication#4126
tasanuma merged 4 commits intoapache:trunkfrom
lfxy:feature/HDFS-16456

Conversation

@lfxy
Copy link
Contributor

@lfxy lfxy commented Mar 31, 2022

HDFS-16456

In below scenario, decommission will fail by TOO_MANY_NODES_ON_RACK reason:

  1. Enable EC policy, such as RS-6-3-1024k.
  2. The rack number in this cluster is equal with or less than the replication number(9)
  3. A rack only has one DN, and decommission this DN.
    The root cause is in BlockPlacementPolicyRackFaultTolerant::getMaxNodesPerRack() function, it will give a limit parameter maxNodesPerRack for choose targets. In this scenario, the maxNodesPerRack is 1, which means each rack can only be chosen one datanode.

int maxNodesPerRack = (totalNumOfReplicas - 1) / numOfRacks + 1;
here will be called, where totalNumOfReplicas=9 and numOfRacks=9

When we decommission one dn which is only one node in its rack, the chooseOnce() in BlockPlacementPolicyRackFaultTolerant::chooseTargetInOrder() will throw NotEnoughReplicasException, but the exception will not be caught and fail to fallback to chooseEvenlyFromRemainingRacks() function.

When decommission, after choose targets, verifyBlockPlacement() function will return the total rack number contains the invalid rack, and BlockPlacementStatusDefault::isPlacementPolicySatisfied() will return false and it will also cause decommission fail.
public boolean isPlacementPolicySatisfied() { return requiredRacks <= currentRacks || currentRacks >= totalRacks; }
According to the above description, we should make the below modify to fix it:

  1. In startDecommission() or stopDecommission(), we should also change the numOfRacks in class NetworkTopology. Or choose targets may fail for the maxNodesPerRack is too small. And even choose targets success, isPlacementPolicySatisfied will also return false cause decommission fail.
  2. In BlockPlacementPolicyRackFaultTolerant::chooseTargetInOrder(), the first chooseOnce() function should also be put in try..catch..., or it will not fallback to call chooseEvenlyFromRemainingRacks() when throw exception.
  3. In verifyBlockPlacement, we need to remove invalid racks from total numOfRacks, or isPlacementPolicySatisfied() will return false and cause fail to reconstruct data.

@lfxy
Copy link
Contributor Author

lfxy commented Mar 31, 2022

@tasanuma Please review, thank you.

@hadoop-yetus
Copy link

💔 -1 overall

Vote Subsystem Runtime Logfile Comment
+0 🆗 reexec 1m 1s Docker mode activated.
_ Prechecks _
+1 💚 dupname 0m 0s No case conflicting files found.
+0 🆗 codespell 0m 1s codespell was not available.
+1 💚 @author 0m 0s The patch does not contain any @author tags.
+1 💚 test4tests 0m 0s The patch appears to include 2 new or modified test files.
_ trunk Compile Tests _
+0 🆗 mvndep 11m 38s Maven dependency ordering for branch
+1 💚 mvninstall 23m 13s trunk passed
+1 💚 compile 23m 2s trunk passed with JDK Ubuntu-11.0.14+9-Ubuntu-0ubuntu2.20.04
+1 💚 compile 20m 15s trunk passed with JDK Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07
+1 💚 checkstyle 3m 37s trunk passed
+1 💚 mvnsite 3m 25s trunk passed
+1 💚 javadoc 2m 29s trunk passed with JDK Ubuntu-11.0.14+9-Ubuntu-0ubuntu2.20.04
+1 💚 javadoc 3m 32s trunk passed with JDK Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07
+1 💚 spotbugs 6m 1s trunk passed
+1 💚 shadedclient 26m 33s branch has no errors when building and testing our client artifacts.
_ Patch Compile Tests _
+0 🆗 mvndep 0m 29s Maven dependency ordering for patch
+1 💚 mvninstall 2m 39s the patch passed
+1 💚 compile 24m 19s the patch passed with JDK Ubuntu-11.0.14+9-Ubuntu-0ubuntu2.20.04
+1 💚 javac 24m 19s the patch passed
+1 💚 compile 21m 51s the patch passed with JDK Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07
+1 💚 javac 21m 51s the patch passed
+1 💚 blanks 0m 0s The patch has no blanks issues.
-0 ⚠️ checkstyle 3m 33s /results-checkstyle-root.txt root: The patch generated 1 new + 155 unchanged - 0 fixed = 156 total (was 155)
+1 💚 mvnsite 3m 21s the patch passed
+1 💚 javadoc 2m 25s the patch passed with JDK Ubuntu-11.0.14+9-Ubuntu-0ubuntu2.20.04
+1 💚 javadoc 3m 32s the patch passed with JDK Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07
+1 💚 spotbugs 6m 18s the patch passed
+1 💚 shadedclient 23m 55s patch has no errors when building and testing our client artifacts.
_ Other Tests _
+1 💚 unit 18m 10s hadoop-common in the patch passed.
-1 ❌ unit 431m 53s /patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt hadoop-hdfs in the patch passed.
+1 💚 asflicense 1m 12s The patch does not generate ASF License warnings.
668m 1s
Reason Tests
Failed junit tests hadoop.hdfs.server.namenode.ha.TestSeveralNameNodes
hadoop.hdfs.server.namenode.ha.TestPipelinesFailover
hadoop.hdfs.server.diskbalancer.command.TestDiskBalancerCommand
Subsystem Report/Notes
Docker ClientAPI=1.41 ServerAPI=1.41 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4126/1/artifact/out/Dockerfile
GITHUB PR #4126
Optional Tests dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient spotbugs checkstyle codespell
uname Linux 03872bc8785a 4.15.0-65-generic #74-Ubuntu SMP Tue Sep 17 17:06:04 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
Build tool maven
Personality dev-support/bin/hadoop.sh
git revision trunk / 77ef654
Default Java Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07
Multi-JDK versions /usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.14+9-Ubuntu-0ubuntu2.20.04 /usr/lib/jvm/java-8-openjdk-amd64:Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07
Test Results https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4126/1/testReport/
Max. process+thread count 2772 (vs. ulimit of 5500)
modules C: hadoop-common-project/hadoop-common hadoop-hdfs-project/hadoop-hdfs U: .
Console output https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4126/1/console
versions git=2.25.1 maven=3.6.3 spotbugs=4.2.2
Powered by Apache Yetus 0.14.0-SNAPSHOT https://yetus.apache.org

This message was automatically generated.

@hadoop-yetus
Copy link

💔 -1 overall

Vote Subsystem Runtime Logfile Comment
+0 🆗 reexec 1m 0s Docker mode activated.
_ Prechecks _
+1 💚 dupname 0m 0s No case conflicting files found.
+0 🆗 codespell 0m 0s codespell was not available.
+1 💚 @author 0m 0s The patch does not contain any @author tags.
+1 💚 test4tests 0m 0s The patch appears to include 2 new or modified test files.
_ trunk Compile Tests _
+0 🆗 mvndep 12m 34s Maven dependency ordering for branch
+1 💚 mvninstall 23m 15s trunk passed
+1 💚 compile 22m 54s trunk passed with JDK Ubuntu-11.0.14+9-Ubuntu-0ubuntu2.20.04
+1 💚 compile 20m 10s trunk passed with JDK Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07
+1 💚 checkstyle 3m 38s trunk passed
+1 💚 mvnsite 3m 25s trunk passed
+1 💚 javadoc 2m 30s trunk passed with JDK Ubuntu-11.0.14+9-Ubuntu-0ubuntu2.20.04
+1 💚 javadoc 3m 31s trunk passed with JDK Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07
+1 💚 spotbugs 5m 58s trunk passed
+1 💚 shadedclient 23m 33s branch has no errors when building and testing our client artifacts.
_ Patch Compile Tests _
+0 🆗 mvndep 0m 29s Maven dependency ordering for patch
+1 💚 mvninstall 2m 17s the patch passed
+1 💚 compile 22m 15s the patch passed with JDK Ubuntu-11.0.14+9-Ubuntu-0ubuntu2.20.04
+1 💚 javac 22m 15s the patch passed
+1 💚 compile 20m 2s the patch passed with JDK Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07
+1 💚 javac 20m 2s the patch passed
+1 💚 blanks 0m 0s The patch has no blanks issues.
+1 💚 checkstyle 3m 36s the patch passed
+1 💚 mvnsite 3m 19s the patch passed
+1 💚 javadoc 2m 27s the patch passed with JDK Ubuntu-11.0.14+9-Ubuntu-0ubuntu2.20.04
+1 💚 javadoc 3m 36s the patch passed with JDK Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07
+1 💚 spotbugs 6m 18s the patch passed
+1 💚 shadedclient 24m 11s patch has no errors when building and testing our client artifacts.
_ Other Tests _
+1 💚 unit 18m 49s hadoop-common in the patch passed.
-1 ❌ unit 420m 34s /patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt hadoop-hdfs in the patch passed.
+1 💚 asflicense 1m 11s The patch does not generate ASF License warnings.
650m 45s
Reason Tests
Failed junit tests hadoop.hdfs.server.balancer.TestBalancerWithHANameNodes
hadoop.hdfs.server.namenode.ha.TestSeveralNameNodes
Subsystem Report/Notes
Docker ClientAPI=1.41 ServerAPI=1.41 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4126/2/artifact/out/Dockerfile
GITHUB PR #4126
Optional Tests dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient spotbugs checkstyle codespell
uname Linux 9a71fa74327b 4.15.0-65-generic #74-Ubuntu SMP Tue Sep 17 17:06:04 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
Build tool maven
Personality dev-support/bin/hadoop.sh
git revision trunk / 372cf01
Default Java Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07
Multi-JDK versions /usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.14+9-Ubuntu-0ubuntu2.20.04 /usr/lib/jvm/java-8-openjdk-amd64:Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07
Test Results https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4126/2/testReport/
Max. process+thread count 2533 (vs. ulimit of 5500)
modules C: hadoop-common-project/hadoop-common hadoop-hdfs-project/hadoop-hdfs U: .
Console output https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4126/2/console
versions git=2.25.1 maven=3.6.3 spotbugs=4.2.2
Powered by Apache Yetus 0.14.0-SNAPSHOT https://yetus.apache.org

This message was automatically generated.

@lfxy
Copy link
Contributor Author

lfxy commented Apr 1, 2022

@tasanuma The failed UT don't seem to relate to this patch, please help to check.

Copy link
Member

@tasanuma tasanuma left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lfxy Thanks for creating the PR. I did some tests with this PR in my test cluster, and it worked well. I left some review comments about typo words.

And I have one more question. There are some clusterMap.getNumOfRacks() in BlockPlacementStatusDefault. Do we need to update them as well?

@lfxy
Copy link
Contributor Author

lfxy commented Apr 7, 2022

@tasanuma Yes, I think clusterMap.getNumOfRacks() in BlockPlacementPolicyDefault should also be updated because only non empty rack makes sense.
And I have fixed other code by your suggestion. Thank you!

@hadoop-yetus
Copy link

🎊 +1 overall

Vote Subsystem Runtime Logfile Comment
+0 🆗 reexec 12m 32s Docker mode activated.
_ Prechecks _
+1 💚 dupname 0m 0s No case conflicting files found.
+0 🆗 codespell 0m 0s codespell was not available.
+1 💚 @author 0m 0s The patch does not contain any @author tags.
+1 💚 test4tests 0m 0s The patch appears to include 2 new or modified test files.
_ trunk Compile Tests _
+0 🆗 mvndep 16m 3s Maven dependency ordering for branch
+1 💚 mvninstall 26m 14s trunk passed
+1 💚 compile 26m 33s trunk passed with JDK Ubuntu-11.0.14.1+1-Ubuntu-0ubuntu1.20.04
+1 💚 compile 22m 52s trunk passed with JDK Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07
+1 💚 checkstyle 3m 56s trunk passed
+1 💚 mvnsite 3m 25s trunk passed
+1 💚 javadoc 2m 27s trunk passed with JDK Ubuntu-11.0.14.1+1-Ubuntu-0ubuntu1.20.04
+1 💚 javadoc 3m 35s trunk passed with JDK Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07
+1 💚 spotbugs 6m 19s trunk passed
+1 💚 shadedclient 24m 36s branch has no errors when building and testing our client artifacts.
_ Patch Compile Tests _
+0 🆗 mvndep 0m 28s Maven dependency ordering for patch
+1 💚 mvninstall 2m 23s the patch passed
+1 💚 compile 22m 59s the patch passed with JDK Ubuntu-11.0.14.1+1-Ubuntu-0ubuntu1.20.04
+1 💚 javac 22m 59s the patch passed
+1 💚 compile 21m 29s the patch passed with JDK Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07
+1 💚 javac 21m 29s the patch passed
+1 💚 blanks 0m 0s The patch has no blanks issues.
+1 💚 checkstyle 3m 40s the patch passed
+1 💚 mvnsite 3m 15s the patch passed
+1 💚 javadoc 2m 18s the patch passed with JDK Ubuntu-11.0.14.1+1-Ubuntu-0ubuntu1.20.04
+1 💚 javadoc 3m 27s the patch passed with JDK Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07
+1 💚 spotbugs 7m 4s the patch passed
+1 💚 shadedclient 25m 24s patch has no errors when building and testing our client artifacts.
_ Other Tests _
+1 💚 unit 18m 10s hadoop-common in the patch passed.
+1 💚 unit 242m 29s hadoop-hdfs in the patch passed.
+1 💚 asflicense 1m 6s The patch does not generate ASF License warnings.
501m 6s
Subsystem Report/Notes
Docker ClientAPI=1.41 ServerAPI=1.41 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4126/4/artifact/out/Dockerfile
GITHUB PR #4126
Optional Tests dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient spotbugs checkstyle codespell
uname Linux 33711d5ec556 4.15.0-156-generic #163-Ubuntu SMP Thu Aug 19 23:31:58 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
Build tool maven
Personality dev-support/bin/hadoop.sh
git revision trunk / 662323a
Default Java Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07
Multi-JDK versions /usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.14.1+1-Ubuntu-0ubuntu1.20.04 /usr/lib/jvm/java-8-openjdk-amd64:Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07
Test Results https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4126/4/testReport/
Max. process+thread count 3325 (vs. ulimit of 5500)
modules C: hadoop-common-project/hadoop-common hadoop-hdfs-project/hadoop-hdfs U: .
Console output https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4126/4/console
versions git=2.25.1 maven=3.6.3 spotbugs=4.2.2
Powered by Apache Yetus 0.14.0-SNAPSHOT https://yetus.apache.org

This message was automatically generated.

Copy link
Member

@tasanuma tasanuma left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for updating the PR. +1.
I will merge this PR next week if there are no other reviews.

@tasanuma
Copy link
Member

tasanuma commented Apr 8, 2022

@surendralilhore Please comment if you have any concerns.

@hadoop-yetus
Copy link

💔 -1 overall

Vote Subsystem Runtime Logfile Comment
+0 🆗 reexec 18m 38s Docker mode activated.
_ Prechecks _
+1 💚 dupname 0m 0s No case conflicting files found.
+0 🆗 codespell 0m 1s codespell was not available.
+1 💚 @author 0m 0s The patch does not contain any @author tags.
+1 💚 test4tests 0m 0s The patch appears to include 2 new or modified test files.
_ trunk Compile Tests _
+0 🆗 mvndep 15m 55s Maven dependency ordering for branch
+1 💚 mvninstall 26m 29s trunk passed
+1 💚 compile 26m 43s trunk passed with JDK Ubuntu-11.0.14.1+1-Ubuntu-0ubuntu1.20.04
+1 💚 compile 23m 9s trunk passed with JDK Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07
+1 💚 checkstyle 3m 45s trunk passed
+1 💚 mvnsite 3m 38s trunk passed
+1 💚 javadoc 2m 43s trunk passed with JDK Ubuntu-11.0.14.1+1-Ubuntu-0ubuntu1.20.04
+1 💚 javadoc 3m 26s trunk passed with JDK Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07
+1 💚 spotbugs 6m 45s trunk passed
+1 💚 shadedclient 24m 45s branch has no errors when building and testing our client artifacts.
_ Patch Compile Tests _
+0 🆗 mvndep 0m 29s Maven dependency ordering for patch
+1 💚 mvninstall 2m 17s the patch passed
+1 💚 compile 23m 30s the patch passed with JDK Ubuntu-11.0.14.1+1-Ubuntu-0ubuntu1.20.04
+1 💚 javac 23m 30s the patch passed
+1 💚 compile 22m 23s the patch passed with JDK Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07
+1 💚 javac 22m 23s the patch passed
+1 💚 blanks 0m 0s The patch has no blanks issues.
+1 💚 checkstyle 3m 40s the patch passed
+1 💚 mvnsite 3m 23s the patch passed
+1 💚 javadoc 2m 21s the patch passed with JDK Ubuntu-11.0.14.1+1-Ubuntu-0ubuntu1.20.04
+1 💚 javadoc 3m 31s the patch passed with JDK Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07
+1 💚 spotbugs 6m 40s the patch passed
+1 💚 shadedclient 25m 0s patch has no errors when building and testing our client artifacts.
_ Other Tests _
+1 💚 unit 18m 27s hadoop-common in the patch passed.
-1 ❌ unit 429m 35s /patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt hadoop-hdfs in the patch passed.
+1 💚 asflicense 1m 15s The patch does not generate ASF License warnings.
697m 5s
Reason Tests
Failed junit tests hadoop.hdfs.server.namenode.ha.TestSeveralNameNodes
Subsystem Report/Notes
Docker ClientAPI=1.41 ServerAPI=1.41 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4126/3/artifact/out/Dockerfile
GITHUB PR #4126
Optional Tests dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient spotbugs checkstyle codespell
uname Linux 97de28519e31 4.15.0-65-generic #74-Ubuntu SMP Tue Sep 17 17:06:04 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
Build tool maven
Personality dev-support/bin/hadoop.sh
git revision trunk / 662323a
Default Java Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07
Multi-JDK versions /usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.14.1+1-Ubuntu-0ubuntu1.20.04 /usr/lib/jvm/java-8-openjdk-amd64:Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07
Test Results https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4126/3/testReport/
Max. process+thread count 2633 (vs. ulimit of 5500)
modules C: hadoop-common-project/hadoop-common hadoop-hdfs-project/hadoop-hdfs U: .
Console output https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4126/3/console
versions git=2.25.1 maven=3.6.3 spotbugs=4.2.2
Powered by Apache Yetus 0.14.0-SNAPSHOT https://yetus.apache.org

This message was automatically generated.

@tasanuma tasanuma merged commit cee8c62 into apache:trunk Apr 14, 2022
@tasanuma
Copy link
Member

Merged. Thanks for your contribution, @lfxy!

@lfxy
Copy link
Contributor Author

lfxy commented Apr 16, 2022

@tasanuma Thank you for your review and giving a lot of useful suggestions.

@lfxy lfxy deleted the feature/HDFS-16456 branch April 18, 2022 02:42
jojochuang pushed a commit to jojochuang/hadoop that referenced this pull request May 20, 2022
…e rack number is equal with replication (apache#4126)

(cherry picked from commit cee8c62)

 Conflicts:
	hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/net/NetworkTopology.java

Change-Id: Id5f937c25d87ae48f3ccabecf8b0c5feac7ca496
(cherry picked from commit dd79aee635fdc61648e0c87bea1560dc35aee053)
jojochuang added a commit that referenced this pull request May 26, 2022
…e rack number is equal with replication (#4126) (#4304)

(cherry picked from commit cee8c62)

 Conflicts:
	hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/net/NetworkTopology.java

(cherry picked from commit dd79aee635fdc61648e0c87bea1560dc35aee053)

Co-authored-by: caozhiqiang <lfxy@163.com>
Reviewed-by: Takanobu Asanuma <tasanuma@apache.org>
HarshitGupta11 pushed a commit to HarshitGupta11/hadoop that referenced this pull request Nov 28, 2022
LiuGuH pushed a commit to LiuGuH/hadoop that referenced this pull request Mar 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants