HDDS-5360. DN failed to process all delete block commands in one heartbeat interval #2420

ChenSammi · 2021-07-16T04:01:10Z

https://issues.apache.org/jira/browse/HDDS-5360

jojochuang

So basically make delete block operation asynchronous, moving it to another thread.
From a very quick look at the PR, looks reasonable to me.

jojochuang · 2021-07-16T08:25:18Z

...he/hadoop/ozone/container/common/statemachine/commandhandler/DeleteBlocksCommandHandler.java

this is hard to read. Can you move out this Callable?

Sure. I refactored the code.

ChenSammi · 2021-07-16T08:33:47Z

This is the rocksDB cache metrics data captured from one DN. It costs about 200ms for open a rocksdb to persist the block deletion trasaction command. Becased on current default configuration, one DN can handle at most 60s / 200ms = 300 rocksdb opens if cache missed in 60s which is SCM block deleting command generation interval.

jojochuang · 2021-07-20T07:36:11Z

This is the rocksDB cache metrics data captured from one DN. It costs about 200ms for open a rocksdb to persist the block deletion trasaction command. Becased on current default configuration, one DN can handle at most 60s / 200ms = 300 rocksdb opens if cache missed in 60s which is SCM block deleting command generation interval.

This is yet another reason why we should reconsider the 1-db-per-container design :)

ChenSammi · 2021-07-20T08:29:10Z

@jojochuang ， do you mean 1-db-per-volume design? We think the same way.

jojochuang

just one comment. The rest looks good to me.

...he/hadoop/ozone/container/common/statemachine/commandhandler/DeleteBlocksCommandHandler.java

jojochuang · 2021-07-21T01:35:47Z

...he/hadoop/ozone/container/common/statemachine/commandhandler/DeleteBlocksCommandHandler.java

Oh one more thing. We should do
Thread.currentThread().interrupt();

ChenSammi · 2021-07-21T02:47:08Z

Hi @adoroszlai , do we need to do something to free disk space if "Warning: ForkStarter IOException: java.io.IOException: No space left on device" is found?

jojochuang

+1 from me. Waiting for Attila's review.

…tbeat interval.

JacksonYao287 · 2021-07-26T03:16:39Z

...he/hadoop/ozone/container/common/statemachine/commandhandler/DeleteBlocksCommandHandler.java

+        while (!deleteCommandQueues.isEmpty()) {
+          DeleteCmdInfo cmd = deleteCommandQueues.poll();
+          try {
+            processCmd(cmd.getCmd(), cmd.getContainer(), cmd.getContext(),


thanks for the work ! NIT, can we use DeleteCmdInfo as the only one parameter of processCmd

ChenSammi · 2021-08-10T06:33:33Z

Thanks @jojochuang and @JacksonYao287 for the code review. Also thanks @adoroszlai for taking care of the "no space left" issue.

adoroszlai · 2021-08-10T17:17:42Z

...tegration-test/src/test/java/org/apache/hadoop/ozone/container/TestContainerReplication.java


    cluster.shutdownHddsDatanode(keyLocation.getPipeline().getFirstNode());

-    waitForReplicaCount(containerID, 2, cluster);


Why was this removed? Without this, the test may give false negative: replica count may be 3 both initially and after re-replication, and we need to distinguish between the two.

Because delete thread pool shutdown takes time， when it finishes， the replication manager already have the container replicated. So the wait for 2 replicas randomly fails because of there are already 3 replicas.

adoroszlai · 2021-08-10T17:18:01Z

TestBlockDeletion timed out twice on master since this commit.

I think this should be fixed.

ChenSammi · 2021-08-11T12:18:37Z

TestBlockDeletion timed out twice on master since this commit.

* https://github.com/elek/ozone-build-results/blob/master/2021/08/10/9545/it-ozone/hadoop-ozone/integration-test/org.apache.hadoop.ozone.container.common.statemachine.commandhandler.TestBlockDeletion.txt

* https://github.com/elek/ozone-build-results/blob/master/2021/08/10/9555/it-ozone/hadoop-ozone/integration-test/org.apache.hadoop.ozone.container.common.statemachine.commandhandler.TestBlockDeletion.txt

I think this should be fixed.

Sure. I will take a look of timeout issue.

adoroszlai · 2021-08-11T13:45:20Z

Sure. I will take a look of timeout issue.

Thanks. Filed HDDS-5605 for it. HDDS-5606 is also related, but it started happening earlier (few weeks ago).

* master: HDDS-5358. Incorrect cache entry invalidation causes intermittent failure in testGetS3SecretAndRevokeS3Secret (apache#2518) HDDS-5608. Fix wrong command in ugrade doc (apache#2524) HDDS-5000. Run CI checks selectively (apache#2479) HDDS-4929. Select target datanodes and containers to move for Container Balancer (apache#2441) HDDS-5283. getStorageSize cast to int can cause issue (apache#2303) HDDS-5449 Recon namespace summary 'du' information should return replicated size of a key (apache#2489) HDDS-5558. vUnit invocation unit() may produce NPE (apache#2513) HDDS-5531. For Link Buckets avoid showing metadata. (apache#2502) HDDS-5549. Add 1.1 to supported versions in security policy (apache#2519) HDDS-5555. remove pipeline manager v1 code (apache#2511) HDDS-5546.OM Service ID change causes OM startup failure. (apache#2512) HDDS-5360. DN failed to process all delete block commands in one heartbeat interval (apache#2420) HDDS-5021. dev-support Dockerfile is badly outdated (apache#2480)

…one heartbeat interval (apache#2420)" This reverts commit f9ec7e9. Conflicts: hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/common/statemachine/DatanodeConfiguration.java hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/common/statemachine/commandhandler/DeleteBlocksCommandHandler.java

ChenSammi force-pushed the HDDS-5360 branch from 3da6254 to fff1096 Compare July 16, 2021 08:17

jojochuang reviewed Jul 16, 2021

View reviewed changes

ChenSammi force-pushed the HDDS-5360 branch from fff1096 to 2529bb5 Compare July 20, 2021 06:44

jojochuang reviewed Jul 20, 2021

View reviewed changes

...he/hadoop/ozone/container/common/statemachine/commandhandler/DeleteBlocksCommandHandler.java Outdated Show resolved Hide resolved

jojochuang reviewed Jul 21, 2021

View reviewed changes

ChenSammi force-pushed the HDDS-5360 branch from 5bd2d41 to 31ac75a Compare July 21, 2021 03:43

jojochuang approved these changes Jul 21, 2021

View reviewed changes

ChenSammi added 7 commits July 21, 2021 17:04

HDDS-5360. DN failed to process all delete block commands in one hear…

d55a98d

…tbeat interval.

refactor DeleteBlocksCommandHandler

e98cedc

fix checkstyle

d4ee4aa

fix findbugs

e0f37eb

fix TestEndPoint#testHeartbeatTaskRpcTimeOut and address comment

5cfd96f

fix TestContainerReplication

b2ebe6f

address comments

39bffdb

ChenSammi force-pushed the HDDS-5360 branch from 31ac75a to 39bffdb Compare July 21, 2021 09:05

ChenSammi added 2 commits July 21, 2021 20:20

Reduce key count in test case

b5f5ef1

continue reduce key count

1c93509

JacksonYao287 reviewed Jul 26, 2021

View reviewed changes

address comments

cd28696

ChenSammi force-pushed the HDDS-5360 branch from 788f369 to cd28696 Compare August 9, 2021 14:02

ChenSammi merged commit f9ec7e9 into apache:master Aug 10, 2021

adoroszlai reviewed Aug 10, 2021

View reviewed changes

JacksonYao287 mentioned this pull request Aug 11, 2021

HDDS-5253. Support container move HA #2488

Merged


		cluster.shutdownHddsDatanode(keyLocation.getPipeline().getFirstNode());

		waitForReplicaCount(containerID, 2, cluster);

HDDS-5360. DN failed to process all delete block commands in one heartbeat interval #2420

HDDS-5360. DN failed to process all delete block commands in one heartbeat interval #2420

Uh oh!

Conversation

ChenSammi commented Jul 16, 2021

Uh oh!

jojochuang left a comment

Choose a reason for hiding this comment

Uh oh!

jojochuang Jul 16, 2021

Choose a reason for hiding this comment

Uh oh!

ChenSammi Jul 20, 2021

Choose a reason for hiding this comment

Uh oh!

ChenSammi commented Jul 16, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jojochuang commented Jul 20, 2021

Uh oh!

ChenSammi commented Jul 20, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jojochuang left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

jojochuang Jul 21, 2021

Choose a reason for hiding this comment

Uh oh!

ChenSammi commented Jul 21, 2021

Uh oh!

jojochuang left a comment

Choose a reason for hiding this comment

Uh oh!

JacksonYao287 Jul 26, 2021

Choose a reason for hiding this comment

Uh oh!

ChenSammi Aug 9, 2021

Choose a reason for hiding this comment

Uh oh!

ChenSammi commented Aug 10, 2021

Uh oh!

adoroszlai Aug 10, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ChenSammi Aug 11, 2021

Choose a reason for hiding this comment

Uh oh!

adoroszlai commented Aug 10, 2021

Uh oh!

ChenSammi commented Aug 11, 2021

Uh oh!

adoroszlai commented Aug 11, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

ChenSammi commented Jul 16, 2021 •

edited

Loading

ChenSammi commented Jul 20, 2021 •

edited

Loading

adoroszlai Aug 10, 2021 •

edited

Loading