HDDS-8024. When readChunk from a datanode fails, retry other datanodes. #4336

szetszwo · 2023-03-01T22:22:57Z

What changes were proposed in this pull request?

When failing read chunk from a datanode, exclude it and retry other datanodes.

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-8024

How was this patch tested?

Updated TestHSync

jojochuang

Thanks. This is a good fix and looks good to me!

Prior to this change, BlockInputStream defines retry policy which retries three times with a 1 ms delay before retry.
After this change, reading a block could retry up to 9 times (assuming 3 replicas): r1, r2, r3, r1, r2, r3, r1, r2, r3.

It might be a good idea to extend the delay between retrying r3 and r1, in case of cluster-wide restart. In fact, we should consider testing such cases so that clients don't fail prematurely.

Also for the case where block token expires (https://issues.apache.org/jira/browse/HDDS-7930) trying another replica will not help, although the retry policy at BlockInputStream level will address this, and a client is supposed to only hit the block token expiration rarely.

Just a suggestion: can you update the jira subject so that it is easier for the community to tell this is a behavior change, not just fixing flaky tests?

Other than that I am +1.

szetszwo · 2023-03-04T01:04:05Z

@jojochuang , thanks a lot for reviewing this! Updated the title.

adoroszlai · 2023-03-04T19:15:43Z

@szetszwo thanks for the improvement. Sorry for being late, but I tried this fix in repeated CI runs and it didn't seem to fix the problem.

56 out of 300 iterations failed:
https://github.com/adoroszlai/hadoop-ozone/actions/runs/4326162668

szetszwo · 2023-03-04T23:12:57Z

@adoroszlai , thanks a lot for testing it!

	at java.io.DataInputStream.read(DataInputStream.java:149)
	at org.apache.hadoop.fs.ozone.TestHSync.runTestHSync(TestHSync.java:201)
	at org.apache.hadoop.fs.ozone.TestHSync.runTestHSync(TestHSync.java:154)
	at org.apache.hadoop.fs.ozone.TestHSync.testO3fsHSync(TestHSync.java:122)

https://github.com/adoroszlai/hadoop-ozone/actions/runs/4326162668/jobs/7553254946#step:5:6237

The line numbers do not match the latest code. I guess it was testing the old code?

adoroszlai · 2023-03-05T08:23:16Z

The line numbers do not match the latest code. I guess it was testing the old code?

No, it was testing the PR, you can find diff here: master...adoroszlai:hadoop-ozone:HDDS-8024-repeat

Line numbers in TestHSync are off because a01676a changed that file on master in the meantime.

szetszwo · 2023-03-06T19:17:38Z

Line numbers in TestHSync are off because a01676a changed that file on master in the meantime.

@adoroszlai , a01676a is HDDS-8029 which is BEFORE this commit 3bbb574 as shown below. Anyway, could you run the test again with the latest code?

commit 3bbb5742c5f494d940ced1553e5772eecfb6398c
Author: Tsz-Wo Nicholas Sze <[email protected]>
Date:   Fri Mar 3 17:03:45 2023 -0800

    HDDS-8024. When readChunk from a datanode fails, retry other datanodes. (#4336)

commit 88292e8a446d183d78f74aefda2b72d3e02ab2f7
Author: Stephen O'Donnell <[email protected]>
Date:   Fri Mar 3 21:00:45 2023 +0000

    HDDS-8075. ECReconstructionCoordinatorTask.runTask should catch Exception (#4342)

commit a432438c363622d287fb4ac3524394fa2ec5fc13
Author: djordje-mijatovic <[email protected]>
Date:   Fri Mar 3 21:33:47 2023 +0100

    HDDS-7210. Missing open containers show up as "Closing" on the container report. (#4207)

commit a01676a9afccf4d7c4c80066d999a087973a9def
Author: Wei-Chiu Chuang <[email protected]>
Date:   Fri Mar 3 11:14:41 2023 -0800

    HDDS-8029. [hsync] Outputstream in encrypted buckets do not return the correct stream capabilities. (#4316)

adoroszlai · 2023-03-06T20:10:53Z

The line numbers do not match the latest code. I guess it was testing the old code?

Line numbers in TestHSync are off because a01676a changed that file on master in the meantime.

a01676a is HDDS-8029 which is BEFORE this commit 3bbb574

The lines added by a01676a were not present in your PR, as it was based on a prior state of master:

* 4afa68e460 (master) HDDS-7816. Add DataNode list to the SCM WebUI (#4289)
* ...
* 3bbb5742c5 HDDS-8024. When readChunk from a datanode fails, retry other datanodes. (#4336)
* ...
* a01676a9af HDDS-8029. [hsync] Outputstream in encrypted buckets do not return the correct stream capabilities. (#4316)
* ...
| * 6fa8fc1291 (HDDS-8024-repeat) TEST 10x30x TestHSync,ITestOzoneContractCreate
| * 91958b3e0f (HDDS-8024) Fix test failures.
| * 67a8b10fdb HDDS-8024. Intermittent inconsistent read in HSync tests.
|/  
* 84f1523d24 HDDS-7869. Log configuration on component startup. (#4271)

Now that the PR has been merged and became 3bbb574 on master, line numbers are different, due to the presence of a01676a. But that doesn't mean I was testing code without the fix.

Now I have launched a new run with current master: https://github.com/adoroszlai/hadoop-ozone/actions/runs/4347182297

adoroszlai · 2023-03-06T21:07:31Z

new run with current master: https://github.com/adoroszlai/hadoop-ozone/actions/runs/4347182297

This one failed, too.

Example:

Tests run: 4, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 54.159 s <<< FAILURE! - in org.apache.hadoop.fs.ozone.TestHSync
org.apache.hadoop.fs.ozone.TestHSync.testO3fsHSync  Time elapsed: 3.85 s  <<< ERROR!
java.io.IOException: Inconsistent read for blockID=conID: 1 locID: 111677748019200001 bcsId: 0 length=97 position=64 numBytesToRead=33 numBytesRead=-1
	at org.apache.hadoop.ozone.client.io.KeyInputStream.checkPartBytesRead(KeyInputStream.java:175)
	at org.apache.hadoop.hdds.scm.storage.MultipartInputStream.readWithStrategy(MultipartInputStream.java:97)
	at org.apache.hadoop.hdds.scm.storage.ExtendedInputStream.read(ExtendedInputStream.java:54)
	at org.apache.hadoop.fs.ozone.OzoneFSInputStream.read(OzoneFSInputStream.java:64)
	at java.io.DataInputStream.read(DataInputStream.java:149)
	at org.apache.hadoop.fs.ozone.TestHSync.runTestHSync(TestHSync.java:211)
	at org.apache.hadoop.fs.ozone.TestHSync.runTestHSync(TestHSync.java:164)
	at org.apache.hadoop.fs.ozone.TestHSync.testO3fsHSync(TestHSync.java:132)

szetszwo · 2023-03-06T23:51:59Z

@adoroszlai , thanks for the update. Sorry that I may have misunderstood how the test was run.

This one failed, too.

The line number in this stack trace matched the current code. Let me take a look.

szetszwo · 2023-03-07T02:09:15Z

@adoroszlai found that getBlock also does not retry other datnaodes. Submitted #4357

szetszwo force-pushed the HDDS-8024 branch from 7c6d7be to 5bffcba Compare March 2, 2023 02:19

HDDS-8024. Intermittent inconsistent read in HSync tests.

67a8b10

szetszwo force-pushed the HDDS-8024 branch from 5bffcba to 67a8b10 Compare March 2, 2023 02:21

Fix test failures.

91958b3

szetszwo requested review from adoroszlai and jojochuang March 2, 2023 19:43

jojochuang approved these changes Mar 4, 2023

View reviewed changes

szetszwo changed the title ~~HDDS-8024. Intermittent inconsistent read in HSync tests.~~ HDDS-8024. When readChunk from a datanode fails, retry other datanodes. Mar 4, 2023

szetszwo merged commit 3bbb574 into apache:master Mar 4, 2023

jojochuang added the hbase HBase on Ozone support label Jan 23, 2024

ivandika3 mentioned this pull request Jan 31, 2024

HDDS-8090. When getBlock from a datanode fails, retry other datanodes. #4357

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

HDDS-8024. When readChunk from a datanode fails, retry other datanodes. #4336

HDDS-8024. When readChunk from a datanode fails, retry other datanodes. #4336

Uh oh!

szetszwo commented Mar 1, 2023

Uh oh!

jojochuang left a comment

Uh oh!

szetszwo commented Mar 4, 2023

Uh oh!

adoroszlai commented Mar 4, 2023

Uh oh!

szetszwo commented Mar 4, 2023

Uh oh!

adoroszlai commented Mar 5, 2023

Uh oh!

szetszwo commented Mar 6, 2023

Uh oh!

adoroszlai commented Mar 6, 2023 •

edited

Loading

Uh oh!

adoroszlai commented Mar 6, 2023

Uh oh!

szetszwo commented Mar 6, 2023

Uh oh!

szetszwo commented Mar 7, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

HDDS-8024. When readChunk from a datanode fails, retry other datanodes. #4336

HDDS-8024. When readChunk from a datanode fails, retry other datanodes. #4336

Uh oh!

Conversation

szetszwo commented Mar 1, 2023

What changes were proposed in this pull request?

What is the link to the Apache JIRA

How was this patch tested?

Uh oh!

jojochuang left a comment

Choose a reason for hiding this comment

Uh oh!

szetszwo commented Mar 4, 2023

Uh oh!

adoroszlai commented Mar 4, 2023

Uh oh!

szetszwo commented Mar 4, 2023

Uh oh!

adoroszlai commented Mar 5, 2023

Uh oh!

szetszwo commented Mar 6, 2023

Uh oh!

adoroszlai commented Mar 6, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

adoroszlai commented Mar 6, 2023

Uh oh!

szetszwo commented Mar 6, 2023

Uh oh!

szetszwo commented Mar 7, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

adoroszlai commented Mar 6, 2023 •

edited

Loading