HDDS-11261. Fix old block info cache when ec reconstruction failed. #7020

juncevich · 2024-08-02T12:35:49Z

HDDS-11261. Refresh cached BlockInfo when EC reconstruction failed

How to reproduce:

Create cluster with 9 datanodes
On datanode create volume (ozone sh volume create /data)
Create bucket with EC replication rs-6-3-1024k (ozone sh bucket create data/test-bucket --type EC --replication rs-6-3-1024k)
Create file e.x. 50 MB (fallocate -l 10M small_file_1)
Put file to bucket (ozone sh key put data/test-bucket/small_file_1 small_file_1 --type EC --replication rs-6-3-1024k)
Disable 4 nodes
Try to get file from bucket (ozone sh key get /data/test-bucket/small_file_1 /tmp/sm_1_1)
You will get "There are insufficient datanodes to read the EC block". It's ok, nodes amount should be at least 6.
Enable 1 node and as fast as possible try to get file.
You will get "There are insufficient datanodes to read the EC block". It is not ok, nodes now 6.
You can try get file from minute later and get this error again.
I reproduced it via docker-compose. With fixed nodes ip addresses (it's important, because docker compose can change ip addresses if not fix).

Why it happened? Command getKey in Ozone Manager has cache. And this cache in this case is not actual. When we try to get file again and again OM return for us list of 5 nodes, instead of 6.

I solved it by recreate blockReader with blockReader from refreshFunction.

Jira link: https://issues.apache.org/jira/browse/HDDS-11261

This patch was manually tested in docker-compose cluster.
I tried write robot test but I didn't find the way to stop reading from nodes (I tried to put file and then disabled 4 of 9 nodes to get error. I used RS-6-3-1024k replicaiton).

errose28 · 2024-08-02T16:50:55Z

Hi @juncevich thanks for investigating this. Is the same issue as HDDS-11209/#6974?

juncevich · 2024-08-02T17:21:42Z

Hi, @errose28! Looks close the same, but decision is different.

HDDS-11261. Fix old block info cache when ec reconstruction failed.

0d29ca1

juncevich closed this Aug 29, 2024

juncevich deleted the HDDS-11261-refresh-block-info-cache-when-ec-reconstruction-fail branch August 29, 2024 12:02

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

HDDS-11261. Fix old block info cache when ec reconstruction failed. #7020

HDDS-11261. Fix old block info cache when ec reconstruction failed. #7020

Uh oh!

juncevich commented Aug 2, 2024

Uh oh!

errose28 commented Aug 2, 2024

Uh oh!

juncevich commented Aug 2, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

HDDS-11261. Fix old block info cache when ec reconstruction failed. #7020

HDDS-11261. Fix old block info cache when ec reconstruction failed. #7020

Uh oh!

Conversation

juncevich commented Aug 2, 2024

Uh oh!

errose28 commented Aug 2, 2024

Uh oh!

juncevich commented Aug 2, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants