Skip to content

Conversation

@juncevich
Copy link
Contributor

HDDS-11261. Refresh cached BlockInfo when EC reconstruction failed

How to reproduce:

Create cluster with 9 datanodes
On datanode create volume (ozone sh volume create /data)
Create bucket with EC replication rs-6-3-1024k (ozone sh bucket create data/test-bucket --type EC --replication rs-6-3-1024k)
Create file e.x. 50 MB (fallocate -l 10M small_file_1)
Put file to bucket (ozone sh key put data/test-bucket/small_file_1 small_file_1 --type EC --replication rs-6-3-1024k)
Disable 4 nodes
Try to get file from bucket (ozone sh key get /data/test-bucket/small_file_1 /tmp/sm_1_1)
You will get "There are insufficient datanodes to read the EC block". It's ok, nodes amount should be at least 6.
Enable 1 node and as fast as possible try to get file.
You will get "There are insufficient datanodes to read the EC block". It is not ok, nodes now 6.
You can try get file from minute later and get this error again.
I reproduced it via docker-compose. With fixed nodes ip addresses (it's important, because docker compose can change ip addresses if not fix).

Why it happened? Command getKey in Ozone Manager has cache. And this cache in this case is not actual. When we try to get file again and again OM return for us list of 5 nodes, instead of 6.

I solved it by recreate blockReader with blockReader from refreshFunction.

Jira link: https://issues.apache.org/jira/browse/HDDS-11261

This patch was manually tested in docker-compose cluster.
I tried write robot test but I didn't find the way to stop reading from nodes (I tried to put file and then disabled 4 of 9 nodes to get error. I used RS-6-3-1024k replicaiton).

@errose28
Copy link
Contributor

errose28 commented Aug 2, 2024

Hi @juncevich thanks for investigating this. Is the same issue as HDDS-11209/#6974?

@juncevich
Copy link
Contributor Author

Hi, @errose28! Looks close the same, but decision is different.

@juncevich juncevich closed this Aug 29, 2024
@juncevich juncevich deleted the HDDS-11261-refresh-block-info-cache-when-ec-reconstruction-fail branch August 29, 2024 12:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants