HDDS-11261. Fix old block info cache when ec reconstruction failed. #7020
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
HDDS-11261. Refresh cached BlockInfo when EC reconstruction failed
How to reproduce:
Create cluster with 9 datanodes
On datanode create volume (ozone sh volume create /data)
Create bucket with EC replication rs-6-3-1024k (ozone sh bucket create data/test-bucket --type EC --replication rs-6-3-1024k)
Create file e.x. 50 MB (fallocate -l 10M small_file_1)
Put file to bucket (ozone sh key put data/test-bucket/small_file_1 small_file_1 --type EC --replication rs-6-3-1024k)
Disable 4 nodes
Try to get file from bucket (ozone sh key get /data/test-bucket/small_file_1 /tmp/sm_1_1)
You will get "There are insufficient datanodes to read the EC block". It's ok, nodes amount should be at least 6.
Enable 1 node and as fast as possible try to get file.
You will get "There are insufficient datanodes to read the EC block". It is not ok, nodes now 6.
You can try get file from minute later and get this error again.
I reproduced it via docker-compose. With fixed nodes ip addresses (it's important, because docker compose can change ip addresses if not fix).
Why it happened? Command getKey in Ozone Manager has cache. And this cache in this case is not actual. When we try to get file again and again OM return for us list of 5 nodes, instead of 6.
I solved it by recreate blockReader with blockReader from refreshFunction.
Jira link: https://issues.apache.org/jira/browse/HDDS-11261
This patch was manually tested in docker-compose cluster.
I tried write robot test but I didn't find the way to stop reading from nodes (I tried to put file and then disabled 4 of 9 nodes to get error. I used RS-6-3-1024k replicaiton).