-
Notifications
You must be signed in to change notification settings - Fork 24.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fail shard if IndexShard#storeStats runs into an IOException #29008
Comments
Pinging @elastic/es-distributed |
@bleskes won't we keep trying to allocate shards to the emptiest mount point (ie the one with the failure?) and almost never assign to the other (working) mount points? |
@clintongormley we will first try another node. If we end up retrying the node we will probably use the same mount point indeed. If it fails immediately, we will only try 5 times and stop. If the failure is more subtle we will probably rinse repeat later. I agree this is not ideal but this is consistent with how we deal with other failures. I think what you mean is a bigger problem - tracking node/path health and avoiding assigning shards to it in general. That one is a way bigger fish. |
Does #16745 address
I don't think that other issue addresses what Simon's talking about there. The PR you linked is just a node launch start up check, is it not? That's not really the same thing as a path reachable check on shard allocation. Inre
But isn't that exactly what #18279 is about? I'm actually a little unclear why you closed that ticket. Can you share your thinking a bit more and/or revisit #18279 and whether it makes sense to leave that ticket open as a high hanging fruit? |
@bleskes As a beginner, if no one is working on this i want to try and take this. I hope it would help me with increased understanding and give me commit/review process experience. |
@bleskes in storeStats, added line to fail shard with message and passing IOException reference. let me know if anything is wrong/missing here. |
@EvanV I responded on the other issue. @milan15 the direction looks good. Can you open an PR and we can iterate there? I would love to have some testing but I wanted to first do some research to give you guidance. You'd need some kind of a mock directory implementation that throws exceptions. If you open a PR, I can offer help there if needed. |
If that happens something is totally wrong. We currently just bubble up the exception to the stats caller.
Relates to #18279 (comment)
The text was updated successfully, but these errors were encountered: