Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fail shard if IndexShard#storeStats runs into an IOException #29008

Closed
bleskes opened this issue Mar 13, 2018 · 8 comments
Closed

Fail shard if IndexShard#storeStats runs into an IOException #29008

bleskes opened this issue Mar 13, 2018 · 8 comments
Assignees
Labels
:Distributed Indexing/Engine Anything around managing Lucene and the Translog in an open shard. good first issue low hanging fruit help wanted adoptme resiliency

Comments

@bleskes
Copy link
Contributor

bleskes commented Mar 13, 2018

If that happens something is totally wrong. We currently just bubble up the exception to the stats caller.

Relates to #18279 (comment)

@bleskes bleskes added good first issue low hanging fruit help wanted adoptme resiliency :Distributed Indexing/Engine Anything around managing Lucene and the Translog in an open shard. labels Mar 13, 2018
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed

@clintongormley
Copy link
Contributor

clintongormley commented Mar 13, 2018

@bleskes won't we keep trying to allocate shards to the emptiest mount point (ie the one with the failure?) and almost never assign to the other (working) mount points?

@bleskes
Copy link
Contributor Author

bleskes commented Mar 13, 2018

@clintongormley we will first try another node. If we end up retrying the node we will probably use the same mount point indeed. If it fails immediately, we will only try 5 times and stop. If the failure is more subtle we will probably rinse repeat later. I agree this is not ideal but this is consistent with how we deal with other failures. I think what you mean is a bigger problem - tracking node/path health and avoiding assigning shards to it in general. That one is a way bigger fish.

@evanvolgas
Copy link

evanvolgas commented Mar 13, 2018

Does #16745 address

we should check if we can write on the datapath before we allocate
mentioned here #18279 (comment)?

I don't think that other issue addresses what Simon's talking about there. The PR you linked is just a node launch start up check, is it not? That's not really the same thing as a path reachable check on shard allocation.

Inre

I think what you mean is a bigger problem - tracking node/path health and avoiding assigning shards to it in general. That one is a way bigger fish.

But isn't that exactly what #18279 is about? I'm actually a little unclear why you closed that ticket. Can you share your thinking a bit more and/or revisit #18279 and whether it makes sense to leave that ticket open as a high hanging fruit?

@milan15
Copy link
Contributor

milan15 commented Mar 13, 2018

@bleskes As a beginner, if no one is working on this i want to try and take this. I hope it would help me with increased understanding and give me commit/review process experience.

@milan15
Copy link
Contributor

milan15 commented Mar 15, 2018

@bleskes in storeStats, added line to fail shard with message and passing IOException reference. let me know if anything is wrong/missing here.

@bleskes
Copy link
Contributor Author

bleskes commented Mar 15, 2018

@EvanV I responded on the other issue.

@milan15 the direction looks good. Can you open an PR and we can iterate there? I would love to have some testing but I wanted to first do some research to give you guidance. You'd need some kind of a mock directory implementation that throws exceptions. If you open a PR, I can offer help there if needed.

@milan15
Copy link
Contributor

milan15 commented Mar 15, 2018

@bleskes i did it here #29078 . thanks for offering your help.

@andrershov andrershov self-assigned this Jul 19, 2018
andrershov added a commit that referenced this issue Jul 23, 2018
Fail shard if IndexShard#storeStats runs into an IOException. Closes #29008
(cherry picked from commit 33f11e6)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed Indexing/Engine Anything around managing Lucene and the Translog in an open shard. good first issue low hanging fruit help wanted adoptme resiliency
Projects
None yet
Development

No branches or pull requests

6 participants