-
Notifications
You must be signed in to change notification settings - Fork 24.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Prevent allocating shards to broken nodes #18417
Comments
@ywelsch I think we can approach this from multiple directions.
I think in all of these cases simplicity wins over complex state... my $0.05 |
Today if a shard fails during initialization phase due to misconfiguration, broken disks, missing analyzers, not installed plugins etc. elasticsaerch keeps on trying to initialize or rather allocate that shard. Yet, in the worst case scenario this ends in an endless allocation loop. To prevent this loop and all it's sideeffects like spamming log files over and over again this commit adds an allocation decider that stops allocating a shard that failed more than N times in a row to allocate. The number or retries can be configured via `index.allocation.max_retry` and it's default is set to `5`. Once the setting is updated shards with less failures than the number set per index will be allowed to allocate again. Internally we maintain a counter on the UnassignedInfo that is reset to `0` once the shards has been started. Relates to elastic#18417
Today if a shard fails during initialization phase due to misconfiguration, broken disks, missing analyzers, not installed plugins etc. elasticsaerch keeps on trying to initialize or rather allocate that shard. Yet, in the worst case scenario this ends in an endless allocation loop. To prevent this loop and all it's sideeffects like spamming log files over and over again this commit adds an allocation decider that stops allocating a shard that failed more than N times in a row to allocate. The number or retries can be configured via `index.allocation.max_retry` and it's default is set to `5`. Once the setting is updated shards with less failures than the number set per index will be allowed to allocate again. Internally we maintain a counter on the UnassignedInfo that is reset to `0` once the shards has been started. Relates to #18417
Pinging @elastic/es-distributed |
We have another, non trivial, of instance of this in shard fetching. When it hard fails on a node (rather then succeeding by finding a broking copy) we currently redo the fetching. This is an easy way around networking issue but can be poisonous on disk failures (for example). |
We would rather remove the broken node from the cluster rather then take an fail allocation(s). |
Allocating shards to a node can fail for various reasons. When an allocation fails, we currently ignore the node for that shard during the next allocation round. However, this means that:
This is particularly bad if the node is permanently broken, leading to a never-ending series of failed allocations. Ultimately this affects the stability of the cluster.
The text was updated successfully, but these errors were encountered: