-
Notifications
You must be signed in to change notification settings - Fork 24.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
No timeouts in co-ordinator to primary and primary to replica interactions during indexing #46994
Comments
Pinging @elastic/es-distributed |
This sounds similar to #40326. In particular, if one node's IO is running so slowly that it is taking tens of minutes to complete an indexing request then maybe we should be considering that node as completely unhealthy and removing it from the cluster. Failing the individual indexing request sounds like an overly weak response: this will trigger recoveries that may only increase the IO pressure on that node. See also #18417. As I asked in #40326, why is such slowness not triggering timeouts from the IO subsystem on the host?
This isn't true, the coordinating node will also start rejecting requests when it reaches capacity. |
Failing the node without getting sufficient metrics(aggregating over some period to remove outliers) might take longer. With primary to replica we can set timeouts based on some factor(typically 2x and 3x) of what it took for primary to process and bail out the request by failing the shard. Failing the node could be more authoritatively with more metrics maybe? For log analytics when only few shard of fewer(maybe 1) index was causing an outage, would it make more sense to fail that shard first and let better health checks fail the node instead @DaveCTurner Thoughts?? |
I think failing a replica if it takes just 2-3x longer than the primary to process an indexing request would be disastrous for cluster stability. |
I think I have not been clear.. I understand a small GC pause for a fast performing request can also fail the replica What I actually propose is max of a high timeout and 2x to 3x of the time on primary. While it might also be better to not fail the shard on the first request., it might add value if we do it after few request exhibiting this pattern. It's better than waiting for hours on bulk to finish which is what it is today. |
The thing to understand here is
|
I think this is the fundamental point. If a node is taking hours to process a bulk then it is surely faulty and cannot reasonably remain in the cluster. But as I keep on asking, why is this node waiting for hours to perform IO? What happened to all the other timeouts elsewhere in the system? "Waiting for hours" is very much not how local disks are expected to behave. |
Closing this for lack of response on a question that was repeatedly asked. @Bukhtawar if you have the requested information please add it in a comment and we can look at re-opening this issue. |
Elasticsearch does not have any timeouts/deadlines for the remote interactions in the indexing path. This causes resources wastage on ES nodes for the requests retried on client. In addition to this, even though primaries and replicas have gating mechanism to prevent resource wastage via thread pools and queues- the co-ordinating node just keeps on accepting requests and hogs huge amount of memory.
I understand that choosing the right timeout value is tricky for indexing, especially in primary-replica interaction since a failure(timeout) results in shard allocation failure, resulting in data movement across nodes.
In synchronous replication systems like ES, such failures are handled via strong healthchecks since there is no other way of eliminating a node. ES just has a health check on pings and not an any of the advanced adaptive metrics like bulk request successes/ failures/timeouts etc or network congestion/TCP retransmits for that matter.
I am proposing to add a deadline for the entire lifecycle of the indexing request. It can be a very defensive timeout to start with. I have noticed that requests succeeding in our system after 30 minutes due to slow disk/network congestion across 2 nodes. In addition to this, we should strengthen the ES healthchecks by taking the indexing requests, bulk indexing queues etc. into consideration.
Thoughts?
The text was updated successfully, but these errors were encountered: