No timeouts in co-ordinator to primary and primary to replica interactions during indexing #46994

itiyamas · 2019-09-24T05:23:12Z

Elasticsearch does not have any timeouts/deadlines for the remote interactions in the indexing path. This causes resources wastage on ES nodes for the requests retried on client. In addition to this, even though primaries and replicas have gating mechanism to prevent resource wastage via thread pools and queues- the co-ordinating node just keeps on accepting requests and hogs huge amount of memory.

I understand that choosing the right timeout value is tricky for indexing, especially in primary-replica interaction since a failure(timeout) results in shard allocation failure, resulting in data movement across nodes.

In synchronous replication systems like ES, such failures are handled via strong healthchecks since there is no other way of eliminating a node. ES just has a health check on pings and not an any of the advanced adaptive metrics like bulk request successes/ failures/timeouts etc or network congestion/TCP retransmits for that matter.

I am proposing to add a deadline for the entire lifecycle of the indexing request. It can be a very defensive timeout to start with. I have noticed that requests succeeding in our system after 30 minutes due to slow disk/network congestion across 2 nodes. In addition to this, we should strengthen the ES healthchecks by taking the indexing requests, bulk indexing queues etc. into consideration.

Thoughts?

elasticmachine · 2019-09-24T07:32:10Z

Pinging @elastic/es-distributed

DaveCTurner · 2019-09-24T08:03:11Z

This sounds similar to #40326. In particular, if one node's IO is running so slowly that it is taking tens of minutes to complete an indexing request then maybe we should be considering that node as completely unhealthy and removing it from the cluster. Failing the individual indexing request sounds like an overly weak response: this will trigger recoveries that may only increase the IO pressure on that node. See also #18417.

As I asked in #40326, why is such slowness not triggering timeouts from the IO subsystem on the host?

the co-ordinating node just keeps on accepting requests

This isn't true, the coordinating node will also start rejecting requests when it reaches capacity.

Bukhtawar · 2019-10-18T09:25:47Z

Failing the node without getting sufficient metrics(aggregating over some period to remove outliers) might take longer. With primary to replica we can set timeouts based on some factor(typically 2x and 3x) of what it took for primary to process and bail out the request by failing the shard. Failing the node could be more authoritatively with more metrics maybe? For log analytics when only few shard of fewer(maybe 1) index was causing an outage, would it make more sense to fail that shard first and let better health checks fail the node instead @DaveCTurner Thoughts??

DaveCTurner · 2019-10-20T08:54:36Z

I think failing a replica if it takes just 2-3x longer than the primary to process an indexing request would be disastrous for cluster stability.

Bukhtawar · 2019-10-20T11:06:59Z

I think I have not been clear.. I understand a small GC pause for a fast performing request can also fail the replica What I actually propose is max of a high timeout and 2x to 3x of the time on primary. While it might also be better to not fail the shard on the first request., it might add value if we do it after few request exhibiting this pattern. It's better than waiting for hours on bulk to finish which is what it is today.

Bukhtawar · 2019-10-20T11:40:51Z

The thing to understand here is

Failing fast a single shard is better than failing the entire node precisely due to the recovery overhead it might trigger. This might be some important use-case for log analytics. When to fail fast is something that can be tuned/configured/observed(few sampled requests)
Failing a node would need some observable metrics to authoritatively say if there are obvious issues with the node and its actually better even at the cost of recovery overhead. This might take longer to decide.

DaveCTurner · 2019-10-21T07:33:22Z

It's better than waiting for hours on bulk to finish which is what it is today.

I think this is the fundamental point. If a node is taking hours to process a bulk then it is surely faulty and cannot reasonably remain in the cluster. But as I keep on asking, why is this node waiting for hours to perform IO? What happened to all the other timeouts elsewhere in the system? "Waiting for hours" is very much not how local disks are expected to behave.

astefan · 2019-12-05T12:29:43Z

Closing this for lack of response on a question that was repeatedly asked. @Bukhtawar if you have the requested information please add it in a comment and we can look at re-opening this issue.

itiyamas changed the title ~~No timeouts in co-ordinator to primary and primary to replica interactions~~ No timeouts in co-ordinator to primary and primary to replica interactions during indexing Sep 24, 2019

alpar-t added the :Distributed Indexing/Distributed A catch all label for anything in the Distributed Area. Please avoid if you can. label Sep 24, 2019

DaveCTurner added the feedback_needed label Sep 26, 2019

astefan closed this as completed Dec 5, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

No timeouts in co-ordinator to primary and primary to replica interactions during indexing #46994

No timeouts in co-ordinator to primary and primary to replica interactions during indexing #46994

itiyamas commented Sep 24, 2019

elasticmachine commented Sep 24, 2019

DaveCTurner commented Sep 24, 2019

Bukhtawar commented Oct 18, 2019

DaveCTurner commented Oct 20, 2019

Bukhtawar commented Oct 20, 2019

Bukhtawar commented Oct 20, 2019

DaveCTurner commented Oct 21, 2019

astefan commented Dec 5, 2019

No timeouts in co-ordinator to primary and primary to replica interactions during indexing #46994

No timeouts in co-ordinator to primary and primary to replica interactions during indexing #46994

Comments

itiyamas commented Sep 24, 2019

elasticmachine commented Sep 24, 2019

DaveCTurner commented Sep 24, 2019

Bukhtawar commented Oct 18, 2019

DaveCTurner commented Oct 20, 2019

Bukhtawar commented Oct 20, 2019

Bukhtawar commented Oct 20, 2019

DaveCTurner commented Oct 21, 2019

astefan commented Dec 5, 2019