Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

No timeouts in co-ordinator to primary and primary to replica interactions during indexing #46994

Closed
itiyamas opened this issue Sep 24, 2019 · 8 comments
Labels
:Distributed Indexing/Distributed A catch all label for anything in the Distributed Area. Please avoid if you can. feedback_needed

Comments

@itiyamas
Copy link

Elasticsearch does not have any timeouts/deadlines for the remote interactions in the indexing path. This causes resources wastage on ES nodes for the requests retried on client. In addition to this, even though primaries and replicas have gating mechanism to prevent resource wastage via thread pools and queues- the co-ordinating node just keeps on accepting requests and hogs huge amount of memory.

I understand that choosing the right timeout value is tricky for indexing, especially in primary-replica interaction since a failure(timeout) results in shard allocation failure, resulting in data movement across nodes.

In synchronous replication systems like ES, such failures are handled via strong healthchecks since there is no other way of eliminating a node. ES just has a health check on pings and not an any of the advanced adaptive metrics like bulk request successes/ failures/timeouts etc or network congestion/TCP retransmits for that matter.

I am proposing to add a deadline for the entire lifecycle of the indexing request. It can be a very defensive timeout to start with. I have noticed that requests succeeding in our system after 30 minutes due to slow disk/network congestion across 2 nodes. In addition to this, we should strengthen the ES healthchecks by taking the indexing requests, bulk indexing queues etc. into consideration.

Thoughts?

@itiyamas itiyamas changed the title No timeouts in co-ordinator to primary and primary to replica interactions No timeouts in co-ordinator to primary and primary to replica interactions during indexing Sep 24, 2019
@alpar-t alpar-t added the :Distributed Indexing/Distributed A catch all label for anything in the Distributed Area. Please avoid if you can. label Sep 24, 2019
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed

@DaveCTurner
Copy link
Contributor

This sounds similar to #40326. In particular, if one node's IO is running so slowly that it is taking tens of minutes to complete an indexing request then maybe we should be considering that node as completely unhealthy and removing it from the cluster. Failing the individual indexing request sounds like an overly weak response: this will trigger recoveries that may only increase the IO pressure on that node. See also #18417.

As I asked in #40326, why is such slowness not triggering timeouts from the IO subsystem on the host?

the co-ordinating node just keeps on accepting requests

This isn't true, the coordinating node will also start rejecting requests when it reaches capacity.

@Bukhtawar
Copy link
Contributor

Failing the node without getting sufficient metrics(aggregating over some period to remove outliers) might take longer. With primary to replica we can set timeouts based on some factor(typically 2x and 3x) of what it took for primary to process and bail out the request by failing the shard. Failing the node could be more authoritatively with more metrics maybe? For log analytics when only few shard of fewer(maybe 1) index was causing an outage, would it make more sense to fail that shard first and let better health checks fail the node instead @DaveCTurner Thoughts??

@DaveCTurner
Copy link
Contributor

I think failing a replica if it takes just 2-3x longer than the primary to process an indexing request would be disastrous for cluster stability.

@Bukhtawar
Copy link
Contributor

I think I have not been clear.. I understand a small GC pause for a fast performing request can also fail the replica What I actually propose is max of a high timeout and 2x to 3x of the time on primary. While it might also be better to not fail the shard on the first request., it might add value if we do it after few request exhibiting this pattern. It's better than waiting for hours on bulk to finish which is what it is today.

@Bukhtawar
Copy link
Contributor

The thing to understand here is

  1. Failing fast a single shard is better than failing the entire node precisely due to the recovery overhead it might trigger. This might be some important use-case for log analytics. When to fail fast is something that can be tuned/configured/observed(few sampled requests)
  2. Failing a node would need some observable metrics to authoritatively say if there are obvious issues with the node and its actually better even at the cost of recovery overhead. This might take longer to decide.

@DaveCTurner
Copy link
Contributor

It's better than waiting for hours on bulk to finish which is what it is today.

I think this is the fundamental point. If a node is taking hours to process a bulk then it is surely faulty and cannot reasonably remain in the cluster. But as I keep on asking, why is this node waiting for hours to perform IO? What happened to all the other timeouts elsewhere in the system? "Waiting for hours" is very much not how local disks are expected to behave.

@astefan
Copy link
Contributor

astefan commented Dec 5, 2019

Closing this for lack of response on a question that was repeatedly asked. @Bukhtawar if you have the requested information please add it in a comment and we can look at re-opening this issue.

@astefan astefan closed this as completed Dec 5, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed Indexing/Distributed A catch all label for anything in the Distributed Area. Please avoid if you can. feedback_needed
Projects
None yet
Development

No branches or pull requests

6 participants