Support speculative / failover querying to peers #954

shanson7 · 2018-06-27T21:10:18Z

Many operations in MetricTank reach out to all peers to assemble a complete dataset. In many cases, this means that we could be waiting for a small subset of peers to respond that are significantly slower than the rest.

Note: shard group = Set of nodes handling the same partitions
Speculative querying involves issuing additional queries to shard group peers when it seems that some instances are slow to respond. You can see a (very) rough cut of this here

Yesterday, on our cluster (120 shard groups, ~50 render requests / sec):
Completed 441,342,750 find_by_tag queries
Issued 4,808,309 speculative find_by_tag queries (1.1%)
4,574,610 speculative queries came back faster than the non-speculative (95.1%)

So, at the cost of issuing 1.1% more http queries to peers, we received 95.1% of the shard group data more quickly (sorry, I don't have good data on how much more quickly).

The main parameter to tune speculative querying is percent of responses to get back before issuing a speculative request (set to 95% on our cluster). Higher values for this reduces the number of speculative queries at the cost of possibly waiting longer than necessary if there are multiple slow peers.

Some open questions after my initial round of implementation:

Current implementation still handles the local shard group locally. Should that be updated to allow speculative querying for the local node's shard group? Right now, speculative querying doesn't work well if the local node is undergoing GC.

If so, should the local query be an http request or some function passed in?

Current implementation assumes that all members of a particular shard group handle the exact same partitions. e.g. there are no scenarios where peer A handles partitions 0-5 and peer B handles partitions 3-7. It seems like the current MembersForQuery function somewhat handles this but I'm not sure if it's been tested in the wild (e.g. deduping the overlap?). Is this something that should be formalized or should MembersForSpeculativeQuery do something similar to MembersForQuery?
Metrics! I figure it would be useful to know:
a) How many speculative queries were issued.
b) how many times speculation improved response time (not necessarily how many speculative queries were faster than their non-speculative counterpart as I have today). Are there others?
Right now I just abandon the outstanding requests. I imagine that I should cancel them properly.

The text was updated successfully, but these errors were encountered:

shanson7 · 2018-07-26T13:41:59Z

Fixed by #956

shanson7 mentioned this issue Jun 28, 2018

Speculative queries #956

Merged

4 tasks

shanson7 closed this as completed Jul 26, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support speculative / failover querying to peers #954

Support speculative / failover querying to peers #954

shanson7 commented Jun 27, 2018

shanson7 commented Jul 26, 2018

Support speculative / failover querying to peers #954

Support speculative / failover querying to peers #954

Comments

shanson7 commented Jun 27, 2018

shanson7 commented Jul 26, 2018