Skip to content
This repository was archived by the owner on Aug 23, 2023. It is now read-only.

Support speculative / failover querying to peers #954

Closed
shanson7 opened this issue Jun 27, 2018 · 1 comment
Closed

Support speculative / failover querying to peers #954

shanson7 opened this issue Jun 27, 2018 · 1 comment

Comments

@shanson7
Copy link
Collaborator

Many operations in MetricTank reach out to all peers to assemble a complete dataset. In many cases, this means that we could be waiting for a small subset of peers to respond that are significantly slower than the rest.

Note: shard group = Set of nodes handling the same partitions
Speculative querying involves issuing additional queries to shard group peers when it seems that some instances are slow to respond. You can see a (very) rough cut of this here

Yesterday, on our cluster (120 shard groups, ~50 render requests / sec):
Completed 441,342,750 find_by_tag queries
Issued 4,808,309 speculative find_by_tag queries (1.1%)
4,574,610 speculative queries came back faster than the non-speculative (95.1%)

So, at the cost of issuing 1.1% more http queries to peers, we received 95.1% of the shard group data more quickly (sorry, I don't have good data on how much more quickly).

The main parameter to tune speculative querying is percent of responses to get back before issuing a speculative request (set to 95% on our cluster). Higher values for this reduces the number of speculative queries at the cost of possibly waiting longer than necessary if there are multiple slow peers.

Some open questions after my initial round of implementation:

  1. Current implementation still handles the local shard group locally. Should that be updated to allow speculative querying for the local node's shard group? Right now, speculative querying doesn't work well if the local node is undergoing GC.
  • If so, should the local query be an http request or some function passed in?
  1. Current implementation assumes that all members of a particular shard group handle the exact same partitions. e.g. there are no scenarios where peer A handles partitions 0-5 and peer B handles partitions 3-7. It seems like the current MembersForQuery function somewhat handles this but I'm not sure if it's been tested in the wild (e.g. deduping the overlap?). Is this something that should be formalized or should MembersForSpeculativeQuery do something similar to MembersForQuery?

  2. Metrics! I figure it would be useful to know:
    a) How many speculative queries were issued.
    b) how many times speculation improved response time (not necessarily how many speculative queries were faster than their non-speculative counterpart as I have today). Are there others?

  3. Right now I just abandon the outstanding requests. I imagine that I should cancel them properly.

@shanson7 shanson7 mentioned this issue Jun 28, 2018
4 tasks
@shanson7
Copy link
Collaborator Author

Fixed by #956

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant