You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository was archived by the owner on Aug 23, 2023. It is now read-only.
Many operations in MetricTank reach out to all peers to assemble a complete dataset. In many cases, this means that we could be waiting for a small subset of peers to respond that are significantly slower than the rest.
Note: shard group = Set of nodes handling the same partitions
Speculative querying involves issuing additional queries to shard group peers when it seems that some instances are slow to respond. You can see a (very) rough cut of this here
Yesterday, on our cluster (120 shard groups, ~50 render requests / sec):
Completed 441,342,750 find_by_tag queries
Issued 4,808,309 speculative find_by_tag queries (1.1%)
4,574,610 speculative queries came back faster than the non-speculative (95.1%)
So, at the cost of issuing 1.1% more http queries to peers, we received 95.1% of the shard group data more quickly (sorry, I don't have good data on how much more quickly).
The main parameter to tune speculative querying is percent of responses to get back before issuing a speculative request (set to 95% on our cluster). Higher values for this reduces the number of speculative queries at the cost of possibly waiting longer than necessary if there are multiple slow peers.
Some open questions after my initial round of implementation:
Current implementation still handles the local shard group locally. Should that be updated to allow speculative querying for the local node's shard group? Right now, speculative querying doesn't work well if the local node is undergoing GC.
If so, should the local query be an http request or some function passed in?
Current implementation assumes that all members of a particular shard group handle the exact same partitions. e.g. there are no scenarios where peer A handles partitions 0-5 and peer B handles partitions 3-7. It seems like the current MembersForQuery function somewhat handles this but I'm not sure if it's been tested in the wild (e.g. deduping the overlap?). Is this something that should be formalized or should MembersForSpeculativeQuery do something similar to MembersForQuery?
Metrics! I figure it would be useful to know:
a) How many speculative queries were issued.
b) how many times speculation improved response time (not necessarily how many speculative queries were faster than their non-speculative counterpart as I have today). Are there others?
Right now I just abandon the outstanding requests. I imagine that I should cancel them properly.
The text was updated successfully, but these errors were encountered:
Many operations in MetricTank reach out to all peers to assemble a complete dataset. In many cases, this means that we could be waiting for a small subset of peers to respond that are significantly slower than the rest.
Note:
shard group
= Set of nodes handling the same partitionsSpeculative querying involves issuing additional queries to shard group peers when it seems that some instances are slow to respond. You can see a (very) rough cut of this here
Yesterday, on our cluster (120 shard groups, ~50 render requests / sec):
Completed 441,342,750 find_by_tag queries
Issued 4,808,309 speculative find_by_tag queries (
1.1%
)4,574,610 speculative queries came back faster than the non-speculative (
95.1%
)So, at the cost of issuing 1.1% more http queries to peers, we received 95.1% of the shard group data more quickly (sorry, I don't have good data on how much more quickly).
The main parameter to tune speculative querying is percent of responses to get back before issuing a speculative request (set to 95% on our cluster). Higher values for this reduces the number of speculative queries at the cost of possibly waiting longer than necessary if there are multiple slow peers.
Some open questions after my initial round of implementation:
Current implementation assumes that all members of a particular shard group handle the exact same partitions. e.g. there are no scenarios where peer A handles partitions 0-5 and peer B handles partitions 3-7. It seems like the current
MembersForQuery
function somewhat handles this but I'm not sure if it's been tested in the wild (e.g. deduping the overlap?). Is this something that should be formalized or shouldMembersForSpeculativeQuery
do something similar toMembersForQuery
?Metrics! I figure it would be useful to know:
a) How many speculative queries were issued.
b) how many times speculation improved response time (not necessarily how many speculative queries were faster than their non-speculative counterpart as I have today). Are there others?
Right now I just abandon the outstanding requests. I imagine that I should cancel them properly.
The text was updated successfully, but these errors were encountered: