Transparent peer retry for certain peer status codes #985

shanson7 · 2018-08-09T14:23:32Z

Sometimes a single peer will go offline due to kafka lag and return a 503 in the midst of a query. This means that until all peers get the status change update (which we have seen take a second) querying this peer fails the request. With 2 peers per shard group, that means 50% of queries fail during this interval, even when there is a completely healthy peer waiting for requests.

Now, I think codes like 500 and 400 etc should fail the request, since we don't really know why that might happen and shouldn't risk propagating it.

Dieterbe · 2018-08-09T14:45:59Z

makes sense

Dieterbe · 2018-08-10T08:22:06Z

how commonly does this happen for you though? an instance exceeding its priority and marking itself as non-ready should be a rare occurrence, not something that should happen during regular operation.
how many msgs/s does an instance consume and what kind of kafka lag and metrictank priorities are you seeing? mind sharing a dashboard snapshot?

shanson7 · 2018-08-10T14:28:34Z

It was happening a couple times per day. I just recently bumped our max-priority to 30 and that silenced it effectively. Each instance consumes only 40k/sec but is capable of much more. It seems that during some hiccups (possibly on the kafka side even) it hits that 10 seconds. At 30, I haven't seen it happen.

But this could also be the beginning of flagging retryable errors in various scenarios as well. I'm open to discussion.

Dieterbe · 2018-08-10T15:11:28Z

the failure you're describing is the logic in api/middleware/cluster.go, right? so we currently use 503 for that. per wikipedia:

503 Service Unavailable
The server is currently unavailable (because it is overloaded or down for maintenance). Generally, this is a temporary state.[65]

so, it makes sense to retry a 503 onto a different replica. solving your root problem makes this less of an issue, but doesn't rule this issue out, so i'm still in favor of the retrying but i think we should try to tackle your root problem first.

max-priority 30 is not elegant. running up to 30 seconds behind is not pretty (charts can lag significantly, alerts become hard to reason about and to setup). it seems worthwhile to dig into what's causing those hickups and stop them from happening. is your kafka cluster healthy throughout these hickups or are you seeing leader changes or stuff like that? how do you monitor kafka? I wonder if it's GC related (either in MT or in the kafka broker)
note that the priority stuff is more of a heuristic/estimate rather than exact science. i recently added a /priority http GET endpoint that gives the breakdown about why it computed a certain priority score. this may help to diagnose. but dashboard snapshots help too. let me know if i can help troubleshooting this.

shanson7 · 2018-08-10T21:08:58Z

Yep.
It really is just a blip. It seems to happen for a matter of only seconds. As the instance is capable of ingesting data far faster than it actually is, once it kicks back in eats up the lag quite quickly. I also don't think that a tiny lag of up to 30s is that bad.

So, based on your comments I started digging into what was happening during that time. I determined it was unlikely to be kafka, since it was usually only one instances at a time. I tried to see what else was happening and it looks like it's the index pruning that is causing the 10-ish second delay.

Dieterbe · 2018-08-10T21:27:50Z

Likely #608

shanson7 · 2018-08-22T16:51:00Z

So, the priority jumps seem to be happening quite often. They don't always seem to align with pruning or even a significant jump in lag (that can just be that our metric publication interval and the prio check aren't lining up well).

But whatever is causing it is highly transient. It (somehow) lags by almost 100 seconds but then is completely ok. It seems to be related to expensive find_by_tag operations as well, but it's hard to judge how much time is spent waiting on the lock vs holding the lock.

shanson7 · 2018-08-22T16:54:09Z

But I think the heuristic is too sensitive or misaligned, because the values definitely spike far too quickly.

stale · 2020-04-04T10:39:19Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

shanson7 · 2020-04-06T07:43:42Z

I believe this has been addressed by #1430 (which will retry failed requests to other peers) and #1670 which will use speculative querying for getTargets.

if the request is bad, there is no point retrying on a different replica of the same shard. breaking the limit on one, will also break it on the 2nd. See also #985

shanson7 mentioned this issue Aug 28, 2018

Modify lag rate calculations #1022

Closed

Dieterbe mentioned this issue Aug 14, 2019

Peer query speculative fixes and improvements #1430

Merged

woodsaj mentioned this issue Aug 14, 2019

WIP: process requests on a per shard basis, not per peer. #1432

Closed

stale bot added the stale label Apr 4, 2020

shanson7 closed this as completed Apr 6, 2020

Dieterbe added a commit that referenced this issue Oct 16, 2020

spec-exec: don't retry requests that fail http 400

df76f3e

if the request is bad, there is no point retrying on a different replica of the same shard. breaking the limit on one, will also break it on the 2nd. See also #985

Dieterbe added a commit that referenced this issue Oct 16, 2020

spec-exec: don't retry requests that fail http 400

795dcb3

if the request is bad, there is no point retrying on a different replica of the same shard. breaking the limit on one, will also break it on the 2nd. See also #985

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Transparent peer retry for certain peer status codes #985

Transparent peer retry for certain peer status codes #985

shanson7 commented Aug 9, 2018

Dieterbe commented Aug 9, 2018

Dieterbe commented Aug 10, 2018

shanson7 commented Aug 10, 2018

Dieterbe commented Aug 10, 2018 •

edited

Loading

shanson7 commented Aug 10, 2018

Dieterbe commented Aug 10, 2018

shanson7 commented Aug 22, 2018

shanson7 commented Aug 22, 2018

stale bot commented Apr 4, 2020

shanson7 commented Apr 6, 2020

Transparent peer retry for certain peer status codes #985

Transparent peer retry for certain peer status codes #985

Comments

shanson7 commented Aug 9, 2018

Dieterbe commented Aug 9, 2018

Dieterbe commented Aug 10, 2018

shanson7 commented Aug 10, 2018

Dieterbe commented Aug 10, 2018 • edited Loading

shanson7 commented Aug 10, 2018

Dieterbe commented Aug 10, 2018

shanson7 commented Aug 22, 2018

shanson7 commented Aug 22, 2018

stale bot commented Apr 4, 2020

shanson7 commented Apr 6, 2020

Dieterbe commented Aug 10, 2018 •

edited

Loading