Skip to content
This repository was archived by the owner on Aug 23, 2023. It is now read-only.

Transparent peer retry for certain peer status codes #985

Closed
shanson7 opened this issue Aug 9, 2018 · 10 comments
Closed

Transparent peer retry for certain peer status codes #985

shanson7 opened this issue Aug 9, 2018 · 10 comments
Labels

Comments

@shanson7
Copy link
Collaborator

shanson7 commented Aug 9, 2018

Sometimes a single peer will go offline due to kafka lag and return a 503 in the midst of a query. This means that until all peers get the status change update (which we have seen take a second) querying this peer fails the request. With 2 peers per shard group, that means 50% of queries fail during this interval, even when there is a completely healthy peer waiting for requests.

Now, I think codes like 500 and 400 etc should fail the request, since we don't really know why that might happen and shouldn't risk propagating it.

@Dieterbe
Copy link
Contributor

Dieterbe commented Aug 9, 2018

makes sense

@Dieterbe
Copy link
Contributor

how commonly does this happen for you though? an instance exceeding its priority and marking itself as non-ready should be a rare occurrence, not something that should happen during regular operation.
how many msgs/s does an instance consume and what kind of kafka lag and metrictank priorities are you seeing? mind sharing a dashboard snapshot?

@shanson7
Copy link
Collaborator Author

It was happening a couple times per day. I just recently bumped our max-priority to 30 and that silenced it effectively. Each instance consumes only 40k/sec but is capable of much more. It seems that during some hiccups (possibly on the kafka side even) it hits that 10 seconds. At 30, I haven't seen it happen.

But this could also be the beginning of flagging retryable errors in various scenarios as well. I'm open to discussion.

@Dieterbe
Copy link
Contributor

Dieterbe commented Aug 10, 2018

  1. the failure you're describing is the logic in api/middleware/cluster.go, right? so we currently use 503 for that. per wikipedia:

503 Service Unavailable
The server is currently unavailable (because it is overloaded or down for maintenance). Generally, this is a temporary state.[65]

so, it makes sense to retry a 503 onto a different replica. solving your root problem makes this less of an issue, but doesn't rule this issue out, so i'm still in favor of the retrying but i think we should try to tackle your root problem first.

  1. max-priority 30 is not elegant. running up to 30 seconds behind is not pretty (charts can lag significantly, alerts become hard to reason about and to setup). it seems worthwhile to dig into what's causing those hickups and stop them from happening. is your kafka cluster healthy throughout these hickups or are you seeing leader changes or stuff like that? how do you monitor kafka? I wonder if it's GC related (either in MT or in the kafka broker)

  2. note that the priority stuff is more of a heuristic/estimate rather than exact science. i recently added a /priority http GET endpoint that gives the breakdown about why it computed a certain priority score. this may help to diagnose. but dashboard snapshots help too. let me know if i can help troubleshooting this.

@shanson7
Copy link
Collaborator Author

  1. Yep.
  2. It really is just a blip. It seems to happen for a matter of only seconds. As the instance is capable of ingesting data far faster than it actually is, once it kicks back in eats up the lag quite quickly. I also don't think that a tiny lag of up to 30s is that bad.

So, based on your comments I started digging into what was happening during that time. I determined it was unlikely to be kafka, since it was usually only one instances at a time. I tried to see what else was happening and it looks like it's the index pruning that is causing the 10-ish second delay.

@Dieterbe
Copy link
Contributor

Likely #608

@shanson7
Copy link
Collaborator Author

So, the priority jumps seem to be happening quite often. They don't always seem to align with pruning or even a significant jump in lag (that can just be that our metric publication interval and the prio check aren't lining up well).

But whatever is causing it is highly transient. It (somehow) lags by almost 100 seconds but then is completely ok. It seems to be related to expensive find_by_tag operations as well, but it's hard to judge how much time is spent waiting on the lock vs holding the lock.

@shanson7
Copy link
Collaborator Author

But I think the heuristic is too sensitive or misaligned, because the values definitely spike far too quickly.

@stale
Copy link

stale bot commented Apr 4, 2020

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale label Apr 4, 2020
@shanson7
Copy link
Collaborator Author

shanson7 commented Apr 6, 2020

I believe this has been addressed by #1430 (which will retry failed requests to other peers) and #1670 which will use speculative querying for getTargets.

@shanson7 shanson7 closed this as completed Apr 6, 2020
Dieterbe added a commit that referenced this issue Oct 16, 2020
if the request is bad, there is no point retrying on a different replica
of the same shard. breaking the limit on one, will also break it on the
2nd.  See also #985
Dieterbe added a commit that referenced this issue Oct 16, 2020
if the request is bad, there is no point retrying on a different replica
of the same shard. breaking the limit on one, will also break it on the
2nd.  See also #985
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

2 participants