-
Notifications
You must be signed in to change notification settings - Fork 107
Transparent peer retry for certain peer status codes #985
Comments
makes sense |
how commonly does this happen for you though? an instance exceeding its priority and marking itself as non-ready should be a rare occurrence, not something that should happen during regular operation. |
It was happening a couple times per day. I just recently bumped our But this could also be the beginning of flagging retryable errors in various scenarios as well. I'm open to discussion. |
so, it makes sense to retry a 503 onto a different replica. solving your root problem makes this less of an issue, but doesn't rule this issue out, so i'm still in favor of the retrying but i think we should try to tackle your root problem first.
|
So, based on your comments I started digging into what was happening during that time. I determined it was unlikely to be kafka, since it was usually only one instances at a time. I tried to see what else was happening and it looks like it's the index pruning that is causing the 10-ish second delay. |
Likely #608 |
So, the priority jumps seem to be happening quite often. They don't always seem to align with pruning or even a significant jump in lag (that can just be that our metric publication interval and the prio check aren't lining up well). But whatever is causing it is highly transient. It (somehow) lags by almost 100 seconds but then is completely ok. It seems to be related to expensive find_by_tag operations as well, but it's hard to judge how much time is spent waiting on the lock vs holding the lock. |
But I think the heuristic is too sensitive or misaligned, because the values definitely spike far too quickly. |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
if the request is bad, there is no point retrying on a different replica of the same shard. breaking the limit on one, will also break it on the 2nd. See also #985
if the request is bad, there is no point retrying on a different replica of the same shard. breaking the limit on one, will also break it on the 2nd. See also #985
Sometimes a single peer will go offline due to kafka lag and return a 503 in the midst of a query. This means that until all peers get the status change update (which we have seen take a second) querying this peer fails the request. With 2 peers per shard group, that means 50% of queries fail during this interval, even when there is a completely healthy peer waiting for requests.
Now, I think codes like 500 and 400 etc should fail the request, since we don't really know why that might happen and shouldn't risk propagating it.
The text was updated successfully, but these errors were encountered: