-
-
Notifications
You must be signed in to change notification settings - Fork 198
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Produce Latency Spikes Due To Race Condition When Brokers Are Scaled Down #746
Comments
In the linked line, So, overall, I see the pain but I don't think that (effectively) "retry spin looping randomly across the cluster until an official leader comes back" will solve the issue, and it may create other issues. Closing for now for issue tidiness but lmk what you think / if we should reopen. |
@twmb I think I did not explain myself correctly. The metadata response doesn't include the old leader, but it does include the new one. However, the produce is not re-sent to the new leader until an eventual metadata refresh. I think it's a race condition where inflight requests sometimes get into this state. |
I understand now, I have a suspicion as to what this could be. |
I believe #761 should improve this behavior. |
Thank you! In the following 24h I will test if this PR resolves the issue 👍 |
Sorry for the delay in getting back to you. I encountered some blockers while testing, but I’ve since completed the tests and haven’t experienced the issue anymore. It looks like the solution works! I have another question: Some customers have reported similar behavior with consumers, where sudden leadership changes cause spikes of 10s (the old value for MetadataMinAge). Is there an equivalent behavior for consumers, by any chance? |
Sorry for the delay here -- let me know if there's some importance to merging and cutting a release. None of the current open issues are major so I've been dragging my feet. There is no equivalent code on the consumers. The path for backing off is:
Slow metadata updates that wait for the min refresh can be seen with the slow waits are
Producer and sink are not the consuming code paths, The metadata code path could be a culprit, but only if forcefully issued metadata requests fail 3x, at which point the code internally moves to slow retries. If you're able to capture debug logs around unexpected 10s delays while consuming, I could look into that -- but I think neither of us know how to trigger it at the moment. |
FWIW implementing KIP-951 may help the consumer side of the equation. |
When a Kafka Broker leaves the cluster, the franz-go client refreshes its metadata and detects the broker’s absence. This results in a closure of broker connections and errors for inflight produce requests. Sometimes, the error received in the produce request is an EOF or TCP connection closed error, which is considered retryable, and the request retries. However, there is a race condition where the error is errUnknownBroker (non-retryable), which requires metadata to be updated. When this happens, the requests do not retry to drain the sink until another metadata request is issued (when MetadataMinAge elapses). Consequently, the produce latency spikes to either MetadataMinAge or the produce request timeout, whichever occurs first.
If the request detected that the broker is unknown but we do know a broker we can send the request to instead, we could retry immediately 🤔 I wonder what you think is the best solution for this. I have a few ideas and I'm happy to send a PR. But wanted to know your opinion first.
The text was updated successfully, but these errors were encountered: