Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Produce Latency Spikes Due To Race Condition When Brokers Are Scaled Down #746

Closed
aratz-lasa opened this issue May 27, 2024 · 8 comments · Fixed by #761
Closed

Produce Latency Spikes Due To Race Condition When Brokers Are Scaled Down #746

aratz-lasa opened this issue May 27, 2024 · 8 comments · Fixed by #761
Labels

Comments

@aratz-lasa
Copy link

aratz-lasa commented May 27, 2024

When a Kafka Broker leaves the cluster, the franz-go client refreshes its metadata and detects the broker’s absence. This results in a closure of broker connections and errors for inflight produce requests. Sometimes, the error received in the produce request is an EOF or TCP connection closed error, which is considered retryable, and the request retries. However, there is a race condition where the error is errUnknownBroker (non-retryable), which requires metadata to be updated. When this happens, the requests do not retry to drain the sink until another metadata request is issued (when MetadataMinAge elapses). Consequently, the produce latency spikes to either MetadataMinAge or the produce request timeout, whichever occurs first.

If the request detected that the broker is unknown but we do know a broker we can send the request to instead, we could retry immediately 🤔 I wonder what you think is the best solution for this. I have a few ideas and I'm happy to send a PR. But wanted to know your opinion first.

@twmb
Copy link
Owner

twmb commented Jun 4, 2024

errUnknownBroker is returned when a request is issued to a broker number, but that number is not known because it was not returned from metadata. If you are seeing it, it means the client literally did just receive a metadata request that did not include the broker in the response.

In the linked line, errUnknownBroker triggers the client to update metadata again at the soonest opportunity -- but don't spin loop. There's no guarantee (in fact the odds are low) that broker leadership has actually transferred from the broker that has left the cluster. From the metadata response, leadership actually has not transferred yet -- that's why the metadata response said "the leader is broker 5" even though 5 was not included in the response. Issuing to a random broker likely wont help the situation. What I've seen more is that NOT_LEADER_FOR_PARTITION is returned for a longer period of time than a broker completely missing from the metadata response.

So, overall, I see the pain but I don't think that (effectively) "retry spin looping randomly across the cluster until an official leader comes back" will solve the issue, and it may create other issues.

Closing for now for issue tidiness but lmk what you think / if we should reopen.

@twmb twmb closed this as completed Jun 4, 2024
@aratz-lasa
Copy link
Author

@twmb I think I did not explain myself correctly. The metadata response doesn't include the old leader, but it does include the new one. However, the produce is not re-sent to the new leader until an eventual metadata refresh. I think it's a race condition where inflight requests sometimes get into this state.

@twmb
Copy link
Owner

twmb commented Jun 4, 2024

I understand now, I have a suspicion as to what this could be.

@twmb
Copy link
Owner

twmb commented Jun 10, 2024

I believe #761 should improve this behavior.

@twmb twmb added the has pr label Jun 10, 2024
@aratz-lasa
Copy link
Author

Thank you! In the following 24h I will test if this PR resolves the issue 👍

@aratz-lasa
Copy link
Author

Sorry for the delay in getting back to you. I encountered some blockers while testing, but I’ve since completed the tests and haven’t experienced the issue anymore. It looks like the solution works!

I have another question: Some customers have reported similar behavior with consumers, where sudden leadership changes cause spikes of 10s (the old value for MetadataMinAge). Is there an equivalent behavior for consumers, by any chance?

@twmb
Copy link
Owner

twmb commented Jul 10, 2024

Sorry for the delay here -- let me know if there's some importance to merging and cutting a release. None of the current open issues are major so I've been dragging my feet.

There is no equivalent code on the consumers. The path for backing off is:

Slow metadata updates that wait for the min refresh can be seen with .triggerUpdateMetadata(, vs immediate forced updates bypassing min wait times are .triggerUpdateMetadataNow(.

the slow waits are

producer.go
871:	cl.triggerUpdateMetadata(false, "reload trigger due to produce topic still not known")

metadata.go
262:				cl.triggerUpdateMetadata(true, fmt.Sprintf("re-updating metadata due to err: %s", err))
264:				cl.triggerUpdateMetadata(true, retryWhy.reason("re-updating due to inner errors"))

consumer.go
1708:		wait = s.c.cl.triggerUpdateMetadata(false, why) // avoid trigger if within refresh interval

source.go
763:		s.cl.triggerUpdateMetadata(false, fmt.Sprintf("opportunistic load during source backoff: %v", why)) // as good a time as any
902:				s.cl.triggerUpdateMetadata(false, why)

sink.go
202:	s.cl.triggerUpdateMetadata(false, "opportunistic load during sink backoff") // as good a time as any
332:				s.cl.triggerUpdateMetadata(false, "attempting to refresh broker list due to failed AddPartitionsToTxn requests")
1007:		s.cl.triggerUpdateMetadata(true, why)

Producer and sink are not the consuming code paths,
Consumer and source happen in a deliberately chosen slow path as well (errors are entirely unknown topic or partition).

The metadata code path could be a culprit, but only if forcefully issued metadata requests fail 3x, at which point the code internally moves to slow retries.

If you're able to capture debug logs around unexpected 10s delays while consuming, I could look into that -- but I think neither of us know how to trigger it at the moment.

@twmb
Copy link
Owner

twmb commented Jul 15, 2024

FWIW implementing KIP-951 may help the consumer side of the equation.

@twmb twmb closed this as completed in #761 Jul 29, 2024
@twmb twmb closed this as completed in e62b402 Jul 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants