Produce Latency Spikes Due To Race Condition When Brokers Are Scaled Down #746

aratz-lasa · 2024-05-27T11:33:48Z

When a Kafka Broker leaves the cluster, the franz-go client refreshes its metadata and detects the broker’s absence. This results in a closure of broker connections and errors for inflight produce requests. Sometimes, the error received in the produce request is an EOF or TCP connection closed error, which is considered retryable, and the request retries. However, there is a race condition where the error is errUnknownBroker (non-retryable), which requires metadata to be updated. When this happens, the requests do not retry to drain the sink until another metadata request is issued (when MetadataMinAge elapses). Consequently, the produce latency spikes to either MetadataMinAge or the produce request timeout, whichever occurs first.

If the request detected that the broker is unknown but we do know a broker we can send the request to instead, we could retry immediately 🤔 I wonder what you think is the best solution for this. I have a few ideas and I'm happy to send a PR. But wanted to know your opinion first.

twmb · 2024-06-04T01:38:42Z

errUnknownBroker is returned when a request is issued to a broker number, but that number is not known because it was not returned from metadata. If you are seeing it, it means the client literally did just receive a metadata request that did not include the broker in the response.

In the linked line, errUnknownBroker triggers the client to update metadata again at the soonest opportunity -- but don't spin loop. There's no guarantee (in fact the odds are low) that broker leadership has actually transferred from the broker that has left the cluster. From the metadata response, leadership actually has not transferred yet -- that's why the metadata response said "the leader is broker 5" even though 5 was not included in the response. Issuing to a random broker likely wont help the situation. What I've seen more is that NOT_LEADER_FOR_PARTITION is returned for a longer period of time than a broker completely missing from the metadata response.

So, overall, I see the pain but I don't think that (effectively) "retry spin looping randomly across the cluster until an official leader comes back" will solve the issue, and it may create other issues.

Closing for now for issue tidiness but lmk what you think / if we should reopen.

aratz-lasa · 2024-06-04T08:12:56Z

@twmb I think I did not explain myself correctly. The metadata response doesn't include the old leader, but it does include the new one. However, the produce is not re-sent to the new leader until an eventual metadata refresh. I think it's a race condition where inflight requests sometimes get into this state.

twmb · 2024-06-04T15:10:59Z

I understand now, I have a suspicion as to what this could be.

twmb · 2024-06-10T02:03:39Z

I believe #761 should improve this behavior.

aratz-lasa · 2024-06-10T09:16:58Z

Thank you! In the following 24h I will test if this PR resolves the issue 👍

aratz-lasa · 2024-06-14T12:09:06Z

Sorry for the delay in getting back to you. I encountered some blockers while testing, but I’ve since completed the tests and haven’t experienced the issue anymore. It looks like the solution works!

I have another question: Some customers have reported similar behavior with consumers, where sudden leadership changes cause spikes of 10s (the old value for MetadataMinAge). Is there an equivalent behavior for consumers, by any chance?

twmb · 2024-07-10T02:23:49Z

Sorry for the delay here -- let me know if there's some importance to merging and cutting a release. None of the current open issues are major so I've been dragging my feet.

There is no equivalent code on the consumers. The path for backing off is:

Any partition error causes a reason to update the metadata (updateWhy):

franz-go/pkg/kgo/source.go

Line 1012 in a5f2b71

updateWhy.add(topic, partition, fp.Err)

IF there are no reloadOffsets (from offset_out_of_range or whatever), then we directly manually trigger an immediate metadata update:

franz-go/pkg/kgo/source.go

Lines 897 to 903 in a5f2b71

    
           if !reloadOffsets.loadWithSessionNow(consumerSession, why) { 
        
           	if updateWhy.isOnly(kerr.UnknownTopicOrPartition) || updateWhy.isOnly(kerr.UnknownTopicID) { 
        
           		s.cl.triggerUpdateMetadata(false, why) 
        
           	} else { 
        
           		s.cl.triggerUpdateMetadataNow(why) 
        
           	} 
        
           }

(this and the next bullet really need a comment internally, I'm going to add one in some random PR; I do realize this is a bit convoluted, IIRC, I thought this years ago when I implemented it but couldn't think of something less convoluted at the time)

IF there ARE reloadOffsets, the offset reloading function triggers a metadata update itself: https://github.com/twmb/franz-go/blob/master/pkg/kgo/consumer.go#L1705-L1709
triggerUpdateMetadataNow forces an immediate metadata update, here: https://github.com/twmb/franz-go/blob/master/pkg/kgo/metadata.go#L140-L145 (you can follow that bit of code pretty easily)

Slow metadata updates that wait for the min refresh can be seen with .triggerUpdateMetadata(, vs immediate forced updates bypassing min wait times are .triggerUpdateMetadataNow(.

the slow waits are

producer.go
871:	cl.triggerUpdateMetadata(false, "reload trigger due to produce topic still not known")

metadata.go
262:				cl.triggerUpdateMetadata(true, fmt.Sprintf("re-updating metadata due to err: %s", err))
264:				cl.triggerUpdateMetadata(true, retryWhy.reason("re-updating due to inner errors"))

consumer.go
1708:		wait = s.c.cl.triggerUpdateMetadata(false, why) // avoid trigger if within refresh interval

source.go
763:		s.cl.triggerUpdateMetadata(false, fmt.Sprintf("opportunistic load during source backoff: %v", why)) // as good a time as any
902:				s.cl.triggerUpdateMetadata(false, why)

sink.go
202:	s.cl.triggerUpdateMetadata(false, "opportunistic load during sink backoff") // as good a time as any
332:				s.cl.triggerUpdateMetadata(false, "attempting to refresh broker list due to failed AddPartitionsToTxn requests")
1007:		s.cl.triggerUpdateMetadata(true, why)

Producer and sink are not the consuming code paths,
Consumer and source happen in a deliberately chosen slow path as well (errors are entirely unknown topic or partition).

The metadata code path could be a culprit, but only if forcefully issued metadata requests fail 3x, at which point the code internally moves to slow retries.

If you're able to capture debug logs around unexpected 10s delays while consuming, I could look into that -- but I think neither of us know how to trigger it at the moment.

twmb · 2024-07-15T19:01:44Z

FWIW implementing KIP-951 may help the consumer side of the equation.

twmb closed this as completed Jun 4, 2024

twmb reopened this Jun 4, 2024

twmb mentioned this issue Jun 10, 2024

kgo sink: do not back off on certain edge case #761

Merged

twmb added the has pr label Jun 10, 2024

twmb closed this as completed in #761 Jul 29, 2024

twmb closed this as completed in e62b402 Jul 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Produce Latency Spikes Due To Race Condition When Brokers Are Scaled Down #746

Produce Latency Spikes Due To Race Condition When Brokers Are Scaled Down #746

aratz-lasa commented May 27, 2024 •

edited

Loading

twmb commented Jun 4, 2024

aratz-lasa commented Jun 4, 2024

twmb commented Jun 4, 2024

twmb commented Jun 10, 2024 •

edited

Loading

aratz-lasa commented Jun 10, 2024

aratz-lasa commented Jun 14, 2024

twmb commented Jul 10, 2024

twmb commented Jul 15, 2024

Produce Latency Spikes Due To Race Condition When Brokers Are Scaled Down #746

Produce Latency Spikes Due To Race Condition When Brokers Are Scaled Down #746

Comments

aratz-lasa commented May 27, 2024 • edited Loading

twmb commented Jun 4, 2024

aratz-lasa commented Jun 4, 2024

twmb commented Jun 4, 2024

twmb commented Jun 10, 2024 • edited Loading

aratz-lasa commented Jun 10, 2024

aratz-lasa commented Jun 14, 2024

twmb commented Jul 10, 2024

twmb commented Jul 15, 2024

aratz-lasa commented May 27, 2024 •

edited

Loading

twmb commented Jun 10, 2024 •

edited

Loading