-
-
Notifications
You must be signed in to change notification settings - Fork 198
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Default heartbeat timeout is too high #92
Comments
Good point, I'll add an exemption in the conn timeout overhead function to have the timeout be the heartbeat interval and to not have retries. |
Actually, I'm not sure dropping the timeout to the heartbeat interval is such a good idea here. If the heartbeat returns an error, the member believes it has lost group membership and needs to rejoin. All partitions it was consuming are "lost", leading to a total stall in consuming. This is fine right now because the Request code retries internally, so any error return is definitely a bit of a hard error. Dropping the total request retry limit to the heartbeat interval makes the request susceptible to failure on temporary network flakiness. If a non-leader loses its partitions and rejoins a stable group, Kafka will handle this without triggering a rebalance and the member will just get its old assignment back. The only downside here is the member has a total consuming stall between the period of the heartbeat failure to when it rejoins. If a leader loses its partitions and rejoins, Kafka will trigger a rebalance for the group. If using cooperative-sticky (the franz-go default), this will also be pretty negligible, because all members will continue consuming and will just be reassigned what they were previously assigned. If not using cooperative-sticky, there will be a total stall in consuming for the group. Kafka (MSK) should be able to failover groups between brokers during routine maintenance, such that a group actually doesn't need to rebalance at all: members will heartbeat to broker 0, eventually when the broker is turning down it'll return INVALID_COORDINATOR, and then members will refresh the coordinator to speak to the new broker (at least, I hope this is how it works...). I think a better upper bound on the heartbeat request would be the session timeout: if the client cannot heartbeat for the entire session timeout, then it can more accurately assume that it (the member itself) has now timed out. This would still result in relatively high latencies (45s), but is less prone to the issues above. What do you think? Also what types of latencies were you experiencing? The consumer group is independent of the consuming itself. Was consuming stalled? |
Hey there TBH i'm not 100% sure that spikes we're experiencing during kafka failover are always related to heartbeat. It was our main guess Let me describe what we're experiencing Besides the rest we measure 2 latency - Most of the spikes related to reboots are Here are some numbers: Our staging setup is quite simple - 3 brokers, 1 topic with 3 partitions We injected I can imagine alternative explanation as well Lets imagine that during failover Node2 is down. And current consumer tries to fetch data from Node2. It waits until all timeouts are gone an then asks to update metadata. And to update metadata we actually have to make new request to one of a brokers. And this broker picked up randomly. And this FetchMetadataRequest hits Node2, which is down, and also waits until all timeouts are gone and then will retry with another broker. And then it comes to Node1, new metadata fetched and consumer/producer can start to make requests to survived brokers If that sounds realistic then we an try to make sorta |
I think there can be some timeout tuning in some places. Metadata could probably be lower, since if it fails, internally the only option is to retry anyway (rather than fail a group or something). I've also always wanted to make an improvement into which broken is chosen for metadata, but it's not the most obvious and has been lower priority. Is it also possible that the producer side is stalled, because there are so many records backed up for the partitions that are currently not accepting writes? This could be checked with the |
* ConnTimeoutOverhead has been lowered from 20s to 10s. This will allow quicker detection of hung writes. For most purposes, this will be a fine timeout, and requests will retry if necessary. For extremely bad connections, users may need to raise this. * RetryTimeout has been lowered from a default of 1m to 30s, with JoinGroup, SyncGroup, and HeartbeatRequest being exempted to the session interval by default. For requests, this will allow quicker detection of hung brokers and will not retry as long. Internally in the client, this *should* result in just extra logging in most cases. For groups, this will avoid trying to stay in a group longer than the group would be valid. For #92.
As a group consumer, we expect 1 active join group request, and then 1 active sync group request. Both of these requests can hang. If a client wants to update metadata while either of these requests are hanging, and the metadata happens to choose the same connection, then the metadata request will be blocked. We do not want this. If an admin issues a request that contains a TimeoutMillis field, it is possible that the request will take a while to execute. We do not want to be blocked if this is so. To fix both of these scenarios, we add two new connection types: cxnGroup, which specifically handles Join and Sync, and cxnSlow, which handles anything with timeout millis. For #92.
I've pushed a change that I hope helps address this issue: timeouts have been lowered from 1min to 30s (for heartbeat, actually only to 45s by default). More importantly, the connection overhead has been lowered to 10s. These two changes combined will make some failover a bit quicker. I see above that you've actually lowered the conn timeout overhead to 1s -- this is fairly aggressive imo and doesn't allow room for unexpected yet temporary latency. 5s may be a bit safer. 1s may be a bit aggressive: if MSK is replying to a fetch request but the reply takes just over 1s, the client would cut the request, and then if that keeps happening, you'll have no progress. For the FetchMetadataRequest2 scenario where requests hit Node2, the metadata can hang for 10s, but after the request is cut due to timeout, the client will choose the next available broker. So, this should only result in 10s of hang. As well, the client will now use a dedicated connection to join&sync. Both of these requests can take some time, so if a metadata request gets stacked on either of these, the metadata will be blocked. That should no longer be the case. I don't think this change actually helps this problem, but if it does, all the better. I'm not sure I'd like to switch broker selection strategy at the moment (last+1), In the happy path, it shouldn't help much, and in the unhappy path, we're bouncing around which broker to choose anyway. When metadata is good and things are stable, metadata is only refreshed once every 5 minutes. When things are failing, we have to choose a different broker again anyway. Did you get a chance to check out the |
I'm going to close the issue for now given the prior commits attempting to help address this. If you investigate further, or if the LeastBackupPartitioner helps, please let me know. This seems like something that, if we can't solve within the client itself, could be documented somewhere in the README / docs, but I'm not 100% sure what to document yet. |
I've tagged the above changes in v1.2. |
We experienced high latencies during standard Kafka failover process which happens on AWS MSK and we were able to track it down to the issue where hearbeat request hangs for too long with default library settings.
We figured that heartbeat request timeout is equal to
OverheadTimeout
option which is 20 seconds by default. Also, these requests are retriable which means that single heartbeat request which is very time-sensitive could stall consumption for some time.We tested our hypothesis with this simple change where we limit total hearbeat execution time to
heartbeatInterval
and the situation improved drastically GetStream@07531b5For now we decided to lower overhead timeout and some other timings to achieve acceptable failover latency, but I think the library should have a reasonable limit for heartbeat request timeout and it shouldn't be higher than
heartbeatInterval
The text was updated successfully, but these errors were encountered: