Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Default heartbeat timeout is too high #92

Closed
AnatolyRugalev opened this issue Oct 7, 2021 · 7 comments
Closed

Default heartbeat timeout is too high #92

AnatolyRugalev opened this issue Oct 7, 2021 · 7 comments

Comments

@AnatolyRugalev
Copy link

We experienced high latencies during standard Kafka failover process which happens on AWS MSK and we were able to track it down to the issue where hearbeat request hangs for too long with default library settings.

We figured that heartbeat request timeout is equal to OverheadTimeout option which is 20 seconds by default. Also, these requests are retriable which means that single heartbeat request which is very time-sensitive could stall consumption for some time.

We tested our hypothesis with this simple change where we limit total hearbeat execution time to heartbeatInterval and the situation improved drastically GetStream@07531b5

For now we decided to lower overhead timeout and some other timings to achieve acceptable failover latency, but I think the library should have a reasonable limit for heartbeat request timeout and it shouldn't be higher than heartbeatInterval

@AnatolyRugalev AnatolyRugalev changed the title Heartbeat timeout is too high Default heartbeat timeout is too high Oct 7, 2021
@twmb
Copy link
Owner

twmb commented Oct 7, 2021

Good point, I'll add an exemption in the conn timeout overhead function to have the timeout be the heartbeat interval and to not have retries.

@twmb
Copy link
Owner

twmb commented Oct 7, 2021

Actually, I'm not sure dropping the timeout to the heartbeat interval is such a good idea here. If the heartbeat returns an error, the member believes it has lost group membership and needs to rejoin. All partitions it was consuming are "lost", leading to a total stall in consuming. This is fine right now because the Request code retries internally, so any error return is definitely a bit of a hard error.

Dropping the total request retry limit to the heartbeat interval makes the request susceptible to failure on temporary network flakiness.

If a non-leader loses its partitions and rejoins a stable group, Kafka will handle this without triggering a rebalance and the member will just get its old assignment back. The only downside here is the member has a total consuming stall between the period of the heartbeat failure to when it rejoins.

If a leader loses its partitions and rejoins, Kafka will trigger a rebalance for the group. If using cooperative-sticky (the franz-go default), this will also be pretty negligible, because all members will continue consuming and will just be reassigned what they were previously assigned. If not using cooperative-sticky, there will be a total stall in consuming for the group.

Kafka (MSK) should be able to failover groups between brokers during routine maintenance, such that a group actually doesn't need to rebalance at all: members will heartbeat to broker 0, eventually when the broker is turning down it'll return INVALID_COORDINATOR, and then members will refresh the coordinator to speak to the new broker (at least, I hope this is how it works...).

I think a better upper bound on the heartbeat request would be the session timeout: if the client cannot heartbeat for the entire session timeout, then it can more accurately assume that it (the member itself) has now timed out. This would still result in relatively high latencies (45s), but is less prone to the issues above. What do you think?

Also what types of latencies were you experiencing? The consumer group is independent of the consuming itself. Was consuming stalled?

@nmerkulov
Copy link

nmerkulov commented Oct 8, 2021

Hey there
Me and @AnatolyRugalev working together =)

TBH i'm not 100% sure that spikes we're experiencing during kafka failover are always related to heartbeat. It was our main guess

Let me describe what we're experiencing

Besides the rest we measure 2 latency - Producer latency and e2e latency (i picked it up from here)

Most of the spikes related to reboots are e2e latency . It doesn't happen every time, but we managed to repeat it often enough - one out of 3-4 broker reboots result in this. Problem with those reboots that we can't control schedule or at least pause AWS MSK maintenance, and sometimes aws notify us in a couple of days. So we want to make it as smooth as it possible.

Here are some numbers:
With default config one (or sometimes few) consumer group is stuck and don't consume anything for about minute and a half. It results in 1.5 minutes e2e latency spikes
After we decreased Dialer to 2 seconds, RequestTimeoutOverhead to 1 second RetryTimeout to 6 seconds we managed to make this number below 30 seconds. But essentially it remains the same - once broker is down - consumer
group stops consume

Our staging setup is quite simple - 3 brokers, 1 topic with 3 partitions

We injected debug logger into client to grep some logs. And we figured, that group tried to connect to rebooted broker for about minute and a half and then happening rebalance and after that all stuck messages got consumed. Thats why we thought it might be related to rebalance. Because well, it is only 3 partitions. Of one node is down - it means that leader broker for this partition has changed

I can imagine alternative explanation as well

Lets imagine that during failover Node2 is down. And current consumer tries to fetch data from Node2. It waits until all timeouts are gone an then asks to update metadata. And to update metadata we actually have to make new request to one of a brokers. And this broker picked up randomly. And this FetchMetadataRequest hits Node2, which is down, and also waits until all timeouts are gone and then will retry with another broker. And then it comes to Node1, new metadata fetched and consumer/producer can start to make requests to survived brokers

If that sounds realistic then we an try to make sorta weight to broker and increase it every time request fails. And to fetch metadata we can get broker with least weight

@twmb
Copy link
Owner

twmb commented Oct 11, 2021

I think there can be some timeout tuning in some places. Metadata could probably be lower, since if it fails, internally the only option is to retry anyway (rather than fail a group or something). I've also always wanted to make an improvement into which broken is chosen for metadata, but it's not the most obvious and has been lower priority.

Is it also possible that the producer side is stalled, because there are so many records backed up for the partitions that are currently not accepting writes? This could be checked with the LeastBackupPartitioner.

twmb added a commit that referenced this issue Oct 16, 2021
* ConnTimeoutOverhead has been lowered from 20s to 10s. This will allow
quicker detection of hung writes. For most purposes, this will be a fine
timeout, and requests will retry if necessary. For extremely bad
connections, users may need to raise this.

* RetryTimeout has been lowered from a default of 1m to 30s, with
JoinGroup, SyncGroup, and HeartbeatRequest being exempted to the session
interval by default. For requests, this will allow quicker detection of
hung brokers and will not retry as long. Internally in the client, this
*should* result in just extra logging in most cases. For groups, this
will avoid trying to stay in a group longer than the group would be
valid.

For #92.
twmb added a commit that referenced this issue Oct 16, 2021
As a group consumer, we expect 1 active join group request, and then 1
active sync group request. Both of these requests can hang. If a client
wants to update metadata while either of these requests are hanging,
and the metadata happens to choose the same connection, then the
metadata request will be blocked. We do not want this.

If an admin issues a request that contains a TimeoutMillis field, it is
possible that the request will take a while to execute. We do not want
to be blocked if this is so.

To fix both of these scenarios, we add two new connection types:
cxnGroup, which specifically handles Join and Sync, and cxnSlow, which
handles anything with timeout millis.

For #92.
@twmb
Copy link
Owner

twmb commented Oct 16, 2021

I've pushed a change that I hope helps address this issue: timeouts have been lowered from 1min to 30s (for heartbeat, actually only to 45s by default). More importantly, the connection overhead has been lowered to 10s. These two changes combined will make some failover a bit quicker. I see above that you've actually lowered the conn timeout overhead to 1s -- this is fairly aggressive imo and doesn't allow room for unexpected yet temporary latency. 5s may be a bit safer. 1s may be a bit aggressive: if MSK is replying to a fetch request but the reply takes just over 1s, the client would cut the request, and then if that keeps happening, you'll have no progress.

For the FetchMetadataRequest2 scenario where requests hit Node2, the metadata can hang for 10s, but after the request is cut due to timeout, the client will choose the next available broker. So, this should only result in 10s of hang.

As well, the client will now use a dedicated connection to join&sync. Both of these requests can take some time, so if a metadata request gets stacked on either of these, the metadata will be blocked. That should no longer be the case. I don't think this change actually helps this problem, but if it does, all the better.

I'm not sure I'd like to switch broker selection strategy at the moment (last+1), In the happy path, it shouldn't help much, and in the unhappy path, we're bouncing around which broker to choose anyway. When metadata is good and things are stable, metadata is only refreshed once every 5 minutes. When things are failing, we have to choose a different broker again anyway.

Did you get a chance to check out the LeastBackupPartitioner? It may be worth playing with, although it may lead to imbalance down the line.

@twmb
Copy link
Owner

twmb commented Oct 17, 2021

I'm going to close the issue for now given the prior commits attempting to help address this. If you investigate further, or if the LeastBackupPartitioner helps, please let me know. This seems like something that, if we can't solve within the client itself, could be documented somewhere in the README / docs, but I'm not 100% sure what to document yet.

@twmb twmb closed this as completed Oct 17, 2021
@twmb
Copy link
Owner

twmb commented Oct 21, 2021

I've tagged the above changes in v1.2.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants