-
Notifications
You must be signed in to change notification settings - Fork 1.4k
Consumer intermittently hangs when committing offsets #1728
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I upgraded a production box to 8c07925 and processes still intermittently hang. However, when the application calls
Raw logs of a failure:
Raw logs of a successful background heartbeat--note that the background thread both issues the request and handles the response:
|
Emm seems like an out of order delivery of packets. As you can see above you got OffsetCommit in heartbeat thread. Maybe related to change on async send. |
I started logging the thread names/IDs (should have done this earlier, but this code has some messy custom logging attached so not easy to do), and now this feels even more weird. When this bug happens, the foreground thread starts receiving all responses, including heartbeat responses. But the root cause isn't the foreground thread taking a permanent lock, because the background heartbeat thread continues to emit background heartbeat requests, it just doesn't receive them. |
Interesting -- this error means that the connections in-flight-requests
queue is getting out of sync w/ the actual in flight requests. FWIW,
because the stack is asynchronous and communicates via futures -- it should
not matter which thread actually processes the response. In this case, it
would be normal behavior for the heartbeat thread to "see" a commit
response message, and, if things were working properly, find the associated
future and signal that it succeeded. Separately, the consumer thread may be
blocking waiting for the commit response, but it should be in fact blocking
on the future, so once that future is completed it should unblock and
continue.
Anyways, in this case I think the issue is in BrokerConnection._send , and
looking carefully we have to be careful that once a request is added to the
_protocol buffer that we make sure the response future is also added to the
in-flight-requests. Otherwise, if we add to _protocol buffer but do not add
to the IFR list, then we are going to end up sending the request eventually
when we drain the buffer and then when we get the response we are going to
pull the next future from the IFR list and when it doesn't match up we get
this exception.
It is possible that some code path within conn._send allows adding a
request to the _protocol buffer without adding a future to the IFR list,
but I can't find one right now. The other way this could happen is simply
multiple threads calling _send concurrently. Since we no longer lock the
send path, two requests could come in at the same time and get placed (1)
R1 -> _protocol buffer, (2) R2 -> _procotol buffer, (3) R2 -> IFR list, (4)
R1 -> IFR list. Replace R1/R2 with heartbeat and commit requests and you
probably get the log output / errors you posted above. If I'm right, we
need to add some additional per-connection locking.
My locking philosophy so far has been to avoid locks in BrokerConnection
and force the caller (KafkaClient) to manage concurrency issues. That
worked fine when we just wrapped everything in the client lock, but since
we are trying here to unblock multiple sends I think we'll need to add a
lock to the BrokerConnection instance. I can work on a patch for that now.
Thanks for testing this out and helping to debug!
… |
Interesting, I had assumed it meant the problem was the wrong callback was getting assigned to the OffsetRequest future. The logs seemed to indicate the response was getting correctly decoded/processed until shortly before the callback
Thanks for clarifying. That makes sense. So the overall hypothesis of requests being mis-ordered due to concurrent I didn't realize the IFR was used when determining how to decode a response--I thought the IFR list was only used to cap the number of in-flight-requests and timeout requests that had been in the queue too long. For response decoding, I assumed client used the kafka protocol API key to decode the rest of the response, then the correlation ID was used to identify the existing request future and fire any callbacks, as well as removing it from the IFR queue. So the response would be fully decoded/processed by the client without regard to any state in the IFR queue. I never actually stepped through the code, this just seemed the intuitive way to do it, so I could easily be overlooking something. The problem I see with tying the knowledge of how to process a response with the requests position in the IFR queue is what happens if we send two requests to the same broker, and the broker returns the requests out of order? I'm not aware of any promise that the Kafka brokers make about returning requests in order to clients. For example, we might send an OffsetCommitRequest, followed immediately by a HeartbeatRequest. The OffsetCommit typically must be replicated to 3 brokers, so it's likely to sit in purgatory on the broker, whereas the HeartbeatResponse can come back immediately. But if the OffsetCommit was added to the IFR first, won't that throw an exception when the HeartbeatResponse comes back first and the client tries to decode it as a OffsetCommitResponse? |
The problem I see with tying the knowledge of how to process a response
with the requests position in the IFR queue is what happens if we send two
requests to the same broker, and the broker returns the requests out of
order? I'm not aware of any promise that the Kafka brokers make about
returning requests in order to clients.
For example, we might send an OffsetCommitRequest, followed immediately by
a HeartbeatRequest. The OffsetCommit typically must be replicated to 3
brokers, so it's likely to sit in purgatory, whereas the HeartbeatResponse
can come back immediately. But if the OffsetCommit was added to the IFR
first, won't that throw an exception when the HeartbeatResponse comes back
first and it tries to decode it as a OffsetCommitResponse
Great analysis! But (unless this has changed recently) the protocol is
ordered so in this case the heartbeat resp would be blocked at the broker
until all prior responses were sent. The correlation id is also not
required to be unique so a client that sent 2 reqs with the same id would
not know which resp was which...
… |
I can confirm. The order is strict as hell) broker will only process 1 request at a time. The quotes and throttling work on top of this design. To avoid the issue for Heartbeats Java client injests another connection specifically for Coordinator requests, so that does not conflict with fetches (not sure where commits are sent thou) |
If you both say it, then it must be true. 😄 And yeah, now that you mention it, I remember reading the correlation ID is not guaranteed to be unique... so that wouldn't work... Fixed by #1733. |
There is some sort of deadlock that we are intermittently hitting within
kafka-python
when our applications are callingcommit()
. The consumer will drop out of the group w/o the process actually dying and the only fix is to restart the process.This is hurting us badly, we are having to restart multiple applications every six hours. However, I have not yet been able to reliably reproduce what's happening. Here is what I have so far.
The code is wrapped in a thin-layer, so the pseudo code/configs may be slightly named differently, but it should be relatively clear what's happening.
Kafka Broker Version:
1.0.1
kafka-python
Version: commit eed59ba (this is newer than 1.4.4, as I wanted to be sure we picked up the latest deadlock fix from a4f0cb8)KafkaConsumer
config:Application pseudo-code:
I enabled debug logging for
kafka-python
, and noticed a difference between healthy and unhealthy logs.Here are healthy logs:
Here are the stuck logs:
Diff'ing the two, it appears that something is breaking between the following two lines:
I am not sure if the problem is in the main thread or the background heartbeat thread:
thread
/threadID
, so I cannot completely rule this out yet. I assume if this happened that the main thread would have thrown an error timing out the offset commit request, but I have not yet confirmed that the code works that way.The text was updated successfully, but these errors were encountered: