set socket timeout for the wake_w #1577

flaneur2020 · 2018-08-17T04:07:10Z

hi Dana,

we're running kafka-python on a SNS site in china, yesterday we've met an issue about the kafka host hangs because of some kernel gotchas, and the application containers hangs after the kafka, thus the site is down before we restarted the kafka.

after some investigation, we found the writing side of the wake_w socketpair is blocking, and with no socket timeout. if the sender thread can not execute wake_r.receive but got blocked, wake_w.sendall(b'x') would block the producer.send() method without timeout. we have a config option about max_block_ms which works on buffer is full or metadata unavailable, but have not worked on the wake_w.send yet.

we'd prefer the wake_w.send method fails fast with a timeout error instead of blocking forever on the producer thread, so we can save ourselves from hanging all the containers by a CircuitBreaker.

thank you

This change is

tvoinarovskyi · 2018-08-20T12:32:22Z

kafka/client_async.py

@@ -198,6 +199,7 @@ def __init__(self, **configs):
        self._bootstrap_fails = 0
        self._wake_r, self._wake_w = socket.socketpair()
        self._wake_r.setblocking(False)
+        self._wake_w.settimeout(self.config['max_block_ms'] / 1000.0)


I remember we set setblocking(True) in code sometimes. That operation resets timeout if I remember correctly. Could you make sure we don't do it on this socket?

yes, here is all the setblocking() calls in 1.4.3:

$ git grep 'setblocking' kafka/client_async.py: self._wake_r.setblocking(False) kafka/conn.py: self._sock.setblocking(False) kafka/vendor/socketpair.py: # setblocking(False) that prevents us from having to create a thread. kafka/vendor/socketpair.py: csock.setblocking(False) kafka/vendor/socketpair.py: csock.setblocking(True)

the csock in socketpair.py is called with setblocking(True) before returning as self._wake_w in KafkaClient.__init__(), but not get called with setblocking() again.

from kafka import KafkaProducer import logging import sys logging.basicConfig(stream=sys.stdout, level=logging.INFO) KAFKA_CONFIG = { 'batch_size': 163840, 'linger_ms': 10000, # enlarge the linger_ms to run into the batch_is_full condition 'max_block_ms': 10, 'reconnect_backoff_ms': 300, } producer = KafkaProducer(bootstrap_servers='kafka03.dev.rack.xxxxxxx.com:3390', **KAFKA_CONFIG) i = 0 while True: producer.send('log.franz.test', b'some_message_bytes') i += 1 if i % 1000 == 0: print('1000')

this is the script to reproduce the hangs, when this script is looping, close the wifi connection. and we will see it hangs, and get this traceback on pressing ^C:

after setting this socket timeout, this script can quit immediately after max_block_ms.

dpkp · 2018-08-31T13:40:08Z

Thanks for the PR! I have a few thoughts here:

(1) We should definitely add a timeout to the _wake_w socket! But I'm not sure that we should inherit the max_block_ms. I would prefer that we use a different name in the client, like wakeup_block_ms, and have the KafkaProducer pass that config to KafkaClient (probably still based on max_block_ms). This will help if/when we need to also support in the KafkaConsumer. I also think the default at the client level can and should be much lower than 60000.

(2) Are you expecting the KafkaProducer to raise an exception if there is a connection error? That is not our current design and I am hesitant to add that without a lot of thought. I would prefer to log a warning, the same as we do on socket.error

(3) But so in normal operation the wake socket pair should be drained very frequently and we should not see blocked writes. So the fact that this happens at all is strange. It suggests that the background producer thread is blocked (or crashed) somewhere as well. I would love to figure out where that is and fix it, because that seems like the root cause. In your local testing can you print sys._current_frames() when you detect a blocked socket before raising the exception and post the results? (you may also need to pass results to traceback.print_stack())

Thanks again for helping improve this library!

flaneur2020 · 2018-09-03T10:25:12Z

Thank you for your advice!

(1) It's much better to separate the concern between the wakeup_timeout in KafkaClient and the max_block_ms in KafkaProducer, I tried rename the max_block_ms in KafkaClient to wakeup_timeout_ms and lower the default timeout to 3s.

(2) In fact this PR did not introduce any new Exception for KafkaProducer, KafkaProducer.send would raise KafkaTimeoutError when unable to fetch topic metadata or unable to ubtain memory buffer in max_block_ms, this PR adds timeout on calling wakeup(), besides with the timeout on fetch topic metadata and ubtain memory buffer.

In my opinion, it's better to fail fast & fail loudly in the production systems. The cascading failure caused by timeout is hard to diagnose, the available backend instances would drained quick by timeout, incoming requests queued in the proxy, and the huge queued requests enlarged the load pressure on the backend instances. Even a timeout as small as 100ms can overload the backend instances quickly, because timeout reduced available computing resources. As far as we know, we can escape this by the Circuit Breaker pattern, which requires an exception to catch and break.

(3) Agree, the root cause might be the sender thread hangs by some reason, so it can not execute the event loop in time to consume the wakeup() IO event.

In my understanding, the KafkaSender works in another thread to decouple the KafkaProducer from unpredicatable network errors. The sender threads have some sync operations, but it's ok because it works in a seperate thread which would not block the user threads, but the sync operations might block the event loop from consuming the wakeup IO event. In my experiment, reducing request_timeout_ms (default 30s) might help reducing the timeout in this case.

Here is the sender thread's backtrace as it prints:

ERROR:kafka.conn:Error sending ProduceRequest_v2(required_acks=1, timeout=10000, topics=[(topic='log.franz.test', partitions=[(partition=0, messages='\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00(\x89\xd9r\xc5\x01\x00\x00\x00\x01e\x9e\xf0J\x89\xff\xff\xff\xff\x00\x00\x00\x12some_message_bytes\x00\x00\x00\x00\x00\x00\x00\x01\x00\x00\x00(T\xc1DG\x01\x00\x00\x00\x01e\x9e\xf0J\x8b\xff\xff\xff\xff\x00\x00\x00\x12some_message_b...')])]) to <BrokerConnection node_id=2 host=kafka02.dev.rack.xxxxxx.com:3390 <connected> [IPv4 ('10.100.18.2', 3390)]>
Traceback (most recent call last):
  File "/Users/liyazhou/code/kafka-python/kafka/conn.py", line 747, in _send
    total_bytes = self._send_bytes_blocking(data)
  File "/Users/liyazhou/code/kafka-python/kafka/conn.py", line 518, in _send_bytes_blocking
    sent_bytes = self._sock.send(data[total_sent:])
timeout: timed out

Regards.

jeffwidman · 2018-10-26T21:14:15Z

nudge @dpkp ^^

flaneur2020 force-pushed the fixhangforever branch 6 times, most recently from 5736393 to 891f927 Compare August 17, 2018 08:52

set socket timeout for the wake_w

6fe3bb7

flaneur2020 force-pushed the fixhangforever branch from 891f927 to 6fe3bb7 Compare August 17, 2018 09:02

flaneur2020 mentioned this pull request Aug 20, 2018

KafkaProducer: The Producer thread is getting stuck when the sender hits an exception. #1543

Closed

tvoinarovskyi reviewed Aug 20, 2018

View reviewed changes

add wakeup_timeout_ms instead of max_block_ms in KafkaClient

0fbe6a5

jeffwidman force-pushed the master branch from cdfbb95 to 4d13713 Compare October 29, 2018 07:56

dpkp approved these changes Nov 10, 2018

View reviewed changes

dpkp merged commit 1c0e894 into dpkp:master Nov 10, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

set socket timeout for the wake_w #1577

set socket timeout for the wake_w #1577

flaneur2020 commented Aug 17, 2018 •

edited by dpkp

Loading

tvoinarovskyi Aug 20, 2018

flaneur2020 Aug 21, 2018 •

edited

Loading

dpkp commented Aug 31, 2018 •

edited

Loading

flaneur2020 commented Sep 3, 2018 •

edited

Loading

jeffwidman commented Oct 26, 2018

set socket timeout for the wake_w #1577

set socket timeout for the wake_w #1577

Conversation

flaneur2020 commented Aug 17, 2018 • edited by dpkp Loading

tvoinarovskyi Aug 20, 2018

Choose a reason for hiding this comment

flaneur2020 Aug 21, 2018 • edited Loading

Choose a reason for hiding this comment

dpkp commented Aug 31, 2018 • edited Loading

flaneur2020 commented Sep 3, 2018 • edited Loading

jeffwidman commented Oct 26, 2018

flaneur2020 commented Aug 17, 2018 •

edited by dpkp

Loading

flaneur2020 Aug 21, 2018 •

edited

Loading

dpkp commented Aug 31, 2018 •

edited

Loading

flaneur2020 commented Sep 3, 2018 •

edited

Loading