Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

set socket timeout for the wake_w #1577

Merged
merged 2 commits into from
Nov 10, 2018
Merged

Conversation

flaneur2020
Copy link
Contributor

@flaneur2020 flaneur2020 commented Aug 17, 2018

hi Dana,

we're running kafka-python on a SNS site in china, yesterday we've met an issue about the kafka host hangs because of some kernel gotchas, and the application containers hangs after the kafka, thus the site is down before we restarted the kafka.

after some investigation, we found the writing side of the wake_w socketpair is blocking, and with no socket timeout. if the sender thread can not execute wake_r.receive but got blocked, wake_w.sendall(b'x') would block the producer.send() method without timeout. we have a config option about max_block_ms which works on buffer is full or metadata unavailable, but have not worked on the wake_w.send yet.

we'd prefer the wake_w.send method fails fast with a timeout error instead of blocking forever on the producer thread, so we can save ourselves from hanging all the containers by a CircuitBreaker.

thank you


This change is Reviewable

@flaneur2020 flaneur2020 force-pushed the fixhangforever branch 6 times, most recently from 5736393 to 891f927 Compare August 17, 2018 08:52
@@ -198,6 +199,7 @@ def __init__(self, **configs):
self._bootstrap_fails = 0
self._wake_r, self._wake_w = socket.socketpair()
self._wake_r.setblocking(False)
self._wake_w.settimeout(self.config['max_block_ms'] / 1000.0)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I remember we set setblocking(True) in code sometimes. That operation resets timeout if I remember correctly. Could you make sure we don't do it on this socket?

Copy link
Contributor Author

@flaneur2020 flaneur2020 Aug 21, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, here is all the setblocking() calls in 1.4.3:

$ git grep 'setblocking'
kafka/client_async.py:        self._wake_r.setblocking(False)
kafka/conn.py:            self._sock.setblocking(False)
kafka/vendor/socketpair.py:        # setblocking(False) that prevents us from having to create a thread.
kafka/vendor/socketpair.py:                csock.setblocking(False)
kafka/vendor/socketpair.py:                csock.setblocking(True)

the csock in socketpair.py is called with setblocking(True) before returning as self._wake_w in KafkaClient.__init__(), but not get called with setblocking() again.

from kafka import KafkaProducer
import logging
import sys

logging.basicConfig(stream=sys.stdout, level=logging.INFO)

KAFKA_CONFIG = {
    'batch_size': 163840,
    'linger_ms': 10000,  # enlarge the linger_ms to run into the batch_is_full condition
    'max_block_ms': 10,
    'reconnect_backoff_ms': 300,
}

producer = KafkaProducer(bootstrap_servers='kafka03.dev.rack.xxxxxxx.com:3390', **KAFKA_CONFIG)
i = 0

while True:
    producer.send('log.franz.test', b'some_message_bytes')
    i += 1
    if i % 1000 == 0:
        print('1000')

this is the script to reproduce the hangs, when this script is looping, close the wifi connection. and we will see it hangs, and get this traceback on pressing ^C:

screen shot 2018-08-21 at 10 52 05 am

after setting this socket timeout, this script can quit immediately after max_block_ms.

@dpkp
Copy link
Owner

dpkp commented Aug 31, 2018

Thanks for the PR! I have a few thoughts here:

(1) We should definitely add a timeout to the _wake_w socket! But I'm not sure that we should inherit the max_block_ms. I would prefer that we use a different name in the client, like wakeup_block_ms, and have the KafkaProducer pass that config to KafkaClient (probably still based on max_block_ms). This will help if/when we need to also support in the KafkaConsumer. I also think the default at the client level can and should be much lower than 60000.

(2) Are you expecting the KafkaProducer to raise an exception if there is a connection error? That is not our current design and I am hesitant to add that without a lot of thought. I would prefer to log a warning, the same as we do on socket.error

(3) But so in normal operation the wake socket pair should be drained very frequently and we should not see blocked writes. So the fact that this happens at all is strange. It suggests that the background producer thread is blocked (or crashed) somewhere as well. I would love to figure out where that is and fix it, because that seems like the root cause. In your local testing can you print sys._current_frames() when you detect a blocked socket before raising the exception and post the results? (you may also need to pass results to traceback.print_stack())

Thanks again for helping improve this library!

@flaneur2020
Copy link
Contributor Author

flaneur2020 commented Sep 3, 2018

Thank you for your advice!

(1) It's much better to separate the concern between the wakeup_timeout in KafkaClient and the max_block_ms in KafkaProducer, I tried rename the max_block_ms in KafkaClient to wakeup_timeout_ms and lower the default timeout to 3s.

(2) In fact this PR did not introduce any new Exception for KafkaProducer, KafkaProducer.send would raise KafkaTimeoutError when unable to fetch topic metadata or unable to ubtain memory buffer in max_block_ms, this PR adds timeout on calling wakeup(), besides with the timeout on fetch topic metadata and ubtain memory buffer.

In my opinion, it's better to fail fast & fail loudly in the production systems. The cascading failure caused by timeout is hard to diagnose, the available backend instances would drained quick by timeout, incoming requests queued in the proxy, and the huge queued requests enlarged the load pressure on the backend instances. Even a timeout as small as 100ms can overload the backend instances quickly, because timeout reduced available computing resources. As far as we know, we can escape this by the Circuit Breaker pattern, which requires an exception to catch and break.

(3) Agree, the root cause might be the sender thread hangs by some reason, so it can not execute the event loop in time to consume the wakeup() IO event.

In my understanding, the KafkaSender works in another thread to decouple the KafkaProducer from unpredicatable network errors. The sender threads have some sync operations, but it's ok because it works in a seperate thread which would not block the user threads, but the sync operations might block the event loop from consuming the wakeup IO event. In my experiment, reducing request_timeout_ms (default 30s) might help reducing the timeout in this case.

Here is the sender thread's backtrace as it prints:

ERROR:kafka.conn:Error sending ProduceRequest_v2(required_acks=1, timeout=10000, topics=[(topic='log.franz.test', partitions=[(partition=0, messages='\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00(\x89\xd9r\xc5\x01\x00\x00\x00\x01e\x9e\xf0J\x89\xff\xff\xff\xff\x00\x00\x00\x12some_message_bytes\x00\x00\x00\x00\x00\x00\x00\x01\x00\x00\x00(T\xc1DG\x01\x00\x00\x00\x01e\x9e\xf0J\x8b\xff\xff\xff\xff\x00\x00\x00\x12some_message_b...')])]) to <BrokerConnection node_id=2 host=kafka02.dev.rack.xxxxxx.com:3390 <connected> [IPv4 ('10.100.18.2', 3390)]>
Traceback (most recent call last):
  File "/Users/liyazhou/code/kafka-python/kafka/conn.py", line 747, in _send
    total_bytes = self._send_bytes_blocking(data)
  File "/Users/liyazhou/code/kafka-python/kafka/conn.py", line 518, in _send_bytes_blocking
    sent_bytes = self._sock.send(data[total_sent:])
timeout: timed out

Regards.

@jeffwidman
Copy link
Collaborator

nudge @dpkp ^^

@dpkp dpkp merged commit 1c0e894 into dpkp:master Nov 10, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants