-
-
Notifications
You must be signed in to change notification settings - Fork 120
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
producer.close
hangs when the cluster is unreachable
#268
Comments
Yeah, that's because we wait on the queue being empty. I will take a shot at this and maybe it can be mitigated on our side. One thing that needs to happen though, is an error callback trigger on failed delivery for instrumentation reasons. |
💯 agree |
Ok after a set of tests it's not a bug and it does not hang forever. it will close after 5 minutes which is what you have as a default. This is consistent and expected. |
On top of that, upon eviction from the queue a proper instrumentation call is being emitted. Example from waterdrop (as it has callbacks hooked): producer = WaterDrop::Producer.new
producer.setup do |config|
config.deliver = true
config.kafka = {
'bootstrap.servers': 'localhost:9093',
'request.required.acks': 1,
'message.timeout.ms': 5_000
}
end
producer.monitor.subscribe('error.occurred') do |event|
error = event[:error]
p "WaterDrop error occurred: #{error}"
end
irb(main):029:0> producer.produce_async(topic: 'test', payload: 'test')
=> #<Rdkafka::Producer::DeliveryHandle:0x00007f41cf6156d0>
irb(main):030:0> "WaterDrop error occurred: Local: Broker transport failure (transport)" |
When you close producer that had evicted messages but they were well evicted ;) it closes without wait. Thus this works exactly as expected as it is now. |
expanded my explanation here: https://karafka.io/docs/FAQ/#why-does-waterdrop-hang-when-i-attempt-to-close-it |
Thank you for your detailed explanation. The config I try the following script in Google Cloud Shell and find an interesting thing.
Just curious do you know where is the extra 1s ❓
|
After the message is evicted, there is still a delay in constructing and shipping via error callback the failed messages + reason and then the delay on polling this and passing on this binding side. I cannot give you the exact location of this though. Sorry |
Thanks for looking into this @mensfeld! I understand the semantics of rdkafka-ruby better now. IMO, it could still make sense to allow using separate timeouts during steady state ( |
@dmariassy but flush is something else. Flush bypasses the wait time for messages sets to be constructed and then attempts to dispatch. A fail dispatch on flush still obeys the eviction policies and keeps messages there. It's just that you do not wait for the retries to kick in in a blocking manner:
So it does not alter this behaviour. We could add ability to drop the messages via purge but then none of them would propage to error callbacks on delivery eviction. |
About that 😄 I'm playing around with flush and that's not what I'm seeing (but I might be missing something): config = {
"bootstrap.servers": "localhost:9092",
"linger.ms": 5000,
}
producer = Rdkafka::Config.new(config).producer
producer.produce(topic: "test", payload: "1")
producer.flush(1000)
sleep 1 The event never lands in the topic ☝🏻 Am I missing something? It appears that flush isn't transmitting messages sets that aren't ready yet (given the
Gotcha. I see the tradeoff 🤔 |
Give it a ms to actually go where it should (as it's async and librdkafka uses multiple queues and buffers down the road): producer = Rdkafka::Config.new(config).producer
producer.produce(topic: "test-me", payload: "1")
sleep(0.001)
producer.flush(1000) then it goes where expected (few attempts): This is why we use So my guess here is that this message is not yet ready to be shipped. It may be that there are other things happening. |
The question is: is linger.ms ignored. My experience shows it should not and your example confirms that (1ms < 5000). The real question is about the flush not flushing the one. It may be something worth looking into, though if you close, it behaves as expected and you always want to close. |
I actually sleep for a second
So |
Let me correct myself:
flush() is mainly used to wait for outstanding messages to be delivered before terminating the producer, it is not aimed to be used in a synchronous produce+flush cycle. See https://github.com/edenhill/librdkafka/wiki/FAQ#why-is-there-no-sync-produce-interface
You always want to use close. If you are implementing stuff like this, look at https://github.com/karafka/waterdrop + read my FAQ on that matter. There are several recommendations for production grade stable systems operations. |
|
Sleep between produce and flush ;) |
I'm still not seeing the data. But the example works on my co-workers machine, so it's possible that my local env is borked. Thanks for all your answers! |
I do not know what you aim to achieve but if you are looking into advanced cases with warranties, instrumentation and monitoring + support of forking, the ability to distinguish in-process producer instances, and other things like that, you may be better with a higher level abstraction like waterdrop. There are many edge cases there and I would always be happy to also accept help in other parts of Ruby-Kafka ecosystem :) |
👋🏻 Hi - I'm not sure if this should be a question raised on librdkafka or here, so apologies if this is the wrong place to report this.
Expected behaviour
When
producer.close
is called, the producer will eventually either close all connections or give up. In either scenario, all producer threads should be cleaned up onceclose
completes.Actual behaviour
When
close
is called on a producer with a non-empty buffer, and the target cluster is unreachable,close
will hang indefinitely. The producer thread is never properly terminated and cleaned up.Steps to reproduce
The text was updated successfully, but these errors were encountered: