You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository has been archived by the owner on Apr 1, 2024. It is now read-only.
Describe the bug
When a topic is unloaded from a broker, a producer can attempt to reconnect before the topic is scheduled to a new broker but after the topic is unloaded. In some instances (discussed more below) this re-connection results in a producer that fails to create or error out and leaves a danging producerFuture that is never resolved or removed (unless the client terminates it's connection). If the topic ends up back on the same broker (either immediately or after another unload occurs from another broker) any producers that try and connect will immediately fail as the existing producerFuture is found and responds to the client with an error.
This apparent break in exception handling (or control flow in general) appears to cause the getOrCreateTopic at https://github.com/apache/pulsar/blob/f2afad353795f70c00910b6dcce5df8c62b94a67/pulsar-broker/src/main/java/org/apache/pulsar/broker/service/ServerCnx.java#L902 to never return and the producerFuture to never be resolved. We see this on the client side by a timeout to create a producer in cases where it fails, in other instances the timeout doesn't occur and we instead get a proper error message to the client, indicating that the namespace bundle isn't owned. As best as I can tell, this doesn't have anything to do with the pendingTopicLoadingQueue as we don't see enough topics being loaded that would put us into needing to use that queue.
Now, with the dangling producerFuture in place, once the topic is re-scheduled back to this broker, it will be unable to ever create a producer.
I honestly have no idea why or how the call to getOrCreateTopic appears to never return, but we have about 10 instances we have validated with our logs (attached an instance below) that all show this same pattern and also have heap dumps (can share if desired) that also support this.
We have seen some other issues like apache#6054 (they haven't re-occurred often enough to gather enough data) but as best as we can tell it may be a related issue, where some call to load the topic never completes.
While there is likely some easy workarounds for this specific issue (such as putting a timeout on the producerFuture or by using the epoch in conjunction with the producerId for the producers map), this is really a spooky one, and I wonder if related issues could have a similar root cause!
To Reproduce
Since this appears to be down to a timing issue, it is hard to reliably reproduce, however, the following should work with enough tries:
Have a producer, producing to a topic with a low reconnect timeout
Unload the topic/bundle
Observe that the producer gets a timeout on creating a producer
Unload the topic/bundle again until it makes it back to the initial broker
Observe that you can't get a new producer
Expected behavior
We expect to be able to create a producer after an unload.
Logs
These aren't full logs, just the relevant lines for when a producerFuture is left behind
Bad Instance
Broker Logs
{"msg":"17:11:16.978 [bookkeeper-ml-workers-OrderedExecutor-2-0] INFO org.apache.pulsar.broker.service.BrokerService - Created topic persistent://canvas-cdc/prod-iad/cluster71-canvas.cluster67_shard_1072.thumbnails - dedup is disabled\n","kube":{"namespace":"pulsar","pod_name":"pulsar-prod-broker-564968ff9c-2r8zr","container_name":"broker"},"docker":{"container_id":"6cc06889852e17619b2520562cdb2eba70b37d5e318210f0df33ee868378fe8d"}}
{"msg":"17:11:16.978 [bookkeeper-ml-workers-OrderedExecutor-2-0] INFO org.apache.pulsar.broker.service.persistent.PersistentTopic - [persistent://canvas-cdc/prod-iad/cluster71-canvas.cluster67_shard_1072.thumbnails] There are no replicated subscriptions on the topic\n","kube":{"namespace":"pulsar","pod_name":"pulsar-prod-broker-564968ff9c-2r8zr","container_name":"broker"},"docker":{"container_id":"6cc06889852e17619b2520562cdb2eba70b37d5e318210f0df33ee868378fe8d"}}
{"msg":"17:11:16.978 [bookkeeper-ml-workers-OrderedExecutor-2-0] INFO org.apache.pulsar.broker.service.persistent.SubscribeRateLimiter - [persistent://canvas-cdc/prod-iad/cluster71-canvas.cluster67_shard_1072.thumbnails] configured subscribe-dispatch rate at broker SubscribeRate{subscribeThrottlingRatePerConsumer=300, ratePeriodInSecond=30}\n","kube":{"namespace":"pulsar","pod_name":"pulsar-prod-broker-564968ff9c-2r8zr","container_name":"broker"},"docker":{"container_id":"6cc06889852e17619b2520562cdb2eba70b37d5e318210f0df33ee868378fe8d"}}
{"msg":"17:11:16.978 [bookkeeper-ml-workers-OrderedExecutor-2-0] INFO org.apache.pulsar.broker.service.persistent.DispatchRateLimiter - [persistent://canvas-cdc/prod-iad/cluster71-canvas.cluster67_shard_1072.thumbnails] configured TOPIC message-dispatch rate at broker DispatchRate{dispatchThrottlingRateInMsg=500000, dispatchThrottlingRateInByte=52428800, ratePeriodInSecond=1}\n","kube":{"namespace":"pulsar","pod_name":"pulsar-prod-broker-564968ff9c-2r8zr","container_name":"broker"},"docker":{"container_id":"6cc06889852e17619b2520562cdb2eba70b37d5e318210f0df33ee868378fe8d"}}
{"msg":"17:11:16.978 [bookkeeper-ml-workers-OrderedExecutor-2-0] INFO org.apache.pulsar.broker.service.AbstractTopic - Disabling publish throttling for persistent://canvas-cdc/prod-iad/cluster71-canvas.cluster67_shard_1072.thumbnails\n","kube":{"namespace":"pulsar","pod_name":"pulsar-prod-broker-564968ff9c-2r8zr","container_name":"broker"},"docker":{"container_id":"6cc06889852e17619b2520562cdb2eba70b37d5e318210f0df33ee868378fe8d"}}
{"msg":"17:11:16.978 [bookkeeper-ml-workers-OrderedExecutor-2-0] WARN org.apache.pulsar.broker.service.AbstractTopic - [persistent://canvas-cdc/prod-iad/cluster71-canvas.cluster67_shard_1072.thumbnails] Error getting policies null and publish throttling will be disabled\n","kube":{"namespace":"pulsar","pod_name":"pulsar-prod-broker-564968ff9c-2r8zr","container_name":"broker"},"docker":{"container_id":"6cc06889852e17619b2520562cdb2eba70b37d5e318210f0df33ee868378fe8d"}}
{"msg":"17:11:16.380 [Thread-5203370] INFO org.apache.pulsar.broker.service.ServerCnx - [/10.9.29.28:37282] Subscribing on topic persistent://canvas-cdc/prod-iad/cluster71-canvas.cluster67_shard_1072.thumbnails / canvas-cdc/clean-prod-iad/cdc-transform-71\n","kube":{"namespace":"pulsar","pod_name":"pulsar-prod-broker-564968ff9c-2r8zr","container_name":"broker"},"docker":{"container_id":"6cc06889852e17619b2520562cdb2eba70b37d5e318210f0df33ee868378fe8d"}}
{"msg":"17:11:16.143 [Thread-492744] WARN org.apache.pulsar.broker.service.BrokerService - Namespace bundle for topic (persistent://canvas-cdc/prod-iad/cluster71-canvas.cluster67_shard_1072.thumbnails) not served by this instance. Please redo the lookup. Request is denied: namespace=canvas-cdc/prod-iad\n","kube":{"namespace":"pulsar","pod_name":"pulsar-prod-broker-564968ff9c-zljhk","container_name":"broker"},"docker":{"container_id":"c2bc11c1db743603d128641575963a36d3c5ee2f6abca2994df97a7e8720359e"}}
{"msg":"17:11:16.143 [Thread-492744] INFO org.apache.pulsar.broker.service.ServerCnx - [/10.9.27.47:40828] persistent://canvas-cdc/prod-iad/cluster71-canvas.cluster67_shard_1072.thumbnails configured with schema false\n","kube":{"namespace":"pulsar","pod_name":"pulsar-prod-broker-564968ff9c-zljhk","container_name":"broker"},"docker":{"container_id":"c2bc11c1db743603d128641575963a36d3c5ee2f6abca2994df97a7e8720359e"}}
{"msg":"17:11:16.073 [bookkeeper-ml-workers-OrderedExecutor-2-0] INFO org.apache.pulsar.broker.service.persistent.PersistentTopic - [persistent://canvas-cdc/prod-iad/cluster71-canvas.cluster67_shard_1072.thumbnails] Topic closed\n","kube":{"namespace":"pulsar","pod_name":"pulsar-prod-broker-564968ff9c-zljhk","container_name":"broker"},"docker":{"container_id":"c2bc11c1db743603d128641575963a36d3c5ee2f6abca2994df97a7e8720359e"}}
{"msg":"17:11:16.065 [Thread-492726] WARN org.apache.pulsar.broker.service.ServerCnx - [/10.9.29.28:52860][persistent://canvas-cdc/prod-iad/cluster71-canvas.cluster67_shard_1072.thumbnails][canvas-cdc/clean-prod-iad/cdc-transform-71] Failed to create consumer: Topic is temporarily unavailable\n","kube":{"namespace":"pulsar","pod_name":"pulsar-prod-broker-564968ff9c-zljhk","container_name":"broker"},"docker":{"container_id":"c2bc11c1db743603d128641575963a36d3c5ee2f6abca2994df97a7e8720359e"}}
{"msg":"17:11:16.065 [Thread-492726] WARN org.apache.pulsar.broker.service.persistent.PersistentTopic - [persistent://canvas-cdc/prod-iad/cluster71-canvas.cluster67_shard_1072.thumbnails] Attempting to subscribe to a fenced topic\n","kube":{"namespace":"pulsar","pod_name":"pulsar-prod-broker-564968ff9c-zljhk","container_name":"broker"},"docker":{"container_id":"c2bc11c1db743603d128641575963a36d3c5ee2f6abca2994df97a7e8720359e"}}
{"msg":"17:11:16.064 [Thread-492726] INFO org.apache.pulsar.broker.service.ServerCnx - [/10.9.29.28:52860] Subscribing on topic persistent://canvas-cdc/prod-iad/cluster71-canvas.cluster67_shard_1072.thumbnails / canvas-cdc/clean-prod-iad/cdc-transform-71\n","kube":{"namespace":"pulsar","pod_name":"pulsar-prod-broker-564968ff9c-zljhk","container_name":"broker"},"docker":{"container_id":"c2bc11c1db743603d128641575963a36d3c5ee2f6abca2994df97a7e8720359e"}}
{"msg":"17:11:16.024 [Thread-492716] INFO org.apache.pulsar.broker.service.ServerCnx - [/10.9.27.47:40828][persistent://canvas-cdc/prod-iad/cluster71-canvas.cluster67_shard_1072.thumbnails] Creating producer. producerId=73\n","kube":{"namespace":"pulsar","pod_name":"pulsar-prod-broker-564968ff9c-zljhk","container_name":"broker"},"docker":{"container_id":"c2bc11c1db743603d128641575963a36d3c5ee2f6abca2994df97a7e8720359e"}}
{"msg":"17:11:15.629 [pulsar-web-33-11] INFO org.apache.pulsar.broker.service.persistent.PersistentSubscription - [persistent://canvas-cdc/prod-iad/cluster71-canvas.cluster67_shard_1072.thumbnails][canvas-cdc/clean-prod-iad/cdc-transform-71] Successfully disconnected and closed subscription\n","kube":{"namespace":"pulsar","pod_name":"pulsar-prod-broker-564968ff9c-zljhk","container_name":"broker"},"docker":{"container_id":"c2bc11c1db743603d128641575963a36d3c5ee2f6abca2994df97a7e8720359e"}}
{"msg":"17:11:15.629 [pulsar-web-33-11] INFO org.apache.pulsar.broker.service.persistent.PersistentSubscription - [persistent://canvas-cdc/prod-iad/cluster71-canvas.cluster67_shard_1072.thumbnails][canvas-cdc/clean-prod-iad/cdc-transform-71] Successfully closed subscription [ManagedCursorImpl{ledger=canvas-cdc/prod-iad/persistent/cluster71-canvas.cluster67_shard_1072.thumbnails, name=canvas-cdc%2Fclean-prod-iad%2Fcdc-transform-71, ackPos=522049:0, readPos=522049:1}]\n","kube":{"namespace":"pulsar","pod_name":"pulsar-prod-broker-564968ff9c-zljhk","container_name":"broker"},"docker":{"container_id":"c2bc11c1db743603d128641575963a36d3c5ee2f6abca2994df97a7e8720359e"}}
{"msg":"17:11:15.628 [pulsar-web-33-11] INFO org.apache.pulsar.broker.service.persistent.PersistentDispatcherMultipleConsumers - [persistent://canvas-cdc/prod-iad/cluster71-canvas.cluster67_shard_1072.thumbnails / canvas-cdc/clean-prod-iad/cdc-transform-71] All consumers removed. Subscription is disconnected\n","kube":{"namespace":"pulsar","pod_name":"pulsar-prod-broker-564968ff9c-zljhk","container_name":"broker"},"docker":{"container_id":"c2bc11c1db743603d128641575963a36d3c5ee2f6abca2994df97a7e8720359e"}}
{"msg":"17:11:15.627 [pulsar-web-33-11] INFO org.apache.pulsar.broker.service.persistent.PersistentDispatcherMultipleConsumers - Removed consumer Consumer{subscription=PersistentSubscription{topic=persistent://canvas-cdc/prod-iad/cluster71-canvas.cluster67_shard_1072.thumbnails, name=canvas-cdc/clean-prod-iad/cdc-transform-71}, consumerId=229, consumerName=5525b, address=/10.9.29.28:52860} with pending 0 acks\n","kube":{"namespace":"pulsar","pod_name":"pulsar-prod-broker-564968ff9c-zljhk","container_name":"broker"},"docker":{"container_id":"c2bc11c1db743603d128641575963a36d3c5ee2f6abca2994df97a7e8720359e"}}
{"msg":"17:11:15.627 [pulsar-web-33-11] INFO org.apache.pulsar.broker.service.Consumer - Disconnecting consumer: Consumer{subscription=PersistentSubscription{topic=persistent://canvas-cdc/prod-iad/cluster71-canvas.cluster67_shard_1072.thumbnails, name=canvas-cdc/clean-prod-iad/cdc-transform-71}, consumerId=229, consumerName=5525b, address=/10.9.29.28:52860}\n","kube":{"namespace":"pulsar","pod_name":"pulsar-prod-broker-564968ff9c-zljhk","container_name":"broker"},"docker":{"container_id":"c2bc11c1db743603d128641575963a36d3c5ee2f6abca2994df97a7e8720359e"}}
{"msg":"17:11:15.627 [pulsar-web-33-11] INFO org.apache.pulsar.broker.service.Producer - Disconnecting producer: Producer{topic=PersistentTopic{topic=persistent://canvas-cdc/prod-iad/cluster71-canvas.cluster67_shard_1072.thumbnails}, client=/10.9.27.47:40828, producerName=pulsar-prod-iad-199-193, producerId=73}\n","kube":{"namespace":"pulsar","pod_name":"pulsar-prod-broker-564968ff9c-zljhk","container_name":"broker"},"docker":{"container_id":"c2bc11c1db743603d128641575963a36d3c5ee2f6abca2994df97a7e8720359e"}}
{"msg":"17:11:15.627 [pulsar-web-33-11] INFO org.apache.pulsar.broker.service.BrokerService - [persistent://canvas-cdc/prod-iad/cluster71-canvas.cluster67_shard_1072.thumbnails] Unloading topic\n","kube":{"namespace":"pulsar","pod_name":"pulsar-prod-broker-564968ff9c-zljhk","container_name":"broker"},"docker":{"container_id":"c2bc11c1db743603d128641575963a36d3c5ee2f6abca2994df97a7e8720359e"}}
Client Logs
{"msg":"17:12:02.508 [pulsar-timer-4-1] INFO org.apache.pulsar.client.impl.ConnectionHandler - [canvas-cdc/prod-iad/cluster71-canvas.cluster67_shard_1072.thumbnails] [pulsar-prod-iad-199-193] Reconnecting after connection was closed\n","kube":{"namespace":"pulsar-cdc","pod_name":"pf-canvas-cdc-prod-iad-cluster71-0","container_name":"pulsarfunction"},"docker":{"container_id":"c4e6640923ec8e00f50ca85375c7b37c0de7b59491dd7cdb68800b84f8ac5740"}}
{"msg":"17:12:02.406 [pulsar-client-io-1-1] WARN org.apache.pulsar.client.impl.ConnectionHandler - [canvas-cdc/prod-iad/cluster71-canvas.cluster67_shard_1072.thumbnails] [pulsar-prod-iad-199-193] Could not get connection to broker: 14588 lookup request timedout after ms 30000 -- Will try again in 0.1 s\n","kube":{"namespace":"pulsar-cdc","pod_name":"pf-canvas-cdc-prod-iad-cluster71-0","container_name":"pulsarfunction"},"docker":{"container_id":"c4e6640923ec8e00f50ca85375c7b37c0de7b59491dd7cdb68800b84f8ac5740"}}
{"msg":"17:12:02.406 [pulsar-client-io-1-1] ERROR org.apache.pulsar.client.impl.ProducerImpl - [canvas-cdc/prod-iad/cluster71-canvas.cluster67_shard_1072.thumbnails] [pulsar-prod-iad-199-193] Failed to create producer: 14588 lookup request timedout after ms 30000\n","kube":{"namespace":"pulsar-cdc","pod_name":"pf-canvas-cdc-prod-iad-cluster71-0","container_name":"pulsarfunction"},"docker":{"container_id":"c4e6640923ec8e00f50ca85375c7b37c0de7b59491dd7cdb68800b84f8ac5740"}}
{"msg":"17:11:15.970 [pulsar-client-io-1-1] INFO org.apache.pulsar.client.impl.ProducerImpl - [canvas-cdc/prod-iad/cluster71-canvas.cluster67_shard_1072.thumbnails] [pulsar-prod-iad-199-193] Creating producer on cnx [id: 0x37f41d66, L:/10.9.27.47:40828 - R:/10.9.36.202:6651]\n","kube":{"namespace":"pulsar-cdc","pod_name":"pf-canvas-cdc-prod-iad-cluster71-0","container_name":"pulsarfunction"},"docker":{"container_id":"c4e6640923ec8e00f50ca85375c7b37c0de7b59491dd7cdb68800b84f8ac5740"}}
{"msg":"17:11:15.968 [pulsar-timer-4-1] INFO org.apache.pulsar.client.impl.ConnectionHandler - [canvas-cdc/prod-iad/cluster71-canvas.cluster67_shard_1072.thumbnails] [pulsar-prod-iad-199-193] Reconnecting after timeout\n","kube":{"namespace":"pulsar-cdc","pod_name":"pf-canvas-cdc-prod-iad-cluster71-0","container_name":"pulsarfunction"},"docker":{"container_id":"c4e6640923ec8e00f50ca85375c7b37c0de7b59491dd7cdb68800b84f8ac5740"}}
{"msg":"17:11:15.866 [pulsar-client-io-1-1] INFO org.apache.pulsar.client.impl.ConnectionHandler - [canvas-cdc/prod-iad/cluster71-canvas.cluster67_shard_1072.thumbnails] [pulsar-prod-iad-199-193] Closed connection [id: 0x37f41d66, L:/10.9.27.47:40828 - R:/10.9.36.202:6651] -- Will try again in 0.1 s\n","kube":{"namespace":"pulsar-cdc","pod_name":"pf-canvas-cdc-prod-iad-cluster71-0","container_name":"pulsarfunction"},"docker":{"container_id":"c4e6640923ec8e00f50ca85375c7b37c0de7b59491dd7cdb68800b84f8ac5740"}}
Good Instance
Broker Logs
{"msg":"17:11:17.631 [bookkeeper-ml-workers-OrderedExecutor-7-0] INFO org.apache.pulsar.broker.service.BrokerService - Created topic persistent://canvas-cdc/prod-iad/cluster71-canvas.cluster67_shard_1072.group_categories - dedup is disabled\n","kube":{"namespace":"pulsar","pod_name":"pulsar-prod-broker-564968ff9c-2r8zr","container_name":"broker"},"docker":{"container_id":"6cc06889852e17619b2520562cdb2eba70b37d5e318210f0df33ee868378fe8d"}}
{"msg":"17:11:17.631 [bookkeeper-ml-workers-OrderedExecutor-7-0] INFO org.apache.pulsar.broker.service.persistent.PersistentTopic - [persistent://canvas-cdc/prod-iad/cluster71-canvas.cluster67_shard_1072.group_categories] There are no replicated subscriptions on the topic\n","kube":{"namespace":"pulsar","pod_name":"pulsar-prod-broker-564968ff9c-2r8zr","container_name":"broker"},"docker":{"container_id":"6cc06889852e17619b2520562cdb2eba70b37d5e318210f0df33ee868378fe8d"}}
{"msg":"17:11:17.630 [bookkeeper-ml-workers-OrderedExecutor-7-0] INFO org.apache.pulsar.broker.service.persistent.SubscribeRateLimiter - [persistent://canvas-cdc/prod-iad/cluster71-canvas.cluster67_shard_1072.group_categories] configured subscribe-dispatch rate at broker SubscribeRate{subscribeThrottlingRatePerConsumer=300, ratePeriodInSecond=30}\n","kube":{"namespace":"pulsar","pod_name":"pulsar-prod-broker-564968ff9c-2r8zr","container_name":"broker"},"docker":{"container_id":"6cc06889852e17619b2520562cdb2eba70b37d5e318210f0df33ee868378fe8d"}}
{"msg":"17:11:17.630 [bookkeeper-ml-workers-OrderedExecutor-7-0] INFO org.apache.pulsar.broker.service.persistent.DispatchRateLimiter - [persistent://canvas-cdc/prod-iad/cluster71-canvas.cluster67_shard_1072.group_categories] configured TOPIC message-dispatch rate at broker DispatchRate{dispatchThrottlingRateInMsg=500000, dispatchThrottlingRateInByte=52428800, ratePeriodInSecond=1}\n","kube":{"namespace":"pulsar","pod_name":"pulsar-prod-broker-564968ff9c-2r8zr","container_name":"broker"},"docker":{"container_id":"6cc06889852e17619b2520562cdb2eba70b37d5e318210f0df33ee868378fe8d"}}
{"msg":"17:11:17.630 [bookkeeper-ml-workers-OrderedExecutor-7-0] INFO org.apache.pulsar.broker.service.AbstractTopic - Disabling publish throttling for persistent://canvas-cdc/prod-iad/cluster71-canvas.cluster67_shard_1072.group_categories\n","kube":{"namespace":"pulsar","pod_name":"pulsar-prod-broker-564968ff9c-2r8zr","container_name":"broker"},"docker":{"container_id":"6cc06889852e17619b2520562cdb2eba70b37d5e318210f0df33ee868378fe8d"}}
{"msg":"17:11:17.630 [bookkeeper-ml-workers-OrderedExecutor-7-0] WARN org.apache.pulsar.broker.service.AbstractTopic - [persistent://canvas-cdc/prod-iad/cluster71-canvas.cluster67_shard_1072.group_categories] Error getting policies null and publish throttling will be disabled\n","kube":{"namespace":"pulsar","pod_name":"pulsar-prod-broker-564968ff9c-2r8zr","container_name":"broker"},"docker":{"container_id":"6cc06889852e17619b2520562cdb2eba70b37d5e318210f0df33ee868378fe8d"}}
{"msg":"17:11:17.369 [Thread-5203753] INFO org.apache.pulsar.broker.service.ServerCnx - [/10.9.27.47:60270][persistent://canvas-cdc/prod-iad/cluster71-canvas.cluster67_shard_1072.group_categories] Creating producer. producerId=97\n","kube":{"namespace":"pulsar","pod_name":"pulsar-prod-broker-564968ff9c-2r8zr","container_name":"broker"},"docker":{"container_id":"6cc06889852e17619b2520562cdb2eba70b37d5e318210f0df33ee868378fe8d"}}
{"msg":"17:11:17.218 [ForkJoinPool.commonPool-worker-1] INFO org.apache.pulsar.broker.service.ServerCnx - [/10.9.29.28:37282] Subscribing on topic persistent://canvas-cdc/prod-iad/cluster71-canvas.cluster67_shard_1072.group_categories / canvas-cdc/clean-prod-iad/cdc-transform-71\n","kube":{"namespace":"pulsar","pod_name":"pulsar-prod-broker-564968ff9c-2r8zr","container_name":"broker"},"docker":{"container_id":"6cc06889852e17619b2520562cdb2eba70b37d5e318210f0df33ee868378fe8d"}}
{"msg":"17:11:16.989 [Thread-493004] WARN org.apache.pulsar.broker.service.BrokerService - Namespace is being unloaded, cannot add topic persistent://canvas-cdc/prod-iad/cluster71-canvas.cluster67_shard_1072.group_categories\n","kube":{"namespace":"pulsar","pod_name":"pulsar-prod-broker-564968ff9c-zljhk","container_name":"broker"},"docker":{"container_id":"c2bc11c1db743603d128641575963a36d3c5ee2f6abca2994df97a7e8720359e"}}
{"msg":"17:11:16.988 [Thread-493004] INFO org.apache.pulsar.broker.service.ServerCnx - [/10.9.27.47:40828][persistent://canvas-cdc/prod-iad/cluster71-canvas.cluster67_shard_1072.group_categories] Creating producer. producerId=97\n","kube":{"namespace":"pulsar","pod_name":"pulsar-prod-broker-564968ff9c-zljhk","container_name":"broker"},"docker":{"container_id":"c2bc11c1db743603d128641575963a36d3c5ee2f6abca2994df97a7e8720359e"}}
{"msg":"17:11:16.967 [bookkeeper-ml-workers-OrderedExecutor-7-0] INFO org.apache.pulsar.broker.service.persistent.PersistentTopic - [persistent://canvas-cdc/prod-iad/cluster71-canvas.cluster67_shard_1072.group_categories] Topic closed\n","kube":{"namespace":"pulsar","pod_name":"pulsar-prod-broker-564968ff9c-zljhk","container_name":"broker"},"docker":{"container_id":"c2bc11c1db743603d128641575963a36d3c5ee2f6abca2994df97a7e8720359e"}}
{"msg":"17:11:16.880 [pulsar-web-33-2] INFO org.apache.pulsar.broker.service.persistent.PersistentSubscription - [persistent://canvas-cdc/prod-iad/cluster71-canvas.cluster67_shard_1072.group_categories][canvas-cdc/clean-prod-iad/cdc-transform-71] Successfully disconnected and closed subscription\n","kube":{"namespace":"pulsar","pod_name":"pulsar-prod-broker-564968ff9c-zljhk","container_name":"broker"},"docker":{"container_id":"c2bc11c1db743603d128641575963a36d3c5ee2f6abca2994df97a7e8720359e"}}
{"msg":"17:11:16.880 [pulsar-web-33-2] INFO org.apache.pulsar.broker.service.persistent.PersistentSubscription - [persistent://canvas-cdc/prod-iad/cluster71-canvas.cluster67_shard_1072.group_categories][canvas-cdc/clean-prod-iad/cdc-transform-71] Successfully closed subscription [ManagedCursorImpl{ledger=canvas-cdc/prod-iad/persistent/cluster71-canvas.cluster67_shard_1072.group_categories, name=canvas-cdc%2Fclean-prod-iad%2Fcdc-transform-71, ackPos=521792:0, readPos=521792:1}]\n","kube":{"namespace":"pulsar","pod_name":"pulsar-prod-broker-564968ff9c-zljhk","container_name":"broker"},"docker":{"container_id":"c2bc11c1db743603d128641575963a36d3c5ee2f6abca2994df97a7e8720359e"}}
{"msg":"17:11:16.880 [pulsar-web-33-2] INFO org.apache.pulsar.broker.service.persistent.PersistentDispatcherMultipleConsumers - [persistent://canvas-cdc/prod-iad/cluster71-canvas.cluster67_shard_1072.group_categories / canvas-cdc/clean-prod-iad/cdc-transform-71] All consumers removed. Subscription is disconnected\n","kube":{"namespace":"pulsar","pod_name":"pulsar-prod-broker-564968ff9c-zljhk","container_name":"broker"},"docker":{"container_id":"c2bc11c1db743603d128641575963a36d3c5ee2f6abca2994df97a7e8720359e"}}
{"msg":"17:11:16.880 [pulsar-web-33-2] INFO org.apache.pulsar.broker.service.persistent.PersistentDispatcherMultipleConsumers - Removed consumer Consumer{subscription=PersistentSubscription{topic=persistent://canvas-cdc/prod-iad/cluster71-canvas.cluster67_shard_1072.group_categories, name=canvas-cdc/clean-prod-iad/cdc-transform-71}, consumerId=81, consumerName=5525b, address=/10.9.29.28:52860} with pending 0 acks\n","kube":{"namespace":"pulsar","pod_name":"pulsar-prod-broker-564968ff9c-zljhk","container_name":"broker"},"docker":{"container_id":"c2bc11c1db743603d128641575963a36d3c5ee2f6abca2994df97a7e8720359e"}}
{"msg":"17:11:16.879 [pulsar-web-33-2] INFO org.apache.pulsar.broker.service.Consumer - Disconnecting consumer: Consumer{subscription=PersistentSubscription{topic=persistent://canvas-cdc/prod-iad/cluster71-canvas.cluster67_shard_1072.group_categories, name=canvas-cdc/clean-prod-iad/cdc-transform-71}, consumerId=81, consumerName=5525b, address=/10.9.29.28:52860}\n","kube":{"namespace":"pulsar","pod_name":"pulsar-prod-broker-564968ff9c-zljhk","container_name":"broker"},"docker":{"container_id":"c2bc11c1db743603d128641575963a36d3c5ee2f6abca2994df97a7e8720359e"}}
{"msg":"17:11:16.879 [pulsar-web-33-2] INFO org.apache.pulsar.broker.service.Producer - Disconnecting producer: Producer{topic=PersistentTopic{topic=persistent://canvas-cdc/prod-iad/cluster71-canvas.cluster67_shard_1072.group_categories}, client=/10.9.27.47:40828, producerName=pulsar-prod-iad-196-161, producerId=97}\n","kube":{"namespace":"pulsar","pod_name":"pulsar-prod-broker-564968ff9c-zljhk","container_name":"broker"},"docker":{"container_id":"c2bc11c1db743603d128641575963a36d3c5ee2f6abca2994df97a7e8720359e"}}
{"msg":"17:11:16.879 [pulsar-web-33-2] INFO org.apache.pulsar.broker.service.BrokerService - [persistent://canvas-cdc/prod-iad/cluster71-canvas.cluster67_shard_1072.group_categories] Unloading topic\n","kube":{"namespace":"pulsar","pod_name":"pulsar-prod-broker-564968ff9c-zljhk","container_name":"broker"},"docker":{"container_id":"c2bc11c1db743603d128641575963a36d3c5ee2f6abca2994df97a7e8720359e"}}
Client Logs
{"msg":"17:11:17.727 [pulsar-client-io-1-1] INFO org.apache.pulsar.client.impl.ProducerImpl - [canvas-cdc/prod-iad/cluster71-canvas.cluster67_shard_1072.group_categories] [pulsar-prod-iad-196-161] Created producer on cnx [id: 0xd2d7a3f2, L:/10.9.27.47:60270 - R:/10.9.47.166:6651]\n","kube":{"namespace":"pulsar-cdc","pod_name":"pf-canvas-cdc-prod-iad-cluster71-0","container_name":"pulsarfunction"},"docker":{"container_id":"c4e6640923ec8e00f50ca85375c7b37c0de7b59491dd7cdb68800b84f8ac5740"}}
{"msg":"17:11:17.365 [pulsar-client-io-1-1] INFO org.apache.pulsar.client.impl.ProducerImpl - [canvas-cdc/prod-iad/cluster71-canvas.cluster67_shard_1072.group_categories] [pulsar-prod-iad-196-161] Creating producer on cnx [id: 0xd2d7a3f2, L:/10.9.27.47:60270 - R:/10.9.47.166:6651]\n","kube":{"namespace":"pulsar-cdc","pod_name":"pf-canvas-cdc-prod-iad-cluster71-0","container_name":"pulsarfunction"},"docker":{"container_id":"c4e6640923ec8e00f50ca85375c7b37c0de7b59491dd7cdb68800b84f8ac5740"}}
{"msg":"17:11:17.362 [pulsar-timer-4-1] INFO org.apache.pulsar.client.impl.ConnectionHandler - [canvas-cdc/prod-iad/cluster71-canvas.cluster67_shard_1072.group_categories] [pulsar-prod-iad-196-161] Reconnecting after connection was closed\n","kube":{"namespace":"pulsar-cdc","pod_name":"pf-canvas-cdc-prod-iad-cluster71-0","container_name":"pulsarfunction"},"docker":{"container_id":"c4e6640923ec8e00f50ca85375c7b37c0de7b59491dd7cdb68800b84f8ac5740"}}
{"msg":"17:11:17.262 [pulsar-client-io-1-1] WARN org.apache.pulsar.client.impl.ConnectionHandler - [canvas-cdc/prod-iad/cluster71-canvas.cluster67_shard_1072.group_categories] [pulsar-prod-iad-196-161] Could not get connection to broker: Namespace is being unloaded, cannot add topic persistent://canvas-cdc/prod-iad/cluster71-canvas.cluster67_shard_1072.group_categories -- Will try again in 0.1 s\n","kube":{"namespace":"pulsar-cdc","pod_name":"pf-canvas-cdc-prod-iad-cluster71-0","container_name":"pulsarfunction"},"docker":{"container_id":"c4e6640923ec8e00f50ca85375c7b37c0de7b59491dd7cdb68800b84f8ac5740"}}
{"msg":"17:11:17.261 [pulsar-client-io-1-1] ERROR org.apache.pulsar.client.impl.ProducerImpl - [canvas-cdc/prod-iad/cluster71-canvas.cluster67_shard_1072.group_categories] [pulsar-prod-iad-196-161] Failed to create producer: Namespace is being unloaded, cannot add topic persistent://canvas-cdc/prod-iad/cluster71-canvas.cluster67_shard_1072.group_categories\n","kube":{"namespace":"pulsar-cdc","pod_name":"pf-canvas-cdc-prod-iad-cluster71-0","container_name":"pulsarfunction"},"docker":{"container_id":"c4e6640923ec8e00f50ca85375c7b37c0de7b59491dd7cdb68800b84f8ac5740"}}
{"msg":"17:11:17.261 [pulsar-client-io-1-1] WARN org.apache.pulsar.client.impl.ClientCnx - [id: 0x37f41d66, L:/10.9.27.47:40828 - R:/10.9.36.202:6651] Received error from server: Namespace is being unloaded, cannot add topic persistent://canvas-cdc/prod-iad/cluster71-canvas.cluster67_shard_1072.group_categories\n","kube":{"namespace":"pulsar-cdc","pod_name":"pf-canvas-cdc-prod-iad-cluster71-0","container_name":"pulsarfunction"},"docker":{"container_id":"c4e6640923ec8e00f50ca85375c7b37c0de7b59491dd7cdb68800b84f8ac5740"}}
{"msg":"17:11:16.982 [pulsar-client-io-1-1] INFO org.apache.pulsar.client.impl.ProducerImpl - [canvas-cdc/prod-iad/cluster71-canvas.cluster67_shard_1072.group_categories] [pulsar-prod-iad-196-161] Creating producer on cnx [id: 0x37f41d66, L:/10.9.27.47:40828 - R:/10.9.36.202:6651]\n","kube":{"namespace":"pulsar-cdc","pod_name":"pf-canvas-cdc-prod-iad-cluster71-0","container_name":"pulsarfunction"},"docker":{"container_id":"c4e6640923ec8e00f50ca85375c7b37c0de7b59491dd7cdb68800b84f8ac5740"}}
{"msg":"17:11:16.980 [pulsar-timer-4-1] INFO org.apache.pulsar.client.impl.ConnectionHandler - [canvas-cdc/prod-iad/cluster71-canvas.cluster67_shard_1072.group_categories] [pulsar-prod-iad-196-161] Reconnecting after timeout\n","kube":{"namespace":"pulsar-cdc","pod_name":"pf-canvas-cdc-prod-iad-cluster71-0","container_name":"pulsarfunction"},"docker":{"container_id":"c4e6640923ec8e00f50ca85375c7b37c0de7b59491dd7cdb68800b84f8ac5740"}}
{"msg":"17:11:16.880 [pulsar-client-io-1-1] INFO org.apache.pulsar.client.impl.ConnectionHandler - [canvas-cdc/prod-iad/cluster71-canvas.cluster67_shard_1072.group_categories] [pulsar-prod-iad-196-161] Closed connection [id: 0x37f41d66, L:/10.9.27.47:40828 - R:/10.9.36.202:6651] -- Will try again in 0.1 s\n","kube":{"namespace":"pulsar-cdc","pod_name":"pf-canvas-cdc-prod-iad-cluster71-0","container_name":"pulsarfunction"},"docker":{"container_id":"c4e6640923ec8e00f50ca85375c7b37c0de7b59491dd7cdb68800b84f8ac5740"}}
The most important thing to note here is just the difference in when we get a client exception (in the good case) and when we don't (in the bad case). We see this pattern pretty much universally, but, AFAICT from the code, there isn't any reason to believe that that the future shouldn't immediately resolve in both cases.
Additional context
We saw this in 2.4.x but are also seeing it in 2.5.0 using the official docker images. We also are connecting directly to the brokers (this occurs in a pulsar IO source). We see this occur when we have lots of load shedding when we are back-filling data into Pulsar and don't have as well distributed load.
The text was updated successfully, but these errors were encountered:
Original Issue: apache#6416
Describe the bug
When a topic is unloaded from a broker, a producer can attempt to reconnect before the topic is scheduled to a new broker but after the topic is unloaded. In some instances (discussed more below) this re-connection results in a producer that fails to create or error out and leaves a danging
producerFuture
that is never resolved or removed (unless the client terminates it's connection). If the topic ends up back on the same broker (either immediately or after another unload occurs from another broker) any producers that try and connect will immediately fail as the existing producerFuture is found and responds to the client with an error.In #5571, an epoch was added to help address some related issues, however, in this case, as seen in https://github.com/apache/pulsar/blob/f2afad353795f70c00910b6dcce5df8c62b94a67/pulsar-broker/src/main/java/org/apache/pulsar/broker/service/ServerCnx.java#L864-L898, the
producerFuture
is retrieved based solely on theproducerId
which the client provides. Because of this, the epoch isn't useful to solve the problem.As to the details of why the producer fails initially, what we see is the following:
org.apache.pulsar.broker.service.BrokerService#checkTopicNsOwnership
(https://github.com/apache/pulsar/blob/f2afad353795f70c00910b6dcce5df8c62b94a67/pulsar-broker/src/main/java/org/apache/pulsar/broker/service/BrokerService.java#L1197) and an exception is thrown (we see this warning in our logs)getOrCreateTopic
at https://github.com/apache/pulsar/blob/f2afad353795f70c00910b6dcce5df8c62b94a67/pulsar-broker/src/main/java/org/apache/pulsar/broker/service/ServerCnx.java#L902 to never return and theproducerFuture
to never be resolved. We see this on the client side by a timeout to create a producer in cases where it fails, in other instances the timeout doesn't occur and we instead get a proper error message to the client, indicating that the namespace bundle isn't owned. As best as I can tell, this doesn't have anything to do with thependingTopicLoadingQueue
as we don't see enough topics being loaded that would put us into needing to use that queue.producerFuture
in place, once the topic is re-scheduled back to this broker, it will be unable to ever create a producer.I honestly have no idea why or how the call to
getOrCreateTopic
appears to never return, but we have about 10 instances we have validated with our logs (attached an instance below) that all show this same pattern and also have heap dumps (can share if desired) that also support this.We have seen some other issues like apache#6054 (they haven't re-occurred often enough to gather enough data) but as best as we can tell it may be a related issue, where some call to load the topic never completes.
While there is likely some easy workarounds for this specific issue (such as putting a timeout on the
producerFuture
or by using theepoch
in conjunction with theproducerId
for theproducers
map), this is really a spooky one, and I wonder if related issues could have a similar root cause!To Reproduce
Since this appears to be down to a timing issue, it is hard to reliably reproduce, however, the following should work with enough tries:
Expected behavior
We expect to be able to create a producer after an unload.
Logs
These aren't full logs, just the relevant lines for when a
producerFuture
is left behindBad Instance
Broker Logs
Client Logs
Good Instance
Broker Logs
Client Logs
The most important thing to note here is just the difference in when we get a client exception (in the good case) and when we don't (in the bad case). We see this pattern pretty much universally, but, AFAICT from the code, there isn't any reason to believe that that the future shouldn't immediately resolve in both cases.
Additional context
We saw this in 2.4.x but are also seeing it in 2.5.0 using the official docker images. We also are connecting directly to the brokers (this occurs in a pulsar IO source). We see this occur when we have lots of load shedding when we are back-filling data into Pulsar and don't have as well distributed load.
The text was updated successfully, but these errors were encountered: