-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
In some instances, producers fail to create after a topic is scheduled onto previously used broker #6416
Labels
type/bug
The PR fixed a bug or issue reported a bug
Comments
/cc @codelipenghui since he introduced epoch to solve some of the create-producer timeouts. |
We are facing the same issue in our environment when auto unloads kicks in. |
@skyrocknroll Which broker version are you used? |
@codelipenghui 2.4.2 |
addisonj
pushed a commit
to instructure/pulsar
that referenced
this issue
Mar 6, 2020
See apache#6416. This change ensures that all futures within BrokerService have a guranteed timeout. As stated in apache#6416, we see cases where it appears that loading or creating a topic fails to resolve the future for unknown reasons. It appears that these futures *may* not be returning. This seems like a sane change to make to ensure that these futures finish, however, it still isn't understood under what conditions these futures may not be returning, so this fix is mostly a workaround for some underlying issues
sijie
pushed a commit
that referenced
this issue
Mar 18, 2020
See #6416. This change ensures that all futures within BrokerService have a guranteed timeout. As stated in #6416, we see cases where it appears that loading or creating a topic fails to resolve the future for unknown reasons. It appears that these futures *may* not be returning. This seems like a sane change to make to ensure that these futures finish, however, it still isn't understood under what conditions these futures may not be returning, so this fix is mostly a workaround for some underlying issues Co-authored-by: Addison Higham <[email protected]>
addisonj
pushed a commit
to instructure/pulsar
that referenced
this issue
Mar 18, 2020
See apache#6416. This change ensures that all futures within BrokerService have a guranteed timeout. As stated in apache#6416, we see cases where it appears that loading or creating a topic fails to resolve the future for unknown reasons. It appears that these futures *may* not be returning. This seems like a sane change to make to ensure that these futures finish, however, it still isn't understood under what conditions these futures may not be returning, so this fix is mostly a workaround for some underlying issues
tuteng
pushed a commit
to AmateurEvents/pulsar
that referenced
this issue
Mar 21, 2020
See apache#6416. This change ensures that all futures within BrokerService have a guranteed timeout. As stated in apache#6416, we see cases where it appears that loading or creating a topic fails to resolve the future for unknown reasons. It appears that these futures *may* not be returning. This seems like a sane change to make to ensure that these futures finish, however, it still isn't understood under what conditions these futures may not be returning, so this fix is mostly a workaround for some underlying issues Co-authored-by: Addison Higham <[email protected]> (cherry picked from commit 4a4cce9)
tuteng
pushed a commit
that referenced
this issue
Apr 6, 2020
See #6416. This change ensures that all futures within BrokerService have a guranteed timeout. As stated in #6416, we see cases where it appears that loading or creating a topic fails to resolve the future for unknown reasons. It appears that these futures *may* not be returning. This seems like a sane change to make to ensure that these futures finish, however, it still isn't understood under what conditions these futures may not be returning, so this fix is mostly a workaround for some underlying issues Co-authored-by: Addison Higham <[email protected]> (cherry picked from commit 4a4cce9)
tuteng
pushed a commit
that referenced
this issue
Apr 13, 2020
See #6416. This change ensures that all futures within BrokerService have a guranteed timeout. As stated in #6416, we see cases where it appears that loading or creating a topic fails to resolve the future for unknown reasons. It appears that these futures *may* not be returning. This seems like a sane change to make to ensure that these futures finish, however, it still isn't understood under what conditions these futures may not be returning, so this fix is mostly a workaround for some underlying issues Co-authored-by: Addison Higham <[email protected]> (cherry picked from commit 4a4cce9)
addisonj
pushed a commit
to instructure/pulsar
that referenced
this issue
May 5, 2020
Fixes apache#6872 Fixes apache#6416 If a producer tries to create a producer to a topic that is currently unloading, we can get a `RuntimeException` from `BrokerService.checkTopicNsOwnership` which is bubbled up through `topic.addProducer`. By only handling a `BrokerServiceException` this results in a future that never completes and results in producers not being able to be created if this topic is scheduled back to this broker.
codelipenghui
pushed a commit
that referenced
this issue
May 6, 2020
Fixes #6872 Fixes #6416 If a producer tries to create a producer to a topic that is currently unloading, we can get a `RuntimeException` from `BrokerService.checkTopicNsOwnership` which is bubbled up through `topic.addProducer`. By only handling a `BrokerServiceException` this results in a future that never completes and results in producers not being able to be created if this topic is scheduled back to this broker.
addisonj
pushed a commit
to instructure/pulsar
that referenced
this issue
May 7, 2020
Fixes apache#6872 Fixes apache#6416 If a producer tries to create a producer to a topic that is currently unloading, we can get a `RuntimeException` from `BrokerService.checkTopicNsOwnership` which is bubbled up through `topic.addProducer`. By only handling a `BrokerServiceException` this results in a future that never completes and results in producers not being able to be created if this topic is scheduled back to this broker.
jiazhai
pushed a commit
that referenced
this issue
May 8, 2020
Fixes #6872 Fixes #6416 If a producer tries to create a producer to a topic that is currently unloading, we can get a `RuntimeException` from `BrokerService.checkTopicNsOwnership` which is bubbled up through `topic.addProducer`. By only handling a `BrokerServiceException` this results in a future that never completes and results in producers not being able to be created if this topic is scheduled back to this broker. (cherry picked from commit 30e26f8)
cdbartholomew
pushed a commit
to kafkaesque-io/pulsar
that referenced
this issue
May 12, 2020
Fixes apache#6872 Fixes apache#6416 If a producer tries to create a producer to a topic that is currently unloading, we can get a `RuntimeException` from `BrokerService.checkTopicNsOwnership` which is bubbled up through `topic.addProducer`. By only handling a `BrokerServiceException` this results in a future that never completes and results in producers not being able to be created if this topic is scheduled back to this broker.
jiazhai
pushed a commit
to jiazhai/pulsar
that referenced
this issue
May 18, 2020
See apache#6416. This change ensures that all futures within BrokerService have a guranteed timeout. As stated in apache#6416, we see cases where it appears that loading or creating a topic fails to resolve the future for unknown reasons. It appears that these futures *may* not be returning. This seems like a sane change to make to ensure that these futures finish, however, it still isn't understood under what conditions these futures may not be returning, so this fix is mostly a workaround for some underlying issues Co-authored-by: Addison Higham <[email protected]>(cherry picked from commit 4a4cce9)
Huanli-Meng
pushed a commit
to Huanli-Meng/pulsar
that referenced
this issue
May 27, 2020
Fixes apache#6872 Fixes apache#6416 If a producer tries to create a producer to a topic that is currently unloading, we can get a `RuntimeException` from `BrokerService.checkTopicNsOwnership` which is bubbled up through `topic.addProducer`. By only handling a `BrokerServiceException` this results in a future that never completes and results in producers not being able to be created if this topic is scheduled back to this broker.
huangdx0726
pushed a commit
to huangdx0726/pulsar
that referenced
this issue
Aug 24, 2020
See apache#6416. This change ensures that all futures within BrokerService have a guranteed timeout. As stated in apache#6416, we see cases where it appears that loading or creating a topic fails to resolve the future for unknown reasons. It appears that these futures *may* not be returning. This seems like a sane change to make to ensure that these futures finish, however, it still isn't understood under what conditions these futures may not be returning, so this fix is mostly a workaround for some underlying issues Co-authored-by: Addison Higham <[email protected]>
huangdx0726
pushed a commit
to huangdx0726/pulsar
that referenced
this issue
Aug 24, 2020
Fixes apache#6872 Fixes apache#6416 If a producer tries to create a producer to a topic that is currently unloading, we can get a `RuntimeException` from `BrokerService.checkTopicNsOwnership` which is bubbled up through `topic.addProducer`. By only handling a `BrokerServiceException` this results in a future that never completes and results in producers not being able to be created if this topic is scheduled back to this broker.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Describe the bug
When a topic is unloaded from a broker, a producer can attempt to reconnect before the topic is scheduled to a new broker but after the topic is unloaded. In some instances (discussed more below) this re-connection results in a producer that fails to create or error out and leaves a danging
producerFuture
that is never resolved or removed (unless the client terminates it's connection). If the topic ends up back on the same broker (either immediately or after another unload occurs from another broker) any producers that try and connect will immediately fail as the existing producerFuture is found and responds to the client with an error.In #5571, an epoch was added to help address some related issues, however, in this case, as seen in
pulsar/pulsar-broker/src/main/java/org/apache/pulsar/broker/service/ServerCnx.java
Lines 864 to 898 in f2afad3
producerFuture
is retrieved based solely on theproducerId
which the client provides. Because of this, the epoch isn't useful to solve the problem.As to the details of why the producer fails initially, what we see is the following:
org.apache.pulsar.broker.service.BrokerService#checkTopicNsOwnership
(pulsar/pulsar-broker/src/main/java/org/apache/pulsar/broker/service/BrokerService.java
Line 1197 in f2afad3
pulsar/pulsar-broker/src/main/java/org/apache/pulsar/broker/service/BrokerService.java
Lines 655 to 665 in f2afad3
getOrCreateTopic
atpulsar/pulsar-broker/src/main/java/org/apache/pulsar/broker/service/ServerCnx.java
Line 902 in f2afad3
producerFuture
to never be resolved. We see this on the client side by a timeout to create a producer in cases where it fails, in other instances the timeout doesn't occur and we instead get a proper error message to the client, indicating that the namespace bundle isn't owned. As best as I can tell, this doesn't have anything to do with thependingTopicLoadingQueue
as we don't see enough topics being loaded that would put us into needing to use that queue.producerFuture
in place, once the topic is re-scheduled back to this broker, it will be unable to ever create a producer.I honestly have no idea why or how the call to
getOrCreateTopic
appears to never return, but we have about 10 instances we have validated with our logs (attached an instance below) that all show this same pattern and also have heap dumps (can share if desired) that also support this.We have seen some other issues like #6054 (they haven't re-occurred often enough to gather enough data) but as best as we can tell it may be a related issue, where some call to load the topic never completes.
While there is likely some easy workarounds for this specific issue (such as putting a timeout on the
producerFuture
or by using theepoch
in conjunction with theproducerId
for theproducers
map), this is really a spooky one, and I wonder if related issues could have a similar root cause!To Reproduce
Since this appears to be down to a timing issue, it is hard to reliably reproduce, however, the following should work with enough tries:
Expected behavior
We expect to be able to create a producer after an unload.
Logs
These aren't full logs, just the relevant lines for when a
producerFuture
is left behindBad Instance
Broker Logs
Client Logs
Good Instance
Broker Logs
Client Logs
The most important thing to note here is just the difference in when we get a client exception (in the good case) and when we don't (in the bad case). We see this pattern pretty much universally, but, AFAICT from the code, there isn't any reason to believe that that the future shouldn't immediately resolve in both cases.
Additional context
We saw this in 2.4.x but are also seeing it in 2.5.0 using the official docker images. We also are connecting directly to the brokers (this occurs in a pulsar IO source). We see this occur when we have lots of load shedding when we are back-filling data into Pulsar and don't have as well distributed load.
The text was updated successfully, but these errors were encountered: