-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] Producer with name xxx is already connect to topic #13342
Comments
It looks like the same issue with #13289. /cc @aloyszhang |
Could you confirm whether #13428 fixed the problem? |
I don't think the problem is solved, because when the problem occurs, the I just built a test with the master branch (including #13428 ), and the repeat steps described in the issue can still be reproduced stably. |
The issue had no activity for 30 days, mark with Stale label. |
I'm working on it and some solutions need to be discussed. Once a reasonable solution is finalized, I'll fix it in time. |
I see this problem quite often is some environments. Do you have news @mattisonchao ? |
Ah, we find that |
For this issue, it's better for the client to config this value, and the broker will use the client value. |
There were some bugs that caused the connection to stay open for longs durations. |
@wenbingshen I think that this is not a bug that you are describing. It might be a way to reproduce "Producer with name xxx is already connect to topic", but if the steps are followed, I think that it's completely expected behavior. |
@lhotari Great, you get what I'm trying to say! I just finished the work at hand and would like to reply to @Technoboy about keepAliveIntervalSeconds=100, the reproduction steps to change keepAliveIntervalSeconds is just to reproduce the problem easily, so that everyone can understand the problem I want to describe, and in the production environment, we never to change the keepAliveIntervalSeconds configuration, it has always been the default 30s, but even so, we often see "Producer with name xxx is already connect to topic". |
These PR looks great, but I'm not sure if this has been resolved #13061, may need to be validated in production environment.
If the steps to reproduce end up being an expected behavior, it may not be easy to reproduce the issue.
@eolivelli often encountered this problem in some environments too. |
Just want to post the explicit requirements that must be met for a producer to reconnect. Here is the producer logic: pulsar/pulsar-broker/src/main/java/org/apache/pulsar/broker/service/Producer.java Lines 175 to 181 in 7d2fdea
and here is where it's called: pulsar/pulsar-broker/src/main/java/org/apache/pulsar/broker/service/AbstractTopic.java Lines 956 to 972 in 7d2fdea
This is an interesting question. The current design assumes that a producer is "a successor" to a former instance of a producer when it has the same connection. Given the way that producers and the
This raises another question for me. Should the producer use another name when it attempts to reconnect after certain failures, like a failed keep alive timeout? If the producer tried to connect to a broker and get a new producer name from the broker, it'd circumvent the issue here. It wouldn't work for overridden producer names on the client side. It would also lead to certain edge cases around potential duplicates in the produced messages (this would likely already happen when a keep alive fails). |
My fix for #12846 was in 2.8.2, and this issue is opened against 2.8.1. Also note that the go client started using I think we need to reproduce this issue for the go client against a newer broker version before we dig into this further. |
Closing this issue and marking as fixed for now since there is a likely fix and the issue hasn't been reproduced against more fixed versions of Pulsar. |
We are experiencing this quite frequently in our production environment, and each time it happens, it is a production incident for the on-call engineer. |
@Martin-Narvar What Pulsar version are you using? There are more recent fixes such as #21155 and #23123 which are available. |
Describe the bug
Master issue: #13061 apache/pulsar-client-go#676
Pulsar version:2.8.1
After our investigation, this problem occurs when the ping/pong between the client and the server gradually deviates, until the client senses that the connection is closed, and the connection close operation fails due to network reasons, and the underlying network is not disconnected, resulting in pulsar The broker is still waiting for the ping/pong to time out, but the client has already used the same PartitionProducer, reconnected via the network (changed the port), and started AddProducer to the pulsar broker.
#11804, this PR rewrites the equals method of the Producer, resulting in that when different
pulsar-client-go
uses different port to reconnect, theold producer
cannot be removed because theremoteAddress
will be verified by equals:#12846, this pr removes
equals
and will usehashcode
for judgment. At this time, theold producer
cannot be removed.This problem can be closed when the pulsar broker perceives ping/pong timeout, or the channel is abnormal, and the connection can be closed, and the producer state can be cleaned up. When the client AddProducer again, it can be restored; but during this period, the client reconnects and starts the add producer. The broker will always report an error:
Producer with name is already connect to topic
.Therefore, I feel that the current protocol cannot fully prove whether the producer client can overwrite itself. It may be necessary to add some fields to prove: I am me
To Reproduce
Steps to reproduce the behavior:
Producer with name is already connect to topic
Expected behavior
A clear and concise description of what you expected to happen.
The text was updated successfully, but these errors were encountered: