[BUG] Producer with name xxx is already connect to topic #13342

wenbingshen · 2021-12-15T14:34:49Z

Describe the bug
Master issue: #13061 apache/pulsar-client-go#676
Pulsar version:2.8.1

After our investigation, this problem occurs when the ping/pong between the client and the server gradually deviates, until the client senses that the connection is closed, and the connection close operation fails due to network reasons, and the underlying network is not disconnected, resulting in pulsar The broker is still waiting for the ping/pong to time out, but the client has already used the same PartitionProducer, reconnected via the network (changed the port), and started AddProducer to the pulsar broker.

#11804, this PR rewrites the equals method of the Producer, resulting in that when different pulsar-client-go uses different port to reconnect, the old producer cannot be removed because the remoteAddress will be verified by equals:

if (producers.remove(producer.getProducerName(), producer)) {

#12846, this pr removes equals and will use hashcode for judgment. At this time, the
old producer cannot be removed.

This problem can be closed when the pulsar broker perceives ping/pong timeout, or the channel is abnormal, and the connection can be closed, and the producer state can be cleaned up. When the client AddProducer again, it can be restored; but during this period, the client reconnects and starts the add producer. The broker will always report an error: Producer with name is already connect to topic.

Therefore, I feel that the current protocol cannot fully prove whether the producer client can overwrite itself. It may be necessary to add some fields to prove: I am me

To Reproduce
Steps to reproduce the behavior:

Change broker keepAliveIntervalSeconds=100
You can choose a pulsar client in any language, such as pulsar-client-go or java and other clients
Use the client to send data to the pulsar server
Use a firewall to disconnect the network between the client and the broker. The time is maintained in 60s. After waiting for 60 seconds, close the firewall
Now, you can check the broker log, at this time you can see the error: Producer with name is already connect to topic

Expected behavior
A clear and concise description of what you expected to happen.

The text was updated successfully, but these errors were encountered:

wenbingshen · 2021-12-15T14:38:21Z

@codelipenghui @BewareMyPower

BewareMyPower · 2021-12-15T14:44:42Z

It looks like the same issue with #13289. /cc @aloyszhang

BewareMyPower · 2021-12-22T07:53:07Z

Could you confirm whether #13428 fixed the problem?

wenbingshen · 2021-12-23T03:54:40Z

Could you confirm whether #13428 fixed the problem?

I don't think the problem is solved, because when the problem occurs, the channelInactive method is not entered, that is, after the producer reconnect, the old connection still does not trigger channelInactive.

I just built a test with the master branch (including #13428 ), and the repeat steps described in the issue can still be reproduced stably.

github-actions · 2022-02-26T01:50:04Z

The issue had no activity for 30 days, mark with Stale label.

mattisonchao · 2022-05-29T02:05:22Z

I'm working on it and some solutions need to be discussed. Once a reasonable solution is finalized, I'll fix it in time.

eolivelli · 2022-05-29T18:14:53Z

I see this problem quite often is some environments.

Do you have news @mattisonchao ?

Technoboy- · 2022-05-31T01:51:02Z

Ah, we find that keepAliveIntervalSeconds exists on both client and server-side. The default value is 30s.
In your reproduce step, the server-side is 100s.
The broker will clean up the producer info when the channel is inactive. So it's about 100s to do this.
But client-side will close the channel in the 30s, and the firewall may cause the closure of the channel not to be sent, then the client will reconnect. In this case, the server-side has the producer info and will throw the above exception.

Technoboy- · 2022-05-31T02:01:30Z

For this issue, it's better for the client to config this value, and the broker will use the client value.

lhotari · 2022-06-02T07:11:07Z

There were some bugs that caused the connection to stay open for longs durations.
#15382 Closes the connection if a ping or pong message cannot be sent .
#15366 Fixes a proxy connection leak when inbound connection closes while connecting is in progress
Both of the above fixes address issues where the symptom is "Producer with name xxx is already connect to topic".

lhotari · 2022-06-02T07:14:10Z

To Reproduce Steps to reproduce the behavior:

Change broker keepAliveIntervalSeconds=100

You can choose a pulsar client in any language, such as pulsar-client-go or java and other clients

Use the client to send data to the pulsar server

Use a firewall to disconnect the network between the client and the broker. The time is maintained in 60s. After waiting for 60 seconds, close the firewall

Now, you can check the broker log, at this time you can see the error: Producer with name is already connect to topic

@wenbingshen I think that this is not a bug that you are describing. It might be a way to reproduce "Producer with name xxx is already connect to topic", but if the steps are followed, I think that it's completely expected behavior.

wenbingshen · 2022-06-02T08:05:37Z

@lhotari Great, you get what I'm trying to say! I just finished the work at hand and would like to reply to @Technoboy about keepAliveIntervalSeconds=100, the reproduction steps to change keepAliveIntervalSeconds is just to reproduce the problem easily, so that everyone can understand the problem I want to describe, and in the production environment, we never to change the keepAliveIntervalSeconds configuration, it has always been the default 30s, but even so, we often see "Producer with name xxx is already connect to topic".

wenbingshen · 2022-06-02T08:21:53Z

There were some bugs that caused the connection to stay open for longs durations. #15382 Closes the connection if a ping or pong message cannot be sent . #15366 Fixes a proxy connection leak when inbound connection closes while connecting is in progress Both of the above fixes address issues where the symptom is "Producer with name xxx is already connect to topic".

These PR looks great, but I'm not sure if this has been resolved #13061, may need to be validated in production environment.

think that this is not a bug that you are describing. It might be a way to reproduce "Producer with name xxx is already connect to topic", but if the steps are followed, I think that it's completely expected behavior.

If the steps to reproduce end up being an expected behavior, it may not be easy to reproduce the issue.
Perhaps it should be considered whether this expected behavior is correct, is there a way to replace and optimize.

I see this problem quite often is some environments.

@eolivelli often encountered this problem in some environments too.

michaeljmarshall · 2022-06-02T17:45:32Z

Just want to post the explicit requirements that must be met for a producer to reconnect. Here is the producer logic:

pulsar/pulsar-broker/src/main/java/org/apache/pulsar/broker/service/Producer.java

Lines 175 to 181 in 7d2fdea

    
           public boolean isSuccessorTo(Producer other) { 
        
               return Objects.equals(producerName, other.producerName) 
        
                       && Objects.equals(topic, other.topic) 
        
                       && producerId == other.producerId 
        
                       && Objects.equals(cnx, other.cnx) 
        
                       && other.getEpoch() < epoch; 
        
           }

and here is where it's called:

pulsar/pulsar-broker/src/main/java/org/apache/pulsar/broker/service/AbstractTopic.java

Lines 956 to 972 in 7d2fdea

    
           private void tryOverwriteOldProducer(Producer oldProducer, Producer newProducer) 
        
                   throws BrokerServiceException { 
        
               if (newProducer.isSuccessorTo(oldProducer) && !isUserProvidedProducerName(oldProducer) 
        
                       && !isUserProvidedProducerName(newProducer)) { 
        
                   oldProducer.close(false); 
        
                   if (!producers.replace(newProducer.getProducerName(), oldProducer, newProducer)) { 
        
                       // Met concurrent update, throw exception here so that client can try reconnect later. 
        
                       throw new BrokerServiceException.NamingException("Producer with name '" + newProducer.getProducerName() 
        
                               + "' replace concurrency error"); 
        
                   } else { 
        
                       handleProducerRemoved(oldProducer); 
        
                   } 
        
               } else { 
        
                   throw new BrokerServiceException.NamingException( 
        
                           "Producer with name '" + newProducer.getProducerName() + "' is already connected to topic"); 
        
               } 
        
           }

Therefore, I feel that the current protocol cannot fully prove whether the producer client can overwrite itself. It may be necessary to add some fields to prove: I am me

This is an interesting question. The current design assumes that a producer is "a successor" to a former instance of a producer when it has the same connection. Given the way that producers and the ServerCnx are integrated, I think we'd need to refactor the code to remove the connection from the isSuccessorTo logic. Perhaps we could achieve producer identity with some kind of UUID passed to the client during producer creation.

The broker will always report an error: Producer with name is already connect to topic.

This raises another question for me. Should the producer use another name when it attempts to reconnect after certain failures, like a failed keep alive timeout? If the producer tried to connect to a broker and get a new producer name from the broker, it'd circumvent the issue here. It wouldn't work for overridden producer names on the client side. It would also lead to certain edge cases around potential duplicates in the produced messages (this would likely already happen when a keep alive fails).

michaeljmarshall · 2022-06-02T17:55:52Z

My fix for #12846 was in 2.8.2, and this issue is opened against 2.8.1. Also note that the go client started using epoch in the producer connection logic in apache/pulsar-client-go#582. That went in 0.7.0, which is the version being used here apache/pulsar-client-go#676.

I think we need to reproduce this issue for the go client against a newer broker version before we dig into this further.

michaeljmarshall · 2022-06-24T15:42:46Z

Closing this issue and marking as fixed for now since there is a likely fix and the issue hasn't been reproduced against more fixed versions of Pulsar.

Martin-Narvar · 2025-01-13T14:54:01Z

We are experiencing this quite frequently in our production environment, and each time it happens, it is a production incident for the on-call engineer.
For example this morning this error was logged 720,000 times in a 10 minute window and the topic in question developed a large backlog (which is what triggered the production alert).

lhotari · 2025-01-13T20:25:19Z

We are experiencing this quite frequently in our production environment, and each time it happens, it is a production incident for the on-call engineer. For example this morning this error was logged 720,000 times in a 10 minute window and the topic in question developed a large backlog (which is what triggered the production alert).

@Martin-Narvar What Pulsar version are you using? There are more recent fixes such as #21155 and #23123 which are available.

wenbingshen added the type/bug The PR fixed a bug or issue reported a bug label Dec 15, 2021

wenbingshen changed the title ~~[BUG]~~ [BUG] Producer with name xxx is already connect to topic Dec 15, 2021

wenbingshen mentioned this issue Dec 15, 2021

c++ client Producer with name 'xxx-18-31667' is already connected to topic #13289

Closed

sijie mentioned this issue Dec 15, 2021

ISSUE-13342: [BUG] Producer with name xxx is already connect to topic streamnative/pulsar-archived#3438

Open

github-actions bot added the lifecycle/stale label Feb 26, 2022

mattisonchao assigned mattisonchao and Technoboy- May 28, 2022

Technoboy- mentioned this issue Jun 2, 2022

[Discuss] Pulsar client : Connect Command add keep_alive_interval #15885

Open

sijie mentioned this issue Jun 2, 2022

ISSUE-15885: [Discuss] Pulsar client : Connect Command add keep_alive_interval streamnative/pulsar-archived#4312

Open

michaeljmarshall closed this as completed Jun 24, 2022

CMoore-Darwinium mentioned this issue May 31, 2023

Fix to Producer Busy error on reconnect streamnative/pulsar-rs#280

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Producer with name xxx is already connect to topic #13342

[BUG] Producer with name xxx is already connect to topic #13342

wenbingshen commented Dec 15, 2021

wenbingshen commented Dec 15, 2021

BewareMyPower commented Dec 15, 2021

BewareMyPower commented Dec 22, 2021

wenbingshen commented Dec 23, 2021

github-actions bot commented Feb 26, 2022

mattisonchao commented May 29, 2022

eolivelli commented May 29, 2022

Technoboy- commented May 31, 2022

Technoboy- commented May 31, 2022

lhotari commented Jun 2, 2022

lhotari commented Jun 2, 2022

wenbingshen commented Jun 2, 2022

wenbingshen commented Jun 2, 2022 •

edited

Loading

michaeljmarshall commented Jun 2, 2022

michaeljmarshall commented Jun 2, 2022 •

edited

Loading

michaeljmarshall commented Jun 24, 2022

Martin-Narvar commented Jan 13, 2025

lhotari commented Jan 13, 2025

[BUG] Producer with name xxx is already connect to topic #13342

[BUG] Producer with name xxx is already connect to topic #13342

Comments

wenbingshen commented Dec 15, 2021

wenbingshen commented Dec 15, 2021

BewareMyPower commented Dec 15, 2021

BewareMyPower commented Dec 22, 2021

wenbingshen commented Dec 23, 2021

github-actions bot commented Feb 26, 2022

mattisonchao commented May 29, 2022

eolivelli commented May 29, 2022

Technoboy- commented May 31, 2022

Technoboy- commented May 31, 2022

lhotari commented Jun 2, 2022

lhotari commented Jun 2, 2022

wenbingshen commented Jun 2, 2022

wenbingshen commented Jun 2, 2022 • edited Loading

michaeljmarshall commented Jun 2, 2022

michaeljmarshall commented Jun 2, 2022 • edited Loading

michaeljmarshall commented Jun 24, 2022

Martin-Narvar commented Jan 13, 2025

lhotari commented Jan 13, 2025

wenbingshen commented Jun 2, 2022 •

edited

Loading

michaeljmarshall commented Jun 2, 2022 •

edited

Loading