Producers failed to open when leader broker shut down #7041

k2la · 2020-05-26T06:05:57Z

When a leader broker shut down, producers that connected with the broker failed to open on a new broker.

According to the leader broker log, the broker unloaded bundles and closed producers.
Also, according to logs of other brokers, another broker became leader broker and bundles were loaded.

However, producers that connected with the old leader broker reconnected with a new broker but some producers of ones failed to open on the new broker.

After the producers reconnected with the new broker, they didn't send CommandProducer messages and stop.

pulsar/pulsar-common/src/main/proto/PulsarApi.proto

Lines 406 to 430 in d55bc00

    
           /// Create a new Producer on a topic, assigning the given producer_id, 
        
           /// all messages sent with this producer_id will be persisted on the topic 
        
           message CommandProducer { 
        
               required string topic         = 1; 
        
               required uint64 producer_id   = 2; 
        
               required uint64 request_id    = 3; 
        
               /// If a producer name is specified, the name will be used, 
        
               /// otherwise the broker will generate a unique name 
        
               optional string producer_name = 4; 
        
               optional bool encrypted       = 5 [default = false]; 
        
               /// Add optional metadata key=value to this producer 
        
               repeated KeyValue metadata    = 6; 
        
               optional Schema schema = 7; 
        
               // If producer reconnect to broker, the epoch of this producer will +1 
        
               optional uint64 epoch = 8 [default = 0]; 
        
               // Indicate the name of the producer is generated or user provided 
        
               // Use default true here is in order to be forward compatible with the client 
        
               optional bool user_provided_producer_name = 9 [default = true]; 
        
           }

Expected behavior

When producers reconnect with a new broker, open on the broker.

Actual behavior

Some producers failed to open on the new broker.

Steps to reproduce

We tried but haven't reproduced yet.

System configuration

OS(Broker): CentOS 7.7
Pulsar Broker: 2.3.2
Pulsar Client Java: 2.3.2

The text was updated successfully, but these errors were encountered:

Master Issue: #7041 ### Motivation When a leader broker is restarted, some producers for topics owned by that broker may not be reopened on the new broker. When this happens, message publishing will continue to fail until the client application is restarted. As a result of the investigation, I found that lookup requests sent by the producers in question are redirected more than 10,000 times between multiple brokers. When a lookup request is redirected, `BinaryProtoLookupService#findBroker()` is called recursively. Therefore, tens of thousands of redirects will cause `StackOverflowError` and `BinaryProtoLookupService#findBroker()` will never complete. ### Modifications Limit the number of times a lookup is redirected to 100. This maximum is user configurable. If the number of redirects exceeds 100, the lookup will fail. But `ConnectionHandler` retries lookup so that the producer can eventually reconnect to the new broker.

Master Issue: apache#7041 ### Motivation When a leader broker is restarted, some producers for topics owned by that broker may not be reopened on the new broker. When this happens, message publishing will continue to fail until the client application is restarted. As a result of the investigation, I found that lookup requests sent by the producers in question are redirected more than 10,000 times between multiple brokers. When a lookup request is redirected, `BinaryProtoLookupService#findBroker()` is called recursively. Therefore, tens of thousands of redirects will cause `StackOverflowError` and `BinaryProtoLookupService#findBroker()` will never complete. ### Modifications Limit the number of times a lookup is redirected to 100. This maximum is user configurable. If the number of redirects exceeds 100, the lookup will fail. But `ConnectionHandler` retries lookup so that the producer can eventually reconnect to the new broker.

massakam · 2020-06-09T05:59:08Z

This issue should be fixed by the following PRs:

Master Issue: apache#7041 ### Motivation When a leader broker is restarted, some producers for topics owned by that broker may not be reopened on the new broker. When this happens, message publishing will continue to fail until the client application is restarted. As a result of the investigation, I found that lookup requests sent by the producers in question are redirected more than 10,000 times between multiple brokers. When a lookup request is redirected, `BinaryProtoLookupService#findBroker()` is called recursively. Therefore, tens of thousands of redirects will cause `StackOverflowError` and `BinaryProtoLookupService#findBroker()` will never complete. ### Modifications Limit the number of times a lookup is redirected to 100. This maximum is user configurable. If the number of redirects exceeds 100, the lookup will fail. But `ConnectionHandler` retries lookup so that the producer can eventually reconnect to the new broker.

sijie mentioned this issue May 26, 2020

ISSUE-7041: Producers failed to open when leader broker shut down streamnative/pulsar-archived#1021

Closed

sijie added area/broker triage/week-22 type/bug The PR fixed a bug or issue reported a bug labels May 27, 2020

massakam mentioned this issue May 29, 2020

[client] Limit the number of times lookup requests are redirected #7096

Merged

codelipenghui linked a pull request May 31, 2020 that will close this issue

[client] Limit the number of times lookup requests are redirected #7096

Merged

codelipenghui closed this as completed in #7096 May 31, 2020

massakam mentioned this issue Jun 1, 2020

[client] Change the default value of maxLookupRedirects of Java client #7126

Merged

massakam mentioned this issue Jun 8, 2020

[broker] Prevent redirection of lookup requests from looping #7200

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Producers failed to open when leader broker shut down #7041

Producers failed to open when leader broker shut down #7041

k2la commented May 26, 2020

massakam commented Jun 9, 2020

Producers failed to open when leader broker shut down #7041

Producers failed to open when leader broker shut down #7041

Comments

k2la commented May 26, 2020

Expected behavior

Actual behavior

Steps to reproduce

System configuration

massakam commented Jun 9, 2020