Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Producers failed to open when leader broker shut down #7041

Closed
k2la opened this issue May 26, 2020 · 1 comment · Fixed by #7096
Closed

Producers failed to open when leader broker shut down #7041

k2la opened this issue May 26, 2020 · 1 comment · Fixed by #7096
Labels
area/broker type/bug The PR fixed a bug or issue reported a bug

Comments

@k2la
Copy link
Contributor

k2la commented May 26, 2020

When a leader broker shut down, producers that connected with the broker failed to open on a new broker.

According to the leader broker log, the broker unloaded bundles and closed producers.
Also, according to logs of other brokers, another broker became leader broker and bundles were loaded.

However, producers that connected with the old leader broker reconnected with a new broker but some producers of ones failed to open on the new broker.

After the producers reconnected with the new broker, they didn't send CommandProducer messages and stop.

/// Create a new Producer on a topic, assigning the given producer_id,
/// all messages sent with this producer_id will be persisted on the topic
message CommandProducer {
required string topic = 1;
required uint64 producer_id = 2;
required uint64 request_id = 3;
/// If a producer name is specified, the name will be used,
/// otherwise the broker will generate a unique name
optional string producer_name = 4;
optional bool encrypted = 5 [default = false];
/// Add optional metadata key=value to this producer
repeated KeyValue metadata = 6;
optional Schema schema = 7;
// If producer reconnect to broker, the epoch of this producer will +1
optional uint64 epoch = 8 [default = 0];
// Indicate the name of the producer is generated or user provided
// Use default true here is in order to be forward compatible with the client
optional bool user_provided_producer_name = 9 [default = true];
}

Expected behavior

When producers reconnect with a new broker, open on the broker.

Actual behavior

Some producers failed to open on the new broker.

Steps to reproduce

We tried but haven't reproduced yet.

System configuration

OS(Broker): CentOS 7.7
Pulsar Broker: 2.3.2
Pulsar Client Java: 2.3.2

@sijie sijie added area/broker triage/week-22 type/bug The PR fixed a bug or issue reported a bug labels May 27, 2020
codelipenghui pushed a commit that referenced this issue May 31, 2020
Master Issue: #7041

### Motivation

When a leader broker is restarted, some producers for topics owned by that broker may not be reopened on the new broker. When this happens, message publishing will continue to fail until the client application is restarted.

As a result of the investigation, I found that lookup requests sent by the producers in question are redirected more than 10,000 times between multiple brokers.

When a lookup request is redirected, `BinaryProtoLookupService#findBroker()` is called recursively. Therefore, tens of thousands of redirects will cause `StackOverflowError` and `BinaryProtoLookupService#findBroker()` will never complete.

### Modifications

Limit the number of times a lookup is redirected to 100. This maximum is user configurable. If the number of redirects exceeds 100, the lookup will fail. But `ConnectionHandler` retries lookup so that the producer can eventually reconnect to the new broker.
Huanli-Meng pushed a commit to Huanli-Meng/pulsar that referenced this issue Jun 1, 2020
Master Issue: apache#7041

### Motivation

When a leader broker is restarted, some producers for topics owned by that broker may not be reopened on the new broker. When this happens, message publishing will continue to fail until the client application is restarted.

As a result of the investigation, I found that lookup requests sent by the producers in question are redirected more than 10,000 times between multiple brokers.

When a lookup request is redirected, `BinaryProtoLookupService#findBroker()` is called recursively. Therefore, tens of thousands of redirects will cause `StackOverflowError` and `BinaryProtoLookupService#findBroker()` will never complete.

### Modifications

Limit the number of times a lookup is redirected to 100. This maximum is user configurable. If the number of redirects exceeds 100, the lookup will fail. But `ConnectionHandler` retries lookup so that the producer can eventually reconnect to the new broker.
Huanli-Meng pushed a commit to Huanli-Meng/pulsar that referenced this issue Jun 1, 2020
Master Issue: apache#7041

### Motivation

When a leader broker is restarted, some producers for topics owned by that broker may not be reopened on the new broker. When this happens, message publishing will continue to fail until the client application is restarted.

As a result of the investigation, I found that lookup requests sent by the producers in question are redirected more than 10,000 times between multiple brokers.

When a lookup request is redirected, `BinaryProtoLookupService#findBroker()` is called recursively. Therefore, tens of thousands of redirects will cause `StackOverflowError` and `BinaryProtoLookupService#findBroker()` will never complete.

### Modifications

Limit the number of times a lookup is redirected to 100. This maximum is user configurable. If the number of redirects exceeds 100, the lookup will fail. But `ConnectionHandler` retries lookup so that the producer can eventually reconnect to the new broker.
Huanli-Meng pushed a commit to Huanli-Meng/pulsar that referenced this issue Jun 12, 2020
Master Issue: apache#7041

### Motivation

When a leader broker is restarted, some producers for topics owned by that broker may not be reopened on the new broker. When this happens, message publishing will continue to fail until the client application is restarted.

As a result of the investigation, I found that lookup requests sent by the producers in question are redirected more than 10,000 times between multiple brokers.

When a lookup request is redirected, `BinaryProtoLookupService#findBroker()` is called recursively. Therefore, tens of thousands of redirects will cause `StackOverflowError` and `BinaryProtoLookupService#findBroker()` will never complete.

### Modifications

Limit the number of times a lookup is redirected to 100. This maximum is user configurable. If the number of redirects exceeds 100, the lookup will fail. But `ConnectionHandler` retries lookup so that the producer can eventually reconnect to the new broker.
huangdx0726 pushed a commit to huangdx0726/pulsar that referenced this issue Aug 24, 2020
Master Issue: apache#7041

### Motivation

When a leader broker is restarted, some producers for topics owned by that broker may not be reopened on the new broker. When this happens, message publishing will continue to fail until the client application is restarted.

As a result of the investigation, I found that lookup requests sent by the producers in question are redirected more than 10,000 times between multiple brokers.

When a lookup request is redirected, `BinaryProtoLookupService#findBroker()` is called recursively. Therefore, tens of thousands of redirects will cause `StackOverflowError` and `BinaryProtoLookupService#findBroker()` will never complete.

### Modifications

Limit the number of times a lookup is redirected to 100. This maximum is user configurable. If the number of redirects exceeds 100, the lookup will fail. But `ConnectionHandler` retries lookup so that the producer can eventually reconnect to the new broker.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/broker type/bug The PR fixed a bug or issue reported a bug
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants