-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug][broker] Error reading entries due to LastConfirmedEntry is null until broker restart #23654
Comments
@BewareMyPower Is this related to #23147 changes? Would you mind checking this issue? |
I wonder if the ledger is running recovery when this is happening. |
This is kind of surprising logic: pulsar/managed-ledger/src/main/java/org/apache/bookkeeper/mledger/impl/ManagedLedgerImpl.java Lines 588 to 599 in 7e6fa55
Perhaps it's correct, but I'd assume that pulsar/managed-ledger/src/main/java/org/apache/bookkeeper/mledger/impl/ManagedLedgerImpl.java Lines 430 to 433 in 7822dca
It looks like the iteration logic was added in #1550. Just wondering if it really makes sense in all cases. |
Yes, #23147 adds a compare logic with the last confirmed entry so that it assumes the |
#23368 might fix this issue. I observed a similar issue before. From the heap dump I saw the managed ledger held by |
@lhotari In our case that issue was taking 10 days (15/11/2024 - 25/11/2024) and it eventually recovered but after brokers restart @BewareMyPower I see that fix is part of 4.0.0 so migrating to that version might solve it? |
Search before asking
Read release policy
Version
OS: docker image apachepulsar/pulsar:3.3.2
Java: Amazon Corretto 21
Pulsar: 3.3.2 (client + server)
Minimal reproduce step
Happens randomly during redeployment of a pulsar cluster - restarting all pulsar cluster components (brokers, bookies, zookeepers).
What did you expect to see?
No mentioned error appear on brokers after restarting them.
What did you see instead?
After brokers restart we start to see a lot of errors on brokers:
[persistent://datalake/ingress/...-partition-... / sub] Error reading entries at 3904681:27 : LastConfirmedEntry is null when reading ledger 3904681, Read Type Normal - Retrying to read in 56.202 seconds"
Retrying to read keeps happen until the next brokers restart.
At the same time we see growing backlog on that topic as the consumer can't read
It happens on multipartition topic with the consumer from this class: org.apache.pulsar.broker.service.persistent.PersistentDispatcherMultipleConsumers
When we detect it and restart all brokers once again then the backlog goes down (as you can see on the screenshot) and mentioned error goes away. So there's a workaround for that issue.
Anything else?
No response
Are you willing to submit a PR?
The text was updated successfully, but these errors were encountered: