Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug][broker] Error reading entries due to LastConfirmedEntry is null until broker restart #23654

Open
2 of 3 tasks
PatrykWitkowski opened this issue Nov 28, 2024 · 6 comments
Assignees
Labels
type/bug The PR fixed a bug or issue reported a bug

Comments

@PatrykWitkowski
Copy link

Search before asking

  • I searched in the issues and found nothing similar.

Read release policy

  • I understand that unsupported versions don't get bug fixes. I will attempt to reproduce the issue on a supported version of Pulsar client and Pulsar broker.

Version

OS: docker image apachepulsar/pulsar:3.3.2
Java: Amazon Corretto 21
Pulsar: 3.3.2 (client + server)

Minimal reproduce step

Happens randomly during redeployment of a pulsar cluster - restarting all pulsar cluster components (brokers, bookies, zookeepers).

What did you expect to see?

No mentioned error appear on brokers after restarting them.

What did you see instead?

After brokers restart we start to see a lot of errors on brokers:
[persistent://datalake/ingress/...-partition-... / sub] Error reading entries at 3904681:27 : LastConfirmedEntry is null when reading ledger 3904681, Read Type Normal - Retrying to read in 56.202 seconds"
Retrying to read keeps happen until the next brokers restart.

At the same time we see growing backlog on that topic as the consumer can't read
image

It happens on multipartition topic with the consumer from this class: org.apache.pulsar.broker.service.persistent.PersistentDispatcherMultipleConsumers

When we detect it and restart all brokers once again then the backlog goes down (as you can see on the screenshot) and mentioned error goes away. So there's a workaround for that issue.

Anything else?

No response

Are you willing to submit a PR?

  • I'm willing to submit a PR!
@PatrykWitkowski PatrykWitkowski added the type/bug The PR fixed a bug or issue reported a bug label Nov 28, 2024
@lhotari
Copy link
Member

lhotari commented Nov 28, 2024

@BewareMyPower Is this related to #23147 changes? Would you mind checking this issue?

@lhotari
Copy link
Member

lhotari commented Nov 28, 2024

Happens randomly during redeployment of a pulsar cluster - restarting all pulsar cluster components (brokers, bookies, zookeepers).

I wonder if the ledger is running recovery when this is happening.
@PatrykWitkowski Does it eventually recover?

@lhotari
Copy link
Member

lhotari commented Nov 28, 2024

This is kind of surprising logic:

lastConfirmedEntry = PositionFactory.create(lh.getId(), -1);
// bypass empty ledgers, find last ledger with Message if possible.
while (lastConfirmedEntry.getEntryId() == -1) {
Map.Entry<Long, LedgerInfo> formerLedger = ledgers.lowerEntry(lastConfirmedEntry.getLedgerId());
if (formerLedger != null) {
LedgerInfo ledgerInfo = formerLedger.getValue();
lastConfirmedEntry =
PositionFactory.create(ledgerInfo.getLedgerId(), ledgerInfo.getEntries() - 1);
} else {
break;
}
}

Perhaps it's correct, but I'd assume that lastConfirmedEntry would be set earlier and not modified after that.

LedgerInfo info = LedgerInfo.newBuilder().setLedgerId(id)
.setEntries(lh.getLastAddConfirmed() + 1).setSize(lh.getLength())
.setTimestamp(clock.millis()).build();
ledgers.put(id, info);

It looks like the iteration logic was added in #1550. Just wondering if it really makes sense in all cases.

@BewareMyPower
Copy link
Contributor

Yes, #23147 adds a compare logic with the last confirmed entry so that it assumes the lastConfirmedEntry is not null. However, this case is impossible once after the initializeBookKeeper method is called because this field will be initialized with a non-null value and all subsequent modifications do not change it to null.

@BewareMyPower
Copy link
Contributor

#23368 might fix this issue. I observed a similar issue before. From the heap dump I saw the managed ledger held by RangeCacheImpl is outdated and invalid.

@PatrykWitkowski
Copy link
Author

@lhotari In our case that issue was taking 10 days (15/11/2024 - 25/11/2024) and it eventually recovered but after brokers restart

@BewareMyPower I see that fix is part of 4.0.0 so migrating to that version might solve it?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type/bug The PR fixed a bug or issue reported a bug
Projects
None yet
Development

No branches or pull requests

3 participants