[Bug][broker] Error reading entries due to LastConfirmedEntry is null until broker restart #23654

PatrykWitkowski · 2024-11-28T12:53:21Z

Search before asking

I searched in the issues and found nothing similar.

Read release policy

I understand that unsupported versions don't get bug fixes. I will attempt to reproduce the issue on a supported version of Pulsar client and Pulsar broker.

Version

OS: docker image apachepulsar/pulsar:3.3.2
Java: Amazon Corretto 21
Pulsar: 3.3.2 (client + server)

Minimal reproduce step

Happens randomly during redeployment of a pulsar cluster - restarting all pulsar cluster components (brokers, bookies, zookeepers).

What did you expect to see?

No mentioned error appear on brokers after restarting them.

What did you see instead?

After brokers restart we start to see a lot of errors on brokers:
[persistent://datalake/ingress/...-partition-... / sub] Error reading entries at 3904681:27 : LastConfirmedEntry is null when reading ledger 3904681, Read Type Normal - Retrying to read in 56.202 seconds"
Retrying to read keeps happen until the next brokers restart.

At the same time we see growing backlog on that topic as the consumer can't read

It happens on multipartition topic with the consumer from this class: org.apache.pulsar.broker.service.persistent.PersistentDispatcherMultipleConsumers

When we detect it and restart all brokers once again then the backlog goes down (as you can see on the screenshot) and mentioned error goes away. So there's a workaround for that issue.

Anything else?

No response

Are you willing to submit a PR?

I'm willing to submit a PR!

The text was updated successfully, but these errors were encountered:

lhotari · 2024-11-28T22:38:51Z

@BewareMyPower Is this related to #23147 changes? Would you mind checking this issue?

lhotari · 2024-11-28T22:42:07Z

Happens randomly during redeployment of a pulsar cluster - restarting all pulsar cluster components (brokers, bookies, zookeepers).

I wonder if the ledger is running recovery when this is happening.
@PatrykWitkowski Does it eventually recover?

lhotari · 2024-11-28T22:59:02Z

This is kind of surprising logic:

pulsar/managed-ledger/src/main/java/org/apache/bookkeeper/mledger/impl/ManagedLedgerImpl.java

Lines 588 to 599 in 7e6fa55

    
           lastConfirmedEntry = PositionFactory.create(lh.getId(), -1); 
        
           // bypass empty ledgers, find last ledger with Message if possible. 
        
           while (lastConfirmedEntry.getEntryId() == -1) { 
        
               Map.Entry<Long, LedgerInfo> formerLedger = ledgers.lowerEntry(lastConfirmedEntry.getLedgerId()); 
        
               if (formerLedger != null) { 
        
                   LedgerInfo ledgerInfo = formerLedger.getValue(); 
        
                   lastConfirmedEntry = 
        
                           PositionFactory.create(ledgerInfo.getLedgerId(), ledgerInfo.getEntries() - 1); 
        
               } else { 
        
                   break; 
        
               } 
        
           }

Perhaps it's correct, but I'd assume that lastConfirmedEntry would be set earlier and not modified after that.

pulsar/managed-ledger/src/main/java/org/apache/bookkeeper/mledger/impl/ManagedLedgerImpl.java

Lines 430 to 433 in 7822dca

    
           LedgerInfo info = LedgerInfo.newBuilder().setLedgerId(id) 
        
                   .setEntries(lh.getLastAddConfirmed() + 1).setSize(lh.getLength()) 
        
                   .setTimestamp(clock.millis()).build(); 
        
           ledgers.put(id, info);

It looks like the iteration logic was added in #1550. Just wondering if it really makes sense in all cases.

BewareMyPower · 2024-11-29T01:54:43Z

Yes, #23147 adds a compare logic with the last confirmed entry so that it assumes the lastConfirmedEntry is not null. However, this case is impossible once after the initializeBookKeeper method is called because this field will be initialized with a non-null value and all subsequent modifications do not change it to null.

BewareMyPower · 2024-11-29T02:00:01Z

#23368 might fix this issue. I observed a similar issue before. From the heap dump I saw the managed ledger held by RangeCacheImpl is outdated and invalid.

PatrykWitkowski · 2024-11-29T11:09:02Z

@lhotari In our case that issue was taking 10 days (15/11/2024 - 25/11/2024) and it eventually recovered but after brokers restart

@BewareMyPower I see that fix is part of 4.0.0 so migrating to that version might solve it?

PatrykWitkowski added the type/bug The PR fixed a bug or issue reported a bug label Nov 28, 2024

lhotari assigned BewareMyPower Nov 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug][broker] Error reading entries due to LastConfirmedEntry is null until broker restart #23654

[Bug][broker] Error reading entries due to LastConfirmedEntry is null until broker restart #23654

PatrykWitkowski commented Nov 28, 2024

lhotari commented Nov 28, 2024 •

edited

Loading

lhotari commented Nov 28, 2024

lhotari commented Nov 28, 2024

BewareMyPower commented Nov 29, 2024

BewareMyPower commented Nov 29, 2024

PatrykWitkowski commented Nov 29, 2024

[Bug][broker] Error reading entries due to LastConfirmedEntry is null until broker restart #23654

[Bug][broker] Error reading entries due to LastConfirmedEntry is null until broker restart #23654

Comments

PatrykWitkowski commented Nov 28, 2024

Search before asking

Read release policy

Version

Minimal reproduce step

What did you expect to see?

What did you see instead?

Anything else?

Are you willing to submit a PR?

lhotari commented Nov 28, 2024 • edited Loading

lhotari commented Nov 28, 2024

lhotari commented Nov 28, 2024

BewareMyPower commented Nov 29, 2024

BewareMyPower commented Nov 29, 2024

PatrykWitkowski commented Nov 29, 2024

lhotari commented Nov 28, 2024 •

edited

Loading