[Bug] Broker memory leak #22157

graysonzeng · 2024-02-29T03:11:13Z

Search before asking

I searched in the issues and found nothing similar.

Version

v3.1.1

Minimal reproduce step

After running for a period of time, the broker memory will gradually increase and eventually lead to a restart.

After the heap dump, it was found that many ManagedLedgerImpl instances were retained in the memory, and these instances occupied most of the memory.

Common Path To the Accumulation Point：

[ ]

What did you expect to see?

Normal memory GC

What did you see instead?

broker restart

Anything else?

No response

Are you willing to submit a PR?

I'm willing to submit a PR!

dao-jun · 2024-02-29T03:25:59Z

could you please upload the heap dump file？

graysonzeng · 2024-02-29T03:38:36Z

The heap dump has reached 13GB and cannot be uploaded

dao-jun · 2024-02-29T03:45:46Z

The heap dump has reached 13GB and cannot be uploaded

compress it and upload to cloud driver, such as 百度云? it is very important to locate the root cause

lhotari · 2024-02-29T09:49:32Z

The heap dump has reached 13GB and cannot be uploaded

compress it and upload to cloud driver, such as 百度云? it is very important to locate the root cause

@graysonzeng @dao-jun please notice that the heap dump could contain sensitive data. it should be never shared without encryption because of this.
encrypting for a specific recipient with gpg is one possible solution.

lhotari · 2024-02-29T11:33:29Z

@graysonzeng from the screenshots, it looks like the problem is caused by the 6.5 million NonDurableCursorImpl instances.

lhotari · 2024-02-29T11:39:42Z

I'd recommend adding https://github.com/vlsi/mat-calcite-plugin plugin to Eclipse MAT so that you can do SQL queries to the heap dump. Eclipse MAT has OQL support, but that's not so handy as the SQL queries where you can do anything that Calcite supports with SQL.

not useful for this case, but an example of a Calcite query using Eclipse MAT + Calcite plugin:

select clientVersion, count(clientVersion) from org.apache.pulsar.broker.service.ServerCnx group by 1 order by 2 desc

graysonzeng · 2024-02-29T11:54:10Z

@lhotari Yes. And I found that these instances are referenced by waitingCursors in the ManagedLedgerImpl instance. At the same time, the Cursor is in the isActive = false state. This indicates that they should be deleted and should not still be retained by waitingCursors.

We used routine load task in starrocks for reader consume, and created and deleted consumers multiple times. Therefore, many NonDurableCursorImpl will be generated.

lhotari · 2024-02-29T11:55:27Z

@graysonzeng related to #13939 ?

graysonzeng · 2024-02-29T12:04:45Z

pulsar version is 3.1.1。It looks like related to #13939 . It looks like maybe removeWaitingCursor is not properly removing the cursor after deactivateCursor() converts the cursor's isActive to false. @lhotari

pulsar/pulsar-broker/src/main/java/org/apache/pulsar/broker/service/persistent/PersistentSubscription.java

Lines 310 to 311 in 6ec473e

    
           deactivateCursor(); 
        
           topic.getManagedLedger().removeWaitingCursor(cursor);

lhotari · 2024-02-29T12:26:52Z

Another possibility is that non-durable cursors and related subscriptions should be cleaned up when a connection dies in an unexpected way. I'm not sure how that is handled in the code base currently.

lhotari · 2024-02-29T12:28:22Z

Or maybe cleaned up after an inactivity period?

Technoboy- · 2024-03-04T08:25:01Z

There is a race condition between the consumer.close and checkForNewEntries.
When the consumer is closed, the waitingCursor is empty.

pulsar/pulsar-broker/src/main/java/org/apache/pulsar/broker/service/persistent/PersistentSubscription.java

Lines 298 to 313 in ccc2ea6

    
           public synchronized void removeConsumer(Consumer consumer, boolean isResetCursor) throws BrokerServiceException { 
        
               cursor.updateLastActive(); 
        
               if (dispatcher != null) { 
        
                   dispatcher.removeConsumer(consumer); 
        
               } 
        
               // preserve accumulative stats form removed consumer 
        
               ConsumerStatsImpl stats = consumer.getStats(); 
        
               bytesOutFromRemovedConsumers.add(stats.bytesOutCounter); 
        
               msgOutFromRemovedConsumer.add(stats.msgOutCounter); 
        
               if (dispatcher != null && dispatcher.getConsumers().isEmpty()) { 
        
                   deactivateCursor(); 
        
                   topic.getManagedLedger().removeWaitingCursor(cursor); 
        
                   if (!cursor.isDurable()) {

.
Then after executing line-311, the cursor was added to waitingCursor.

Reproduced test

Add sleep time for checkForNewEntries, like 10s.

pulsar/managed-ledger/src/main/java/org/apache/bookkeeper/mledger/impl/ManagedCursorImpl.java

Lines 983 to 989 in ccc2ea6

    
           private void checkForNewEntries(OpReadEntry op, ReadEntriesCallback callback, Object ctx) { 
        
               try { 
        
                   if (log.isDebugEnabled()) { 
        
                       log.debug("[{}] [{}] Re-trying the read at position {}", ledger.getName(), name, op.readPosition); 
        
                   } 
        
                   if (!hasMoreEntries()) {

Then the below test will reproduce.

@Test
    public void testWaitingCursors() throws Exception {
        final String ns = "prop/ns-test";
        admin.namespaces().createNamespace(ns, 2);
        final String topicName = "persistent://prop/ns-test/testWaitingCursors";
        admin.topics().createNonPartitionedTopic(topicName);
        final Consumer<String> consumer = pulsarClient.newConsumer(Schema.STRING).topic(topicName)
                .subscriptionMode(SubscriptionMode.NonDurable)
                .subscriptionType(SubscriptionType.Exclusive)
                .subscriptionName("sub-2").subscribe();
        final Producer<String> producer = pulsarClient.newProducer(Schema.STRING).topic(topicName).create();
        producer.send("test");
        producer.close();
        final String broker = admin.lookups().lookupTopic(topicName);
        final Optional<Topic> topic = pulsar.getBrokerService().getTopic(topicName, false).join();
        assertNotNull(topic.get());
        PersistentTopic persistentTopic = (PersistentTopic) topic.get();
        final PersistentSubscription subscription = persistentTopic.getSubscription("sub-2");
        NonDurableCursorImpl cursor = (NonDurableCursorImpl) subscription.getCursor();
        final Message<String> receive = consumer.receive();
        assertEquals("test", receive.getValue());
        consumer.close();
        while (true) {
            ManagedLedgerImpl ledger = (ManagedLedgerImpl) cursor.getManagedLedger();
            log.info("waitingCursorsCount : {}", ledger.getWaitingCursorsCount());
            Thread.sleep(5 * 1000);
        }
    }

dao-jun added type/bug The PR fixed a bug or issue reported a bug area/broker labels Feb 29, 2024

dao-jun added this to the 3.3.0 milestone Feb 29, 2024

Technoboy- self-assigned this Feb 29, 2024

This comment was marked as off-topic.

Sign in to view

Technoboy- mentioned this issue Mar 4, 2024

[fix][broker] Check cursor state before adding it to the waitingCursors #22191

Merged

4 tasks

lhotari closed this as completed in #22191 Mar 27, 2024

lhotari mentioned this issue Apr 8, 2024

[fix][broker] Fix consumer stops receiving messages when with large backlogs processing #22454

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] Broker memory leak #22157

[Bug] Broker memory leak #22157

graysonzeng commented Feb 29, 2024

dao-jun commented Feb 29, 2024

graysonzeng commented Feb 29, 2024

dao-jun commented Feb 29, 2024

lhotari commented Feb 29, 2024 •

edited

Loading

lhotari commented Feb 29, 2024

lhotari commented Feb 29, 2024

graysonzeng commented Feb 29, 2024

lhotari commented Feb 29, 2024

graysonzeng commented Feb 29, 2024

This comment was marked as off-topic.

lhotari commented Feb 29, 2024

lhotari commented Feb 29, 2024

Technoboy- commented Mar 4, 2024

[Bug] Broker memory leak #22157

[Bug] Broker memory leak #22157

Comments

graysonzeng commented Feb 29, 2024

Search before asking

Version

Minimal reproduce step

What did you expect to see?

What did you see instead?

Anything else?

Are you willing to submit a PR?

dao-jun commented Feb 29, 2024

graysonzeng commented Feb 29, 2024

dao-jun commented Feb 29, 2024

lhotari commented Feb 29, 2024 • edited Loading

lhotari commented Feb 29, 2024

lhotari commented Feb 29, 2024

graysonzeng commented Feb 29, 2024

lhotari commented Feb 29, 2024

graysonzeng commented Feb 29, 2024

This comment was marked as off-topic.

lhotari commented Feb 29, 2024

lhotari commented Feb 29, 2024

Technoboy- commented Mar 4, 2024

Reproduced test

lhotari commented Feb 29, 2024 •

edited

Loading