fix: multiple patches during long running tests for LMQ over RocksDB #8915

lizhanhui · 2024-11-13T11:38:48Z

Which Issue(s) This PR Fixes

To #8829

Brief Description

How Did You Test This Change?

Signed-off-by: Li Zhanhui <[email protected]>

store/src/main/java/org/apache/rocketmq/store/rocksdb/RocksDBOptionsFactory.java

RongtongJin · 2024-11-14T03:01:48Z

store/src/main/java/org/apache/rocketmq/store/queue/RocksGroupCommitService.java

+                    } catch (RocksDBException e) {
+                        log.error("Failed to build consume queue in RocksDB", e);
+                    }


If throwing a RocksDBException leads to an endless loop without any pause, is this expected behavior?

Overall, yes, retry util it's 1) interrupted by SIGTERM; 2) failure recovered by internal restart; 3) blocked if RocksDB thread goes to state D;

If RocksDB encounters an unrecoverable failure, for example, due to hardware failure, this thread will be blocked as the write opt prefers write-stall fast failure;
if RocksDB experiences recoverable failures, sampling log error looks reasonable;

Or, if you guys are suggesting something alternative, go ahead and comments are welcome

store/src/main/java/org/apache/rocketmq/store/queue/RocksDBConsumeQueueStore.java

fuyou001 · 2024-11-14T06:31:52Z

broker/src/main/java/org/apache/rocketmq/broker/config/v2/ConfigStorage.java

+    private void accountWriteOpsForWalFlush() throws RocksDBException {
+        int writeCount = writeOpsCounter.incrementAndGet();
+        if (writeCount >= messageStoreConfig.getRocksdbFlushWalFrequency()) {
+            this.db.flushWal(false);


may be flushWal(true) better

The frequency of commit-offset may be very high, I prefer to keep this periodic flush async.

store/src/main/java/org/apache/rocketmq/store/queue/RocksGroupCommitService.java

fuyou001 · 2024-11-14T06:43:56Z

common/src/main/java/org/apache/rocketmq/common/config/AbstractRocksDBStorage.java

    }

    protected void initAbleWalWriteOptions() {
        this.ableWalWriteOptions = new WriteOptions();
        this.ableWalWriteOptions.setSync(false);
        this.ableWalWriteOptions.setDisableWAL(false);
-        this.ableWalWriteOptions.setNoSlowdown(true);
+        // https://github.com/facebook/rocksdb/wiki/Write-Stalls
+        this.ableWalWriteOptions.setNoSlowdown(false);


No fast failure, may be block

Checkout the RocksGroupCommitService, the group-commit thread is responsible for triggering and await batch-write to RocksDB; If the group-commit buffer[bounded, 100k, by default] is full, back-pressure the main dispatch thread, hence the dispatch-lag metrics alarms.

…Store#findConsumeQueueMap override Signed-off-by: Li Zhanhui <[email protected]>

fix: multiple patches during long running tests for LMQ over RocksDB

4fedc06

Signed-off-by: Li Zhanhui <[email protected]>

lizhanhui requested review from RongtongJin, tianliuliu, DongyuanPan and fuyou001 November 13, 2024 11:38

tianliuliu reviewed Nov 14, 2024

View reviewed changes

store/src/main/java/org/apache/rocketmq/store/rocksdb/RocksDBOptionsFactory.java Show resolved Hide resolved

RongtongJin reviewed Nov 14, 2024

View reviewed changes

fuyou001 reviewed Nov 14, 2024

View reviewed changes

fix: fix a bug in RocksGroupCommitService; remove RocksDBConsumeQueue…

461e8de

…Store#findConsumeQueueMap override Signed-off-by: Li Zhanhui <[email protected]>

RongtongJin approved these changes Nov 15, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: multiple patches during long running tests for LMQ over RocksDB #8915

fix: multiple patches during long running tests for LMQ over RocksDB #8915

lizhanhui commented Nov 13, 2024

RongtongJin Nov 14, 2024

lizhanhui Nov 14, 2024

fuyou001 Nov 14, 2024

lizhanhui Nov 14, 2024

fuyou001 Nov 14, 2024

lizhanhui Nov 14, 2024

fix: multiple patches during long running tests for LMQ over RocksDB #8915

Are you sure you want to change the base?

fix: multiple patches during long running tests for LMQ over RocksDB #8915

Conversation

lizhanhui commented Nov 13, 2024

Which Issue(s) This PR Fixes

Brief Description

How Did You Test This Change?

RongtongJin Nov 14, 2024

Choose a reason for hiding this comment

lizhanhui Nov 14, 2024

Choose a reason for hiding this comment

fuyou001 Nov 14, 2024

Choose a reason for hiding this comment

lizhanhui Nov 14, 2024

Choose a reason for hiding this comment

fuyou001 Nov 14, 2024

Choose a reason for hiding this comment

lizhanhui Nov 14, 2024

Choose a reason for hiding this comment