Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: multiple patches during long running tests for LMQ over RocksDB #8915

Open
wants to merge 2 commits into
base: develop
Choose a base branch
from

Conversation

lizhanhui
Copy link
Contributor

Which Issue(s) This PR Fixes

To #8829

Brief Description

How Did You Test This Change?

Comment on lines 89 to 91
} catch (RocksDBException e) {
log.error("Failed to build consume queue in RocksDB", e);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If throwing a RocksDBException leads to an endless loop without any pause, is this expected behavior?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, yes, retry util it's 1) interrupted by SIGTERM; 2) failure recovered by internal restart; 3) blocked if RocksDB thread goes to state D;

If RocksDB encounters an unrecoverable failure, for example, due to hardware failure, this thread will be blocked as the write opt prefers write-stall fast failure;
if RocksDB experiences recoverable failures, sampling log error looks reasonable;

Or, if you guys are suggesting something alternative, go ahead and comments are welcome

private void accountWriteOpsForWalFlush() throws RocksDBException {
int writeCount = writeOpsCounter.incrementAndGet();
if (writeCount >= messageStoreConfig.getRocksdbFlushWalFrequency()) {
this.db.flushWal(false);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

may be flushWal(true) better

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The frequency of commit-offset may be very high, I prefer to keep this periodic flush async.

}

protected void initAbleWalWriteOptions() {
this.ableWalWriteOptions = new WriteOptions();
this.ableWalWriteOptions.setSync(false);
this.ableWalWriteOptions.setDisableWAL(false);
this.ableWalWriteOptions.setNoSlowdown(true);
// https://github.com/facebook/rocksdb/wiki/Write-Stalls
this.ableWalWriteOptions.setNoSlowdown(false);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No fast failure, may be block

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Checkout the RocksGroupCommitService, the group-commit thread is responsible for triggering and await batch-write to RocksDB; If the group-commit buffer[bounded, 100k, by default] is full, back-pressure the main dispatch thread, hence the dispatch-lag metrics alarms.

…Store#findConsumeQueueMap override

Signed-off-by: Li Zhanhui <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants