Skip to content
This repository has been archived by the owner on Jan 13, 2025. It is now read-only.

Print additional error handling instructions on RocksDB error. #27170

Closed
wants to merge 1 commit into from
Closed

Print additional error handling instructions on RocksDB error. #27170

wants to merge 1 commit into from

Conversation

yhchiang-sol
Copy link
Contributor

@yhchiang-sol yhchiang-sol commented Aug 16, 2022

Problem

When a validator stops due to certain rocksdb error, we would like to include
additional information or instructions about the error and how to handle it.

Summary of Changes

This PR introduces and uses rocks_try! macro that allows rocksdb call to
provide additional information in the validator.log when the rocksdb call
returns non-Ok status.

@yhchiang-sol yhchiang-sol marked this pull request as draft August 16, 2022 05:42
@yhchiang-sol yhchiang-sol requested a review from ryoqun August 16, 2022 17:37
@yhchiang-sol yhchiang-sol marked this pull request as ready for review August 16, 2022 17:37
@@ -254,6 +254,18 @@ struct Rocks {
write_batch_perf_status: PerfSamplingStatus,
}

macro_rules! rocks_try {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why macro is needed? i think generic wrapper fn should suffice at quick glance.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fn will work as well, but since we are doing this only for rare errors, macro_rules! should have less performance impact on the normal case.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, i don't think using macro_rules! will bring better performance here... could you elaborate a bit? I'm currently thinking the fn version would surely be inlined, just like the macro one.


fn print_error_handling_instruction(rocks_error: &rocksdb::Error) {
let error_msg = format!("{:?}", rocks_error);
if error_msg.contains("Corruption:") {
Copy link
Contributor

@ryoqun ryoqun Aug 17, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, oh, I didn't know this is available in rust-rocksdb. Thanks for pointing it out!

@@ -441,17 +453,17 @@ impl Rocks {
}

fn get_cf(&self, cf: &ColumnFamily, key: &[u8]) -> Result<Option<Vec<u8>>> {
let opt = self.db.get_cf(cf, key)?;
let opt = rocks_try!(self.db.get_cf(cf, key));
Copy link
Contributor

@ryoqun ryoqun Aug 17, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why get_cf/put_cf/delete_cf/delete_file_in_range_cf/write are only chosen to look for corruption errors?

are we certain this is the only source of the error?

Copy link
Contributor

@ryoqun ryoqun Aug 17, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also, i think this is a bit maintenance annoyance, when adding more api surface into rocks.

instead, i'd propose remove this auto-derived From trait impl and write our own one.in that way, type safety may bless us. :) In that custom impl, the fn would take (err, atomicbool) and set true to the bool if ErrorKind::Corruption is detected.

RocksDb(#[from] rocksdb::Error),

Copy link
Contributor Author

@yhchiang-sol yhchiang-sol Aug 17, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why get_cf/put_cf/delete_cf/delete_file_in_range_cf/write are only chosen to look for corruption errors?
are we certain this is the only source of the error?

These are just starters :p. Ideally, it should apply to all rocksdb calls. This PR only takes care of those rocksdb calls inside struct Rocks.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's another one that we cannot catch is the corruption reported via rocksdb's background compaction job.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

instead, i'd propose remove this auto-derived From trait impl and write our own one.in that way, type safety may bless us. :) In that custom impl, the fn would take (err, atomicbool) and set true to the bool if ErrorKind::Corruption is detected.

This sounds like a good idea. Let me think more about it as we want it to be general enough to cover all types of severe/irrecoverable errors.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's another one that we cannot catch is the corruption reported via rocksdb's background compaction job.

oh, that's good point. btw, can we be notified these errors in some way like registering error listeners to rocks? if it's easy, let's do this. :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh, that's good point. btw, can we be notified these errors in some way like registering error listeners to rocks? if it's easy, let's do this. :)

This could be done via the EventListener::OnBackgroundError() callback, but the EventListener API isn't available in RocksDB's C API (and thus not available in the rust-rocksdb API).

}
}};
}

impl Rocks {
fn open(path: &Path, options: BlockstoreOptions) -> Result<Rocks> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

according to past bug reports, wrapping open isn't enough, i guess? these corruption error needs a full scan of entire sst files. i doubt it will be done everytime when opening db. ;)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RocksDB reports corruption only when it actually read that block, which IMO is a good thing because this will allow us to open the db and obtain file metadata in our ledger tool. For instance, #26790 allows our ledger tool to print out slot range of a sst file, and #26905 allows our ledger tool to purge older slots to free up space. They can both operate on corrupted sst files.

fn print_error_handling_instruction(rocks_error: &rocksdb::Error) {
let error_msg = format!("{:?}", rocks_error);
if error_msg.contains("Corruption:") {
error!(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

well, I don't think error! logs is enough because people won't see logs that carefully. I think we should gracefully terminate the validator process. so, i know it's a bit tedious, but i'd rather like to plumb the exit atomicbool way down to here from the validator startup process.

@ryoqun
Copy link
Contributor

ryoqun commented Aug 17, 2022

@yhchiang-sol unfortunately, devils is in the details.. it looks like my small favor will need more work, than anticipated.... thanks in advance!

@yhchiang-sol
Copy link
Contributor Author

@yhchiang-sol unfortunately, devils is in the details.. it looks like my small favor will need more work, than anticipated.... thanks in advance!

No worries. Let me turn it back to draft and bake it a little bit!

@yhchiang-sol yhchiang-sol marked this pull request as draft August 18, 2022 01:32
@github-actions github-actions bot added the stale [bot only] Added to stale content; results in auto-close after a week. label Dec 29, 2022
@github-actions github-actions bot closed this Jan 6, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
stale [bot only] Added to stale content; results in auto-close after a week.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants