-
Notifications
You must be signed in to change notification settings - Fork 4.5k
Print additional error handling instructions on RocksDB error. #27170
Conversation
@@ -254,6 +254,18 @@ struct Rocks { | |||
write_batch_perf_status: PerfSamplingStatus, | |||
} | |||
|
|||
macro_rules! rocks_try { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why macro is needed? i think generic wrapper fn
should suffice at quick glance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fn
will work as well, but since we are doing this only for rare errors, macro_rules!
should have less performance impact on the normal case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm, i don't think using macro_rules!
will bring better performance here... could you elaborate a bit? I'm currently thinking the fn
version would surely be inlined, just like the macro one.
|
||
fn print_error_handling_instruction(rocks_error: &rocksdb::Error) { | ||
let error_msg = format!("{:?}", rocks_error); | ||
if error_msg.contains("Corruption:") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think rocksdb::Error::kind() can be used?
https://docs.rs/rocksdb/latest/rocksdb/struct.Error.html#method.kind
https://docs.rs/rocksdb/latest/rocksdb/enum.ErrorKind.html#variant.Corruption
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, oh, I didn't know this is available in rust-rocksdb. Thanks for pointing it out!
@@ -441,17 +453,17 @@ impl Rocks { | |||
} | |||
|
|||
fn get_cf(&self, cf: &ColumnFamily, key: &[u8]) -> Result<Option<Vec<u8>>> { | |||
let opt = self.db.get_cf(cf, key)?; | |||
let opt = rocks_try!(self.db.get_cf(cf, key)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why get_cf
/put_cf
/delete_cf
/delete_file_in_range_cf
/write
are only chosen to look for corruption errors?
are we certain this is the only source of the error?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also, i think this is a bit maintenance annoyance, when adding more api surface into rocks.
instead, i'd propose remove this auto-derived From
trait impl and write our own one.in that way, type safety may bless us. :) In that custom impl, the fn would take (err, atomicbool)
and set true
to the bool if ErrorKind::Corruption
is detected.
solana/ledger/src/blockstore_db.rs
Line 104 in 1eba91a
RocksDb(#[from] rocksdb::Error), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why get_cf/put_cf/delete_cf/delete_file_in_range_cf/write are only chosen to look for corruption errors?
are we certain this is the only source of the error?
These are just starters :p. Ideally, it should apply to all rocksdb calls. This PR only takes care of those rocksdb calls inside struct Rocks
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's another one that we cannot catch is the corruption reported via rocksdb's background compaction job.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
instead, i'd propose remove this auto-derived From trait impl and write our own one.in that way, type safety may bless us. :) In that custom impl, the fn would take (err, atomicbool) and set true to the bool if ErrorKind::Corruption is detected.
This sounds like a good idea. Let me think more about it as we want it to be general enough to cover all types of severe/irrecoverable errors.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's another one that we cannot catch is the corruption reported via rocksdb's background compaction job.
oh, that's good point. btw, can we be notified these errors in some way like registering error listeners to rocks? if it's easy, let's do this. :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh, that's good point. btw, can we be notified these errors in some way like registering error listeners to rocks? if it's easy, let's do this. :)
This could be done via the EventListener::OnBackgroundError()
callback, but the EventListener API isn't available in RocksDB's C API (and thus not available in the rust-rocksdb API).
} | ||
}}; | ||
} | ||
|
||
impl Rocks { | ||
fn open(path: &Path, options: BlockstoreOptions) -> Result<Rocks> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
according to past bug reports, wrapping open
isn't enough, i guess? these corruption error needs a full scan of entire sst files. i doubt it will be done everytime when opening db. ;)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
RocksDB reports corruption only when it actually read that block, which IMO is a good thing because this will allow us to open the db and obtain file metadata in our ledger tool. For instance, #26790 allows our ledger tool to print out slot range of a sst file, and #26905 allows our ledger tool to purge older slots to free up space. They can both operate on corrupted sst files.
fn print_error_handling_instruction(rocks_error: &rocksdb::Error) { | ||
let error_msg = format!("{:?}", rocks_error); | ||
if error_msg.contains("Corruption:") { | ||
error!( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
well, I don't think error!
logs is enough because people won't see logs that carefully. I think we should gracefully terminate the validator process. so, i know it's a bit tedious, but i'd rather like to plumb the exit
atomicbool way down to here from the validator startup process.
@yhchiang-sol unfortunately, devils is in the details.. it looks like my small favor will need more work, than anticipated.... thanks in advance! |
No worries. Let me turn it back to draft and bake it a little bit! |
Problem
When a validator stops due to certain rocksdb error, we would like to include
additional information or instructions about the error and how to handle it.
Summary of Changes
This PR introduces and uses rocks_try! macro that allows rocksdb call to
provide additional information in the validator.log when the rocksdb call
returns non-Ok status.