Skip to content
This repository has been archived by the owner on Jan 13, 2025. It is now read-only.

Rocksdb corruption issues on compaction #9009

Open
sakridge opened this issue Mar 22, 2020 · 21 comments
Open

Rocksdb corruption issues on compaction #9009

sakridge opened this issue Mar 22, 2020 · 21 comments
Assignees
Milestone

Comments

@sakridge
Copy link
Contributor

Problem

Seen messages like this on many validators from rocksdb.

[2020-03-22T10:22:48.766837593Z ERROR solana_core::window_service] thread 
Some("solana-window-insert") error BlockstoreError(RocksDb(Error { 
message: "Corruption: block checksum mismatch: expected 3583270445, got 3398136873  
in /mnt/vol1/ledger/rocksdb/165855.sst offset 25107936 size 3758" }))

Proposed Solution

Debug and fix.

@sakridge
Copy link
Contributor Author

Linked rocksdb issue:
facebook/rocksdb#6435

@mvines mvines added this to the v1.0.9 milestone Mar 22, 2020
@mvines mvines modified the milestones: v1.0.9, v1.0.10 Mar 24, 2020
@brianlong
Copy link
Contributor

brianlong commented Mar 24, 2020

The --no-rocksdb-compaction flag resulted in delinquent validators for Pete -- LunaNova and me. We are both running without it.

@mvines mvines modified the milestones: v1.0.10, v1.0.11 Mar 26, 2020
@sakridge sakridge self-assigned this Mar 26, 2020
@mvines mvines modified the milestones: v1.0.11, v1.0.12, v1.0.13, v1.1.1, v1.2.0 Mar 30, 2020
@mvines
Copy link
Contributor

mvines commented May 19, 2020

I've not heard any reports of this issue recently, closing for now

@mvines mvines closed this as completed May 19, 2020
@ryoqun ryoqun reopened this Aug 11, 2021
@ryoqun
Copy link
Contributor

ryoqun commented Aug 11, 2021

actually, there is sporadic reports of this (search our discord server).....

@ryoqun
Copy link
Contributor

ryoqun commented Aug 11, 2021

I've been waiting to exclude our unusual manual compaction codepath could possibly be causing this. But, after the write stall fix is well rolled out, there's still reports. So, I dug up a bit:

$ diff -U3 <(xxd slot-88733596-shreds-good) <(xxd slot-88733596-shreds-bad) 
--- /dev/fd/63  2021-08-11 22:53:03.943777438 +0900
+++ /dev/fd/62  2021-08-11 22:53:03.943777438 +0900
@@ -78745,7 +78745,7 @@
 00133980: 3033 4145 4532 3846 4630 4238 3838 4442  03AEE28FF0B888DB
 00133990: 3746 3146 4331 4534 3939 3030 4238 4331  7F1FC1E49900B8C1
 001339a0: 3633 3930 4637 3842 3234 4132 4134 3630  6390F78B24A2A460
-001339b0: 3841 4531 3644 3033 4331 4231 3743 3330  8AE16D03C1B17C30
+001339b0: 3941 4531 3644 3033 4331 4231 3743 3330  9AE16D03C1B17C30
 001339c0: 3641 3744 3531 3731 3932 4630 4141 4643  6A7D517192F0AAFC
 001339d0: 3646 3236 3545 3346 4237 3743 4337 4144  6F265E3FB77CC7AD
 001339e0: 4138 3243 3532 3944 3042 4533 4231 3336  A82C529D0BE3B136

making a fun debugging story short, this boils down rocksdb actually caught single bit rot from underlying hardware.

So, our next action here would be to properly handle the error and show message and abort solana-validator/solana-ledger-tool:

Unfortunately, RocksDb detected data corruption. This usually means your ledger integrity is irrevocably compromised due to underlying hardware problem like silent data corruption and can't continue to operate. If possible, bug report to #9009 with the mentioned SST file attached. After that, remove your current ledger to restore operation or restore using your backup system. Original RocksDB error: RocksDb(Error { message: "Corruption: block checksum mismatch: stored = 1822493296, computed = 125871199 in /home/solana/validator-ledger/rocksdb/005714.sst offset 2076534 size 4631" })

@calypso-it
Copy link

@ryoqun @sakridge As requested, a crash report: We experienced it on 1.10.27:

Jun 25 21:53:12 testnet testnet.sh[499257]: [2022-06-25T21:53:12.250867140Z INFO  solana_ledger::blockstore_processor] processing ledger from slot 138536385...
Jun 25 21:53:12 testnet testnet.sh[499257]: thread 'main' panicked at 'Couldn't set root slot on startup: RocksDb(Error { message: "Corruption: block checksum mismatch: stored = 2324967102, computed = 2123576394, type = 1  in /storage/validator-ledger/rocksdb/011080.sst offset 36943576 size 4271" })', ledger/src/blockstore_processor.rs:677:14
Jun 25 21:53:12 testnet testnet.sh[499257]: stack backtrace:
Jun 25 21:53:12 testnet testnet.sh[499257]:    0: rust_begin_unwind
Jun 25 21:53:12 testnet testnet.sh[499257]:              at ./rustc/9d1b2106e23b1abd32fce1f17267604a5102f57a/library/std/src/panicking.rs:498:5
Jun 25 21:53:12 testnet testnet.sh[499257]:    1: core::panicking::panic_fmt
Jun 25 21:53:12 testnet testnet.sh[499257]:              at ./rustc/9d1b2106e23b1abd32fce1f17267604a5102f57a/library/core/src/panicking.rs:116:14
Jun 25 21:53:12 testnet testnet.sh[499257]:    2: core::result::unwrap_failed
Jun 25 21:53:12 testnet testnet.sh[499257]:              at ./rustc/9d1b2106e23b1abd32fce1f17267604a5102f57a/library/core/src/result.rs:1690:5
Jun 25 21:53:12 testnet testnet.sh[499257]:    3: solana_ledger::blockstore_processor::process_blockstore_from_root
Jun 25 21:53:12 testnet testnet.sh[499257]:    4: solana_core::validator::Validator::new
Jun 25 21:53:12 testnet testnet.sh[499257]:    5: solana_validator::main
Jun 25 21:53:12 testnet testnet.sh[499257]: note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.
Jun 25 21:53:12 testnet testnet.sh[499257]: [2022-06-25T21:53:12.258596408Z ERROR solana_metrics::metrics] datapoint: panic program="validator" thread="main" one=1i message="panicked at 'Couldn't set root slot on startup: RocksDb(Error { message: \"Corruption: block checksum mismatch: stored = 2324967102, computed = 2123576394, type = 1  in /storage/validator-ledger/rocksdb/011080.sst offset 36943576 size 4271\" })', ledger/src/blockstore_processor.rs:677:14" location="ledger/src/blockstore_processor.rs:677:14" version="\"1.10.27 (src:8a9f5962; feat:1425680972)\""

File: https://github.com/calypso-it/sol-rocksdb-panic/blob/main/011080.sst

@sakridge
Copy link
Contributor Author

@yhchiang-sol could you take a look at that report? Is there a way we can recover from checksum mismatch maybe by removing that data which is corrupted and re-downloading it?

@yhchiang-sol
Copy link
Contributor

@yhchiang-sol could you take a look at that report? Is there a way we can recover from checksum mismatch maybe by removing that data which is corrupted and re-downloading it?

Should be something doable, but we need to write some code to make it happen.

Since the corruption happens in one or more blocks inside a sst file, it means we might still be able to read its sst file metadata to obtain its key range. If so, then we might be able to recover using the following steps:

  • obtain the key range of the file via the get_column_family_metadata() API
  • delete the file using the delete_file API (which may break the consistency in the deleted file has any deletion entries, but it's better than not able to read the file.)
  • based on the key range obtained in step 1, redownload those keys and reinsert them to rocksdb.

Note that there would be some subtle difference between the recovered db and the db before the corruption as the recovered entries will become newer than they used to be, but if we always read the db without snapshot, then we should be fine.

@yhchiang-sol
Copy link
Contributor

Note that the assumption here is we are still able to read the metadata of the sst file that has one or more block corruption.

Btw, in case the corrupted instance is still there, can I have a copy of that instance for debug purposes?

@calypso-it
Copy link

@yhchiang-sol What files do you mean by "corrupted instance"?
The issues re-appears frequently, so I can save them on the next crash

@yhchiang-sol
Copy link
Contributor

@yhchiang-sol What files do you mean by "corrupted instance"?

Ideally, the entire rocksdb database, but that will definitely be too big to keep.
And, thanks for attaching the corrupted sst file, I just saw that in your previous message.
I will do some debugging.

The issues re-appears frequently, so I can save them on the next crash

Thank you, can you save the sst file as well on the next crash.

Btw, can I know whether this happens frequently to all your validators or just one validator if you happen to have multiple validators?

@yhchiang-sol yhchiang-sol self-assigned this Jun 30, 2022
@ryoqun
Copy link
Contributor

ryoqun commented Jun 30, 2022

(i think this is another data corruption. let me revive the last investigation patch...)

@ryoqun
Copy link
Contributor

ryoqun commented Jun 30, 2022

@yhchiang-sol @sakridge

so, this time wan't that the root cause is a single bit flip. but there was a oddly zero-ed range in the reported sst file:

$ xxd /home/ryoqun/Downloads/011080.sst
...
0233be70: 0000 0000 0000 d953 4108 0000 0000 1f00  .......SA.......
0233be80: 0000 da53 4108 0000 0000 1e00 0000 db53  ...SA..........S
0233be90: 4108 0000 0000 1d00 0000 dc53 4108 0000  A..........SA...
0233bea0: 0000 1c00 0000 ec53 4108 0000 0000 1b00  .......SA.......
0233beb0: 0000 ed53 4108 0000 0000 1a00 0000 ee53  ...SA..........S
0233bec0: 4108 0000 0000 1900 0000 ef53 4108 0000  A..........SA...
0233bed0: 0000 1800 0000 f053 4108 0000 0000 1700  .......SA.......
0233bee0: 0000 f153 4108 0000 0000 1600 0000 f853  ...SA..........S
0233bef0: 4108 0000 0000 1500 0000 0854 4108 0000  A..........TA...
0233bf00: 0000 1400 0000 0954 4108 0000 0000 1300  .......TA.......
0233bf10: 0000 1054 4108 0000 0000 1200 0000 1154  ...TA..........T
0233bf20: 4108 0000 0000 1100 0000 1254 4108 0000  A..........TA...
0233bf30: 0000 1000 0000 1354 4108 0000 0000 0f00  .......TA.......
0233bf40: 0000 1454 4108 0000 0000 0e00 0000 1554  ...TA..........T
0233bf50: 4108 0000 0000 0d00 0000 1c54 4108 0000  A..........TA...
0233bf60: 0000 0c00 0000 1d54 4108 0000 0000 0b00  .......TA.......
0233bf70: 0000 2454 4108 0000 0000 0a00 0000 2554  ..$TA.........%T
0233bf80: 4108 0000 0000 0900 0000 2654 4108 0000  A.........&TA...
0233bf90: 0000 0800 0000 2754 4108 0000 0000 0700  ......'TA.......
0233bfa0: 0000 3d54 4108 0000 0000 0600 0000 4454  ..=TA.........DT
0233bfb0: 4108 0000 0000 0500 0000 5054 4108 0000  A.........PTA...
0233bfc0: 0000 0400 0000 5154 4108 0000 0000 0300  ......QTA.......
0233bfd0: 0000 5254 4108 0000 0000 0200 0000 5354  ..RTA.........ST
0233bfe0: 4108 0000 0000 010f 098b 06e3 019d 93db  A...............
0233bff0: 1700 0000 ff5f 9d41 8d71 076a 2217 cb9a  ....._.A.q.j"...
0233c000: 0000 0000 0000 0000 0000 0000 0000 0000  ................
0233c010: 0000 0000 0000 0000 0000 0000 0000 0000  ................
0233c020: 0000 0000 0000 0000 0000 0000 0000 0000  ................
0233c030: 0000 0000 0000 0000 0000 0000 0000 0000  ................
0233c040: 0000 0000 0000 0000 0000 0000 0000 0000  ................
0233c050: 0000 0000 0000 0000 0000 0000 0000 0000  ................
0233c060: 0000 0000 0000 0000 0000 0000 0000 0000  ................
0233c070: 0000 0000 0000 0000 0000 0000 0000 0000  ................
...
0235bfc0: 0000 0000 0000 0000 0000 0000 0000 0000  ................
0235bfd0: 0000 0000 0000 0000 0000 0000 0000 0000  ................
0235bfe0: 0000 0000 0000 0000 0000 0000 0000 0000  ................
0235bff0: 0000 0000 0000 0000 0000 0000 0000 0000  ................
0235c000: 0104 27b7 6200 0000 0001 1686 15da 7295  ..'.b.........r.
[1] pry(main)> "%0x" % (0x0235c000 - 0x0233c000)
=> "20000"

as far as I understand, sst files serialization isn't aligned that way. Also, sst files won't waste file space with those zeros under its usual operation.

so, i highly suspect the underlying filesystem/hardware wiped the nicely-hex-round block range due to some failure.

@ryoqun
Copy link
Contributor

ryoqun commented Jun 30, 2022

... so rocksdb is innocent for this case as well. :)

@yhchiang-sol
Copy link
Contributor

Thanks for adding some data points, @ryoqun.

Btw, in case the underlying filesystem/hardware is the root cause, we are still interested in providing the recovery tool to recover the sst file?

@ryoqun
Copy link
Contributor

ryoqun commented Jun 30, 2022

@calypso-it hi, did you experience this panic at testnet or mainnet-beta?

@calypso-it
Copy link

calypso-it commented Jun 30, 2022

@ryoqun

hi, did you experience this panic at testnet or mainnet-beta?

Testnet

so, i highly suspect the underlying filesystem/hardware wiped the nicely-hex-round block range due to some failure.

That was my suspicion, too. I will change the NVMe and see, if the issue persists. Nonetheless auto-recovery / repair would be a nice feature, if repairable

@yhchiang-sol
Copy link
Contributor

Nonetheless auto-recovery / repair would be a nice feature, if repairable

Good to hear that. I will invest some time into this.

@yhchiang-sol
Copy link
Contributor

yhchiang-sol commented Jul 27, 2022

Aghh, a bit show stopper of my previously proposed solution based on my WIP PR (#26790) as we are not able to obtain the key range of the corrupted sst file even if the corruption only happens on the data block.

Output from a normal sst file:

[2022-07-27T10:33:05.218841891Z INFO  solana_ledger::blockstore] "/mnt/disks/ledger/mainnet/rocksdb" open took 4.7s
cf_name: "transaction_status", start_slot: 143304188, end_slot: 143306321
[2022-07-27T10:33:05.286706397Z INFO  solana_ledger_tool] ledger tool took 4.8s

output from a corrupted sst file, where the tool is not able to obtain the sst file metadata. Probably need to change rocksdb code in order to bypass the check.

[2022-07-27T10:43:59.747644892Z INFO  solana_ledger::blockstore] "/mnt/disks/ledger/mainnet/rocksdb" open took 4.7s
Failed to find the specified file "006024.sst"
[2022-07-27T10:43:59.814122652Z INFO  solana_ledger_tool] ledger tool took 4.8s

@yhchiang-sol
Copy link
Contributor

Aghh, a bit show stopper of my previously proposed solution based on my WIP PR (#26790) as we are not able to obtain the key range of the corrupted sst file even if the corruption only happens on the data block.

Please ignore my previous comment as it turns out that xxd somehow always padding one more byte that cause the file unreadable to rocksdb.

If the corruption only happens in the data block of an sst file without changing the file size, then we are able to obtain the key range of that corrupted file! This opens the possibility that we might be able to replace the corrupted file by copying uncorrupted data of that key range from a healthy validator instance.

[2022-07-27T12:19:46.258790292Z INFO  solana_ledger::blockstore] "/mnt/disks/ledger/mainnet/rocksdb" open took 4.4s
cf_name: "code_shred", start_key: 143342061, end_key: 143342160
[2022-07-27T12:19:46.302937014Z INFO  solana_ledger_tool] ledger tool took 4.5s

@yhchiang-sol
Copy link
Contributor

Created a separate issue for recovery tools: #26813

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants