Handle removing slots during account scans#17471
Handle removing slots during account scans#17471carllin merged 12 commits intosolana-labs:masterfrom
Conversation
| #[error("KeyExcludedFromSecondaryIndex")] | ||
| KeyExcludedFromSecondaryIndex { index_key: String }, | ||
| #[error("ScanError")] | ||
| ScanError(#[from] ScanError), |
There was a problem hiding this comment.
@CriesofCarrots Would like your opinion on this proposed way of propagating the error up to the calling client and the associated logic in rpc.rs
There was a problem hiding this comment.
I guess the question is, do these details need to be exposed to the calling client? Or would a generic internal_error be sufficient (along with logging the details on the server side)?
There was a problem hiding this comment.
@CriesofCarrots yeah I didn't see any downside in exposing the exact details to the client, it's just another error anyways. This is a pretty exceptional case, so if a client complains about this condition, we know what the details are exactly without having to dig through logs.
There was a problem hiding this comment.
Okay, then. If you're sure it's worth the churn, your error/result implementation looks a-okay!
| Err(ScanError::SlotRemoved { | ||
| slot: ancestors.max(), | ||
| slot_id: scan_slot_id, | ||
| }) |
There was a problem hiding this comment.
@CriesofCarrots this is where the new error is first created
| // TODO: Figure out how to properly handle errors here | ||
| .unwrap_or_default(); |
There was a problem hiding this comment.
@CriesofCarrots not 100% sure the best way to handle this yet, specifically where this method is called from process_rest() in impl RequestMiddleware for RpcRequestMiddleware, is there a best way to plumb up an error?
There was a problem hiding this comment.
Oh looks like that path in process_rest() always uses a root_bank, and only unrooted forks can be removed, so maybe it's safe to just unwrap there
There was a problem hiding this comment.
Yeah, looks right to me.
949bcc2 to
4c00fa1
Compare
6da92cf to
3072545
Compare
f162230 to
a9ea752
Compare
| if self.skip_drop.load(Relaxed) { | ||
| return; | ||
| } | ||
|
|
There was a problem hiding this comment.
We remove this because now even after a call to remove_unrooted_slot() on a bank, we still want to run purge_slot to remove the entry for the bank from the newly introduced AccountsIndex::removed_bank_ids
| // Note: we cannot remove this slot from the slot cache until we've removed its | ||
| // entries from the accounts index first. This is because `scan_accounts()` relies on | ||
| // holding the index lock, finding the index entry, and then looking up the entry | ||
| // in the cache. If it fails to find that entry, it will panic in `get_loaded_account()` | ||
| if let Some(slot_cache) = self.accounts_cache.slot_cache(*remove_slot) { |
There was a problem hiding this comment.
Note the issue described here was not an issue before because scan would never run concurrently with purge_slots_from_cache_and_store() because purge_slots_from_cache_and_store only ran via Bank::drop(), and a scan on a bank means it had a reference to said bank, and thus the drop() must not have run yet.
Now, however, because remove_unrooted_slots() can call into purge_slots_from_cache_and_store(), which can happen while a scan is running, we need to account for this case.
| { | ||
| let mut locked_removed_bank_ids = self.accounts_index.removed_bank_ids.lock().unwrap(); | ||
| for (_slot, remove_bank_id) in remove_slots.iter() { | ||
| locked_removed_bank_ids.insert(*remove_bank_id); | ||
| } | ||
| } |
There was a problem hiding this comment.
Marking down these bank's account states have been removed
Codecov Report
@@ Coverage Diff @@
## master #17471 +/- ##
=========================================
- Coverage 82.6% 82.6% -0.1%
=========================================
Files 431 431
Lines 120996 121269 +273
=========================================
+ Hits 99995 100197 +202
- Misses 21001 21072 +71 |
| data: None, | ||
| }, | ||
| RpcCustomError::ScanError(scan_err) => Self { | ||
| code: ErrorCode::ServerError(JSON_RPC_REMOVED_SLOT), |
There was a problem hiding this comment.
How do we know in this context the error is removed_slot? I noticed the enum has only SlotRemoved -- it makes sense. Maybe we should still do a further match on it so that is really that error nature incase the catalog is expanded.
There was a problem hiding this comment.
Renamed JSON_RPC_REMOVED_SLOT -> JSON_RPC_SCAN_ERROR to cover all possible scan errors.
| } | ||
| } | ||
|
|
||
| // If the fork with tip at bank `scan_bank_id` was removed durin our scan, then the scan |
There was a problem hiding this comment.
And how do we know it is really happened "during" scan? And not after it and before this check is made? If latter case can also happen, I am not sure what is the value of this multiple checks before and after the scan operation.
There was a problem hiding this comment.
The check before the scan operation is to avoid incurring the cost of the scan if we already know that the slot has been dumped
The check after the scan operation is to account for slot dumping that occurred during the scan.
It's as you said, if the dump happens after the scan was finished and before this check is made, then we chalk this up to bad luck and still abort the results. I don't know of a great way to avoid this without blocking the remove during an ongoing scan. Also if a slot was aborted, its results are probably not usable to the caller anyways, so I think this should be ok.
There was a problem hiding this comment.
If the latter case can happen, which is after the scan and before the check, then it is also possible it can happen after we did the check and receive the negative answer and then it happens right after that while we already made decision to return a success. Would that have any correctness concern?
There was a problem hiding this comment.
@lijunwangs correct that can happen as well. That should be ok because the results returned in that case are guaranteed to not have been corrupted by the slot dumping, which upholds our guarantee that scans should be consistent up to slot boundaries.
The client will get the results from a stale slot, but that's ok because if they are not using strong enough consistency queries in RPC, then rollback is to be expected.
(cherry picked from commit ccc013e) # Conflicts: # runtime/src/bank.rs
(cherry picked from commit ccc013e)
Problem
Built on top of #17269, addressing the TODO: here:
solana/runtime/src/accounts_db.rs
Line 3411 in 3d0cacc
While a scan on some fork at bank
Bis happening, the slots on that fork may be removed byremove_unrooted_slot()if:This means the scan results will be potentially inconsistent/missing accounts
Summary of Changes
If we detect a fork has been dumped during a scan, the scan result will be aborted. Important changes are summarized in comments below starting with #17471 (comment)
Fixes #