Skip to content
This repository was archived by the owner on Jan 22, 2025. It is now read-only.

Handle removing slots during account scans#17471

Merged
carllin merged 12 commits intosolana-labs:masterfrom
carllin:SafeScan
Jun 15, 2021
Merged

Handle removing slots during account scans#17471
carllin merged 12 commits intosolana-labs:masterfrom
carllin:SafeScan

Conversation

@carllin
Copy link
Copy Markdown
Contributor

@carllin carllin commented May 25, 2021

Problem

Built on top of #17269, addressing the TODO: here:

// TODO: This is currently unsafe with scan because it can remove a slot in the middle

While a scan on some fork at bank B is happening, the slots on that fork may be removed by remove_unrooted_slot() if:

  1. There was a duplicate slot on that fork
  2. We detect another confirmed version of that slot

This means the scan results will be potentially inconsistent/missing accounts

Summary of Changes

If we detect a fork has been dumped during a scan, the scan result will be aborted. Important changes are summarized in comments below starting with #17471 (comment)

Fixes #

@carllin carllin changed the title Safe scan Handle removing slots during account scans May 25, 2021
Comment thread client/src/rpc_custom_error.rs Outdated
#[error("KeyExcludedFromSecondaryIndex")]
KeyExcludedFromSecondaryIndex { index_key: String },
#[error("ScanError")]
ScanError(#[from] ScanError),
Copy link
Copy Markdown
Contributor Author

@carllin carllin May 25, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@CriesofCarrots Would like your opinion on this proposed way of propagating the error up to the calling client and the associated logic in rpc.rs

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess the question is, do these details need to be exposed to the calling client? Or would a generic internal_error be sufficient (along with logging the details on the server side)?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@CriesofCarrots yeah I didn't see any downside in exposing the exact details to the client, it's just another error anyways. This is a pretty exceptional case, so if a client complains about this condition, we know what the details are exactly without having to dig through logs.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, then. If you're sure it's worth the churn, your error/result implementation looks a-okay!

Comment on lines +793 to +876
Err(ScanError::SlotRemoved {
slot: ancestors.max(),
slot_id: scan_slot_id,
})
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@CriesofCarrots this is where the new error is first created

Comment thread runtime/src/non_circulating_supply.rs Outdated
Comment on lines +45 to +46
// TODO: Figure out how to properly handle errors here
.unwrap_or_default();
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@CriesofCarrots not 100% sure the best way to handle this yet, specifically where this method is called from process_rest() in impl RequestMiddleware for RpcRequestMiddleware, is there a best way to plumb up an error?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh looks like that path in process_rest() always uses a root_bank, and only unrooted forks can be removed, so maybe it's safe to just unwrap there

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, looks right to me.

@carllin carllin force-pushed the SafeScan branch 6 times, most recently from 949bcc2 to 4c00fa1 Compare May 26, 2021 09:49
@carllin carllin force-pushed the SafeScan branch 9 times, most recently from 6da92cf to 3072545 Compare June 4, 2021 03:35
@carllin carllin force-pushed the SafeScan branch 2 times, most recently from f162230 to a9ea752 Compare June 4, 2021 23:34
@carllin carllin marked this pull request as ready for review June 5, 2021 02:47
Comment thread runtime/src/bank.rs
Comment on lines -5244 to -5253
if self.skip_drop.load(Relaxed) {
return;
}

Copy link
Copy Markdown
Contributor Author

@carllin carllin Jun 5, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We remove this because now even after a call to remove_unrooted_slot() on a bank, we still want to run purge_slot to remove the entry for the bank from the newly introduced AccountsIndex::removed_bank_ids

Comment on lines +3242 to +3369
// Note: we cannot remove this slot from the slot cache until we've removed its
// entries from the accounts index first. This is because `scan_accounts()` relies on
// holding the index lock, finding the index entry, and then looking up the entry
// in the cache. If it fails to find that entry, it will panic in `get_loaded_account()`
if let Some(slot_cache) = self.accounts_cache.slot_cache(*remove_slot) {
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note the issue described here was not an issue before because scan would never run concurrently with purge_slots_from_cache_and_store() because purge_slots_from_cache_and_store only ran via Bank::drop(), and a scan on a bank means it had a reference to said bank, and thus the drop() must not have run yet.

Now, however, because remove_unrooted_slots() can call into purge_slots_from_cache_and_store(), which can happen while a scan is running, we need to account for this case.

Comment on lines +3539 to +3667
{
let mut locked_removed_bank_ids = self.accounts_index.removed_bank_ids.lock().unwrap();
for (_slot, remove_bank_id) in remove_slots.iter() {
locked_removed_bank_ids.insert(*remove_bank_id);
}
}
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Marking down these bank's account states have been removed

@codecov
Copy link
Copy Markdown

codecov Bot commented Jun 5, 2021

Codecov Report

Merging #17471 (5973ab4) into master (0e162ba) will decrease coverage by 0.0%.
The diff coverage is 88.4%.

@@            Coverage Diff            @@
##           master   #17471     +/-   ##
=========================================
- Coverage    82.6%    82.6%   -0.1%     
=========================================
  Files         431      431             
  Lines      120996   121269    +273     
=========================================
+ Hits        99995   100197    +202     
- Misses      21001    21072     +71     

Comment thread client/src/rpc_custom_error.rs Outdated
data: None,
},
RpcCustomError::ScanError(scan_err) => Self {
code: ErrorCode::ServerError(JSON_RPC_REMOVED_SLOT),
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do we know in this context the error is removed_slot? I noticed the enum has only SlotRemoved -- it makes sense. Maybe we should still do a further match on it so that is really that error nature incase the catalog is expanded.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Renamed JSON_RPC_REMOVED_SLOT -> JSON_RPC_SCAN_ERROR to cover all possible scan errors.

Comment thread runtime/src/accounts_index.rs Outdated
}
}

// If the fork with tip at bank `scan_bank_id` was removed durin our scan, then the scan
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nits, durin --> during

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And how do we know it is really happened "during" scan? And not after it and before this check is made? If latter case can also happen, I am not sure what is the value of this multiple checks before and after the scan operation.

Copy link
Copy Markdown
Contributor Author

@carllin carllin Jun 8, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The check before the scan operation is to avoid incurring the cost of the scan if we already know that the slot has been dumped

The check after the scan operation is to account for slot dumping that occurred during the scan.

It's as you said, if the dump happens after the scan was finished and before this check is made, then we chalk this up to bad luck and still abort the results. I don't know of a great way to avoid this without blocking the remove during an ongoing scan. Also if a slot was aborted, its results are probably not usable to the caller anyways, so I think this should be ok.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the latter case can happen, which is after the scan and before the check, then it is also possible it can happen after we did the check and receive the negative answer and then it happens right after that while we already made decision to return a success. Would that have any correctness concern?

Copy link
Copy Markdown
Contributor Author

@carllin carllin Jun 9, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lijunwangs correct that can happen as well. That should be ok because the results returned in that case are guaranteed to not have been corrupted by the slot dumping, which upholds our guarantee that scans should be consistent up to slot boundaries.

The client will get the results from a stale slot, but that's ok because if they are not using strong enough consistency queries in RPC, then rollback is to be expected.

Copy link
Copy Markdown
Contributor

@lijunwangs lijunwangs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

lijunwangs
lijunwangs previously approved these changes Jun 9, 2021
@mergify mergify Bot dismissed lijunwangs’s stale review June 9, 2021 09:34

Pull request has been modified.

@carllin carllin added the v1.7 label Jun 15, 2021
@carllin carllin merged commit ccc013e into solana-labs:master Jun 15, 2021
mergify Bot pushed a commit that referenced this pull request Jun 15, 2021
(cherry picked from commit ccc013e)

# Conflicts:
#	runtime/src/bank.rs
carllin added a commit that referenced this pull request Jun 21, 2021
mergify Bot added a commit that referenced this pull request Jun 22, 2021
(cherry picked from commit ccc013e)

Co-authored-by: carllin <carl@solana.com>
@brooksprumo brooksprumo mentioned this pull request Aug 23, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants