Handle removing slots during account scans by carllin · Pull Request #17471 · solana-labs/solana

carllin · 2021-05-25T06:55:00Z

Problem

Built on top of #17269, addressing the TODO: here:

Line 3411 in 3d0cacc

    
           // TODO: This is currently unsafe with scan because it can remove a slot in the middle

While a scan on some fork at bank B is happening, the slots on that fork may be removed by remove_unrooted_slot() if:

There was a duplicate slot on that fork
We detect another confirmed version of that slot

This means the scan results will be potentially inconsistent/missing accounts

Summary of Changes

If we detect a fork has been dumped during a scan, the scan result will be aborted. Important changes are summarized in comments below starting with #17471 (comment)

Fixes #

carllin · 2021-05-25T06:56:30Z

+    #[error("KeyExcludedFromSecondaryIndex")]
+    KeyExcludedFromSecondaryIndex { index_key: String },
+    #[error("ScanError")]
+    ScanError(#[from] ScanError),


@CriesofCarrots Would like your opinion on this proposed way of propagating the error up to the calling client and the associated logic in rpc.rs

I guess the question is, do these details need to be exposed to the calling client? Or would a generic internal_error be sufficient (along with logging the details on the server side)?

@CriesofCarrots yeah I didn't see any downside in exposing the exact details to the client, it's just another error anyways. This is a pretty exceptional case, so if a client complains about this condition, we know what the details are exactly without having to dig through logs.

Okay, then. If you're sure it's worth the churn, your error/result implementation looks a-okay!

carllin · 2021-05-25T06:57:40Z

+            Err(ScanError::SlotRemoved {
+                slot: ancestors.max(),
+                slot_id: scan_slot_id,
+            })


@CriesofCarrots this is where the new error is first created

carllin · 2021-05-25T07:01:00Z

+    // TODO: Figure out how to properly handle errors here
+    .unwrap_or_default();


@CriesofCarrots not 100% sure the best way to handle this yet, specifically where this method is called from process_rest() in impl RequestMiddleware for RpcRequestMiddleware, is there a best way to plumb up an error?

Oh looks like that path in process_rest() always uses a root_bank, and only unrooted forks can be removed, so maybe it's safe to just unwrap there

Yeah, looks right to me.

carllin · 2021-06-05T04:17:35Z

-        if self.skip_drop.load(Relaxed) {
-            return;
-        }
-


We remove this because now even after a call to remove_unrooted_slot() on a bank, we still want to run purge_slot to remove the entry for the bank from the newly introduced AccountsIndex::removed_bank_ids

carllin · 2021-06-05T04:26:44Z

+            // Note: we cannot remove this slot from the slot cache until we've removed its
+            // entries from the accounts index first. This is because `scan_accounts()` relies on
+            // holding the index lock, finding the index entry, and then looking up the entry
+            // in the cache. If it fails to find that entry, it will panic in `get_loaded_account()`
+            if let Some(slot_cache) = self.accounts_cache.slot_cache(*remove_slot) {


Note the issue described here was not an issue before because scan would never run concurrently with purge_slots_from_cache_and_store() because purge_slots_from_cache_and_store only ran via Bank::drop(), and a scan on a bank means it had a reference to said bank, and thus the drop() must not have run yet.

Now, however, because remove_unrooted_slots() can call into purge_slots_from_cache_and_store(), which can happen while a scan is running, we need to account for this case.

carllin · 2021-06-05T04:44:13Z

+        {
+            let mut locked_removed_bank_ids = self.accounts_index.removed_bank_ids.lock().unwrap();
+            for (_slot, remove_bank_id) in remove_slots.iter() {
+                locked_removed_bank_ids.insert(*remove_bank_id);
+            }
+        }


Marking down these bank's account states have been removed

codecov · 2021-06-05T06:08:45Z

Codecov Report

Merging #17471 (5973ab4) into master (0e162ba) will decrease coverage by 0.0%.
The diff coverage is 88.4%.

@@            Coverage Diff            @@
##           master   #17471     +/-   ##
=========================================
- Coverage    82.6%    82.6%   -0.1%     
=========================================
  Files         431      431             
  Lines      120996   121269    +273     
=========================================
+ Hits        99995   100197    +202     
- Misses      21001    21072     +71

lijunwangs · 2021-06-07T19:56:53Z

                data: None,
            },
+            RpcCustomError::ScanError(scan_err) => Self {
+                code: ErrorCode::ServerError(JSON_RPC_REMOVED_SLOT),


How do we know in this context the error is removed_slot? I noticed the enum has only SlotRemoved -- it makes sense. Maybe we should still do a further match on it so that is really that error nature incase the catalog is expanded.

Renamed JSON_RPC_REMOVED_SLOT -> JSON_RPC_SCAN_ERROR to cover all possible scan errors.

lijunwangs · 2021-06-07T21:17:31Z

            }
        }
+
+        // If the fork with tip at bank `scan_bank_id` was removed durin our scan, then the scan


nits, durin --> during

And how do we know it is really happened "during" scan? And not after it and before this check is made? If latter case can also happen, I am not sure what is the value of this multiple checks before and after the scan operation.

The check before the scan operation is to avoid incurring the cost of the scan if we already know that the slot has been dumped

The check after the scan operation is to account for slot dumping that occurred during the scan.

It's as you said, if the dump happens after the scan was finished and before this check is made, then we chalk this up to bad luck and still abort the results. I don't know of a great way to avoid this without blocking the remove during an ongoing scan. Also if a slot was aborted, its results are probably not usable to the caller anyways, so I think this should be ok.

If the latter case can happen, which is after the scan and before the check, then it is also possible it can happen after we did the check and receive the negative answer and then it happens right after that while we already made decision to return a success. Would that have any correctness concern?

@lijunwangs correct that can happen as well. That should be ok because the results returned in that case are guaranteed to not have been corrupted by the slot dumping, which upholds our guarantee that scans should be consistent up to slot boundaries.

The client will get the results from a stale slot, but that's ok because if they are not using strong enough consistency queries in RPC, then rollback is to be expected.

lijunwangs

lgtm

Pull request has been modified.

(cherry picked from commit ccc013e) # Conflicts: # runtime/src/bank.rs

(cherry picked from commit ccc013e)

(cherry picked from commit ccc013e) Co-authored-by: carllin <carl@solana.com>

carllin requested review from CriesofCarrots and sakridge May 25, 2021 06:55

carllin changed the title ~~Safe scan~~ Handle removing slots during account scans May 25, 2021

carllin commented May 25, 2021

View reviewed changes

carllin force-pushed the SafeScan branch 6 times, most recently from 949bcc2 to 4c00fa1 Compare May 26, 2021 09:49

carllin mentioned this pull request May 27, 2021

Support out of band dumping of unrooted slots in AccountsDb #17269

Merged

carllin force-pushed the SafeScan branch 9 times, most recently from 6da92cf to 3072545 Compare June 4, 2021 03:35

carllin force-pushed the SafeScan branch 2 times, most recently from f162230 to a9ea752 Compare June 4, 2021 23:34

carllin marked this pull request as ready for review June 5, 2021 02:47

carllin requested review from brooksprumo, lijunwangs and ryoqun June 5, 2021 04:04

carllin commented Jun 5, 2021

View reviewed changes

lijunwangs reviewed Jun 8, 2021

View reviewed changes

carllin force-pushed the SafeScan branch from 784fadc to 7098209 Compare June 9, 2021 02:02

lijunwangs reviewed Jun 9, 2021

View reviewed changes

lijunwangs previously approved these changes Jun 9, 2021

View reviewed changes

carllin force-pushed the SafeScan branch from 7098209 to 0aa4764 Compare June 9, 2021 09:34

carllin added 11 commits June 10, 2021 20:57

Add safe scan

5b9189a

Add unique bank ids to distinguish banks for same slot

8b7a414

Account for scans starting after remove_unrooted_slots()

5d5fca0

Add tests and fixup

e6e8072

temp

f5d3603

Fix bad ordering in purge_slots_from_cache_and_store()

86fec15

Cleanup

b47e12b

Use proper bank id

49c3ccf

Enforce scan minimum in tests

911767b

Rename SlotId to BankId

ca71047

Address comments

2e0a6f5

carllin force-pushed the SafeScan branch from 0aa4764 to 2e0a6f5 Compare June 11, 2021 04:03

Remove runtime from client

5973ab4

carllin force-pushed the SafeScan branch from c620806 to 5973ab4 Compare June 11, 2021 22:34

carllin added the v1.7 label Jun 15, 2021

carllin merged commit ccc013e into solana-labs:master Jun 15, 2021

mergify Bot pushed a commit that referenced this pull request Jun 15, 2021

Handle removing slots during account scans (#17471)

02cdfa1

(cherry picked from commit ccc013e) # Conflicts: # runtime/src/bank.rs

mergify Bot mentioned this pull request Jun 15, 2021

Handle removing slots during account scans (backport #17471) #17953

Merged

carllin added a commit that referenced this pull request Jun 21, 2021

Handle removing slots during account scans (#17471)

be2bb71

(cherry picked from commit ccc013e)

mergify Bot added a commit that referenced this pull request Jun 22, 2021

Handle removing slots during account scans (#17471) (#17953)

6f37648

(cherry picked from commit ccc013e) Co-authored-by: carllin <carl@solana.com>

brooksprumo mentioned this pull request Aug 23, 2021

backport 19361 v17 #19380

Closed

brooksprumo mentioned this pull request Mar 14, 2025

Removes unused ScanSlotTracker anza-xyz/agave#5298

Merged

		// TODO: Figure out how to properly handle errors here
		.unwrap_or_default();

Conversation

carllin commented May 25, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Summary of Changes

Uh oh!

carllin May 25, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

carllin Jun 5, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

codecov Bot commented Jun 5, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

carllin Jun 8, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

carllin Jun 9, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lijunwangs left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

carllin commented May 25, 2021 •

edited

Loading

carllin May 25, 2021 •

edited

Loading

carllin Jun 5, 2021 •

edited

Loading

codecov Bot commented Jun 5, 2021 •

edited

Loading

carllin Jun 8, 2021 •

edited

Loading

carllin Jun 9, 2021 •

edited

Loading