Skip to content

Resolve intermittent test failure in test_store_scan_consistency_root#8486

Merged
roryharr merged 1 commit intoanza-xyz:masterfrom
roryharr:resolve_scan_consistency_intermittent_failures
Oct 17, 2025
Merged

Resolve intermittent test failure in test_store_scan_consistency_root#8486
roryharr merged 1 commit intoanza-xyz:masterfrom
roryharr:resolve_scan_consistency_intermittent_failures

Conversation

@roryharr
Copy link
Copy Markdown

@roryharr roryharr commented Oct 15, 2025

Problem

There's a race condition between bank drop and process_dead_slots.

        // Remove dead slots from the accounts index root tracker
        self.remove_dead_slots_metadata(dead_slots.iter());

        clean_dead_slots.stop();
        let mut purge_removed_slots = Measure::start("reclaims::purge_removed_slots");
        self.purge_dead_slots_from_storage(dead_slots.iter(), purge_stats);
        purge_removed_slots.stop();

With a sleep added to make the issue more likely, this test fails every time.

  1. remove_dead_slots_metadata removes the slot from the alive roots list,
  2. Bank Drop is called and attempts to purge its slot if its unrooted. In this case it sees that the account isn't on the alives root list and attempts to purge it
  3. During the purge, there is nothing to reclaim, since it was already reclaimed by whatever is removing dead slots
    purge_slot_storage fails because
        // After handling the reclaimed entries, this slot's
        // storage entries should be purged from self.storage
        assert!(
            self.storage.get_slot_storage_entry(remove_slot).is_none(),
            "slot {remove_slot} is not none"
        );

During normal operation this isn't an issue as the AccountsBackgroundService handles all Bank Drops, which is done in the same thread as remove_dead_slots_metadata.

Summary of Changes

  • Copy fix from test_store_scan_consistency_unrooted
  • Checked all other tests using test_store_scan_consistency_root with none, and did not see any failures.
  • It might make sense to always require a bank drop callback in Bank, but that is a separate refactor.

Fixes #

@roryharr roryharr force-pushed the resolve_scan_consistency_intermittent_failures branch from 6fb9065 to 8836e7f Compare October 15, 2025 00:40
@codecov-commenter
Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 83.2%. Comparing base (a9eee69) to head (8836e7f).
⚠️ Report is 63 commits behind head on master.

Additional details and impacted files
@@           Coverage Diff            @@
##           master    #8486    +/-   ##
========================================
  Coverage    83.2%    83.2%            
========================================
  Files         846      846            
  Lines      369251   369259     +8     
========================================
+ Hits       307229   307342   +113     
+ Misses      62022    61917   -105     
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@roryharr roryharr marked this pull request as ready for review October 15, 2025 17:04
Copy link
Copy Markdown

@HaoranYi HaoranYi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm.

Copy link
Copy Markdown

@brooksprumo brooksprumo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code looks correct. I think it'll be valuable to look back and figure out why this was None originally.

Comment thread runtime/src/bank/tests.rs
pruned_banks_request_handler.handle_request(&current_bank);
}
},
None,
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why was this None originally though?

Copy link
Copy Markdown
Author

@roryharr roryharr Oct 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I spent some time on that. All the tests were originally None, but unrooted_slot was updated a long time ago during a refactor: solana-labs#17319

I checked out that changeset and got it to build. test_store_scan_consistency_root was passing at that time even with the added delay. Sometime between then and now something changed.

@roryharr roryharr requested a review from brooksprumo October 15, 2025 21:40
Copy link
Copy Markdown

@brooksprumo brooksprumo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:shipit:

@roryharr roryharr added this pull request to the merge queue Oct 17, 2025
Merged via the queue into anza-xyz:master with commit 87c9c6c Oct 17, 2025
43 checks passed
@roryharr roryharr deleted the resolve_scan_consistency_intermittent_failures branch October 17, 2025 17:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants