fix: re-enable test_scheduler_drop_idle by resolving race condition by AvhiMaz · Pull Request #9042 · anza-xyz/agave

AvhiMaz · 2025-11-12T11:24:35Z

Problem

The test test_scheduler_drop_idle was disabled in PR #8278 due to a race condition that caused intermittent CI failures. This left a gap in test coverage for the scheduler pool's idle scheduler cleanup logic.

Summary of Changes

Removed #[ignore] attribute to re-enable the test
Fixed the race condition by using an explicit 300ms idle threshold (instead of 100ms)
Increased sleep duration to 350ms to provide a safe timing margin
Improved comments explaining the timing guarantees

This eliminates the race condition by providing a 50ms+ safety margin between the idle threshold and actual scheduler ages.

Fixes #8279

Remove #[ignore] attribute and fix timing to eliminate race condition in test_scheduler_drop_idle, which was disabled in PR anza-xyz#8278. The test verifies the scheduler pool's cleaner thread correctly removes idle schedulers while preserving recently-pooled ones. Root cause: The original test used a 100ms idle threshold with 1000ms sleep, but timing was still unreliable due to system variations. The race occurred when old_scheduler and new_scheduler had unclear age differences. Fix: - Use explicit 300ms idle threshold for this test (instead of 100ms) - Sleep 350ms before returning new_scheduler (provides 50ms+ safety margin) - This guarantees old_scheduler is idle while new_scheduler is not Result: - old_scheduler: 350ms old (> 300ms threshold) → definitely idle → removed - new_scheduler: ~0ms old (< 300ms threshold) → definitely not idle → kept - Test is now deterministic and matches expected checkpoint sequence Fixes anza-xyz#8279 Signed-off-by: AvhiMaz <avhimazumder5@outlook.com>

steviez

I didn't really dig into the test fully previously, but I'm a little dubious that tweaking sleep durations will not leave this open to still being flaky. Gonna tag @bw-solana in for this one tho

AvhiMaz · 2025-11-12T16:01:54Z

I didn't really dig into the test fully previously, but I'm a little dubious that tweaking sleep durations will not leave this open to still being flaky. Gonna tag @bw-solana in for this one tho

what i thought is that the sleep here isn’t really for syncing with the cleaner. that part still happens deterministically through sleepless_testing checkpoints (BeforeIdleSchedulerCleaned / AfterIdleSchedulerCleaned). The 350 ms wait just creates a clear age gap between the old and new schedulers (300 ms idle threshold vs 350 ms sleep), so the cleaner reliably treats only the old one as idle. should hold up fine under CI, but happy to make adjustments if there’s a cleaner approach.

bw-solana · 2025-11-16T16:31:33Z

what i thought is that the sleep here isn’t really for syncing with the cleaner. that part still happens deterministically through sleepless_testing checkpoints (BeforeIdleSchedulerCleaned / AfterIdleSchedulerCleaned). The 350 ms wait just creates a clear age gap between the old and new schedulers (300 ms idle threshold vs 350 ms sleep), so the cleaner reliably treats only the old one as idle. should hold up fine under CI, but happy to make adjustments if there’s a cleaner approach.

To be clear, the key is how much time can pass between the following LOC relative to the pooling duration: https://github.com/anza-xyz/agave/blob/master/unified-scheduler-pool/src/lib.rs#L2910-L2925

The previous code only allows for up to 100ms while the new code allows for up to 300ms

- Move SHORTENED_MAX_POOLING_DURATION constant to test_scheduler_drop_stale where it's used - Convert test_max_pooling_duration from let to const TEST_MAX_POOLING_DURATION following conventions - Create TEST_WAIT_FOR_IDLE constant (500ms) with explicit safety margin tied to TEST_MAX_POOLING_DURATION - Increase safety margin from 50ms to 200ms for CI reliability - Add detailed comments explaining 300ms pooling duration tradeoff (speed vs race condition window) - Update assertion comment with explicit timing guarantees: old_scheduler ~500ms old (exceeds 300ms threshold), new_scheduler ~50ms old (below threshold) Signed-off-by: AvhiMaz <avhimazumder5@outlook.com>

…curacy Signed-off-by: AvhiMaz <avhimazumder5@outlook.com>

bw-solana

LGTM

AvhiMaz · 2025-11-19T02:24:51Z

LGTM

🫡

Signed-off-by: AvhiMaz <avhimazumder5@outlook.com>

codecov-commenter · 2025-11-19T08:24:36Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 82.6%. Comparing base (7d07dd3) to head (a037011).
⚠️ Report is 25 commits behind head on master.

Additional details and impacted files

@@            Coverage Diff            @@
##           master    #9042     +/-   ##
=========================================
- Coverage    82.6%    82.6%   -0.1%     
=========================================
  Files         889      890      +1     
  Lines      320898   320997     +99     
=========================================
+ Hits       265260   265301     +41     
- Misses      55638    55696     +58

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

AvhiMaz · 2025-11-19T08:25:53Z

@bw-solana there was a formatting issue, fixed it.

…nza-xyz#9042) * fix: re-enable test_scheduler_drop_idle by resolving race condition Remove #[ignore] attribute and fix timing to eliminate race condition in test_scheduler_drop_idle, which was disabled in PR anza-xyz#8278. The test verifies the scheduler pool's cleaner thread correctly removes idle schedulers while preserving recently-pooled ones. Root cause: The original test used a 100ms idle threshold with 1000ms sleep, but timing was still unreliable due to system variations. The race occurred when old_scheduler and new_scheduler had unclear age differences. Fix: - Use explicit 300ms idle threshold for this test (instead of 100ms) - Sleep 350ms before returning new_scheduler (provides 50ms+ safety margin) - This guarantees old_scheduler is idle while new_scheduler is not Result: - old_scheduler: 350ms old (> 300ms threshold) → definitely idle → removed - new_scheduler: ~0ms old (< 300ms threshold) → definitely not idle → kept - Test is now deterministic and matches expected checkpoint sequence Fixes anza-xyz#8279 Signed-off-by: AvhiMaz <avhimazumder5@outlook.com> * Address all reviewer feedback on test_scheduler_drop_idle - Move SHORTENED_MAX_POOLING_DURATION constant to test_scheduler_drop_stale where it's used - Convert test_max_pooling_duration from let to const TEST_MAX_POOLING_DURATION following conventions - Create TEST_WAIT_FOR_IDLE constant (500ms) with explicit safety margin tied to TEST_MAX_POOLING_DURATION - Increase safety margin from 50ms to 200ms for CI reliability - Add detailed comments explaining 300ms pooling duration tradeoff (speed vs race condition window) - Update assertion comment with explicit timing guarantees: old_scheduler ~500ms old (exceeds 300ms threshold), new_scheduler ~50ms old (below threshold) Signed-off-by: AvhiMaz <avhimazumder5@outlook.com> * refactor: tie test duration constants together and improve comment accuracy Signed-off-by: AvhiMaz <avhimazumder5@outlook.com> * fix: cargo fmt Signed-off-by: AvhiMaz <avhimazumder5@outlook.com> --------- Signed-off-by: AvhiMaz <avhimazumder5@outlook.com>

mergify Bot added community need:merge-assist labels Nov 12, 2025

mergify Bot requested a review from a team November 12, 2025 11:25

Merge branch 'master' into fix/re-enable-test-scheduler-drop-idle

bc7c29f

steviez reviewed Nov 12, 2025

View reviewed changes

steviez requested a review from bw-solana November 12, 2025 15:17

ryoqun self-requested a review November 16, 2025 16:16

bw-solana reviewed Nov 16, 2025

View reviewed changes

Comment thread unified-scheduler-pool/src/lib.rs Outdated

Comment thread unified-scheduler-pool/src/lib.rs Outdated

Comment thread unified-scheduler-pool/src/lib.rs Outdated

Comment thread unified-scheduler-pool/src/lib.rs Outdated

Comment thread unified-scheduler-pool/src/lib.rs Outdated

AvhiMaz and others added 2 commits November 17, 2025 12:52

Merge branch 'master' into fix/re-enable-test-scheduler-drop-idle

b6b3535

AvhiMaz requested a review from bw-solana November 17, 2025 07:23

bw-solana reviewed Nov 17, 2025

View reviewed changes

Comment thread unified-scheduler-pool/src/lib.rs Outdated

Comment thread unified-scheduler-pool/src/lib.rs Outdated

AvhiMaz and others added 2 commits November 17, 2025 23:57

Merge branch 'master' into fix/re-enable-test-scheduler-drop-idle

ecdc7cc

refactor: tie test duration constants together and improve comment ac…

af5aaaa

…curacy Signed-off-by: AvhiMaz <avhimazumder5@outlook.com>

AvhiMaz requested a review from bw-solana November 17, 2025 18:29

bw-solana previously approved these changes Nov 18, 2025

View reviewed changes

AvhiMaz requested a review from steviez November 19, 2025 02:25

bw-solana added the CI Pull Request is ready to enter CI label Nov 19, 2025

anza-team removed the CI Pull Request is ready to enter CI label Nov 19, 2025

fix: cargo fmt

a037011

Signed-off-by: AvhiMaz <avhimazumder5@outlook.com>

AvhiMaz dismissed bw-solana’s stale review via a037011 November 19, 2025 07:54

bw-solana approved these changes Nov 19, 2025

View reviewed changes

bw-solana added this pull request to the merge queue Nov 19, 2025

Merged via the queue into anza-xyz:master with commit 35f5333 Nov 19, 2025
47 checks passed

AvhiMaz deleted the fix/re-enable-test-scheduler-drop-idle branch November 19, 2025 21:53

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: re-enable test_scheduler_drop_idle by resolving race condition#9042

fix: re-enable test_scheduler_drop_idle by resolving race condition#9042
bw-solana merged 7 commits intoanza-xyz:masterfrom
AvhiMaz:fix/re-enable-test-scheduler-drop-idle

AvhiMaz commented Nov 12, 2025

Uh oh!

steviez left a comment

Uh oh!

AvhiMaz commented Nov 12, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

bw-solana commented Nov 16, 2025

Uh oh!

Uh oh!

Uh oh!

bw-solana left a comment

Uh oh!

AvhiMaz commented Nov 19, 2025

Uh oh!

codecov-commenter commented Nov 19, 2025

Uh oh!

AvhiMaz commented Nov 19, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

AvhiMaz commented Nov 12, 2025

Problem

Summary of Changes

Uh oh!

steviez left a comment

Choose a reason for hiding this comment

Uh oh!

AvhiMaz commented Nov 12, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

bw-solana commented Nov 16, 2025

Uh oh!

Uh oh!

Uh oh!

bw-solana left a comment

Choose a reason for hiding this comment

Uh oh!

AvhiMaz commented Nov 19, 2025

Uh oh!

codecov-commenter commented Nov 19, 2025

Codecov Report

Uh oh!

AvhiMaz commented Nov 19, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants