Skip to content

fix: re-enable test_scheduler_drop_idle by resolving race condition#9042

Merged
bw-solana merged 7 commits intoanza-xyz:masterfrom
AvhiMaz:fix/re-enable-test-scheduler-drop-idle
Nov 19, 2025
Merged

fix: re-enable test_scheduler_drop_idle by resolving race condition#9042
bw-solana merged 7 commits intoanza-xyz:masterfrom
AvhiMaz:fix/re-enable-test-scheduler-drop-idle

Conversation

@AvhiMaz
Copy link
Copy Markdown

@AvhiMaz AvhiMaz commented Nov 12, 2025

Problem

The test test_scheduler_drop_idle was disabled in PR #8278 due to a race condition that caused intermittent CI failures. This left a gap in test coverage for the scheduler pool's idle scheduler cleanup logic.

Summary of Changes

  • Removed #[ignore] attribute to re-enable the test
  • Fixed the race condition by using an explicit 300ms idle threshold (instead of 100ms)
  • Increased sleep duration to 350ms to provide a safe timing margin
  • Improved comments explaining the timing guarantees

This eliminates the race condition by providing a 50ms+ safety margin between the idle threshold and actual scheduler ages.

Fixes #8279

  Remove #[ignore] attribute and fix timing to eliminate race condition
  in test_scheduler_drop_idle, which was disabled in PR anza-xyz#8278.

  The test verifies the scheduler pool's cleaner thread correctly removes
  idle schedulers while preserving recently-pooled ones.

  Root cause:
  The original test used a 100ms idle threshold with 1000ms sleep, but
  timing was still unreliable due to system variations. The race occurred
  when old_scheduler and new_scheduler had unclear age differences.

  Fix:
  - Use explicit 300ms idle threshold for this test (instead of 100ms)
  - Sleep 350ms before returning new_scheduler (provides 50ms+ safety margin)
  - This guarantees old_scheduler is idle while new_scheduler is not

  Result:
  - old_scheduler: 350ms old (> 300ms threshold) → definitely idle → removed
  - new_scheduler: ~0ms old (< 300ms threshold) → definitely not idle → kept
  - Test is now deterministic and matches expected checkpoint sequence

  Fixes anza-xyz#8279

Signed-off-by: AvhiMaz <avhimazumder5@outlook.com>
@mergify mergify Bot requested a review from a team November 12, 2025 11:25
Copy link
Copy Markdown

@steviez steviez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't really dig into the test fully previously, but I'm a little dubious that tweaking sleep durations will not leave this open to still being flaky. Gonna tag @bw-solana in for this one tho

@steviez steviez requested a review from bw-solana November 12, 2025 15:17
@AvhiMaz
Copy link
Copy Markdown
Author

AvhiMaz commented Nov 12, 2025

I didn't really dig into the test fully previously, but I'm a little dubious that tweaking sleep durations will not leave this open to still being flaky. Gonna tag @bw-solana in for this one tho

what i thought is that the sleep here isn’t really for syncing with the cleaner. that part still happens deterministically through sleepless_testing checkpoints (BeforeIdleSchedulerCleaned / AfterIdleSchedulerCleaned). The 350 ms wait just creates a clear age gap between the old and new schedulers (300 ms idle threshold vs 350 ms sleep), so the cleaner reliably treats only the old one as idle. should hold up fine under CI, but happy to make adjustments if there’s a cleaner approach.

@ryoqun ryoqun self-requested a review November 16, 2025 16:16
Comment thread unified-scheduler-pool/src/lib.rs Outdated
Comment thread unified-scheduler-pool/src/lib.rs Outdated
Comment thread unified-scheduler-pool/src/lib.rs Outdated
Comment thread unified-scheduler-pool/src/lib.rs Outdated
Comment thread unified-scheduler-pool/src/lib.rs Outdated
@bw-solana
Copy link
Copy Markdown

what i thought is that the sleep here isn’t really for syncing with the cleaner. that part still happens deterministically through sleepless_testing checkpoints (BeforeIdleSchedulerCleaned / AfterIdleSchedulerCleaned). The 350 ms wait just creates a clear age gap between the old and new schedulers (300 ms idle threshold vs 350 ms sleep), so the cleaner reliably treats only the old one as idle. should hold up fine under CI, but happy to make adjustments if there’s a cleaner approach.

To be clear, the key is how much time can pass between the following LOC relative to the pooling duration: https://github.com/anza-xyz/agave/blob/master/unified-scheduler-pool/src/lib.rs#L2910-L2925

The previous code only allows for up to 100ms while the new code allows for up to 300ms

AvhiMaz and others added 2 commits November 17, 2025 12:52
   - Move SHORTENED_MAX_POOLING_DURATION constant to test_scheduler_drop_stale where it's used
   - Convert test_max_pooling_duration from let to const TEST_MAX_POOLING_DURATION following conventions
   - Create TEST_WAIT_FOR_IDLE constant (500ms) with explicit safety margin tied to TEST_MAX_POOLING_DURATION
   - Increase safety margin from 50ms to 200ms for CI reliability
   - Add detailed comments explaining 300ms pooling duration tradeoff (speed vs race condition window)
   - Update assertion comment with explicit timing guarantees: old_scheduler ~500ms old (exceeds 300ms threshold),
   new_scheduler ~50ms old (below threshold)

Signed-off-by: AvhiMaz <avhimazumder5@outlook.com>
@AvhiMaz AvhiMaz requested a review from bw-solana November 17, 2025 07:23
Comment thread unified-scheduler-pool/src/lib.rs Outdated
Comment thread unified-scheduler-pool/src/lib.rs Outdated
@AvhiMaz AvhiMaz requested a review from bw-solana November 17, 2025 18:29
bw-solana
bw-solana previously approved these changes Nov 18, 2025
Copy link
Copy Markdown

@bw-solana bw-solana left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@AvhiMaz
Copy link
Copy Markdown
Author

AvhiMaz commented Nov 19, 2025

LGTM

🫡

@AvhiMaz AvhiMaz requested a review from steviez November 19, 2025 02:25
@bw-solana bw-solana added the CI Pull Request is ready to enter CI label Nov 19, 2025
@anza-team anza-team removed the CI Pull Request is ready to enter CI label Nov 19, 2025
Signed-off-by: AvhiMaz <avhimazumder5@outlook.com>
@codecov-commenter
Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 82.6%. Comparing base (7d07dd3) to head (a037011).
⚠️ Report is 25 commits behind head on master.

Additional details and impacted files
@@            Coverage Diff            @@
##           master    #9042     +/-   ##
=========================================
- Coverage    82.6%    82.6%   -0.1%     
=========================================
  Files         889      890      +1     
  Lines      320898   320997     +99     
=========================================
+ Hits       265260   265301     +41     
- Misses      55638    55696     +58     
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@AvhiMaz
Copy link
Copy Markdown
Author

AvhiMaz commented Nov 19, 2025

@bw-solana there was a formatting issue, fixed it.

@bw-solana bw-solana added this pull request to the merge queue Nov 19, 2025
Merged via the queue into anza-xyz:master with commit 35f5333 Nov 19, 2025
47 checks passed
@AvhiMaz AvhiMaz deleted the fix/re-enable-test-scheduler-drop-idle branch November 19, 2025 21:53
rustopian pushed a commit to rustopian/agave that referenced this pull request Nov 20, 2025
…nza-xyz#9042)

* fix: re-enable test_scheduler_drop_idle by resolving race condition

  Remove #[ignore] attribute and fix timing to eliminate race condition
  in test_scheduler_drop_idle, which was disabled in PR anza-xyz#8278.

  The test verifies the scheduler pool's cleaner thread correctly removes
  idle schedulers while preserving recently-pooled ones.

  Root cause:
  The original test used a 100ms idle threshold with 1000ms sleep, but
  timing was still unreliable due to system variations. The race occurred
  when old_scheduler and new_scheduler had unclear age differences.

  Fix:
  - Use explicit 300ms idle threshold for this test (instead of 100ms)
  - Sleep 350ms before returning new_scheduler (provides 50ms+ safety margin)
  - This guarantees old_scheduler is idle while new_scheduler is not

  Result:
  - old_scheduler: 350ms old (> 300ms threshold) → definitely idle → removed
  - new_scheduler: ~0ms old (< 300ms threshold) → definitely not idle → kept
  - Test is now deterministic and matches expected checkpoint sequence

  Fixes anza-xyz#8279

Signed-off-by: AvhiMaz <avhimazumder5@outlook.com>

* Address all reviewer feedback on test_scheduler_drop_idle

   - Move SHORTENED_MAX_POOLING_DURATION constant to test_scheduler_drop_stale where it's used
   - Convert test_max_pooling_duration from let to const TEST_MAX_POOLING_DURATION following conventions
   - Create TEST_WAIT_FOR_IDLE constant (500ms) with explicit safety margin tied to TEST_MAX_POOLING_DURATION
   - Increase safety margin from 50ms to 200ms for CI reliability
   - Add detailed comments explaining 300ms pooling duration tradeoff (speed vs race condition window)
   - Update assertion comment with explicit timing guarantees: old_scheduler ~500ms old (exceeds 300ms threshold),
   new_scheduler ~50ms old (below threshold)

Signed-off-by: AvhiMaz <avhimazumder5@outlook.com>

* refactor: tie test duration constants together and improve comment accuracy

Signed-off-by: AvhiMaz <avhimazumder5@outlook.com>

* fix: cargo fmt

Signed-off-by: AvhiMaz <avhimazumder5@outlook.com>

---------

Signed-off-by: AvhiMaz <avhimazumder5@outlook.com>
AvhiMaz added a commit to AvhiMaz/agave that referenced this pull request Nov 28, 2025
…nza-xyz#9042)

* fix: re-enable test_scheduler_drop_idle by resolving race condition

  Remove #[ignore] attribute and fix timing to eliminate race condition
  in test_scheduler_drop_idle, which was disabled in PR anza-xyz#8278.

  The test verifies the scheduler pool's cleaner thread correctly removes
  idle schedulers while preserving recently-pooled ones.

  Root cause:
  The original test used a 100ms idle threshold with 1000ms sleep, but
  timing was still unreliable due to system variations. The race occurred
  when old_scheduler and new_scheduler had unclear age differences.

  Fix:
  - Use explicit 300ms idle threshold for this test (instead of 100ms)
  - Sleep 350ms before returning new_scheduler (provides 50ms+ safety margin)
  - This guarantees old_scheduler is idle while new_scheduler is not

  Result:
  - old_scheduler: 350ms old (> 300ms threshold) → definitely idle → removed
  - new_scheduler: ~0ms old (< 300ms threshold) → definitely not idle → kept
  - Test is now deterministic and matches expected checkpoint sequence

  Fixes anza-xyz#8279

Signed-off-by: AvhiMaz <avhimazumder5@outlook.com>

* Address all reviewer feedback on test_scheduler_drop_idle

   - Move SHORTENED_MAX_POOLING_DURATION constant to test_scheduler_drop_stale where it's used
   - Convert test_max_pooling_duration from let to const TEST_MAX_POOLING_DURATION following conventions
   - Create TEST_WAIT_FOR_IDLE constant (500ms) with explicit safety margin tied to TEST_MAX_POOLING_DURATION
   - Increase safety margin from 50ms to 200ms for CI reliability
   - Add detailed comments explaining 300ms pooling duration tradeoff (speed vs race condition window)
   - Update assertion comment with explicit timing guarantees: old_scheduler ~500ms old (exceeds 300ms threshold),
   new_scheduler ~50ms old (below threshold)

Signed-off-by: AvhiMaz <avhimazumder5@outlook.com>

* refactor: tie test duration constants together and improve comment accuracy

Signed-off-by: AvhiMaz <avhimazumder5@outlook.com>

* fix: cargo fmt

Signed-off-by: AvhiMaz <avhimazumder5@outlook.com>

---------

Signed-off-by: AvhiMaz <avhimazumder5@outlook.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Re-enable test_scheduler_drop_idle

5 participants