fix: re-enable test_scheduler_drop_idle by resolving race condition#9042
Conversation
Remove #[ignore] attribute and fix timing to eliminate race condition in test_scheduler_drop_idle, which was disabled in PR anza-xyz#8278. The test verifies the scheduler pool's cleaner thread correctly removes idle schedulers while preserving recently-pooled ones. Root cause: The original test used a 100ms idle threshold with 1000ms sleep, but timing was still unreliable due to system variations. The race occurred when old_scheduler and new_scheduler had unclear age differences. Fix: - Use explicit 300ms idle threshold for this test (instead of 100ms) - Sleep 350ms before returning new_scheduler (provides 50ms+ safety margin) - This guarantees old_scheduler is idle while new_scheduler is not Result: - old_scheduler: 350ms old (> 300ms threshold) → definitely idle → removed - new_scheduler: ~0ms old (< 300ms threshold) → definitely not idle → kept - Test is now deterministic and matches expected checkpoint sequence Fixes anza-xyz#8279 Signed-off-by: AvhiMaz <avhimazumder5@outlook.com>
steviez
left a comment
There was a problem hiding this comment.
I didn't really dig into the test fully previously, but I'm a little dubious that tweaking sleep durations will not leave this open to still being flaky. Gonna tag @bw-solana in for this one tho
what i thought is that the sleep here isn’t really for syncing with the cleaner. that part still happens deterministically through |
To be clear, the key is how much time can pass between the following LOC relative to the pooling duration: https://github.com/anza-xyz/agave/blob/master/unified-scheduler-pool/src/lib.rs#L2910-L2925 The previous code only allows for up to 100ms while the new code allows for up to 300ms |
- Move SHORTENED_MAX_POOLING_DURATION constant to test_scheduler_drop_stale where it's used - Convert test_max_pooling_duration from let to const TEST_MAX_POOLING_DURATION following conventions - Create TEST_WAIT_FOR_IDLE constant (500ms) with explicit safety margin tied to TEST_MAX_POOLING_DURATION - Increase safety margin from 50ms to 200ms for CI reliability - Add detailed comments explaining 300ms pooling duration tradeoff (speed vs race condition window) - Update assertion comment with explicit timing guarantees: old_scheduler ~500ms old (exceeds 300ms threshold), new_scheduler ~50ms old (below threshold) Signed-off-by: AvhiMaz <avhimazumder5@outlook.com>
…curacy Signed-off-by: AvhiMaz <avhimazumder5@outlook.com>
🫡 |
Signed-off-by: AvhiMaz <avhimazumder5@outlook.com>
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## master #9042 +/- ##
=========================================
- Coverage 82.6% 82.6% -0.1%
=========================================
Files 889 890 +1
Lines 320898 320997 +99
=========================================
+ Hits 265260 265301 +41
- Misses 55638 55696 +58 🚀 New features to boost your workflow:
|
|
@bw-solana there was a formatting issue, fixed it. |
…nza-xyz#9042) * fix: re-enable test_scheduler_drop_idle by resolving race condition Remove #[ignore] attribute and fix timing to eliminate race condition in test_scheduler_drop_idle, which was disabled in PR anza-xyz#8278. The test verifies the scheduler pool's cleaner thread correctly removes idle schedulers while preserving recently-pooled ones. Root cause: The original test used a 100ms idle threshold with 1000ms sleep, but timing was still unreliable due to system variations. The race occurred when old_scheduler and new_scheduler had unclear age differences. Fix: - Use explicit 300ms idle threshold for this test (instead of 100ms) - Sleep 350ms before returning new_scheduler (provides 50ms+ safety margin) - This guarantees old_scheduler is idle while new_scheduler is not Result: - old_scheduler: 350ms old (> 300ms threshold) → definitely idle → removed - new_scheduler: ~0ms old (< 300ms threshold) → definitely not idle → kept - Test is now deterministic and matches expected checkpoint sequence Fixes anza-xyz#8279 Signed-off-by: AvhiMaz <avhimazumder5@outlook.com> * Address all reviewer feedback on test_scheduler_drop_idle - Move SHORTENED_MAX_POOLING_DURATION constant to test_scheduler_drop_stale where it's used - Convert test_max_pooling_duration from let to const TEST_MAX_POOLING_DURATION following conventions - Create TEST_WAIT_FOR_IDLE constant (500ms) with explicit safety margin tied to TEST_MAX_POOLING_DURATION - Increase safety margin from 50ms to 200ms for CI reliability - Add detailed comments explaining 300ms pooling duration tradeoff (speed vs race condition window) - Update assertion comment with explicit timing guarantees: old_scheduler ~500ms old (exceeds 300ms threshold), new_scheduler ~50ms old (below threshold) Signed-off-by: AvhiMaz <avhimazumder5@outlook.com> * refactor: tie test duration constants together and improve comment accuracy Signed-off-by: AvhiMaz <avhimazumder5@outlook.com> * fix: cargo fmt Signed-off-by: AvhiMaz <avhimazumder5@outlook.com> --------- Signed-off-by: AvhiMaz <avhimazumder5@outlook.com>
…nza-xyz#9042) * fix: re-enable test_scheduler_drop_idle by resolving race condition Remove #[ignore] attribute and fix timing to eliminate race condition in test_scheduler_drop_idle, which was disabled in PR anza-xyz#8278. The test verifies the scheduler pool's cleaner thread correctly removes idle schedulers while preserving recently-pooled ones. Root cause: The original test used a 100ms idle threshold with 1000ms sleep, but timing was still unreliable due to system variations. The race occurred when old_scheduler and new_scheduler had unclear age differences. Fix: - Use explicit 300ms idle threshold for this test (instead of 100ms) - Sleep 350ms before returning new_scheduler (provides 50ms+ safety margin) - This guarantees old_scheduler is idle while new_scheduler is not Result: - old_scheduler: 350ms old (> 300ms threshold) → definitely idle → removed - new_scheduler: ~0ms old (< 300ms threshold) → definitely not idle → kept - Test is now deterministic and matches expected checkpoint sequence Fixes anza-xyz#8279 Signed-off-by: AvhiMaz <avhimazumder5@outlook.com> * Address all reviewer feedback on test_scheduler_drop_idle - Move SHORTENED_MAX_POOLING_DURATION constant to test_scheduler_drop_stale where it's used - Convert test_max_pooling_duration from let to const TEST_MAX_POOLING_DURATION following conventions - Create TEST_WAIT_FOR_IDLE constant (500ms) with explicit safety margin tied to TEST_MAX_POOLING_DURATION - Increase safety margin from 50ms to 200ms for CI reliability - Add detailed comments explaining 300ms pooling duration tradeoff (speed vs race condition window) - Update assertion comment with explicit timing guarantees: old_scheduler ~500ms old (exceeds 300ms threshold), new_scheduler ~50ms old (below threshold) Signed-off-by: AvhiMaz <avhimazumder5@outlook.com> * refactor: tie test duration constants together and improve comment accuracy Signed-off-by: AvhiMaz <avhimazumder5@outlook.com> * fix: cargo fmt Signed-off-by: AvhiMaz <avhimazumder5@outlook.com> --------- Signed-off-by: AvhiMaz <avhimazumder5@outlook.com>
Problem
The test
test_scheduler_drop_idlewas disabled in PR #8278 due to a race condition that caused intermittent CI failures. This left a gap in test coverage for the scheduler pool's idle scheduler cleanup logic.Summary of Changes
#[ignore]attribute to re-enable the testThis eliminates the race condition by providing a 50ms+ safety margin between the idle threshold and actual scheduler ages.
Fixes #8279