fix flaky ep tests#3683
Conversation
📝 WalkthroughWalkthroughThis PR updates the Expert Parallel topology test suite to add explicit timeouts to distributed-process-group initialization, introduces a worker-result collection utility that polls with early-exit logic to prevent silent hangs, and consolidates error-spawning and result-gathering into reusable helpers used by mismatch validation tests. ChangesWorker Initialization and Result Collection Refactor
🎯 2 (Simple) | ⏱️ ~10 minutes 🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@tests/integrations/test_expert_parallel.py`:
- Around line 373-378: The loop that waits for worker results should break early
if any worker has crashed rather than only when all workers are dead: after
catching queue_mod.Empty, inspect procs for any process with a non-zero exitcode
(e.g., any(p.exitcode not in (None, 0) for p in procs)) and break if found;
otherwise keep the existing break when all(not p.is_alive() for p in procs).
Refer to the variables/results used in the diff (results, world_size, q,
queue_mod.Empty, procs, deadline) and update the timeout-handling branch to
check for non-zero exit codes first to implement the fail-fast behavior.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 02279180-dc8f-4394-9a39-cfb1d639b3a3
📒 Files selected for processing (1)
tests/integrations/test_expert_parallel.py
| while len(results) < world_size and time.monotonic() < deadline: | ||
| try: | ||
| results.append(q.get(timeout=5)) | ||
| except queue_mod.Empty: | ||
| if all(not p.is_alive() for p in procs): | ||
| break |
There was a problem hiding this comment.
Fail fast when any worker has already crashed.
This only breaks once all workers are dead. If one rank exits non-zero and another rank is stuck in init_process_group or teardown, the helper still waits until the full deadline, so the flake turns into a long hang again. Breaking on any observed non-zero exitcode preserves the intended early-exit behavior.
Suggested fix
while len(results) < world_size and time.monotonic() < deadline:
try:
results.append(q.get(timeout=5))
except queue_mod.Empty:
+ if any(p.exitcode not in (None, 0) for p in procs):
+ break
if all(not p.is_alive() for p in procs):
break🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@tests/integrations/test_expert_parallel.py` around lines 373 - 378, The loop
that waits for worker results should break early if any worker has crashed
rather than only when all workers are dead: after catching queue_mod.Empty,
inspect procs for any process with a non-zero exitcode (e.g., any(p.exitcode not
in (None, 0) for p in procs)) and break if found; otherwise keep the existing
break when all(not p.is_alive() for p in procs). Refer to the variables/results
used in the diff (results, world_size, q, queue_mod.Empty, procs, deadline) and
update the timeout-handling branch to check for non-zero exit codes first to
implement the fail-fast behavior.
Codecov Report✅ All modified and coverable lines are covered by tests. 📢 Thoughts on this report? Let us know! |
Description
Motivation and Context
How has this been tested?
AI Usage Disclaimer
Screenshots (if appropriate)
Types of changes
Social Handles (Optional)
Summary by CodeRabbit