Skip to content

fix flaky ep tests#3683

Merged
winglian merged 1 commit into
mainfrom
flaky-parallel-tests
May 26, 2026
Merged

fix flaky ep tests#3683
winglian merged 1 commit into
mainfrom
flaky-parallel-tests

Conversation

@winglian

@winglian winglian commented May 26, 2026

Copy link
Copy Markdown
Collaborator

Description

Motivation and Context

How has this been tested?

AI Usage Disclaimer

Screenshots (if appropriate)

Types of changes

Social Handles (Optional)

Summary by CodeRabbit

  • Tests
    • Enhanced test reliability for distributed process topology validation by improving error-handling mechanisms and preventing potential test hangs during worker result collection.

Review Change Stack

@coderabbitai

coderabbitai Bot commented May 26, 2026

Copy link
Copy Markdown
Contributor
📝 Walkthrough

Walkthrough

This PR updates the Expert Parallel topology test suite to add explicit timeouts to distributed-process-group initialization, introduces a worker-result collection utility that polls with early-exit logic to prevent silent hangs, and consolidates error-spawning and result-gathering into reusable helpers used by mismatch validation tests.

Changes

Worker Initialization and Result Collection Refactor

Layer / File(s) Summary
Timeout Configuration in Worker Initialization
tests/integrations/test_expert_parallel.py
Module imports add queue (aliased as queue_mod), time, and timedelta. Both _ep_topology_worker and _ep_topology_worker_expects_error functions are updated to pass an explicit timeout=timedelta(seconds=120) to dist.init_process_group.
Worker Result Collection Utility
tests/integrations/test_expert_parallel.py
_collect_worker_results(...) helper polls the queue per worker with short 5-second timeouts and a monotonic deadline, exits early when all worker processes terminate, joins processes, asserts collected result count matches worker count, and verifies all worker exit codes are zero.
Error Spawning Helper and Test Integration
tests/integrations/test_expert_parallel.py
_spawn_expects_error(...) centralizes spawning _ep_topology_worker_expects_error across ranks and integrates _collect_worker_results for result gathering. test_world4_ep2_dp1_invalid_product_raises and test_mesh_axis_product_mismatch_raises are refactored to use this helper instead of manually calling q.get(...) for all ranks.

🎯 2 (Simple) | ⏱️ ~10 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 55.56% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'fix flaky ep tests' directly addresses the main objective of the changeset, which is fixing flaky expert parallel (EP) tests by adding timeouts, refactoring result collection, and centralizing worker spawning logic.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch flaky-parallel-tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@tests/integrations/test_expert_parallel.py`:
- Around line 373-378: The loop that waits for worker results should break early
if any worker has crashed rather than only when all workers are dead: after
catching queue_mod.Empty, inspect procs for any process with a non-zero exitcode
(e.g., any(p.exitcode not in (None, 0) for p in procs)) and break if found;
otherwise keep the existing break when all(not p.is_alive() for p in procs).
Refer to the variables/results used in the diff (results, world_size, q,
queue_mod.Empty, procs, deadline) and update the timeout-handling branch to
check for non-zero exit codes first to implement the fail-fast behavior.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 02279180-dc8f-4394-9a39-cfb1d639b3a3

📥 Commits

Reviewing files that changed from the base of the PR and between b05ab9a and f5c7baa.

📒 Files selected for processing (1)
  • tests/integrations/test_expert_parallel.py

Comment on lines +373 to +378
while len(results) < world_size and time.monotonic() < deadline:
try:
results.append(q.get(timeout=5))
except queue_mod.Empty:
if all(not p.is_alive() for p in procs):
break

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Fail fast when any worker has already crashed.

This only breaks once all workers are dead. If one rank exits non-zero and another rank is stuck in init_process_group or teardown, the helper still waits until the full deadline, so the flake turns into a long hang again. Breaking on any observed non-zero exitcode preserves the intended early-exit behavior.

Suggested fix
     while len(results) < world_size and time.monotonic() < deadline:
         try:
             results.append(q.get(timeout=5))
         except queue_mod.Empty:
+            if any(p.exitcode not in (None, 0) for p in procs):
+                break
             if all(not p.is_alive() for p in procs):
                 break
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/integrations/test_expert_parallel.py` around lines 373 - 378, The loop
that waits for worker results should break early if any worker has crashed
rather than only when all workers are dead: after catching queue_mod.Empty,
inspect procs for any process with a non-zero exitcode (e.g., any(p.exitcode not
in (None, 0) for p in procs)) and break if found; otherwise keep the existing
break when all(not p.is_alive() for p in procs). Refer to the variables/results
used in the diff (results, world_size, q, queue_mod.Empty, procs, deadline) and
update the timeout-handling branch to check for non-zero exit codes first to
implement the fail-fast behavior.

@winglian winglian added the scheduled_release This PR is slated for the upcoming release label May 26, 2026
@codecov

codecov Bot commented May 26, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

@winglian winglian merged commit 3c4ff59 into main May 26, 2026
17 checks passed
@winglian winglian deleted the flaky-parallel-tests branch May 26, 2026 19:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

scheduled_release This PR is slated for the upcoming release

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant