fix flaky ep tests by winglian · Pull Request #3683 · axolotl-ai-cloud/axolotl

winglian · 2026-05-26T13:17:15Z

Description

Motivation and Context

How has this been tested?

AI Usage Disclaimer

Screenshots (if appropriate)

Types of changes

Social Handles (Optional)

Summary by CodeRabbit

Tests
- Enhanced test reliability for distributed process topology validation by improving error-handling mechanisms and preventing potential test hangs during worker result collection.

coderabbitai · 2026-05-26T13:17:30Z

📝 Walkthrough

Walkthrough

This PR updates the Expert Parallel topology test suite to add explicit timeouts to distributed-process-group initialization, introduces a worker-result collection utility that polls with early-exit logic to prevent silent hangs, and consolidates error-spawning and result-gathering into reusable helpers used by mismatch validation tests.

Changes

Worker Initialization and Result Collection Refactor

Layer / File(s)	Summary
Timeout Configuration in Worker Initialization `tests/integrations/test_expert_parallel.py`	Module imports add `queue` (aliased as `queue_mod`), `time`, and `timedelta`. Both `_ep_topology_worker` and `_ep_topology_worker_expects_error` functions are updated to pass an explicit `timeout=timedelta(seconds=120)` to `dist.init_process_group`.
Worker Result Collection Utility `tests/integrations/test_expert_parallel.py`	`_collect_worker_results(...)` helper polls the queue per worker with short 5-second timeouts and a monotonic deadline, exits early when all worker processes terminate, joins processes, asserts collected result count matches worker count, and verifies all worker exit codes are zero.
Error Spawning Helper and Test Integration `tests/integrations/test_expert_parallel.py`	`_spawn_expects_error(...)` centralizes spawning `_ep_topology_worker_expects_error` across ranks and integrates `_collect_worker_results` for result gathering. `test_world4_ep2_dp1_invalid_product_raises` and `test_mesh_axis_product_mismatch_raises` are refactored to use this helper instead of manually calling `q.get(...)` for all ranks.

🎯 2 (Simple) | ⏱️ ~10 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 55.56% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title 'fix flaky ep tests' directly addresses the main objective of the changeset, which is fixing flaky expert parallel (EP) tests by adding timeouts, refactoring result collection, and centralizing worker spawning logic.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch flaky-parallel-tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@tests/integrations/test_expert_parallel.py`:
- Around line 373-378: The loop that waits for worker results should break early
if any worker has crashed rather than only when all workers are dead: after
catching queue_mod.Empty, inspect procs for any process with a non-zero exitcode
(e.g., any(p.exitcode not in (None, 0) for p in procs)) and break if found;
otherwise keep the existing break when all(not p.is_alive() for p in procs).
Refer to the variables/results used in the diff (results, world_size, q,
queue_mod.Empty, procs, deadline) and update the timeout-handling branch to
check for non-zero exit codes first to implement the fail-fast behavior.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 02279180-dc8f-4394-9a39-cfb1d639b3a3

📥 Commits

Reviewing files that changed from the base of the PR and between b05ab9a and f5c7baa.

📒 Files selected for processing (1)

tests/integrations/test_expert_parallel.py

coderabbitai · 2026-05-26T13:21:11Z

+    while len(results) < world_size and time.monotonic() < deadline:
+        try:
+            results.append(q.get(timeout=5))
+        except queue_mod.Empty:
+            if all(not p.is_alive() for p in procs):
+                break


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Fail fast when any worker has already crashed.

This only breaks once all workers are dead. If one rank exits non-zero and another rank is stuck in init_process_group or teardown, the helper still waits until the full deadline, so the flake turns into a long hang again. Breaking on any observed non-zero exitcode preserves the intended early-exit behavior.

Suggested fix

while len(results) < world_size and time.monotonic() < deadline: try: results.append(q.get(timeout=5)) except queue_mod.Empty: + if any(p.exitcode not in (None, 0) for p in procs): + break if all(not p.is_alive() for p in procs): break

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/integrations/test_expert_parallel.py` around lines 373 - 378, The loop that waits for worker results should break early if any worker has crashed rather than only when all workers are dead: after catching queue_mod.Empty, inspect procs for any process with a non-zero exitcode (e.g., any(p.exitcode not in (None, 0) for p in procs)) and break if found; otherwise keep the existing break when all(not p.is_alive() for p in procs). Refer to the variables/results used in the diff (results, world_size, q, queue_mod.Empty, procs, deadline) and update the timeout-handling branch to check for non-zero exit codes first to implement the fail-fast behavior.

codecov · 2026-05-26T13:29:38Z

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

fix flaky ep tests

f5c7baa

coderabbitai Bot reviewed May 26, 2026

View reviewed changes

winglian added the scheduled_release This PR is slated for the upcoming release label May 26, 2026

winglian merged commit 3c4ff59 into main May 26, 2026
17 checks passed

winglian deleted the flaky-parallel-tests branch May 26, 2026 19:04

winglian mentioned this pull request May 29, 2026

feat: add dataset weight support for weighted data mixing (supersedes #3501) #3690

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix flaky ep tests#3683

fix flaky ep tests#3683
winglian merged 1 commit into
mainfrom
flaky-parallel-tests

winglian commented May 26, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented May 26, 2026 •

edited

Loading

Walkthrough

Changes

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot May 26, 2026

Uh oh!

codecov Bot commented May 26, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

winglian commented May 26, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Motivation and Context

How has this been tested?

AI Usage Disclaimer

Screenshots (if appropriate)

Types of changes

Social Handles (Optional)

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented May 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 26, 2026

Choose a reason for hiding this comment

Uh oh!

codecov Bot commented May 26, 2026

Codecov Report

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

winglian commented May 26, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented May 26, 2026 •

edited

Loading