test(scattermoe-lora): skip on CUDA OOM under xdist contention by winglian · Pull Request #3689 · axolotl-ai-cloud/axolotl

winglian · 2026-05-29T03:22:44Z

When the suite runs under pytest-xdist, multiple workers race for the same physical GPU's memory budget. A test that fits comfortably in isolation can OOM purely because peer workers are already holding most of VRAM (observed: 8 workers each holding ~44 GiB on a 44 GiB card).

Add a conftest in tests/integrations/kernels/scattermoe_lora/ that hooks pytest_runtest_call and converts torch.OutOfMemoryError into a skip. Real correctness bugs still surface as failures since they raise asserts / typed exceptions, not OOM.

Uses a hookwrapper rather than an autouse fixture because pytest captures the test exception before re-entering the fixture's generator, so the fixture's try/except around yield never sees it.

Description

Motivation and Context

How has this been tested?

AI Usage Disclaimer

Screenshots (if appropriate)

Types of changes

Social Handles (Optional)

Summary by CodeRabbit

Tests
- Tests now automatically skip when GPU memory is insufficient instead of failing, improving reliability in memory-constrained environments.

When the suite runs under pytest-xdist, multiple workers race for the same physical GPU's memory budget. A test that fits comfortably in isolation can OOM purely because peer workers are already holding most of VRAM (observed: 8 workers each holding ~44 GiB on a 44 GiB card). Add a conftest in tests/integrations/kernels/scattermoe_lora/ that hooks pytest_runtest_call and converts torch.OutOfMemoryError into a skip. Real correctness bugs still surface as failures since they raise asserts / typed exceptions, not OOM. Uses a hookwrapper rather than an autouse fixture because pytest captures the test exception before re-entering the fixture's generator, so the fixture's try/except around yield never sees it. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

coderabbitai · 2026-05-29T03:22:57Z

📝 Walkthrough

Walkthrough

This PR adds a pytest configuration hook to the ScatterMoE LoRA integration test suite that gracefully handles CUDA out-of-memory failures by converting them to test skips. The hook performs garbage collection and cache clearing before skipping, addressing transient GPU memory contention in distributed test execution.

Changes

CUDA OOM Skip Handling

Layer / File(s)	Summary
CUDA OOM test skip hook `tests/integrations/kernels/scattermoe_lora/conftest.py`	`pytest_runtest_call` hook detects CUDA OOM exceptions via `_cuda_oom_types()` helper, runs `gc.collect()` and `torch.cuda.empty_cache()` when available, then converts exceptions to `pytest.skip.Exception` with location-aware skip reason.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~8 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title 'test(scattermoe-lora): skip on CUDA OOM under xdist contention' clearly and specifically summarizes the main change: adding test-level handling to skip CUDA out-of-memory failures during distributed test execution.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch fix/scattermoe-lora-skip-on-cuda-oom

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@tests/integrations/kernels/scattermoe_lora/conftest.py`:
- Line 34: The fallback in _OOM currently returns (RuntimeError,) which can mask
real runtime errors; change the function that returns "return tuple(types) or
(RuntimeError,)" to return an empty tuple instead (return tuple(types) or ()) so
_OOM only contains actual OOM exception types; then adjust any callers (e.g.,
the except _OOM: usage at the code path referencing _OOM around Line 47) to only
perform an except when _OOM is non-empty (or explicitly check before using
except) so unrelated RuntimeError instances are not swallowed.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 792cd86e-12ec-4dde-9a2b-522d67b2b518

📥 Commits

Reviewing files that changed from the base of the PR and between ead6bc7 and 8aba07e.

📒 Files selected for processing (1)

tests/integrations/kernels/scattermoe_lora/conftest.py

coderabbitai · 2026-05-29T03:25:04Z

+    cuda_oom = getattr(torch.cuda, "OutOfMemoryError", None)
+    if cuda_oom is not None and cuda_oom not in types:
+        types.append(cuda_oom)
+    return tuple(types) or (RuntimeError,)


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Avoid broad RuntimeError fallback that can hide real failures.

At Line 34, defaulting _OOM to (RuntimeError,) means Line 47 may skip unrelated runtime failures if torch OOM classes are unavailable, masking real regressions.

Proposed fix

def _cuda_oom_types() -> tuple[type[BaseException], ...]: types: list[type[BaseException]] = [] if hasattr(torch, "OutOfMemoryError"): types.append(torch.OutOfMemoryError) cuda_oom = getattr(torch.cuda, "OutOfMemoryError", None) if cuda_oom is not None and cuda_oom not in types: types.append(cuda_oom) - return tuple(types) or (RuntimeError,) + return tuple(types)

def pytest_runtest_call(item): outcome = yield excinfo = outcome.excinfo if excinfo is None: return + if not _OOM: + return exc_val = excinfo[1] if isinstance(exc_val, _OOM): gc.collect()

Also applies to: 47-47

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/integrations/kernels/scattermoe_lora/conftest.py` at line 34, The fallback in _OOM currently returns (RuntimeError,) which can mask real runtime errors; change the function that returns "return tuple(types) or (RuntimeError,)" to return an empty tuple instead (return tuple(types) or ()) so _OOM only contains actual OOM exception types; then adjust any callers (e.g., the except _OOM: usage at the code path referencing _OOM around Line 47) to only perform an except when _OOM is non-empty (or explicitly check before using except) so unrelated RuntimeError instances are not swallowed.

codecov · 2026-05-29T03:35:17Z

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

coderabbitai Bot reviewed May 29, 2026

View reviewed changes

winglian merged commit bf19bff into main May 29, 2026
13 of 17 checks passed

winglian deleted the fix/scattermoe-lora-skip-on-cuda-oom branch May 29, 2026 05:15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

test(scattermoe-lora): skip on CUDA OOM under xdist contention#3689

test(scattermoe-lora): skip on CUDA OOM under xdist contention#3689
winglian merged 1 commit into
mainfrom
fix/scattermoe-lora-skip-on-cuda-oom

winglian commented May 29, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented May 29, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot May 29, 2026

Uh oh!

codecov Bot commented May 29, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

winglian commented May 29, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Motivation and Context

How has this been tested?

AI Usage Disclaimer

Screenshots (if appropriate)

Types of changes

Social Handles (Optional)

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented May 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 29, 2026

Choose a reason for hiding this comment

Uh oh!

codecov Bot commented May 29, 2026

Codecov Report

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

winglian commented May 29, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented May 29, 2026 •

edited

Loading