Skip to content

test(scattermoe-lora): skip on CUDA OOM under xdist contention#3689

Merged
winglian merged 1 commit into
mainfrom
fix/scattermoe-lora-skip-on-cuda-oom
May 29, 2026
Merged

test(scattermoe-lora): skip on CUDA OOM under xdist contention#3689
winglian merged 1 commit into
mainfrom
fix/scattermoe-lora-skip-on-cuda-oom

Conversation

@winglian

@winglian winglian commented May 29, 2026

Copy link
Copy Markdown
Collaborator

When the suite runs under pytest-xdist, multiple workers race for the same physical GPU's memory budget. A test that fits comfortably in isolation can OOM purely because peer workers are already holding most of VRAM (observed: 8 workers each holding ~44 GiB on a 44 GiB card).

Add a conftest in tests/integrations/kernels/scattermoe_lora/ that hooks pytest_runtest_call and converts torch.OutOfMemoryError into a skip. Real correctness bugs still surface as failures since they raise asserts / typed exceptions, not OOM.

Uses a hookwrapper rather than an autouse fixture because pytest captures the test exception before re-entering the fixture's generator, so the fixture's try/except around yield never sees it.

Description

Motivation and Context

How has this been tested?

AI Usage Disclaimer

Screenshots (if appropriate)

Types of changes

Social Handles (Optional)

Summary by CodeRabbit

  • Tests
    • Tests now automatically skip when GPU memory is insufficient instead of failing, improving reliability in memory-constrained environments.

Review Change Stack

When the suite runs under pytest-xdist, multiple workers race for the same
physical GPU's memory budget. A test that fits comfortably in isolation
can OOM purely because peer workers are already holding most of VRAM
(observed: 8 workers each holding ~44 GiB on a 44 GiB card).

Add a conftest in tests/integrations/kernels/scattermoe_lora/ that hooks
pytest_runtest_call and converts torch.OutOfMemoryError into a skip. Real
correctness bugs still surface as failures since they raise asserts /
typed exceptions, not OOM.

Uses a hookwrapper rather than an autouse fixture because pytest captures
the test exception before re-entering the fixture's generator, so the
fixture's try/except around yield never sees it.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@coderabbitai

coderabbitai Bot commented May 29, 2026

Copy link
Copy Markdown
Contributor
📝 Walkthrough

Walkthrough

This PR adds a pytest configuration hook to the ScatterMoE LoRA integration test suite that gracefully handles CUDA out-of-memory failures by converting them to test skips. The hook performs garbage collection and cache clearing before skipping, addressing transient GPU memory contention in distributed test execution.

Changes

CUDA OOM Skip Handling

Layer / File(s) Summary
CUDA OOM test skip hook
tests/integrations/kernels/scattermoe_lora/conftest.py
pytest_runtest_call hook detects CUDA OOM exceptions via _cuda_oom_types() helper, runs gc.collect() and torch.cuda.empty_cache() when available, then converts exceptions to pytest.skip.Exception with location-aware skip reason.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~8 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'test(scattermoe-lora): skip on CUDA OOM under xdist contention' clearly and specifically summarizes the main change: adding test-level handling to skip CUDA out-of-memory failures during distributed test execution.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/scattermoe-lora-skip-on-cuda-oom

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@tests/integrations/kernels/scattermoe_lora/conftest.py`:
- Line 34: The fallback in _OOM currently returns (RuntimeError,) which can mask
real runtime errors; change the function that returns "return tuple(types) or
(RuntimeError,)" to return an empty tuple instead (return tuple(types) or ()) so
_OOM only contains actual OOM exception types; then adjust any callers (e.g.,
the except _OOM: usage at the code path referencing _OOM around Line 47) to only
perform an except when _OOM is non-empty (or explicitly check before using
except) so unrelated RuntimeError instances are not swallowed.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 792cd86e-12ec-4dde-9a2b-522d67b2b518

📥 Commits

Reviewing files that changed from the base of the PR and between ead6bc7 and 8aba07e.

📒 Files selected for processing (1)
  • tests/integrations/kernels/scattermoe_lora/conftest.py

cuda_oom = getattr(torch.cuda, "OutOfMemoryError", None)
if cuda_oom is not None and cuda_oom not in types:
types.append(cuda_oom)
return tuple(types) or (RuntimeError,)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Avoid broad RuntimeError fallback that can hide real failures.

At Line 34, defaulting _OOM to (RuntimeError,) means Line 47 may skip unrelated runtime failures if torch OOM classes are unavailable, masking real regressions.

Proposed fix
 def _cuda_oom_types() -> tuple[type[BaseException], ...]:
     types: list[type[BaseException]] = []
     if hasattr(torch, "OutOfMemoryError"):
         types.append(torch.OutOfMemoryError)
     cuda_oom = getattr(torch.cuda, "OutOfMemoryError", None)
     if cuda_oom is not None and cuda_oom not in types:
         types.append(cuda_oom)
-    return tuple(types) or (RuntimeError,)
+    return tuple(types)
 def pytest_runtest_call(item):
     outcome = yield
     excinfo = outcome.excinfo
     if excinfo is None:
         return
+    if not _OOM:
+        return
     exc_val = excinfo[1]
     if isinstance(exc_val, _OOM):
         gc.collect()

Also applies to: 47-47

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/integrations/kernels/scattermoe_lora/conftest.py` at line 34, The
fallback in _OOM currently returns (RuntimeError,) which can mask real runtime
errors; change the function that returns "return tuple(types) or
(RuntimeError,)" to return an empty tuple instead (return tuple(types) or ()) so
_OOM only contains actual OOM exception types; then adjust any callers (e.g.,
the except _OOM: usage at the code path referencing _OOM around Line 47) to only
perform an except when _OOM is non-empty (or explicitly check before using
except) so unrelated RuntimeError instances are not swallowed.

@codecov

codecov Bot commented May 29, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

@winglian winglian merged commit bf19bff into main May 29, 2026
13 of 17 checks passed
@winglian winglian deleted the fix/scattermoe-lora-skip-on-cuda-oom branch May 29, 2026 05:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant