test(scattermoe-lora): skip on CUDA OOM under xdist contention#3689
Conversation
When the suite runs under pytest-xdist, multiple workers race for the same physical GPU's memory budget. A test that fits comfortably in isolation can OOM purely because peer workers are already holding most of VRAM (observed: 8 workers each holding ~44 GiB on a 44 GiB card). Add a conftest in tests/integrations/kernels/scattermoe_lora/ that hooks pytest_runtest_call and converts torch.OutOfMemoryError into a skip. Real correctness bugs still surface as failures since they raise asserts / typed exceptions, not OOM. Uses a hookwrapper rather than an autouse fixture because pytest captures the test exception before re-entering the fixture's generator, so the fixture's try/except around yield never sees it. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
📝 WalkthroughWalkthroughThis PR adds a pytest configuration hook to the ScatterMoE LoRA integration test suite that gracefully handles CUDA out-of-memory failures by converting them to test skips. The hook performs garbage collection and cache clearing before skipping, addressing transient GPU memory contention in distributed test execution. ChangesCUDA OOM Skip Handling
Estimated code review effort🎯 2 (Simple) | ⏱️ ~8 minutes 🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@tests/integrations/kernels/scattermoe_lora/conftest.py`:
- Line 34: The fallback in _OOM currently returns (RuntimeError,) which can mask
real runtime errors; change the function that returns "return tuple(types) or
(RuntimeError,)" to return an empty tuple instead (return tuple(types) or ()) so
_OOM only contains actual OOM exception types; then adjust any callers (e.g.,
the except _OOM: usage at the code path referencing _OOM around Line 47) to only
perform an except when _OOM is non-empty (or explicitly check before using
except) so unrelated RuntimeError instances are not swallowed.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 792cd86e-12ec-4dde-9a2b-522d67b2b518
📒 Files selected for processing (1)
tests/integrations/kernels/scattermoe_lora/conftest.py
| cuda_oom = getattr(torch.cuda, "OutOfMemoryError", None) | ||
| if cuda_oom is not None and cuda_oom not in types: | ||
| types.append(cuda_oom) | ||
| return tuple(types) or (RuntimeError,) |
There was a problem hiding this comment.
Avoid broad RuntimeError fallback that can hide real failures.
At Line 34, defaulting _OOM to (RuntimeError,) means Line 47 may skip unrelated runtime failures if torch OOM classes are unavailable, masking real regressions.
Proposed fix
def _cuda_oom_types() -> tuple[type[BaseException], ...]:
types: list[type[BaseException]] = []
if hasattr(torch, "OutOfMemoryError"):
types.append(torch.OutOfMemoryError)
cuda_oom = getattr(torch.cuda, "OutOfMemoryError", None)
if cuda_oom is not None and cuda_oom not in types:
types.append(cuda_oom)
- return tuple(types) or (RuntimeError,)
+ return tuple(types) def pytest_runtest_call(item):
outcome = yield
excinfo = outcome.excinfo
if excinfo is None:
return
+ if not _OOM:
+ return
exc_val = excinfo[1]
if isinstance(exc_val, _OOM):
gc.collect()Also applies to: 47-47
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@tests/integrations/kernels/scattermoe_lora/conftest.py` at line 34, The
fallback in _OOM currently returns (RuntimeError,) which can mask real runtime
errors; change the function that returns "return tuple(types) or
(RuntimeError,)" to return an empty tuple instead (return tuple(types) or ()) so
_OOM only contains actual OOM exception types; then adjust any callers (e.g.,
the except _OOM: usage at the code path referencing _OOM around Line 47) to only
perform an except when _OOM is non-empty (or explicitly check before using
except) so unrelated RuntimeError instances are not swallowed.
Codecov Report✅ All modified and coverable lines are covered by tests. 📢 Thoughts on this report? Let us know! |
When the suite runs under pytest-xdist, multiple workers race for the same physical GPU's memory budget. A test that fits comfortably in isolation can OOM purely because peer workers are already holding most of VRAM (observed: 8 workers each holding ~44 GiB on a 44 GiB card).
Add a conftest in tests/integrations/kernels/scattermoe_lora/ that hooks pytest_runtest_call and converts torch.OutOfMemoryError into a skip. Real correctness bugs still surface as failures since they raise asserts / typed exceptions, not OOM.
Uses a hookwrapper rather than an autouse fixture because pytest captures the test exception before re-entering the fixture's generator, so the fixture's try/except around yield never sees it.
Description
Motivation and Context
How has this been tested?
AI Usage Disclaimer
Screenshots (if appropriate)
Types of changes
Social Handles (Optional)
Summary by CodeRabbit