Skip to content
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
56 changes: 56 additions & 0 deletions tests/integrations/kernels/scattermoe_lora/conftest.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,56 @@
# SPDX-License-Identifier: Apache-2.0
# Copyright (c) Axolotl AI
# Licensed under the Apache License, Version 2.0

"""Treat CUDA OOM as a skip for tests in this directory.

When the suite runs under ``pytest-xdist``, multiple workers contend for the
same physical GPU's memory budget. A test that fits comfortably in isolation
can OOM purely because peer workers are already holding most of VRAM. That's
an environmental race, not a code defect, so converting it to a skip keeps
mixed-GPU CI green without masking real regressions (a real correctness bug
surfaces as an assert/exception, not as ``torch.OutOfMemoryError``).

We hook ``pytest_runtest_call`` rather than using an autouse fixture because
pytest captures the test exception before re-entering the fixture's
generator — the fixture's ``try/except`` around ``yield`` never sees it.
"""

from __future__ import annotations

import gc

import pytest
import torch


def _cuda_oom_types() -> tuple[type[BaseException], ...]:
types: list[type[BaseException]] = []
if hasattr(torch, "OutOfMemoryError"):
types.append(torch.OutOfMemoryError)
cuda_oom = getattr(torch.cuda, "OutOfMemoryError", None)
if cuda_oom is not None and cuda_oom not in types:
types.append(cuda_oom)
return tuple(types) or (RuntimeError,)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Avoid broad RuntimeError fallback that can hide real failures.

At Line 34, defaulting _OOM to (RuntimeError,) means Line 47 may skip unrelated runtime failures if torch OOM classes are unavailable, masking real regressions.

Proposed fix
 def _cuda_oom_types() -> tuple[type[BaseException], ...]:
     types: list[type[BaseException]] = []
     if hasattr(torch, "OutOfMemoryError"):
         types.append(torch.OutOfMemoryError)
     cuda_oom = getattr(torch.cuda, "OutOfMemoryError", None)
     if cuda_oom is not None and cuda_oom not in types:
         types.append(cuda_oom)
-    return tuple(types) or (RuntimeError,)
+    return tuple(types)
 def pytest_runtest_call(item):
     outcome = yield
     excinfo = outcome.excinfo
     if excinfo is None:
         return
+    if not _OOM:
+        return
     exc_val = excinfo[1]
     if isinstance(exc_val, _OOM):
         gc.collect()

Also applies to: 47-47

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/integrations/kernels/scattermoe_lora/conftest.py` at line 34, The
fallback in _OOM currently returns (RuntimeError,) which can mask real runtime
errors; change the function that returns "return tuple(types) or
(RuntimeError,)" to return an empty tuple instead (return tuple(types) or ()) so
_OOM only contains actual OOM exception types; then adjust any callers (e.g.,
the except _OOM: usage at the code path referencing _OOM around Line 47) to only
perform an except when _OOM is non-empty (or explicitly check before using
except) so unrelated RuntimeError instances are not swallowed.



_OOM = _cuda_oom_types()


@pytest.hookimpl(hookwrapper=True)
def pytest_runtest_call(item):
outcome = yield
excinfo = outcome.excinfo
if excinfo is None:
return
exc_val = excinfo[1]
if isinstance(exc_val, _OOM):
gc.collect()
if torch.cuda.is_available():
torch.cuda.empty_cache()
outcome.force_exception(
pytest.skip.Exception(
f"skipping on CUDA OOM (likely xdist worker contention): {exc_val}",
_use_item_location=True,
)
)
Loading