Skip to content

fix: fail early for CI if meet CUDA error#3737

Merged
NanoCode012 merged 2 commits into
mainfrom
fix/ci-cuda-error
Jun 16, 2026
Merged

fix: fail early for CI if meet CUDA error#3737
NanoCode012 merged 2 commits into
mainfrom
fix/ci-cuda-error

Conversation

@NanoCode012

@NanoCode012 NanoCode012 commented Jun 16, 2026

Copy link
Copy Markdown
Collaborator

Description

I'm seeing a few failures, in main, in unrelated PRs (for ex https://github.com/axolotl-ai-cloud/axolotl/actions/runs/27540583170/job/81407934320?pr=3736) , where the modal gpu docker e2e fails due to some device flakiness. This change makes those failures fail fast, so we can restart them at our own time and save cost instead of stuck in error loop.

Motivation and Context

How has this been tested?

AI Usage Disclaimer

Claude

Screenshots (if appropriate)

Types of changes

Social Handles (Optional)

Summary by CodeRabbit

  • Tests
    • Improved GPU test execution: automatic detection of fatal CUDA errors with early test termination to prevent cascading failures.

@coderabbitai

coderabbitai Bot commented Jun 16, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

Important

Review skipped

Auto incremental reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 1e9024c7-db16-4f96-894d-2d13c69c6442

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Adds 29 lines to tests/conftest.py implementing a CUDA fatal-error guard: two module-level variables track detected fatal markers, a pytest_runtest_makereport hookwrapper scans failing-test exception messages for CUDA device-side assert/illegal-access strings and sets a poisoned flag, and a pytest_runtest_setup hook calls pytest.exit when that flag is set.

Changes

CUDA Fatal Error Detection

Layer / File(s) Summary
CUDA context poisoning state and pytest hooks
tests/conftest.py
Defines _CUDA_FATAL_MARKERS tuple and _cuda_context_poisoned bool; implements pytest_runtest_makereport (hookwrapper) to detect CUDA fatal substrings in exception reprs and set the flag; implements pytest_runtest_setup to abort the session via pytest.exit when the flag is set.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~8 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'fix: fail early for CI if meet CUDA error' clearly and specifically describes the main change: implementing early failure behavior when CUDA errors occur during CI tests.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/ci-cuda-error

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@codecov

codecov Bot commented Jun 16, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

@NanoCode012 NanoCode012 merged commit bc7e265 into main Jun 16, 2026
16 of 17 checks passed
@NanoCode012 NanoCode012 deleted the fix/ci-cuda-error branch June 16, 2026 09:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant