fix: fail early for CI if meet CUDA error by NanoCode012 · Pull Request #3737 · axolotl-ai-cloud/axolotl

NanoCode012 · 2026-06-16T02:31:01Z

Description

I'm seeing a few failures, in main, in unrelated PRs (for ex https://github.com/axolotl-ai-cloud/axolotl/actions/runs/27540583170/job/81407934320?pr=3736) , where the modal gpu docker e2e fails due to some device flakiness. This change makes those failures fail fast, so we can restart them at our own time and save cost instead of stuck in error loop.

Motivation and Context

How has this been tested?

AI Usage Disclaimer

Claude

Screenshots (if appropriate)

Types of changes

Social Handles (Optional)

Summary by CodeRabbit

Tests
- Improved GPU test execution: automatic detection of fatal CUDA errors with early test termination to prevent cascading failures.

coderabbitai · 2026-06-16T02:31:14Z

Important

Review skipped

Auto incremental reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 1e9024c7-db16-4f96-894d-2d13c69c6442

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

🔍 Trigger review

📝 Walkthrough

Walkthrough

Adds 29 lines to tests/conftest.py implementing a CUDA fatal-error guard: two module-level variables track detected fatal markers, a pytest_runtest_makereport hookwrapper scans failing-test exception messages for CUDA device-side assert/illegal-access strings and sets a poisoned flag, and a pytest_runtest_setup hook calls pytest.exit when that flag is set.

Changes

CUDA Fatal Error Detection

Layer / File(s)	Summary
CUDA context poisoning state and pytest hooks `tests/conftest.py`	Defines `_CUDA_FATAL_MARKERS` tuple and `_cuda_context_poisoned` bool; implements `pytest_runtest_makereport` (hookwrapper) to detect CUDA fatal substrings in exception reprs and set the flag; implements `pytest_runtest_setup` to abort the session via `pytest.exit` when the flag is set.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~8 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title 'fix: fail early for CI if meet CUDA error' clearly and specifically describes the main change: implementing early failure behavior when CUDA errors occur during CI tests.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch fix/ci-cuda-error

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

codecov · 2026-06-16T02:57:36Z

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

fix: fail early for CI if meet CUDA error

b062a04

NanoCode012 added the review requested label Jun 16, 2026

fix: switch to clean abort

eac3488

NanoCode012 merged commit bc7e265 into main Jun 16, 2026
16 of 17 checks passed

NanoCode012 deleted the fix/ci-cuda-error branch June 16, 2026 09:42

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: fail early for CI if meet CUDA error#3737

fix: fail early for CI if meet CUDA error#3737
NanoCode012 merged 2 commits into
mainfrom
fix/ci-cuda-error

NanoCode012 commented Jun 16, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented Jun 16, 2026 •

edited

Loading

Review skipped

Walkthrough

Changes

Estimated code review effort

❌ Failed checks (1 warning)

Uh oh!

codecov Bot commented Jun 16, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

NanoCode012 commented Jun 16, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Motivation and Context

How has this been tested?

AI Usage Disclaimer

Screenshots (if appropriate)

Types of changes

Social Handles (Optional)

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Jun 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review skipped

Walkthrough

Changes

Estimated code review effort

❌ Failed checks (1 warning)

Uh oh!

codecov Bot commented Jun 16, 2026

Codecov Report

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

NanoCode012 commented Jun 16, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Jun 16, 2026 •

edited

Loading