prefer latest pytorch as gated e2e tests by winglian · Pull Request #3698 · axolotl-ai-cloud/axolotl

winglian · 2026-06-01T17:06:07Z

Description

Motivation and Context

How has this been tested?

AI Usage Disclaimer

Screenshots (if appropriate)

Types of changes

Social Handles (Optional)

Summary by CodeRabbit

Tests
- Updated GPU testing configurations to verify compatibility with Python 3.12 and PyTorch 2.11.0–2.12.0
Chores
- Refreshed CI/CD test matrix dependencies for enhanced compatibility assurance

coderabbitai · 2026-06-01T17:06:22Z

Important

Review skipped

Auto incremental reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 1e44db1c-1bb1-4ca4-9998-97b70b77e9b4

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

🔍 Trigger review

📝 Walkthrough

Walkthrough

This PR updates GitHub Actions workflow matrices across two GPU e2e test workflows to upgrade Python and PyTorch versions for CUDA configurations, removing legacy version combinations and standardizing on newer runtime versions.

Changes

GPU Test Matrix Version Upgrades

Layer / File(s)	Summary
Multi-GPU e2e matrix configuration `.github/workflows/multi-gpu-e2e.yml`	CUDA 130 test configuration updated from Python 3.11 / PyTorch 2.9.1 to Python 3.12 / PyTorch 2.12.0.
Initial e2e gate test matrix `.github/workflows/tests.yml`	docker-e2e-tests-1st job updated to use PyTorch 2.12.0 for the CUDA 13.0 / Python 3.12 matrix include.
Main e2e test matrix reconfiguration `.github/workflows/tests.yml`	CUDA 12.8 / Python 3.11 / PyTorch 2.9.1 configuration removed from docker-e2e-tests matrix, and CUDA 13.0 entry updated to Python 3.12 / PyTorch 2.11.0.

Estimated code review effort

🎯 1 (Trivial) | ⏱️ ~3 minutes

Possibly related PRs

axolotl-ai-cloud/axolotl#3367: Updates CUDA 13.0 and Python 3.12 entries in GPU test workflow matrices with PyTorch version changes similar to this PR.
axolotl-ai-cloud/axolotl#3034: Related CI/workflow matrix updates for GPU e2e and base image build configurations with CUDA/PyTorch/Python version bumps.
axolotl-ai-cloud/axolotl#3697: Overlapping multi-GPU workflow matrix updates for CUDA 13.0 with Python 3.12 and PyTorch 2.12.0.

Suggested reviewers

djsaunde
salmanmohammadi

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title 'prefer latest pytorch as gated e2e tests' clearly reflects the main changes: updating PyTorch versions to latest versions (2.12.0, 2.11.0) in the e2e test workflows across multiple files.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch pytorch-2120-tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 2

🧹 Nitpick comments (1)

.github/workflows/tests.yml (1)
373-378: 💤 Low value

Verify cleanup job version alignment.

The docker-e2e-cleanup job uses PyTorch 2.9.1 for CUDA 128, but the docker-e2e-tests job now uses PyTorch 2.10.0 for the same CUDA 128 configuration (line 326).

If the cleanup job is intended to remove cached artifacts from previous test runs, consider whether it should match the current test matrix versions or if the older version is intentionally retained for cleaning legacy caches.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In @.github/workflows/tests.yml around lines 373 - 378, The docker-e2e-cleanup
job currently pins pytorch: 2.9.1 for cuda: 128 while docker-e2e-tests uses
pytorch: 2.10.0 for the same CUDA 128 config (docker-e2e-cleanup vs
docker-e2e-tests); update the docker-e2e-cleanup matrix to use pytorch: 2.10.0
(or explicitly document/guard retention of 2.9.1) so the cleanup job targets the
same artifact cache versions as the tests, ensuring cached artifacts removed
match the test environment.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In @.github/workflows/multi-gpu-e2e.yml:
- Around line 43-44: The workflow currently specifies pytorch: 2.12.0 which is
built for cu118/cu121 but the runner uses CUDA 13.0.0; update the workflow so
the CUDA/runtime and PyTorch align by either changing the pytorch value to a
build compatible with CUDA 13 (if available) or, more simply, change the
CUDA/runtime image/version to a supported one (e.g., cu118 or cu121) so that the
pytorch: 2.12.0 wheel matches; locate the keys python_version and pytorch in the
YAML and update the corresponding CUDA/runtime setting or the pytorch entry
accordingly.

In @.github/workflows/tests.yml:
- Around line 331-332: The CI defines two e2e jobs with mismatched PyTorch
versions: docker-e2e-tests-1st uses pytorch 2.12.0 while docker-e2e-tests uses
pytorch 2.11.0; update one of these jobs so both use the same PyTorch version
for the CUDA 130 / Python 3.12 matrix entry (either set docker-e2e-tests-1st to
2.11.0 or docker-e2e-tests to 2.12.0) so the gate and main e2e suites run
against an identical PyTorch version; ensure you update the pytorch value in the
job matrix entries referencing python_version "3.12" to keep them consistent
and, if intentional, add a brief comment explaining the divergence.

---

Nitpick comments:
In @.github/workflows/tests.yml:
- Around line 373-378: The docker-e2e-cleanup job currently pins pytorch: 2.9.1
for cuda: 128 while docker-e2e-tests uses pytorch: 2.10.0 for the same CUDA 128
config (docker-e2e-cleanup vs docker-e2e-tests); update the docker-e2e-cleanup
matrix to use pytorch: 2.10.0 (or explicitly document/guard retention of 2.9.1)
so the cleanup job targets the same artifact cache versions as the tests,
ensuring cached artifacts removed match the test environment.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 6f880187-2f7c-410f-a9f1-e69a1656f5bf

📥 Commits

Reviewing files that changed from the base of the PR and between 6da2f9e and 3ecddc4.

📒 Files selected for processing (2)

.github/workflows/multi-gpu-e2e.yml
.github/workflows/tests.yml

coderabbitai · 2026-06-01T17:11:36Z

+            python_version: "3.12"
+            pytorch: 2.12.0


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🌐 Web query:

Is PyTorch 2.12.0 released and does it support CUDA 13.0 and Python 3.12?

💡 Result:

Yes, PyTorch 2.12.0 has been released [1][2][3]. Regarding your questions on support: * CUDA 13.0 Support: PyTorch 2.12.0 includes official support for CUDA 13.0 [4][5]. In fact, CUDA 13.0 is the stable default for PyTorch 2.12.0 [5][3]. Additionally, experimental builds for newer versions such as CUDA 13.2 are also available [5][6]. * Python 3.12 Support: PyTorch 2.12.0 supports Python 3.12 [4]. As of the 2.x release series, PyTorch generally maintains compatibility with recent Python versions (including experimental support for newer releases like Python 3.14t) [4][7]. You can verify specific compatibility for your environment by checking the official PyTorch installation documentation or the release notes for version 2.12.0 [1][5].

Citations:

1: https://github.com/pytorch/pytorch/releases/tag/v2.12.0

2: https://github.com/pytorch/pytorch/releases

3: https://newreleases.io/project/pypi/torch/release/2.12.0

4: https://github.com/pytorch/pytorch/blob/HEAD/RELEASE.md

5: https://pytorch.org/blog/pytorch-2-12-release-blog/

6: https://www.reddit.com/r/StableDiffusion/comments/1tcd718/pytorch_2120cu132_cuda_132_sa2sa3_attention/

7: https://pytorch.org/blog/pytorch-2-10-release-blog/

Fix PyTorch/CUDA version mismatch in .github/workflows/multi-gpu-e2e.yml (lines 43-44): torch==2.12.0 is built for CUDA 11.8/12.1 wheels (cu118/cu121), not CUDA 13.0.0—update the workflow to a supported CUDA version or bump PyTorch accordingly. Python 3.12 is supported by PyTorch 2.12.0.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In @.github/workflows/multi-gpu-e2e.yml around lines 43 - 44, The workflow currently specifies pytorch: 2.12.0 which is built for cu118/cu121 but the runner uses CUDA 13.0.0; update the workflow so the CUDA/runtime and PyTorch align by either changing the pytorch value to a build compatible with CUDA 13 (if available) or, more simply, change the CUDA/runtime image/version to a supported one (e.g., cu118 or cu121) so that the pytorch: 2.12.0 wheel matches; locate the keys python_version and pytorch in the YAML and update the corresponding CUDA/runtime setting or the pytorch entry accordingly.

coderabbitai · 2026-06-01T17:11:36Z

+            python_version: "3.12"
+            pytorch: 2.11.0


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Version mismatch between gate and main e2e tests.

The docker-e2e-tests-1st gate job (line 277) uses PyTorch 2.12.0 for CUDA 130 / Python 3.12, but the docker-e2e-tests main job uses PyTorch 2.11.0 for the same configuration. This inconsistency means the gate test validates against a different PyTorch version than the full test suite runs with, potentially missing compatibility issues specific to PyTorch 2.11.0.

Consider aligning both jobs to use the same PyTorch version for the CUDA 130 configuration, or document if this difference is intentional.

🔧 Suggested fix to align versions

If the intent is to use PyTorch 2.12.0 consistently (matching the PR title "prefer latest pytorch as gated e2e tests"):

- cuda: 130 cuda_version: 13.0.0 python_version: "3.12" - pytorch: 2.11.0 + pytorch: 2.12.0 num_gpus: 1 axolotl_extras:

Alternatively, if 2.11.0 is preferred for the main tests, update the gate to match:

- cuda: 130 cuda_version: 13.0.0 python_version: "3.12" - pytorch: 2.12.0 + pytorch: 2.11.0 num_gpus: 1 axolotl_extras:

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

python_version: "3.12"

pytorch: 2.11.0

python_version: "3.12"

pytorch: 2.12.0

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In @.github/workflows/tests.yml around lines 331 - 332, The CI defines two e2e jobs with mismatched PyTorch versions: docker-e2e-tests-1st uses pytorch 2.12.0 while docker-e2e-tests uses pytorch 2.11.0; update one of these jobs so both use the same PyTorch version for the CUDA 130 / Python 3.12 matrix entry (either set docker-e2e-tests-1st to 2.11.0 or docker-e2e-tests to 2.12.0) so the gate and main e2e suites run against an identical PyTorch version; ensure you update the pytorch value in the job matrix entries referencing python_version "3.12" to keep them consistent and, if intentional, add a brief comment explaining the divergence.

codecov · 2026-06-01T17:17:25Z

Codecov Report

❌ Patch coverage is 93.33333% with 1 line in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
src/axolotl/utils/optimizers/adopt.py	87.50%	1 Missing ⚠️

📢 Thoughts on this report? Let us know!

…allback to 2.11 torch 2.12.0 rewrote the sharded-param construction in FSDPParam._init_sharded_param from a two-line form self.sharded_param = nn.Parameter(self.to_sharded_dtensor(sharded_param)) self.sharded_param.requires_grad_(param.requires_grad) to a single multi-line Parameter() call with requires_grad= as a kwarg self.sharded_param = nn.Parameter( self.to_sharded_dtensor(sharded_param), requires_grad=param.requires_grad, ) Functionally identical, but the axolotl monkey-patch is source-level text replacement: the 2.11 anchor no longer matches the 2.12 source, so the substitution silently falls through to the warning branch and the method stays unpatched — bnb Params4bit / Int8Params lose their quantization metadata through the FSDP2 shard cycle. Try the 2.12 anchor first; fall back to the 2.11 anchor so the patch keeps working against both torch versions in our test matrix. init_unsharded_param uses the same kwarg-style call in both 2.11 and 2.12, so its anchor is untouched.

torch 2.12 hoisted the unsharded-param construction out of the first-all-gather `else:` branch up to method-body level, so the 2.11 anchor (8-space, inside else) no longer matched and the patch silently no-op'd. This left bitsandbytes Params4bit unreconstructed under FSDP2, surfacing as `mat1 and mat2 shapes cannot be multiplied (... 1x36864)` in QLoRA training. Add the 2.12 method-body-level anchor with its own replacement indentation, falling back to the 2.11 form.

test_lora_ddp ran only 2 steps with no seed, so train_loss was a random draw (observed 1.95-3.23 across runs) and the 2.8 threshold tripped intermittently — the torch 2.12 bump just happened to surface it. Run 20 steps with seed=42 to make the loss deterministic (2.189-2.191 spread), and tighten the threshold to 2.5.

torch 2.11 renamed Optimizer._cuda_graph_capture_health_check to _accelerator_graph_capture_health_check (2.12 re-added the old name as an alias). ADOPT called the old name, so it raised AttributeError under torch 2.11 — surfaced by bumping the docker-e2e row from 2.9.1 to 2.11.0. Resolve whichever name exists, preferring the new one. Also swap the deprecated torch._utils.is_compiling() for torch.compiler.is_compiling().

coderabbitai Bot reviewed Jun 1, 2026

View reviewed changes

winglian added 2 commits June 2, 2026 01:56

prefer latest pytorch as gated e2e tests

ca49bcc

winglian force-pushed the pytorch-2120-tests branch from 8ee5068 to e6d0431 Compare June 2, 2026 01:56

winglian added 2 commits June 2, 2026 01:57

winglian force-pushed the pytorch-2120-tests branch from e6d0431 to ff991d6 Compare June 2, 2026 01:57

winglian merged commit 406aee4 into main Jun 2, 2026
43 of 46 checks passed

winglian deleted the pytorch-2120-tests branch June 2, 2026 20:49

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

prefer latest pytorch as gated e2e tests#3698

prefer latest pytorch as gated e2e tests#3698
winglian merged 5 commits into
mainfrom
pytorch-2120-tests

winglian commented Jun 1, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented Jun 1, 2026 •

edited

Loading

Review skipped

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested reviewers

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot Jun 1, 2026

Uh oh!

coderabbitai Bot Jun 1, 2026

Uh oh!

codecov Bot commented Jun 1, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

winglian commented Jun 1, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Motivation and Context

How has this been tested?

AI Usage Disclaimer

Screenshots (if appropriate)

Types of changes

Social Handles (Optional)

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Jun 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review skipped

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested reviewers

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 1, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 1, 2026

Choose a reason for hiding this comment

Uh oh!

codecov Bot commented Jun 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

winglian commented Jun 1, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Jun 1, 2026 •

edited

Loading

codecov Bot commented Jun 1, 2026 •

edited

Loading