Skip to content

prefer latest pytorch as gated e2e tests#3698

Merged
winglian merged 5 commits into
mainfrom
pytorch-2120-tests
Jun 2, 2026
Merged

prefer latest pytorch as gated e2e tests#3698
winglian merged 5 commits into
mainfrom
pytorch-2120-tests

Conversation

@winglian
Copy link
Copy Markdown
Collaborator

@winglian winglian commented Jun 1, 2026

Description

Motivation and Context

How has this been tested?

AI Usage Disclaimer

Screenshots (if appropriate)

Types of changes

Social Handles (Optional)

Summary by CodeRabbit

  • Tests

    • Updated GPU testing configurations to verify compatibility with Python 3.12 and PyTorch 2.11.0–2.12.0
  • Chores

    • Refreshed CI/CD test matrix dependencies for enhanced compatibility assurance

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Jun 1, 2026

Review Change Stack

Important

Review skipped

Auto incremental reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 1e44db1c-1bb1-4ca4-9998-97b70b77e9b4

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

This PR updates GitHub Actions workflow matrices across two GPU e2e test workflows to upgrade Python and PyTorch versions for CUDA configurations, removing legacy version combinations and standardizing on newer runtime versions.

Changes

GPU Test Matrix Version Upgrades

Layer / File(s) Summary
Multi-GPU e2e matrix configuration
.github/workflows/multi-gpu-e2e.yml
CUDA 130 test configuration updated from Python 3.11 / PyTorch 2.9.1 to Python 3.12 / PyTorch 2.12.0.
Initial e2e gate test matrix
.github/workflows/tests.yml
docker-e2e-tests-1st job updated to use PyTorch 2.12.0 for the CUDA 13.0 / Python 3.12 matrix include.
Main e2e test matrix reconfiguration
.github/workflows/tests.yml
CUDA 12.8 / Python 3.11 / PyTorch 2.9.1 configuration removed from docker-e2e-tests matrix, and CUDA 13.0 entry updated to Python 3.12 / PyTorch 2.11.0.

Estimated code review effort

🎯 1 (Trivial) | ⏱️ ~3 minutes

Possibly related PRs

  • axolotl-ai-cloud/axolotl#3367: Updates CUDA 13.0 and Python 3.12 entries in GPU test workflow matrices with PyTorch version changes similar to this PR.
  • axolotl-ai-cloud/axolotl#3034: Related CI/workflow matrix updates for GPU e2e and base image build configurations with CUDA/PyTorch/Python version bumps.
  • axolotl-ai-cloud/axolotl#3697: Overlapping multi-GPU workflow matrix updates for CUDA 13.0 with Python 3.12 and PyTorch 2.12.0.

Suggested reviewers

  • djsaunde
  • salmanmohammadi
🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'prefer latest pytorch as gated e2e tests' clearly reflects the main changes: updating PyTorch versions to latest versions (2.12.0, 2.11.0) in the e2e test workflows across multiple files.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch pytorch-2120-tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (1)
.github/workflows/tests.yml (1)

373-378: 💤 Low value

Verify cleanup job version alignment.

The docker-e2e-cleanup job uses PyTorch 2.9.1 for CUDA 128, but the docker-e2e-tests job now uses PyTorch 2.10.0 for the same CUDA 128 configuration (line 326).

If the cleanup job is intended to remove cached artifacts from previous test runs, consider whether it should match the current test matrix versions or if the older version is intentionally retained for cleaning legacy caches.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In @.github/workflows/tests.yml around lines 373 - 378, The docker-e2e-cleanup
job currently pins pytorch: 2.9.1 for cuda: 128 while docker-e2e-tests uses
pytorch: 2.10.0 for the same CUDA 128 config (docker-e2e-cleanup vs
docker-e2e-tests); update the docker-e2e-cleanup matrix to use pytorch: 2.10.0
(or explicitly document/guard retention of 2.9.1) so the cleanup job targets the
same artifact cache versions as the tests, ensuring cached artifacts removed
match the test environment.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In @.github/workflows/multi-gpu-e2e.yml:
- Around line 43-44: The workflow currently specifies pytorch: 2.12.0 which is
built for cu118/cu121 but the runner uses CUDA 13.0.0; update the workflow so
the CUDA/runtime and PyTorch align by either changing the pytorch value to a
build compatible with CUDA 13 (if available) or, more simply, change the
CUDA/runtime image/version to a supported one (e.g., cu118 or cu121) so that the
pytorch: 2.12.0 wheel matches; locate the keys python_version and pytorch in the
YAML and update the corresponding CUDA/runtime setting or the pytorch entry
accordingly.

In @.github/workflows/tests.yml:
- Around line 331-332: The CI defines two e2e jobs with mismatched PyTorch
versions: docker-e2e-tests-1st uses pytorch 2.12.0 while docker-e2e-tests uses
pytorch 2.11.0; update one of these jobs so both use the same PyTorch version
for the CUDA 130 / Python 3.12 matrix entry (either set docker-e2e-tests-1st to
2.11.0 or docker-e2e-tests to 2.12.0) so the gate and main e2e suites run
against an identical PyTorch version; ensure you update the pytorch value in the
job matrix entries referencing python_version "3.12" to keep them consistent
and, if intentional, add a brief comment explaining the divergence.

---

Nitpick comments:
In @.github/workflows/tests.yml:
- Around line 373-378: The docker-e2e-cleanup job currently pins pytorch: 2.9.1
for cuda: 128 while docker-e2e-tests uses pytorch: 2.10.0 for the same CUDA 128
config (docker-e2e-cleanup vs docker-e2e-tests); update the docker-e2e-cleanup
matrix to use pytorch: 2.10.0 (or explicitly document/guard retention of 2.9.1)
so the cleanup job targets the same artifact cache versions as the tests,
ensuring cached artifacts removed match the test environment.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 6f880187-2f7c-410f-a9f1-e69a1656f5bf

📥 Commits

Reviewing files that changed from the base of the PR and between 6da2f9e and 3ecddc4.

📒 Files selected for processing (2)
  • .github/workflows/multi-gpu-e2e.yml
  • .github/workflows/tests.yml

Comment on lines +43 to +44
python_version: "3.12"
pytorch: 2.12.0
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🌐 Web query:

Is PyTorch 2.12.0 released and does it support CUDA 13.0 and Python 3.12?

💡 Result:

Yes, PyTorch 2.12.0 has been released [1][2][3]. Regarding your questions on support: * CUDA 13.0 Support: PyTorch 2.12.0 includes official support for CUDA 13.0 [4][5]. In fact, CUDA 13.0 is the stable default for PyTorch 2.12.0 [5][3]. Additionally, experimental builds for newer versions such as CUDA 13.2 are also available [5][6]. * Python 3.12 Support: PyTorch 2.12.0 supports Python 3.12 [4]. As of the 2.x release series, PyTorch generally maintains compatibility with recent Python versions (including experimental support for newer releases like Python 3.14t) [4][7]. You can verify specific compatibility for your environment by checking the official PyTorch installation documentation or the release notes for version 2.12.0 [1][5].

Citations:


Fix PyTorch/CUDA version mismatch in .github/workflows/multi-gpu-e2e.yml (lines 43-44): torch==2.12.0 is built for CUDA 11.8/12.1 wheels (cu118/cu121), not CUDA 13.0.0—update the workflow to a supported CUDA version or bump PyTorch accordingly. Python 3.12 is supported by PyTorch 2.12.0.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In @.github/workflows/multi-gpu-e2e.yml around lines 43 - 44, The workflow
currently specifies pytorch: 2.12.0 which is built for cu118/cu121 but the
runner uses CUDA 13.0.0; update the workflow so the CUDA/runtime and PyTorch
align by either changing the pytorch value to a build compatible with CUDA 13
(if available) or, more simply, change the CUDA/runtime image/version to a
supported one (e.g., cu118 or cu121) so that the pytorch: 2.12.0 wheel matches;
locate the keys python_version and pytorch in the YAML and update the
corresponding CUDA/runtime setting or the pytorch entry accordingly.

Comment on lines +331 to +332
python_version: "3.12"
pytorch: 2.11.0
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Version mismatch between gate and main e2e tests.

The docker-e2e-tests-1st gate job (line 277) uses PyTorch 2.12.0 for CUDA 130 / Python 3.12, but the docker-e2e-tests main job uses PyTorch 2.11.0 for the same configuration. This inconsistency means the gate test validates against a different PyTorch version than the full test suite runs with, potentially missing compatibility issues specific to PyTorch 2.11.0.

Consider aligning both jobs to use the same PyTorch version for the CUDA 130 configuration, or document if this difference is intentional.

🔧 Suggested fix to align versions

If the intent is to use PyTorch 2.12.0 consistently (matching the PR title "prefer latest pytorch as gated e2e tests"):

           - cuda: 130
             cuda_version: 13.0.0
             python_version: "3.12"
-            pytorch: 2.11.0
+            pytorch: 2.12.0
             num_gpus: 1
             axolotl_extras:

Alternatively, if 2.11.0 is preferred for the main tests, update the gate to match:

           - cuda: 130
             cuda_version: 13.0.0
             python_version: "3.12"
-            pytorch: 2.12.0
+            pytorch: 2.11.0
             num_gpus: 1
             axolotl_extras:
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
python_version: "3.12"
pytorch: 2.11.0
python_version: "3.12"
pytorch: 2.12.0
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In @.github/workflows/tests.yml around lines 331 - 332, The CI defines two e2e
jobs with mismatched PyTorch versions: docker-e2e-tests-1st uses pytorch 2.12.0
while docker-e2e-tests uses pytorch 2.11.0; update one of these jobs so both use
the same PyTorch version for the CUDA 130 / Python 3.12 matrix entry (either set
docker-e2e-tests-1st to 2.11.0 or docker-e2e-tests to 2.12.0) so the gate and
main e2e suites run against an identical PyTorch version; ensure you update the
pytorch value in the job matrix entries referencing python_version "3.12" to
keep them consistent and, if intentional, add a brief comment explaining the
divergence.

@codecov
Copy link
Copy Markdown

codecov Bot commented Jun 1, 2026

Codecov Report

❌ Patch coverage is 93.33333% with 1 line in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
src/axolotl/utils/optimizers/adopt.py 87.50% 1 Missing ⚠️

📢 Thoughts on this report? Let us know!

winglian added 2 commits June 2, 2026 01:56
…allback to 2.11

torch 2.12.0 rewrote the sharded-param construction in
FSDPParam._init_sharded_param from a two-line form

    self.sharded_param = nn.Parameter(self.to_sharded_dtensor(sharded_param))
    self.sharded_param.requires_grad_(param.requires_grad)

to a single multi-line Parameter() call with requires_grad= as a kwarg

    self.sharded_param = nn.Parameter(
        self.to_sharded_dtensor(sharded_param),
        requires_grad=param.requires_grad,
    )

Functionally identical, but the axolotl monkey-patch is source-level
text replacement: the 2.11 anchor no longer matches the 2.12 source, so
the substitution silently falls through to the warning branch and the
method stays unpatched — bnb Params4bit / Int8Params lose their
quantization metadata through the FSDP2 shard cycle.

Try the 2.12 anchor first; fall back to the 2.11 anchor so the patch
keeps working against both torch versions in our test matrix.

init_unsharded_param uses the same kwarg-style call in both 2.11 and
2.12, so its anchor is untouched.
@winglian winglian force-pushed the pytorch-2120-tests branch from 8ee5068 to e6d0431 Compare June 2, 2026 01:56
winglian added 2 commits June 2, 2026 01:57
torch 2.12 hoisted the unsharded-param construction out of the
first-all-gather `else:` branch up to method-body level, so the 2.11
anchor (8-space, inside else) no longer matched and the patch silently
no-op'd. This left bitsandbytes Params4bit unreconstructed under FSDP2,
surfacing as `mat1 and mat2 shapes cannot be multiplied (... 1x36864)`
in QLoRA training. Add the 2.12 method-body-level anchor with its own
replacement indentation, falling back to the 2.11 form.
test_lora_ddp ran only 2 steps with no seed, so train_loss was a random
draw (observed 1.95-3.23 across runs) and the 2.8 threshold tripped
intermittently — the torch 2.12 bump just happened to surface it. Run 20
steps with seed=42 to make the loss deterministic (2.189-2.191 spread),
and tighten the threshold to 2.5.
@winglian winglian force-pushed the pytorch-2120-tests branch from e6d0431 to ff991d6 Compare June 2, 2026 01:57
torch 2.11 renamed Optimizer._cuda_graph_capture_health_check to
_accelerator_graph_capture_health_check (2.12 re-added the old name as an
alias). ADOPT called the old name, so it raised AttributeError under torch
2.11 — surfaced by bumping the docker-e2e row from 2.9.1 to 2.11.0. Resolve
whichever name exists, preferring the new one. Also swap the deprecated
torch._utils.is_compiling() for torch.compiler.is_compiling().
@winglian winglian merged commit 406aee4 into main Jun 2, 2026
43 of 46 checks passed
@winglian winglian deleted the pytorch-2120-tests branch June 2, 2026 20:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant