prefer latest pytorch as gated e2e tests#3698
Conversation
|
Important Review skippedAuto incremental reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
📝 WalkthroughWalkthroughThis PR updates GitHub Actions workflow matrices across two GPU e2e test workflows to upgrade Python and PyTorch versions for CUDA configurations, removing legacy version combinations and standardizing on newer runtime versions. ChangesGPU Test Matrix Version Upgrades
Estimated code review effort🎯 1 (Trivial) | ⏱️ ~3 minutes Possibly related PRs
Suggested reviewers
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (1)
.github/workflows/tests.yml (1)
373-378: 💤 Low valueVerify cleanup job version alignment.
The
docker-e2e-cleanupjob uses PyTorch 2.9.1 for CUDA 128, but thedocker-e2e-testsjob now uses PyTorch 2.10.0 for the same CUDA 128 configuration (line 326).If the cleanup job is intended to remove cached artifacts from previous test runs, consider whether it should match the current test matrix versions or if the older version is intentionally retained for cleaning legacy caches.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In @.github/workflows/tests.yml around lines 373 - 378, The docker-e2e-cleanup job currently pins pytorch: 2.9.1 for cuda: 128 while docker-e2e-tests uses pytorch: 2.10.0 for the same CUDA 128 config (docker-e2e-cleanup vs docker-e2e-tests); update the docker-e2e-cleanup matrix to use pytorch: 2.10.0 (or explicitly document/guard retention of 2.9.1) so the cleanup job targets the same artifact cache versions as the tests, ensuring cached artifacts removed match the test environment.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In @.github/workflows/multi-gpu-e2e.yml:
- Around line 43-44: The workflow currently specifies pytorch: 2.12.0 which is
built for cu118/cu121 but the runner uses CUDA 13.0.0; update the workflow so
the CUDA/runtime and PyTorch align by either changing the pytorch value to a
build compatible with CUDA 13 (if available) or, more simply, change the
CUDA/runtime image/version to a supported one (e.g., cu118 or cu121) so that the
pytorch: 2.12.0 wheel matches; locate the keys python_version and pytorch in the
YAML and update the corresponding CUDA/runtime setting or the pytorch entry
accordingly.
In @.github/workflows/tests.yml:
- Around line 331-332: The CI defines two e2e jobs with mismatched PyTorch
versions: docker-e2e-tests-1st uses pytorch 2.12.0 while docker-e2e-tests uses
pytorch 2.11.0; update one of these jobs so both use the same PyTorch version
for the CUDA 130 / Python 3.12 matrix entry (either set docker-e2e-tests-1st to
2.11.0 or docker-e2e-tests to 2.12.0) so the gate and main e2e suites run
against an identical PyTorch version; ensure you update the pytorch value in the
job matrix entries referencing python_version "3.12" to keep them consistent
and, if intentional, add a brief comment explaining the divergence.
---
Nitpick comments:
In @.github/workflows/tests.yml:
- Around line 373-378: The docker-e2e-cleanup job currently pins pytorch: 2.9.1
for cuda: 128 while docker-e2e-tests uses pytorch: 2.10.0 for the same CUDA 128
config (docker-e2e-cleanup vs docker-e2e-tests); update the docker-e2e-cleanup
matrix to use pytorch: 2.10.0 (or explicitly document/guard retention of 2.9.1)
so the cleanup job targets the same artifact cache versions as the tests,
ensuring cached artifacts removed match the test environment.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 6f880187-2f7c-410f-a9f1-e69a1656f5bf
📒 Files selected for processing (2)
.github/workflows/multi-gpu-e2e.yml.github/workflows/tests.yml
| python_version: "3.12" | ||
| pytorch: 2.12.0 |
There was a problem hiding this comment.
🧩 Analysis chain
🌐 Web query:
Is PyTorch 2.12.0 released and does it support CUDA 13.0 and Python 3.12?
💡 Result:
Yes, PyTorch 2.12.0 has been released [1][2][3]. Regarding your questions on support: * CUDA 13.0 Support: PyTorch 2.12.0 includes official support for CUDA 13.0 [4][5]. In fact, CUDA 13.0 is the stable default for PyTorch 2.12.0 [5][3]. Additionally, experimental builds for newer versions such as CUDA 13.2 are also available [5][6]. * Python 3.12 Support: PyTorch 2.12.0 supports Python 3.12 [4]. As of the 2.x release series, PyTorch generally maintains compatibility with recent Python versions (including experimental support for newer releases like Python 3.14t) [4][7]. You can verify specific compatibility for your environment by checking the official PyTorch installation documentation or the release notes for version 2.12.0 [1][5].
Citations:
- 1: https://github.com/pytorch/pytorch/releases/tag/v2.12.0
- 2: https://github.com/pytorch/pytorch/releases
- 3: https://newreleases.io/project/pypi/torch/release/2.12.0
- 4: https://github.com/pytorch/pytorch/blob/HEAD/RELEASE.md
- 5: https://pytorch.org/blog/pytorch-2-12-release-blog/
- 6: https://www.reddit.com/r/StableDiffusion/comments/1tcd718/pytorch_2120cu132_cuda_132_sa2sa3_attention/
- 7: https://pytorch.org/blog/pytorch-2-10-release-blog/
Fix PyTorch/CUDA version mismatch in .github/workflows/multi-gpu-e2e.yml (lines 43-44): torch==2.12.0 is built for CUDA 11.8/12.1 wheels (cu118/cu121), not CUDA 13.0.0—update the workflow to a supported CUDA version or bump PyTorch accordingly. Python 3.12 is supported by PyTorch 2.12.0.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In @.github/workflows/multi-gpu-e2e.yml around lines 43 - 44, The workflow
currently specifies pytorch: 2.12.0 which is built for cu118/cu121 but the
runner uses CUDA 13.0.0; update the workflow so the CUDA/runtime and PyTorch
align by either changing the pytorch value to a build compatible with CUDA 13
(if available) or, more simply, change the CUDA/runtime image/version to a
supported one (e.g., cu118 or cu121) so that the pytorch: 2.12.0 wheel matches;
locate the keys python_version and pytorch in the YAML and update the
corresponding CUDA/runtime setting or the pytorch entry accordingly.
| python_version: "3.12" | ||
| pytorch: 2.11.0 |
There was a problem hiding this comment.
Version mismatch between gate and main e2e tests.
The docker-e2e-tests-1st gate job (line 277) uses PyTorch 2.12.0 for CUDA 130 / Python 3.12, but the docker-e2e-tests main job uses PyTorch 2.11.0 for the same configuration. This inconsistency means the gate test validates against a different PyTorch version than the full test suite runs with, potentially missing compatibility issues specific to PyTorch 2.11.0.
Consider aligning both jobs to use the same PyTorch version for the CUDA 130 configuration, or document if this difference is intentional.
🔧 Suggested fix to align versions
If the intent is to use PyTorch 2.12.0 consistently (matching the PR title "prefer latest pytorch as gated e2e tests"):
- cuda: 130
cuda_version: 13.0.0
python_version: "3.12"
- pytorch: 2.11.0
+ pytorch: 2.12.0
num_gpus: 1
axolotl_extras:Alternatively, if 2.11.0 is preferred for the main tests, update the gate to match:
- cuda: 130
cuda_version: 13.0.0
python_version: "3.12"
- pytorch: 2.12.0
+ pytorch: 2.11.0
num_gpus: 1
axolotl_extras:📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| python_version: "3.12" | |
| pytorch: 2.11.0 | |
| python_version: "3.12" | |
| pytorch: 2.12.0 |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In @.github/workflows/tests.yml around lines 331 - 332, The CI defines two e2e
jobs with mismatched PyTorch versions: docker-e2e-tests-1st uses pytorch 2.12.0
while docker-e2e-tests uses pytorch 2.11.0; update one of these jobs so both use
the same PyTorch version for the CUDA 130 / Python 3.12 matrix entry (either set
docker-e2e-tests-1st to 2.11.0 or docker-e2e-tests to 2.12.0) so the gate and
main e2e suites run against an identical PyTorch version; ensure you update the
pytorch value in the job matrix entries referencing python_version "3.12" to
keep them consistent and, if intentional, add a brief comment explaining the
divergence.
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
…allback to 2.11
torch 2.12.0 rewrote the sharded-param construction in
FSDPParam._init_sharded_param from a two-line form
self.sharded_param = nn.Parameter(self.to_sharded_dtensor(sharded_param))
self.sharded_param.requires_grad_(param.requires_grad)
to a single multi-line Parameter() call with requires_grad= as a kwarg
self.sharded_param = nn.Parameter(
self.to_sharded_dtensor(sharded_param),
requires_grad=param.requires_grad,
)
Functionally identical, but the axolotl monkey-patch is source-level
text replacement: the 2.11 anchor no longer matches the 2.12 source, so
the substitution silently falls through to the warning branch and the
method stays unpatched — bnb Params4bit / Int8Params lose their
quantization metadata through the FSDP2 shard cycle.
Try the 2.12 anchor first; fall back to the 2.11 anchor so the patch
keeps working against both torch versions in our test matrix.
init_unsharded_param uses the same kwarg-style call in both 2.11 and
2.12, so its anchor is untouched.
8ee5068 to
e6d0431
Compare
torch 2.12 hoisted the unsharded-param construction out of the first-all-gather `else:` branch up to method-body level, so the 2.11 anchor (8-space, inside else) no longer matched and the patch silently no-op'd. This left bitsandbytes Params4bit unreconstructed under FSDP2, surfacing as `mat1 and mat2 shapes cannot be multiplied (... 1x36864)` in QLoRA training. Add the 2.12 method-body-level anchor with its own replacement indentation, falling back to the 2.11 form.
test_lora_ddp ran only 2 steps with no seed, so train_loss was a random draw (observed 1.95-3.23 across runs) and the 2.8 threshold tripped intermittently — the torch 2.12 bump just happened to surface it. Run 20 steps with seed=42 to make the loss deterministic (2.189-2.191 spread), and tighten the threshold to 2.5.
e6d0431 to
ff991d6
Compare
torch 2.11 renamed Optimizer._cuda_graph_capture_health_check to _accelerator_graph_capture_health_check (2.12 re-added the old name as an alias). ADOPT called the old name, so it raised AttributeError under torch 2.11 — surfaced by bumping the docker-e2e row from 2.9.1 to 2.11.0. Resolve whichever name exists, preferring the new one. Also swap the deprecated torch._utils.is_compiling() for torch.compiler.is_compiling().
Description
Motivation and Context
How has this been tested?
AI Usage Disclaimer
Screenshots (if appropriate)
Types of changes
Social Handles (Optional)
Summary by CodeRabbit
Tests
Chores