deprecate torch 2.7.1 by winglian · Pull Request #3339 · axolotl-ai-cloud/axolotl

winglian · 2025-12-30T14:07:29Z

Description

Motivation and Context

How has this been tested?

Screenshots (if appropriate)

Types of changes

Social Handles (Optional)

Summary by CodeRabbit

Chores
- Updated CI/CD workflows to use CUDA 12.8.1 and PyTorch 2.8.0–2.9.1, removing support for older CUDA 12.6.3 configurations.
- Updated Docker image tags to reflect newer versions across cloud training integrations.
Documentation
- Updated minimum PyTorch requirement from 2.7.1 to 2.8.0 in Quick Start guide.
- Updated installation examples and Blackwell GPU guidance with latest PyTorch/CUDA versions.

_{✏️ Tip: You can customize this high-level summary in your review settings.}

coderabbitai · 2025-12-30T14:07:36Z

Important

Review skipped

Auto incremental reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

📝 Walkthrough

Walkthrough

Multiple GitHub Actions workflow files (.github/workflows/*) are updated to remove legacy CUDA 12.6.3 and PyTorch 2.7.1 matrix entries, standardizing on CUDA 12.8.1 and PyTorch 2.8.0–2.9.1. Documentation and base image references are updated correspondingly with new Docker tag versions.

Changes

Cohort / File(s)	Summary
CI Workflow Matrix Consolidation `.github/workflows/base.yml`, `main.yml`, `multi-gpu-e2e.yml`, `nightlies.yml`, `tests-nightly.yml`, `tests.yml`	Removed legacy CUDA 126 (12.6.3) and PyTorch 2.7.1 matrix entries; updated PyTorch versions from 2.7.x→2.8.0 and 2.8.0→2.9.1 across include blocks; streamlined matrix configurations to focus on CUDA 128 (12.8.1) with newer PyTorch versions.
Documentation Version Updates `README.md`, `docs/docker.qmd`, `docs/installation.qmd`	Updated PyTorch version requirement from ≥2.7.1 to ≥2.8.0 in README; refreshed Docker image tag examples (removed 2.7.x entries, added 2.8.0 and 2.9.1); updated Blackwell GPU guidance to reference PyTorch 2.9.1 with CUDA 12.8.
Base Image Configuration `src/axolotl/cli/cloud/baseten/template/train_sft.py`, `src/axolotl/cli/cloud/modal_.py`	Updated Docker base image tag from `main-py3.11-cu126-2.7.1` to `main-py3.11-cu128-2.9.1` in both training template and ModalCloud configuration.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~20 minutes

Possibly related PRs

Torch 2.9.1 base images #3268: Performs concurrent CI/base image updates raising PyTorch to 2.9.1 and aligning CUDA 12.8.1 across workflow matrices and base image selections.
add torch 2.9.0 to ci #3223: Modifies CI workflow matrices in .github/workflows to update supported PyTorch/CUDA combinations with same version targeting.
feat(modal): update docker tag to use torch2.6 from torch2.5 #2749: Updates the ModalCloud.get_image default Docker tag to reflect newer PyTorch/CUDA base image versions.

Suggested labels

ready to merge

Suggested reviewers

SalmanMohammadi
NanoCode012

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title 'deprecate torch 2.7.1' directly matches the main objective of the PR, which systematically removes PyTorch 2.7.1 from CI/CD workflows and documentation across multiple files.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)

docs/docker.qmd (1)
12-12: Update Blackwell GPU guidance to reflect PyTorch 2.9.1.

This callout still references PyTorch 2.7.1, which is being deprecated in this PR. Based on the changes in docs/installation.qmd (line 29 and 114), Blackwell GPUs should now use PyTorch 2.9.1 with CUDA 12.8.
🔎 Proposed fix
-For Blackwell GPUs, please use the tags with PyTorch 2.7.1 and CUDA 12.8.
+For Blackwell GPUs, please use the tags with PyTorch 2.9.1 and CUDA 12.8.
.github/workflows/main.yml (1)

60-60: Add missing base image configuration for pytorch 2.9.0.

The main.yml workflow requires base images for cuda 128 with pytorch 2.8.0, 2.9.0, and 2.9.1. However, base.yml only builds images for 2.8.0 and 2.9.1. The BASE_TAG reference for pytorch 2.9.0 (line 27 in main.yml) will fail because the corresponding base image main-base-py3.11-cu128-2.9.0 does not exist.

Add a matrix entry to base.yml for cuda 128 with pytorch 2.9.0, or remove the 2.9.0 configuration from main.yml.

🧹 Nitpick comments (1)

.github/workflows/tests.yml (1)
306-318: Consider adding PyTorch 2.9.0 to the e2e test matrix.

The e2e tests only cover PyTorch 2.8.0 and 2.9.1, omitting 2.9.0 that is tested in unit tests and built in the main workflow. While this may be intentional to reduce CI costs, it creates a gap where 2.9.0 Docker images are built and unit-tested but not e2e-tested.

If comprehensive coverage is desired, add a 2.9.0 entry:
🔎 Suggested addition for complete coverage
           - cuda: 128
             cuda_version: 12.8.1
             python_version: "3.11"
             pytorch: 2.8.0
             num_gpus: 1
             gpu_type: "B200"
             axolotl_extras: fbgemm-gpu
+          - cuda: 128
+            cuda_version: 12.8.1
+            python_version: "3.11"
+            pytorch: 2.9.0
+            num_gpus: 1
+            axolotl_extras:
           - cuda: 128
             cuda_version: 12.8.1
             python_version: "3.11"
             pytorch: 2.9.1
             num_gpus: 1
             axolotl_extras:

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between e73dab6 and 97c93e4.

📒 Files selected for processing (11)

.github/workflows/base.yml
.github/workflows/main.yml
.github/workflows/multi-gpu-e2e.yml
.github/workflows/nightlies.yml
.github/workflows/tests-nightly.yml
.github/workflows/tests.yml
README.md
docs/docker.qmd
docs/installation.qmd
src/axolotl/cli/cloud/baseten/template/train_sft.py
src/axolotl/cli/cloud/modal_.py

💤 Files with no reviewable changes (1)

.github/workflows/base.yml

🧰 Additional context used

🧠 Learnings (1)

📚 Learning: 2025-07-02T02:56:20.788Z

Learnt from: NanoCode012
Repo: axolotl-ai-cloud/axolotl PR: 2854
File: README.md:73-77
Timestamp: 2025-07-02T02:56:20.788Z
Learning: For Axolotl Docker commands, the `--ipc=host` flag should be included by default to prevent shared memory failures that commonly occur with PyTorch DataLoaders and multiprocessing during machine learning training workflows.

Applied to files:

docs/installation.qmd

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (7)

GitHub Check: PyTest (3.11, 2.9.1)
GitHub Check: PyTest from Source Dist (3.11, 2.8.0)
GitHub Check: PyTest from Source Dist (3.11, 2.9.1)
GitHub Check: preview
GitHub Check: PyTest (3.11, 2.9.0)
GitHub Check: PyTest from Source Dist (3.11, 2.9.0)
GitHub Check: PyTest (3.11, 2.8.0)

🔇 Additional comments (15)

README.md (1)

80-80: LGTM! PyTorch minimum version updated correctly.

The requirement update to PyTorch ≥2.8.0 aligns with the PR's objective to deprecate PyTorch 2.7.1 and standardize on newer versions.

.github/workflows/multi-gpu-e2e.yml (1)

32-45: LGTM! Multi-GPU test matrix updated to latest PyTorch versions.

The matrix now tests PyTorch 2.8.0 and 2.9.1 with CUDA 12.8.1, removing the deprecated 2.7.1 configuration. This provides good coverage of supported PyTorch versions.

docs/docker.qmd (2)

35-36: LGTM! Base image tag examples updated correctly.

The tag examples now reflect PyTorch 2.8.0 and 2.9.1 with CUDA 12.8, consistent with the deprecation of PyTorch 2.7.1.

74-79: LGTM! Main image tag examples updated correctly.

The tag examples have been updated to PyTorch 2.8.0 and 2.9.1. The version example update from 0.10.1 to 0.12.0 appears to be bringing the documentation in line with a more recent release.

docs/installation.qmd (2)

29-29: LGTM! Blackwell GPU guidance updated correctly.

The PyTorch version requirement for Blackwell GPUs has been updated to 2.9.1 with CUDA 12.8, consistent with the deprecation of PyTorch 2.7.1.

114-114: LGTM! Blackwell Docker image recommendation updated correctly.

The Docker image tags for Blackwell GPUs now reference PyTorch 2.9.1 with CUDA 12.8, maintaining consistency with the installation guidance on line 29.

.github/workflows/nightlies.yml (2)

15-24: LGTM! Nightly build matrix updated to current PyTorch versions.

The build-axolotl matrix has been updated to build images for PyTorch 2.8.0 and 2.9.1 with CUDA 12.8.1, removing the deprecated PyTorch 2.7.1 configuration.

67-76: LGTM! Cloud image matrix updated consistently.

The build-axolotl-cloud matrix mirrors the build-axolotl updates, ensuring consistent PyTorch 2.8.0 and 2.9.1 images are built for both standard and cloud variants.

src/axolotl/cli/cloud/modal_.py (1)

85-85: LGTM! Modal cloud default image updated to latest version.

The default Docker tag has been updated to use PyTorch 2.9.1 with CUDA 12.8, aligning with the deprecation of PyTorch 2.7.1. Users can still override this via config.docker_tag if needed.

src/axolotl/cli/cloud/baseten/template/train_sft.py (1)

27-27: LGTM! Baseten training template updated to latest image.

The BASE_IMAGE has been updated to use PyTorch 2.9.1 with CUDA 12.8, consistent with the Modal cloud changes and the deprecation of PyTorch 2.7.1.

.github/workflows/tests-nightly.yml (3)

29-29: LGTM! PyTest matrix expanded to cover current PyTorch versions.

The matrix now tests PyTorch 2.8.0, 2.9.0, and 2.9.1, removing the deprecated 2.7.1 and providing comprehensive coverage of the 2.8.x and 2.9.x series.

102-115: LGTM! Docker e2e test matrix updated consistently.

The matrix has been migrated to CUDA 12.8.1 with PyTorch 2.8.0 and 2.9.1, removing the deprecated CUDA 12.6.3/PyTorch 2.7.1 configuration.

151-157: LGTM! Multi-GPU e2e tests updated to latest version.

The multi-GPU test configuration now uses CUDA 12.8.1 with PyTorch 2.9.1, ensuring multi-GPU scenarios are tested with the latest supported version.
.github/workflows/main.yml (1)
137-148: Verify the intentional omission of PyTorch 2.9.0 in this job.

The build-axolotl-cloud-no-tmux job matrix only includes PyTorch 2.8.0 and 2.9.1, while build-axolotl and build-axolotl-cloud jobs include 2.8.0, 2.9.0, and 2.9.1. This creates an asymmetry where the base image for PyTorch 2.9.0 will be built, but the cloud-no-tmux variant will not be available for that version.

If this is intentional for cost/time savings, consider documenting it. Otherwise, add the 2.9.0 entry for consistency:
🔎 Suggested addition for consistency
           - cuda: 128
             cuda_version: 12.8.1
             python_version: "3.11"
             pytorch: 2.8.0
             axolotl_extras:
             is_latest:
+          - cuda: 128
+            cuda_version: 12.8.1
+            python_version: "3.11"
+            pytorch: 2.9.0
+            axolotl_extras:
+            is_latest:
           - cuda: 128
             cuda_version: 12.8.1
             python_version: "3.11"
             pytorch: 2.9.1
             axolotl_extras:
             is_latest:
.github/workflows/tests.yml (1)

58-58: LGTM!

The pytest matrix correctly includes all three new PyTorch versions (2.8.0, 2.9.0, 2.9.1), ensuring comprehensive test coverage across the supported versions.

github-actions · 2025-12-31T12:03:02Z

📖 Documentation Preview: https://6955e242c69db45c99a375e4--resonant-treacle-0fd729.netlify.app

Deployed on Netlify from commit 8f714f1

codecov · 2025-12-31T12:05:41Z

Codecov Report

❌ Patch coverage is 0% with 1 line in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
src/axolotl/cli/cloud/modal_.py	0.00%	1 Missing ⚠️

📢 Thoughts on this report? Let us know!

deprecate torch 2.7.1

97c93e4

winglian force-pushed the deprecate-torch27x branch from 78ffef5 to 97c93e4 Compare December 31, 2025 03:13

winglian marked this pull request as ready for review December 31, 2025 11:55

coderabbitai Bot reviewed Dec 31, 2025

View reviewed changes

Comment thread .github/workflows/tests.yml

parity for 2.9.0/2.9.1 support and cuda 128/130

8f714f1

winglian merged commit afe18ac into main Jan 1, 2026
25 of 27 checks passed

winglian deleted the deprecate-torch27x branch January 1, 2026 11:52

coderabbitai Bot mentioned this pull request Mar 25, 2026

Revert "feat: move to uv first" #3544

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

deprecate torch 2.7.1#3339

deprecate torch 2.7.1#3339
winglian merged 2 commits into
mainfrom
deprecate-torch27x

winglian commented Dec 30, 2025 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented Dec 30, 2025 •

edited

Loading

Review skipped

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

github-actions Bot commented Dec 31, 2025 •

edited

Loading

Uh oh!

codecov Bot commented Dec 31, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

winglian commented Dec 30, 2025 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Motivation and Context

How has this been tested?

Screenshots (if appropriate)

Types of changes

Social Handles (Optional)

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Dec 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review skipped

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

Pre-merge checks and finishing touches

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

github-actions Bot commented Dec 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov Bot commented Dec 31, 2025

Codecov Report

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

winglian commented Dec 30, 2025 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Dec 30, 2025 •

edited

Loading

github-actions Bot commented Dec 31, 2025 •

edited

Loading