Conversation
|
Important Review skippedAuto incremental reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the You can disable this status message by setting the 📝 WalkthroughWalkthroughAdds CUDA 12.8.1 / PyTorch 2.8.0 / Python 3.11 matrix variants to CI build and test workflows; relaxes the Changes
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Suggested reviewers
✨ Finishing Touches🧪 Generate unit tests
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. CodeRabbit Commands (Invoked using PR/Issue comments)Type Other keywords and placeholders
Status, Documentation and Community
|
There was a problem hiding this comment.
Actionable comments posted: 0
🧹 Nitpick comments (5)
.github/workflows/main.yml (3)
118-122: Cloud image: add 2.8.0 entry — consistent with other jobs. Consider latest-tag decision.Parity with the build job looks good. As above, decide whether 2.8.0 should carry
is_latest: truefor the cloud image to avoid ambiguity around which CUDA 128 tag is considered latest.Optional tweak:
pytorch: 2.8.0 axolotl_extras: + is_latest: true
182-187: Avoid bareis_latest:with no value (can be ambiguous in GH expressions).
is_latest:without a value renders as empty string/null. While it will evaluate falsy in the${{ ... }}expression, being explicit improves readability and prevents surprises if the expression ever changes.Use
falseor omit the key:pytorch: 2.8.0 axolotl_extras: - is_latest: + is_latest: false
39-43: Add 2.8.0 CUDA 12.8.1 variant – consider marking as latest
The new matrix entry is correct and will build the image for Python 3.11 + CUDA 128 + PyTorch 2.8.0.
To have this build published under the
-latesttag, addis_latest: trueto the variant:pytorch: 2.8.0 axolotl_extras: + is_latest: trueBefore merging, ensure that the corresponding base image tag has been published. GitHub Actions will override
ARG CUDAandARG PYTORCH_VERSIONvia:build-args: BASE_TAG=${{ github.ref_type == 'tag' && 'main' || github.ref_name }}-base-py${{ matrix.python_version }}-cu${{ matrix.cuda }}-${{ matrix.pytorch }} CUDA=${{ matrix.cuda }} PYTORCH_VERSION=${{ matrix.pytorch }}Verify the base image exists, for example:
# Replace <branch> with your branch or tag name (e.g. main) curl -f -I https://hub.docker.com/v2/repositories/axolotlai/axolotl-base/tags/<branch>-base-py3.11-cu128-2.8.0 # or via Docker CLI docker pull axolotlai/axolotl-base:<branch>-base-py3.11-cu128-2.8.0This confirms that
axolotlai/axolotl-base:…-cu128-2.8.0is available before the build runs..github/workflows/tests.yml (2)
133-133: SDist matrix: mirrored 2.8.0 — consistent with wheels path.Looks good and mirrors the wheel-based job. Same note about torchvision compatibility applies.
If you observe resolver downgrades in this job, keep the single-transaction install but add a constraint file with a torch↔torchvision pin for 2.8.0 only. I can draft one if needed.
301-306: E2E matrix: Adds CUDA 128 + Torch 2.8.0 — consider promoting a 2.8.0 case into the “gate” job.Great to have full E2E for 2.8.0. To fail fast on core breakages for the new stack, consider adding one 2.8.0 variant to
docker-e2e-tests-1st(your gate job) alongside existing entries. This can save GPU minutes when 2.8.0-specific regressions occur.Proposed YAML to add under
docker-e2e-tests-1st.strategy.matrix.include(outside the changed hunk):- cuda: 128 cuda_version: 12.8.1 python_version: "3.11" pytorch: 2.8.0 num_gpus: 1 axolotl_extras:Please also confirm Modal base images for CUDA 12.8.1 are available on your runners and that your Dockerfiles reference a cu128-compatible base.
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (2)
.github/workflows/main.yml(3 hunks).github/workflows/tests.yml(3 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (8)
- GitHub Check: PyTest from Source Dist (3.11, 2.7.1)
- GitHub Check: PyTest (3.11, 2.7.0)
- GitHub Check: PyTest from Source Dist (3.11, 2.7.0)
- GitHub Check: PyTest from Source Dist (3.11, 2.8.0)
- GitHub Check: PyTest (3.11, 2.7.1)
- GitHub Check: PyTest from Source Dist (3.11, 2.6.0)
- GitHub Check: PyTest (3.11, 2.8.0)
- GitHub Check: PyTest (3.11, 2.6.0)
🔇 Additional comments (1)
.github/workflows/tests.yml (1)
58-58: Verify torchvision compatibility for Torch 2.8.0
- File: .github/workflows/tests.yml (line 58)
Ensure thepip install torch==${{ matrix.pytorch_version }} torchvisionstep picks a matching torchvision for PyTorch 2.8.0 on the ubuntu-latest runner.To confirm the resolver behavior and exact torchvision version selected, run one of the following locally:
# List available torchvision releases and ensure there’s a 2.8.0-compatible build pip index versions torchvision# Install torch 2.8.0 alongside torchvision and check which torchvision was chosen pip install torch==2.8.0 torchvision && pip freeze | grep torchvisionVerify that the resolved torchvision version aligns with PyTorch’s official compatibility matrix for 2.8.0.
There was a problem hiding this comment.
Actionable comments posted: 1
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (1)
requirements.txt(1 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (10)
- GitHub Check: PyTest from Source Dist (3.11, 2.6.0)
- GitHub Check: PyTest from Source Dist (3.11, 2.7.0)
- GitHub Check: PyTest from Source Dist (3.11, 2.8.0)
- GitHub Check: PyTest from Source Dist (3.11, 2.7.1)
- GitHub Check: PyTest (3.11, 2.7.1)
- GitHub Check: PyTest (3.11, 2.6.0)
- GitHub Check: PyTest (3.11, 2.7.0)
- GitHub Check: PyTest (3.11, 2.8.0)
- GitHub Check: test-axolotl-multigpu (126, 12.6.3, 3.11, 2.6.0, 2, true)
- GitHub Check: test-axolotl-multigpu (126, 12.6.3, 3.11, 2.7.1, vllm, 2, true)
| # triton 3.4.0 is not compatible with CCE | ||
| triton>=3.0.0,<3.4.0 | ||
| triton>=3.0.0 |
There was a problem hiding this comment.
💡 Verification agent
🧩 Analysis chain
Removing Triton upper bound contradicts the CCE warning and may break CCE users.
Line 5 states “triton 3.4.0 is not compatible with CCE”, but Line 6 removes the <3.4.0 cap. This allows pip to resolve to 3.4.x+ globally, which could reintroduce the CCE incompatibility. If PyTorch 2.8.0 truly requires Triton ≥3.4, we should scope that to the PyTorch 2.8/CUDA 12.8 CI variants via constraints rather than relaxing it for all users.
Two safe options; pick one:
- Option A (conservative): keep the cap by default and override in CI for PT 2.8 only via constraints.
- Option B (if CCE is now known-good with ≥3.4): keep the relaxed spec but fix the comment and add a mitigation note.
Proposed diffs:
Option A – restore cap and restrict mac installs with markers (safer default):
-# triton 3.4.0 is not compatible with CCE
-triton>=3.0.0
+# triton 3.4.0 is not compatible with CCE; keep <3.4.0 by default and override via constraints for PT 2.8 builds
+triton>=3.0.0,<3.4.0; sys_platform != "darwin" and platform_machine == "x86_64"
+# For PT 2.8 / CUDA 12.8 jobs, allow newer Triton via a constraints file in CI.Option B – keep relaxed bound but correct the comment and document the risk:
-# triton 3.4.0 is not compatible with CCE
-triton>=3.0.0
+# NOTE: Triton ≥3.4 previously had CCE issues. If you are on CCE, pin triton<3.4.0 via a constraints file until verified.
+triton>=3.0.0Follow-ups:
- If you choose Option A, add a constraints file (e.g., constraints-pt28.txt) used only in the PT 2.8 / CUDA 12.8 CI jobs that loosens Triton to the vetted range.
- If you choose Option B, please link evidence that CCE incompat is resolved for ≥3.4 so we can safely remove the warning.
Run this repo-scoped check to see where Triton is referenced and whether CI already uses constraints/overrides:
🏁 Script executed:
#!/bin/bash
# Find Triton refs and any constraints usage
rg -n -C2 -i '\btriton\b|constraints.*\.txt|pip install .* -c ' .github scripts requirements* setup* docker* || true
# Surface PT 2.8 entries in workflows to target constraints there
rg -n -C2 -i 'pytorch[^0-9]*2\.8\.0|cuda[^0-9]*12\.8|128\b' .github/workflows || true
# Search for any CCE references to validate the warning context
rg -n -C2 -i '\bCCE\b' || trueLength of output: 8488
Require Triton <3.4.0 by default and scope any bump to CI-only
No existing constraints file (e.g. constraints-pt28.txt) or pip install -c … override is configured in the PyTorch 2.8 / CUDA 12.8 workflows, so the removal of the <3.4.0 cap contradicts the CCE warning and risks breaking CCE users. Please apply Option A:
• In requirements.txt, restore the Triton upper bound and include an OS marker:
- # triton 3.4.0 is not compatible with CCE
- triton>=3.0.0
+ # triton 3.4.0 is not compatible with CCE; keep <3.4.0 by default and override via constraints for PT 2.8 CI
+ triton>=3.0.0,<3.4.0; sys_platform != "darwin" and platform_machine == "x86_64"• Add a new constraints-pt28.txt at the repo root to allow Triton ≥3.4.0 in CI:
triton>=3.0.0
• In each PyTorch 2.8 / CUDA 12.8 job in .github/workflows/*.yml, install with that constraints file:
run: pip install -r requirements.txt -c constraints-pt28.txtThis preserves CCE compatibility for end users while permitting Triton ≥3.4.0 in controlled CI builds.
🤖 Prompt for AI Agents
In requirements.txt around lines 5-6, the Triton upper bound was removed which
breaks CCE users; restore the Triton constraint to require triton>=3.0.0,<3.4.0
and add the appropriate OS marker (e.g. ; platform_system != "Windows" or the
project-specific marker) so end users keep CCE-compatible versions. Add a new
file constraints-pt28.txt at the repo root containing a relaxed constraint
(triton>=3.0.0) to allow CI to test newer Triton, and update each PyTorch 2.8 /
CUDA 12.8 job in .github/workflows/*.yml to install with pip install -r
requirements.txt -c constraints-pt28.txt so CI can opt into Triton ≥3.4.0 while
keeping the default user install capped.
Codecov Report✅ All modified and coverable lines are covered by tests. 📢 Thoughts on this report? Let us know! |
There was a problem hiding this comment.
Actionable comments posted: 0
🧹 Nitpick comments (6)
setup.py (6)
67-69: Torch>=2.8 branch is a no-op; explicitly handle xformers/vLLM to avoid stale/incompatible pinsLeaving this branch as
passpreserves whateverxformersandvllmpins come from requirements/extras. That can easily drift out of compatibility with Torch 2.8 and CUDA 12.8. At minimum, mirror the 2.7 logic for vLLM (relax pin) and proactively avoid stalexformerspins for 2.8.Apply this diff to make the 2.8 branch explicit and safer:
- if (major, minor) >= (2, 8): - pass + if (major, minor) >= (2, 8): + # Torch 2.8+: avoid stale xformers pins; relax vLLM + try: + _install_requires.pop(_install_requires.index(xformers_version)) + except ValueError: + pass # xformers not pinned in requirements.txt + # Keep vLLM opt-in but avoid over-restricting to a specific patch + extras_require_map["vllm"] = ["vllm>=0.10.0"]Would you like me to open a follow-up PR to run the new 2.8 matrix against: (a) install from sdist and (b) extras "vllm" + "flash-attn" to ensure solver stability and import sanity?
45-47: Remove debugPrinting the entire
install_requiresduring setup is noisy and leaks environment details in user installs/CI logs.Apply:
- print( - _install_requires, [req in skip_packages for req in _install_requires] - )
54-56: Hard default to Torch 2.6.0 can surprise in fresh envs; allow an overrideWhen Torch isn’t preinstalled, you force
torch==2.6.0. On a PR explicitly adding 2.8 support, this default may be counterintuitive. Make it overrideable via env var so CI can set a newer default without code changes.Apply:
- torch_version = "2.6.0" # default to torch 2.6 + # default Torch fallback (override in CI with AXOLOTL_TORCH_FALLBACK) + torch_version = os.getenv("AXOLOTL_TORCH_FALLBACK", "2.6.0")Ensure your CI matrices set
AXOLOTL_TORCH_FALLBACK=2.8.0in the 2.8 jobs if you rely on this path.
28-39: Guard against missing packages when parsing requirements
[...][0]will raiseIndexErrorifxformers/autoawqare removed from requirements in the future. Given the 2.8 branch may dropxformers, make the detection resilient.Apply:
- xformers_version = [req for req in _install_requires if "xformers" in req][0] - autoawq_version = [req for req in _install_requires if "autoawq" in req][0] + def find_req(reqs, name): + for req in reqs: + if re.split(r"[>=<]", req)[0].strip() == name: + return req + return None + xformers_version = find_req(_install_requires, "xformers") + autoawq_version = find_req(_install_requires, "autoawq")Then guard subsequent pops:
- _install_requires.pop(_install_requires.index(xformers_version)) + if xformers_version: + _install_requires.pop(_install_requires.index(xformers_version)) ... - _install_requires.pop(_install_requires.index(autoawq_version)) + if autoawq_version: + _install_requires.pop(_install_requires.index(autoawq_version))
57-66: Preferpackaging.versionfor version parsing/compareManual regex + tuple comparison works but is brittle with local version segments (e.g.,
2.8.0+cu128).packaging.version.Versionhandles this robustly and self-documents the intent.Example:
- version_match = re.match(r"^(\d+)\.(\d+)(?:\.(\d+))?", torch_version) - if version_match: - major, minor, patch = version_match.groups() - major, minor = int(major), int(minor) - patch = ( - int(patch) if patch is not None else 0 - ) # Default patch to 0 if not present - else: - raise ValueError("Invalid version format") + from packaging.version import Version + v = Version(torch_version) + major, minor, patch = v.major, v.minor, v.micro
122-129: Confirm wheel availability for flash-attn 2.8.3 and DeepSpeed 0.17.2 under PyTorch 2.8/CUDA 12.8Our inspection of PyPI metadata shows that neither flash-attn 2.8.3 nor DeepSpeed 0.17.2 publish prebuilt wheels for Python 3.11 or CUDA 12.x. In a PT 2.8/CUDA 12.8 environment,
pip installwill fall back to source builds (or fail), leading to longer installs or potential errors.• File: setup.py, lines 122–129 (flash-attn extras)
• File: setup.py, lines 130–136 (DeepSpeed extras)Recommendations:
- Gate these extras by environment markers to skip them on unsupported platforms (e.g.
; platform_system=="Linux" and cuda_version<12.8)- Widen version pins to pull in future releases that include CUDA 12.8 wheels (e.g.
"flash-attn>=2.8.3,<2.9") once available- Document that unsupported envs require manual source compilation or a matching CUDA toolkit
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (1)
setup.py(1 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (10)
- GitHub Check: test-axolotl-multigpu (126, 12.6.3, 3.11, 2.7.1, vllm, 2, true)
- GitHub Check: test-axolotl-multigpu (126, 12.6.3, 3.11, 2.6.0, 2, true)
- GitHub Check: PyTest from Source Dist (3.11, 2.8.0)
- GitHub Check: PyTest from Source Dist (3.11, 2.6.0)
- GitHub Check: PyTest from Source Dist (3.11, 2.7.1)
- GitHub Check: PyTest from Source Dist (3.11, 2.7.0)
- GitHub Check: PyTest (3.11, 2.8.0)
- GitHub Check: PyTest (3.11, 2.7.1)
- GitHub Check: PyTest (3.11, 2.7.0)
- GitHub Check: PyTest (3.11, 2.6.0)
🔇 Additional comments (2)
setup.py (2)
159-161: Relax vLLM version pin to >=0.10.0The default extras currently lock vLLM to an exact version (
0.10.0), which can block users upgrading to PyTorch 2.8.0 with CUDA 12.8. Since the 2.7.x branch already loosened this pin, let’s align the default extras here as well.• File:
setup.py
• Lines: 159–161
• Apply:- "vllm": [ - "vllm==0.10.0", - ], + "vllm": [ + "vllm>=0.10.0", + ],Please verify against the official vLLM documentation or release notes that all vLLM versions ≥0.10.0 support PyTorch 2.8.0 and CUDA 12.8 before merging.
152-155: Ignore typo suggestion:torch-optimiis a valid package
The packagetorch-optimi(PyPI project “optimi”) exists with version 0.2.1 and provides the intended low-precision PyTorch optimizers (pypi.org). No change tosetup.pyis necessary.
| fail-fast: false | ||
| matrix: | ||
| python_version: ["3.11"] | ||
| pytorch_version: ["2.6.0", "2.7.0", "2.7.1"] |
There was a problem hiding this comment.
any reason not to drop 2.7.0 in favor of 2.7.1?
There was a problem hiding this comment.
I think we had originally kept 2.7.0 due to vllm. let me double check and you're most likely spot on that we can drop 2.7.0
| bitsandbytes==0.47.0 | ||
| # triton 3.4.0 is not compatible with CCE | ||
| triton>=3.0.0,<3.4.0 | ||
| triton>=3.0.0 |
There was a problem hiding this comment.
Doesn't torch 2.8.0 require triton 3.4.0, and this ends up with an incompatibility? I think we need to fix our CCE, right?
There was a problem hiding this comment.
I think this gets addressed with axolotl-ai-cloud/ml-cross-entropy#19
|
📖 Documentation Preview: https://68af42022daf05688d3f48f5--resonant-treacle-0fd729.netlify.app Deployed on Netlify from commit 22c3eb4 |
Co-authored-by: NanoCode012 <nano@axolotl.ai>
Description
Motivation and Context
How has this been tested?
Screenshots (if appropriate)
Types of changes
Social Handles (Optional)
Summary by CodeRabbit