fix: pin torchvision per matrix entry to prevent ABI drift#3653
fix: pin torchvision per matrix entry to prevent ABI drift#3653ved1beta wants to merge 13 commits into
Conversation
📝 WalkthroughWalkthroughThis PR introduces explicit torchvision version pinning across CI workflows and Docker build matrices, plus full Q-GaLore optimizer support including schema definitions, validation, parameter group building, trainer integration, and comprehensive tests. ChangesTorchVision Version Pinning Infrastructure
Q-GaLore Optimizer Integration
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related PRs
Suggested reviewers
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Warning Review ran into problems🔥 ProblemsStopped waiting for pipeline failures after 30000ms. One of your pipelines takes longer than our 30000ms fetch window to run, so review may not consider pipeline-failure results for inline comments if any failures occurred after the fetch window. Increase the timeout if you want to wait longer or run a Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 3
🧹 Nitpick comments (4)
src/axolotl/utils/optimizers/qgalore.py (2)
30-30: 💤 Low valueConsider using tuple unpacking for clarity.
Static analysis suggests replacing tuple concatenation with unpacking syntax for improved readability:
♻️ Suggested refactor
- lambda *a, **kw: bw( - *(a[:7] + (0.0, 0.0) + a[7:] if len(a) == 15 else a), **kw - ) + lambda *a, **kw: bw( + *(*a[:7], 0.0, 0.0, *a[7:]) if len(a) == 15 else a, **kw + )Apply similar change to line 35.
Also applies to: 35-35
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/axolotl/utils/optimizers/qgalore.py` at line 30, The tuple is being built via concatenation (a[:7] + (0.0, 0.0) + a[7:]) which is less readable; replace that with explicit tuple unpacking so the call uses ( *a[:7], 0.0, 0.0, *a[7:] ) instead (do the same refactor at the similar occurrence on line 35) — update the expression inside the starred argument in qgalore.py where this concatenation appears to use unpacking for clarity.
64-64: 💤 Low valueVerify substring matching behavior for target modules.
The filter
any(t in name for t in target_modules)performs substring matching, sotarget_modules = ["attn", "mlp"]will match parameter names like"model.layers.0.self_attn.q_proj.weight"and"model.layers.0.mlp.gate_proj.weight". This is likely intentional for flexibility, but it could also match unintended parameters (e.g., if a parameter name happens to contain "mlp" as part of a longer word).Consider documenting this substring matching behavior in the docstring or adding an example showing what patterns are expected to match.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/axolotl/utils/optimizers/qgalore.py` at line 64, The current filter uses substring matching via any(t in name for t in target_modules) (seen in the p.dim() == 2 and any(...) check), which can unintentionally match parameters containing those substrings; either document this substring-matching behavior in the function/class docstring that builds parameter groups (mention target_modules, name, and p.dim() condition) or tighten the check to boundary-aware matching (e.g., match against module name tokens via name.split('.') or use regex with word boundaries) so only intended modules like "attn" or "mlp" match their module segments rather than any substring.cicd/Dockerfile-uv.jinja (1)
28-28: 💤 Low valueClarify the rationale for uninstalling causal_conv1d.
Line 28 adds
uv pip uninstall causal_conv1dbefore the main installation. This appears to prevent conflicts during the subsequent editable install, but the reason isn't documented. Consider adding a comment explaining why this pre-uninstall is necessary (e.g., version conflicts, rebuild requirements, or ABI compatibility).📝 Suggested comment
+# Uninstall causal_conv1d to avoid version conflicts during editable install RUN uv pip uninstall causal_conv1d🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@cicd/Dockerfile-uv.jinja` at line 28, Add a brief explanatory comment above the RUN uv pip uninstall causal_conv1d line in Dockerfile-uv.jinja that states why causal_conv1d is being uninstalled (e.g., to avoid version/ABI conflicts with the subsequent editable install, force a rebuild, or remove preinstalled incompatible wheel), referencing the RUN uv pip uninstall causal_conv1d command so reviewers understand the rationale and can remove it safely if requirements change.docker/Dockerfile-uv-base (1)
11-12: 💤 Low valueConsider updating default versions to match workflow values.
The default
PYTORCH_VERSION="2.6.0"andTORCHVISION_VERSION="0.21.0"don't match the most common versions used in the workflow matrices (e.g., PyTorch 2.9.1/2.10.0 with torchvision 0.24.1/0.25.0). While these defaults are overridden by build-args in CI, aligning them with the primary workflow versions would make local builds and debugging more consistent.📝 Suggested default version alignment
-ARG PYTORCH_VERSION="2.6.0" -ARG TORCHVISION_VERSION="0.21.0" +ARG PYTORCH_VERSION="2.9.1" +ARG TORCHVISION_VERSION="0.24.1"🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@docker/Dockerfile-uv-base` around lines 11 - 12, Update the default ARG values in the Dockerfile-uv-base so local builds match the workflow matrix: change ARG PYTORCH_VERSION and ARG TORCHVISION_VERSION from "2.6.0"/"0.21.0" to the primary workflow versions (e.g., "2.10.0" for PYTORCH_VERSION and "0.25.0" for TORCHVISION_VERSION) so local debugging uses the same defaults as CI; ensure you only update the values for the ARGs named PYTORCH_VERSION and TORCHVISION_VERSION.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In @.github/workflows/base.yml:
- Around line 41-42: Update the inline comment above the "platforms:
\"linux/amd64\"" setting to state that the cu128/cu130 wheels for torchvision
0.24.1 are unavailable for any platform rather than only aarch64; replace the
current comment `# arm64 disabled: torchvision 0.24.1+cu128 has no aarch64
wheel` with something like `# torchvision 0.24.1 does not have cu128/cu130
wheels available` so the reason for restricting platforms is accurate and
unambiguous.
In `@cicd/single_gpu.py`:
- Line 26: Update the CI default PyTorch/TorchVision versions to match
pyproject.toml constraints: change the default environment values used in cicd
single/multi-gpu configs so that TORCH_VERSION is "2.9.1" and
TORCHVISION_VERSION is "0.24.1"; locate the variables named TORCH_VERSION and
TORCHVISION_VERSION (e.g., the assignment line with "TORCHVISION_VERSION":
os.environ.get("TORCHVISION_VERSION", "0.21.0") in single_gpu.py and the
analogous line in multigpu.py) and update their fallback strings to "2.9.1" and
"0.24.1" (also ensure Dockerfile-uv-base uses the same versions).
In `@src/axolotl/utils/schemas/validation.py`:
- Around line 934-945: The validator for q_galore_adamw8bit currently only
rejects fsdp_config.use_orig_params when it is explicitly False; update the
check in the block that calls cls._resolve_fsdp_version and reads fsdp_config so
that it requires use_orig_params to be explicitly True (i.e., treat None/absent
the same as False) by replacing the current boolean check with a strict is not
True condition and raise the same ValueError referencing
fsdp_config.use_orig_params and q_galore_adamw8bit.
---
Nitpick comments:
In `@cicd/Dockerfile-uv.jinja`:
- Line 28: Add a brief explanatory comment above the RUN uv pip uninstall
causal_conv1d line in Dockerfile-uv.jinja that states why causal_conv1d is being
uninstalled (e.g., to avoid version/ABI conflicts with the subsequent editable
install, force a rebuild, or remove preinstalled incompatible wheel),
referencing the RUN uv pip uninstall causal_conv1d command so reviewers
understand the rationale and can remove it safely if requirements change.
In `@docker/Dockerfile-uv-base`:
- Around line 11-12: Update the default ARG values in the Dockerfile-uv-base so
local builds match the workflow matrix: change ARG PYTORCH_VERSION and ARG
TORCHVISION_VERSION from "2.6.0"/"0.21.0" to the primary workflow versions
(e.g., "2.10.0" for PYTORCH_VERSION and "0.25.0" for TORCHVISION_VERSION) so
local debugging uses the same defaults as CI; ensure you only update the values
for the ARGs named PYTORCH_VERSION and TORCHVISION_VERSION.
In `@src/axolotl/utils/optimizers/qgalore.py`:
- Line 30: The tuple is being built via concatenation (a[:7] + (0.0, 0.0) +
a[7:]) which is less readable; replace that with explicit tuple unpacking so the
call uses ( *a[:7], 0.0, 0.0, *a[7:] ) instead (do the same refactor at the
similar occurrence on line 35) — update the expression inside the starred
argument in qgalore.py where this concatenation appears to use unpacking for
clarity.
- Line 64: The current filter uses substring matching via any(t in name for t in
target_modules) (seen in the p.dim() == 2 and any(...) check), which can
unintentionally match parameters containing those substrings; either document
this substring-matching behavior in the function/class docstring that builds
parameter groups (mention target_modules, name, and p.dim() condition) or
tighten the check to boundary-aware matching (e.g., match against module name
tokens via name.split('.') or use regex with word boundaries) so only intended
modules like "attn" or "mlp" match their module segments rather than any
substring.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 8f208ff9-2639-42d3-8d71-1fa5b33f315d
📒 Files selected for processing (19)
.github/workflows/base.yml.github/workflows/multi-gpu-e2e.yml.github/workflows/tests-nightly.yml.github/workflows/tests.ymlcicd/Dockerfile-uv.jinjacicd/multigpu.pycicd/single_gpu.pydocker/Dockerfile-basedocker/Dockerfile-uvdocker/Dockerfile-uv-basedocs/optimizers.qmdpyproject.tomlsrc/axolotl/core/builders/base.pysrc/axolotl/utils/optimizers/qgalore.pysrc/axolotl/utils/schemas/enums.pysrc/axolotl/utils/schemas/training.pysrc/axolotl/utils/schemas/validation.pytests/e2e/test_optimizers.pytests/utils/schemas/validation/test_qgalore.py
| # arm64 disabled: torchvision 0.24.1+cu128 has no aarch64 wheel | ||
| platforms: "linux/amd64" |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Check PyPI for torchvision 0.24.1 wheel availability by platform and CUDA version
echo "=== Checking torchvision 0.24.1 wheels ==="
curl -s https://pypi.org/pypi/torchvision/0.24.1/json | \
jq -r '.urls[] | select(.packagetype == "bdist_wheel") | .filename' | \
grep -E "(cu128|cu130)" | sort
echo ""
echo "=== Filtering for aarch64 wheels ==="
curl -s https://pypi.org/pypi/torchvision/0.24.1/json | \
jq -r '.urls[] | select(.packagetype == "bdist_wheel") | .filename' | \
grep -E "(cu128|cu130)" | grep aarch64 || echo "No aarch64 wheels found for cu128/cu130"Repository: axolotl-ai-cloud/axolotl
Length of output: 189
Clarify that cu128/cu130 wheels are unavailable entirely for torchvision 0.24.1, not just aarch64.
The verification confirms no cu128/cu130 wheels exist for torchvision 0.24.1 on PyPI. The comment is technically accurate (aarch64 wheels are indeed absent), but the wording suggests aarch64-specific unavailability when these CUDA versions aren't available for any platform. Update the comment to clarify: # torchvision 0.24.1 does not have cu128/cu130 wheels available rather than implying aarch64 is uniquely restricted.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In @.github/workflows/base.yml around lines 41 - 42, Update the inline comment
above the "platforms: \"linux/amd64\"" setting to state that the cu128/cu130
wheels for torchvision 0.24.1 are unavailable for any platform rather than only
aarch64; replace the current comment `# arm64 disabled: torchvision 0.24.1+cu128
has no aarch64 wheel` with something like `# torchvision 0.24.1 does not have
cu128/cu130 wheels available` so the reason for restricting platforms is
accurate and unambiguous.
| "AXOLOTL_EXTRAS": os.environ.get("AXOLOTL_EXTRAS", ""), | ||
| "AXOLOTL_ARGS": os.environ.get("AXOLOTL_ARGS", ""), | ||
| "PYTORCH_VERSION": os.environ.get("PYTORCH_VERSION", "2.6.0"), | ||
| "TORCHVISION_VERSION": os.environ.get("TORCHVISION_VERSION", "0.21.0"), |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
set -euo pipefail
python - <<'PY'
import re
from pathlib import Path
pp = Path("pyproject.toml").read_text()
torch_min = re.search(r'"torch>=([^"]+)"', pp).group(1)
tv_min = re.search(r'"torchvision>=([^"]+)"', pp).group(1)
print(f"pyproject minimums: torch>={torch_min}, torchvision>={tv_min}")
for path in ("cicd/single_gpu.py", "cicd/multigpu.py"):
txt = Path(path).read_text()
t = re.search(r'"PYTORCH_VERSION":\s*os\.environ\.get\("PYTORCH_VERSION",\s*"([^"]+)"\)', txt)
v = re.search(r'"TORCHVISION_VERSION":\s*os\.environ\.get\("TORCHVISION_VERSION",\s*"([^"]+)"\)', txt)
print(f"{path} defaults: torch={t.group(1)}, torchvision={v.group(1)}")
PY
rg -n 'TORCHVISION_VERSION' .github/workflows cicd docker || trueRepository: axolotl-ai-cloud/axolotl
Length of output: 1714
Align default PyTorch/TorchVision versions with pyproject.toml constraints across CI files.
The defaults in cicd/single_gpu.py (line 26), cicd/multigpu.py (line 27), and docker/Dockerfile-uv-base all specify torch=2.6.0 and torchvision=0.21.0, but pyproject.toml requires torch>=2.9.1 and torchvision>=0.24.1. Without environment variable overrides, this silently builds against an older, incompatible ABI. Update these defaults to 2.9.1 and 0.24.1 respectively.
Suggested changes
# cicd/single_gpu.py & cicd/multigpu.py
- "PYTORCH_VERSION": os.environ.get("PYTORCH_VERSION", "2.6.0"),
- "TORCHVISION_VERSION": os.environ.get("TORCHVISION_VERSION", "0.21.0"),
+ "PYTORCH_VERSION": os.environ.get("PYTORCH_VERSION", "2.9.1"),
+ "TORCHVISION_VERSION": os.environ.get("TORCHVISION_VERSION", "0.24.1"),
# docker/Dockerfile-uv-base
-ARG TORCHVISION_VERSION="0.21.0"
+ARG TORCHVISION_VERSION="0.24.1"🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@cicd/single_gpu.py` at line 26, Update the CI default PyTorch/TorchVision
versions to match pyproject.toml constraints: change the default environment
values used in cicd single/multi-gpu configs so that TORCH_VERSION is "2.9.1"
and TORCHVISION_VERSION is "0.24.1"; locate the variables named TORCH_VERSION
and TORCHVISION_VERSION (e.g., the assignment line with "TORCHVISION_VERSION":
os.environ.get("TORCHVISION_VERSION", "0.21.0") in single_gpu.py and the
analogous line in multigpu.py) and update their fallback strings to "2.9.1" and
"0.24.1" (also ensure Dockerfile-uv-base uses the same versions).
| if data.get("fsdp") or data.get("fsdp_config"): | ||
| fsdp_version = cls._resolve_fsdp_version(data) | ||
| if str(fsdp_version) != "2": | ||
| raise ValueError( | ||
| "q_galore_adamw8bit requires FSDP2. Set fsdp_version: 2." | ||
| ) | ||
| fsdp_config = data.get("fsdp_config") or {} | ||
| if fsdp_config.get("use_orig_params") is False: | ||
| raise ValueError( | ||
| "q_galore_adamw8bit requires fsdp_config.use_orig_params=True so " | ||
| "that per-parameter projection state survives FSDP sharding." | ||
| ) |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Search for FSDP2 use_orig_params default behavior
rg -n -C3 'use_orig_params' --type=py -g '!test*'Repository: axolotl-ai-cloud/axolotl
Length of output: 4102
Fix use_orig_params validation to reject unset values.
The validation at line 941 checks if fsdp_config.get("use_orig_params") is False, which only rejects explicitly set False values. When use_orig_params is omitted (defaults to None), the check passes despite the requirement that it must be explicitly True. Since the schema allows None as the default and q_galore_adamw8bit requires the projection state to survive FSDP sharding, the validator should enforce explicit enablement:
Suggested fix
if fsdp_config.get("use_orig_params") is not True:
raise ValueError(
"q_galore_adamw8bit requires fsdp_config.use_orig_params=True so "
"that per-parameter projection state survives FSDP sharding."
)🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@src/axolotl/utils/schemas/validation.py` around lines 934 - 945, The
validator for q_galore_adamw8bit currently only rejects
fsdp_config.use_orig_params when it is explicitly False; update the check in the
block that calls cls._resolve_fsdp_version and reads fsdp_config so that it
requires use_orig_params to be explicitly True (i.e., treat None/absent the same
as False) by replacing the current boolean check with a strict is not True
condition and raise the same ValueError referencing fsdp_config.use_orig_params
and q_galore_adamw8bit.
Codecov Report❌ Patch coverage is 📢 Thoughts on this report? Let us know! |
fix the torchvision version fail
Summary by CodeRabbit