Skip to content

fix: pin torchvision per matrix entry to prevent ABI drift#3653

Closed
ved1beta wants to merge 13 commits into
axolotl-ai-cloud:mainfrom
ved1beta:feat-qgalore
Closed

fix: pin torchvision per matrix entry to prevent ABI drift#3653
ved1beta wants to merge 13 commits into
axolotl-ai-cloud:mainfrom
ved1beta:feat-qgalore

Conversation

@ved1beta
Copy link
Copy Markdown
Member

@ved1beta ved1beta commented May 14, 2026

fix the torchvision version fail

Summary by CodeRabbit

  • Chores
    • Updated Docker build and CI/CD pipeline configurations to use explicit torchvision version pinning across different environments.
    • Introduced torchvision version specifications for various PyTorch configurations to ensure consistent dependency resolution during builds and testing.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 14, 2026

📝 Walkthrough

Walkthrough

This PR introduces explicit torchvision version pinning across CI workflows and Docker build matrices, plus full Q-GaLore optimizer support including schema definitions, validation, parameter group building, trainer integration, and comprehensive tests.

Changes

TorchVision Version Pinning Infrastructure

Layer / File(s) Summary
CI workflow matrix updates for torchvision
.github/workflows/base.yml, .github/workflows/multi-gpu-e2e.yml, .github/workflows/tests-nightly.yml, .github/workflows/tests.yml
All CI test and build workflows add torchvision version to their build matrices per CUDA/PyTorch combination and export TORCHVISION_VERSION to the environment for downstream Docker builds.
Dockerfile base images with torchvision pinning
docker/Dockerfile-base, docker/Dockerfile-uv, docker/Dockerfile-uv-base, cicd/Dockerfile-uv.jinja
Base Docker images now declare TORCHVISION_VERSION build arguments, pin both torch and torchvision during installation with CUDA-specific suffixes, add VIRTUAL_ENV setup, and verify torchvision.ops.nms post-install.
CI/CD build scripts pass torchvision template args
cicd/single_gpu.py, cicd/multigpu.py
Single-GPU and multi-GPU CI build scripts populate TORCHVISION_VERSION from environment variables and pass it through Dockerfile Jinja template rendering.
Project dependency declaration updated
pyproject.toml
Added explicit torchvision>=0.24.1 dependency to match base image pinning.

Q-GaLore Optimizer Integration

Layer / File(s) Summary
Q-GaLore schema definitions and enum
src/axolotl/utils/schemas/enums.py, src/axolotl/utils/schemas/training.py
Added q_galore_adamw8bit to CustomSupportedOptimizers enum; added ten optional Q-GaLore hyperparameter fields to HyperparametersConfig (rank, projection gap, scale, projection type/quantization settings, cosine threshold, gamma growth, queue size).
Q-GaLore configuration validation
src/axolotl/utils/schemas/validation.py
OptimizationValidationMixin.check_qgalore enforces Q-GaLore constraints: rejects adapter/deepspeed combinations, requires FSDP2 with use_orig_params, warns on missing mixed-precision, auto-populates optim_target_modules with attention/MLP linears.
Q-GaLore optimizer utility functions
src/axolotl/utils/optimizers/qgalore.py
New module provides patch_q_galore_for_modern_bnb() to adapt Q-GaLore's optimizer signature to newer bitsandbytes, and build_qgalore_param_groups() to split trainable parameters into projected vs. plain groups based on dimensionality and target module matching.
Trainer builder Q-GaLore optimizer wiring
src/axolotl/core/builders/base.py
_configure_optimizer() detects q_galore_adamw8bit, applies modern-bnb patching, instantiates QGaLoreAdamW8bit, and builds parameter groups via build_qgalore_param_groups() with configured hyperparameters. Also refactors get_callbacks() to pass gc_steps directly to GCCallback.
Q-GaLore validation unit tests
tests/utils/schemas/validation/test_qgalore.py
Test suite validates adapter rejection, FSDP version constraints, and default field auto-population for bf16/rank/proj_bits.
Q-GaLore end-to-end training test
tests/e2e/test_optimizers.py
End-to-end test trains with q_galore_adamw8bit optimizer (bf16, small ranks), verifies output artifacts, and asserts optimizer class contains "AdamW8bit".
Q-GaLore optimizer documentation
docs/optimizers.qmd
Describes Q-GaLore's INT4 projection matrix behavior, installation, configuration constraints (full fine-tuning, FSDP2 required), and includes complete YAML example with all hyperparameters.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes


Possibly related PRs

  • axolotl-ai-cloud/axolotl#3457: Both PRs extend CustomSupportedOptimizers and add optimizer-specific branches/validation inside src/axolotl/core/builders/base.py and src/axolotl/utils/schemas/validation.py, with corresponding end-to-end coverage in tests/e2e/test_optimizers.py (FlashOptim in #3457 vs Q-GaLore in the main PR).
  • axolotl-ai-cloud/axolotl#3017: Both PRs touch src/axolotl/core/builders/base.py in the custom optimizer configuration path—main PR adds Q-GaLore (q_galore_adamw8bit) handling while retrieved PR removes ao_adamw_4bit/8bit custom optimizer support—so the changes are related at the code-level.
  • axolotl-ai-cloud/axolotl#3268: Both PRs modify docker/Dockerfile-base/.github/workflows/base.yml around the PYTORCH_VERSION selection (pinning/updating to Torch 2.9.1), though the main PR also adds TORCHVISION_VERSION plumbing and torchvision install/verification.

Suggested reviewers

  • winglian
  • SalmanMohammadi
🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 14.29% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The PR title 'fix: pin torchvision per matrix entry to prevent ABI drift' directly and specifically describes the main change across the changeset—adding explicit torchvision version pinning to CI workflows and Dockerfile templates to prevent ABI compatibility issues.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Warning

Review ran into problems

🔥 Problems

Stopped waiting for pipeline failures after 30000ms. One of your pipelines takes longer than our 30000ms fetch window to run, so review may not consider pipeline-failure results for inline comments if any failures occurred after the fetch window. Increase the timeout if you want to wait longer or run a @coderabbit review after the pipeline has finished.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@ved1beta ved1beta marked this pull request as ready for review May 14, 2026 14:25
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Nitpick comments (4)
src/axolotl/utils/optimizers/qgalore.py (2)

30-30: 💤 Low value

Consider using tuple unpacking for clarity.

Static analysis suggests replacing tuple concatenation with unpacking syntax for improved readability:

♻️ Suggested refactor
-            lambda *a, **kw: bw(
-                *(a[:7] + (0.0, 0.0) + a[7:] if len(a) == 15 else a), **kw
-            )
+            lambda *a, **kw: bw(
+                *(*a[:7], 0.0, 0.0, *a[7:]) if len(a) == 15 else a, **kw
+            )

Apply similar change to line 35.

Also applies to: 35-35

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/axolotl/utils/optimizers/qgalore.py` at line 30, The tuple is being built
via concatenation (a[:7] + (0.0, 0.0) + a[7:]) which is less readable; replace
that with explicit tuple unpacking so the call uses ( *a[:7], 0.0, 0.0, *a[7:] )
instead (do the same refactor at the similar occurrence on line 35) — update the
expression inside the starred argument in qgalore.py where this concatenation
appears to use unpacking for clarity.

64-64: 💤 Low value

Verify substring matching behavior for target modules.

The filter any(t in name for t in target_modules) performs substring matching, so target_modules = ["attn", "mlp"] will match parameter names like "model.layers.0.self_attn.q_proj.weight" and "model.layers.0.mlp.gate_proj.weight". This is likely intentional for flexibility, but it could also match unintended parameters (e.g., if a parameter name happens to contain "mlp" as part of a longer word).

Consider documenting this substring matching behavior in the docstring or adding an example showing what patterns are expected to match.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/axolotl/utils/optimizers/qgalore.py` at line 64, The current filter uses
substring matching via any(t in name for t in target_modules) (seen in the
p.dim() == 2 and any(...) check), which can unintentionally match parameters
containing those substrings; either document this substring-matching behavior in
the function/class docstring that builds parameter groups (mention
target_modules, name, and p.dim() condition) or tighten the check to
boundary-aware matching (e.g., match against module name tokens via
name.split('.') or use regex with word boundaries) so only intended modules like
"attn" or "mlp" match their module segments rather than any substring.
cicd/Dockerfile-uv.jinja (1)

28-28: 💤 Low value

Clarify the rationale for uninstalling causal_conv1d.

Line 28 adds uv pip uninstall causal_conv1d before the main installation. This appears to prevent conflicts during the subsequent editable install, but the reason isn't documented. Consider adding a comment explaining why this pre-uninstall is necessary (e.g., version conflicts, rebuild requirements, or ABI compatibility).

📝 Suggested comment
+# Uninstall causal_conv1d to avoid version conflicts during editable install
 RUN uv pip uninstall causal_conv1d
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@cicd/Dockerfile-uv.jinja` at line 28, Add a brief explanatory comment above
the RUN uv pip uninstall causal_conv1d line in Dockerfile-uv.jinja that states
why causal_conv1d is being uninstalled (e.g., to avoid version/ABI conflicts
with the subsequent editable install, force a rebuild, or remove preinstalled
incompatible wheel), referencing the RUN uv pip uninstall causal_conv1d command
so reviewers understand the rationale and can remove it safely if requirements
change.
docker/Dockerfile-uv-base (1)

11-12: 💤 Low value

Consider updating default versions to match workflow values.

The default PYTORCH_VERSION="2.6.0" and TORCHVISION_VERSION="0.21.0" don't match the most common versions used in the workflow matrices (e.g., PyTorch 2.9.1/2.10.0 with torchvision 0.24.1/0.25.0). While these defaults are overridden by build-args in CI, aligning them with the primary workflow versions would make local builds and debugging more consistent.

📝 Suggested default version alignment
-ARG PYTORCH_VERSION="2.6.0"
-ARG TORCHVISION_VERSION="0.21.0"
+ARG PYTORCH_VERSION="2.9.1"
+ARG TORCHVISION_VERSION="0.24.1"
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docker/Dockerfile-uv-base` around lines 11 - 12, Update the default ARG
values in the Dockerfile-uv-base so local builds match the workflow matrix:
change ARG PYTORCH_VERSION and ARG TORCHVISION_VERSION from "2.6.0"/"0.21.0" to
the primary workflow versions (e.g., "2.10.0" for PYTORCH_VERSION and "0.25.0"
for TORCHVISION_VERSION) so local debugging uses the same defaults as CI; ensure
you only update the values for the ARGs named PYTORCH_VERSION and
TORCHVISION_VERSION.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In @.github/workflows/base.yml:
- Around line 41-42: Update the inline comment above the "platforms:
\"linux/amd64\"" setting to state that the cu128/cu130 wheels for torchvision
0.24.1 are unavailable for any platform rather than only aarch64; replace the
current comment `# arm64 disabled: torchvision 0.24.1+cu128 has no aarch64
wheel` with something like `# torchvision 0.24.1 does not have cu128/cu130
wheels available` so the reason for restricting platforms is accurate and
unambiguous.

In `@cicd/single_gpu.py`:
- Line 26: Update the CI default PyTorch/TorchVision versions to match
pyproject.toml constraints: change the default environment values used in cicd
single/multi-gpu configs so that TORCH_VERSION is "2.9.1" and
TORCHVISION_VERSION is "0.24.1"; locate the variables named TORCH_VERSION and
TORCHVISION_VERSION (e.g., the assignment line with "TORCHVISION_VERSION":
os.environ.get("TORCHVISION_VERSION", "0.21.0") in single_gpu.py and the
analogous line in multigpu.py) and update their fallback strings to "2.9.1" and
"0.24.1" (also ensure Dockerfile-uv-base uses the same versions).

In `@src/axolotl/utils/schemas/validation.py`:
- Around line 934-945: The validator for q_galore_adamw8bit currently only
rejects fsdp_config.use_orig_params when it is explicitly False; update the
check in the block that calls cls._resolve_fsdp_version and reads fsdp_config so
that it requires use_orig_params to be explicitly True (i.e., treat None/absent
the same as False) by replacing the current boolean check with a strict is not
True condition and raise the same ValueError referencing
fsdp_config.use_orig_params and q_galore_adamw8bit.

---

Nitpick comments:
In `@cicd/Dockerfile-uv.jinja`:
- Line 28: Add a brief explanatory comment above the RUN uv pip uninstall
causal_conv1d line in Dockerfile-uv.jinja that states why causal_conv1d is being
uninstalled (e.g., to avoid version/ABI conflicts with the subsequent editable
install, force a rebuild, or remove preinstalled incompatible wheel),
referencing the RUN uv pip uninstall causal_conv1d command so reviewers
understand the rationale and can remove it safely if requirements change.

In `@docker/Dockerfile-uv-base`:
- Around line 11-12: Update the default ARG values in the Dockerfile-uv-base so
local builds match the workflow matrix: change ARG PYTORCH_VERSION and ARG
TORCHVISION_VERSION from "2.6.0"/"0.21.0" to the primary workflow versions
(e.g., "2.10.0" for PYTORCH_VERSION and "0.25.0" for TORCHVISION_VERSION) so
local debugging uses the same defaults as CI; ensure you only update the values
for the ARGs named PYTORCH_VERSION and TORCHVISION_VERSION.

In `@src/axolotl/utils/optimizers/qgalore.py`:
- Line 30: The tuple is being built via concatenation (a[:7] + (0.0, 0.0) +
a[7:]) which is less readable; replace that with explicit tuple unpacking so the
call uses ( *a[:7], 0.0, 0.0, *a[7:] ) instead (do the same refactor at the
similar occurrence on line 35) — update the expression inside the starred
argument in qgalore.py where this concatenation appears to use unpacking for
clarity.
- Line 64: The current filter uses substring matching via any(t in name for t in
target_modules) (seen in the p.dim() == 2 and any(...) check), which can
unintentionally match parameters containing those substrings; either document
this substring-matching behavior in the function/class docstring that builds
parameter groups (mention target_modules, name, and p.dim() condition) or
tighten the check to boundary-aware matching (e.g., match against module name
tokens via name.split('.') or use regex with word boundaries) so only intended
modules like "attn" or "mlp" match their module segments rather than any
substring.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 8f208ff9-2639-42d3-8d71-1fa5b33f315d

📥 Commits

Reviewing files that changed from the base of the PR and between d7cb1c9 and 4407920.

📒 Files selected for processing (19)
  • .github/workflows/base.yml
  • .github/workflows/multi-gpu-e2e.yml
  • .github/workflows/tests-nightly.yml
  • .github/workflows/tests.yml
  • cicd/Dockerfile-uv.jinja
  • cicd/multigpu.py
  • cicd/single_gpu.py
  • docker/Dockerfile-base
  • docker/Dockerfile-uv
  • docker/Dockerfile-uv-base
  • docs/optimizers.qmd
  • pyproject.toml
  • src/axolotl/core/builders/base.py
  • src/axolotl/utils/optimizers/qgalore.py
  • src/axolotl/utils/schemas/enums.py
  • src/axolotl/utils/schemas/training.py
  • src/axolotl/utils/schemas/validation.py
  • tests/e2e/test_optimizers.py
  • tests/utils/schemas/validation/test_qgalore.py

Comment on lines +41 to +42
# arm64 disabled: torchvision 0.24.1+cu128 has no aarch64 wheel
platforms: "linux/amd64"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Check PyPI for torchvision 0.24.1 wheel availability by platform and CUDA version

echo "=== Checking torchvision 0.24.1 wheels ==="
curl -s https://pypi.org/pypi/torchvision/0.24.1/json | \
  jq -r '.urls[] | select(.packagetype == "bdist_wheel") | .filename' | \
  grep -E "(cu128|cu130)" | sort

echo ""
echo "=== Filtering for aarch64 wheels ==="
curl -s https://pypi.org/pypi/torchvision/0.24.1/json | \
  jq -r '.urls[] | select(.packagetype == "bdist_wheel") | .filename' | \
  grep -E "(cu128|cu130)" | grep aarch64 || echo "No aarch64 wheels found for cu128/cu130"

Repository: axolotl-ai-cloud/axolotl

Length of output: 189


Clarify that cu128/cu130 wheels are unavailable entirely for torchvision 0.24.1, not just aarch64.

The verification confirms no cu128/cu130 wheels exist for torchvision 0.24.1 on PyPI. The comment is technically accurate (aarch64 wheels are indeed absent), but the wording suggests aarch64-specific unavailability when these CUDA versions aren't available for any platform. Update the comment to clarify: # torchvision 0.24.1 does not have cu128/cu130 wheels available rather than implying aarch64 is uniquely restricted.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In @.github/workflows/base.yml around lines 41 - 42, Update the inline comment
above the "platforms: \"linux/amd64\"" setting to state that the cu128/cu130
wheels for torchvision 0.24.1 are unavailable for any platform rather than only
aarch64; replace the current comment `# arm64 disabled: torchvision 0.24.1+cu128
has no aarch64 wheel` with something like `# torchvision 0.24.1 does not have
cu128/cu130 wheels available` so the reason for restricting platforms is
accurate and unambiguous.

Comment thread cicd/single_gpu.py
"AXOLOTL_EXTRAS": os.environ.get("AXOLOTL_EXTRAS", ""),
"AXOLOTL_ARGS": os.environ.get("AXOLOTL_ARGS", ""),
"PYTORCH_VERSION": os.environ.get("PYTORCH_VERSION", "2.6.0"),
"TORCHVISION_VERSION": os.environ.get("TORCHVISION_VERSION", "0.21.0"),
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
set -euo pipefail

python - <<'PY'
import re
from pathlib import Path

pp = Path("pyproject.toml").read_text()
torch_min = re.search(r'"torch>=([^"]+)"', pp).group(1)
tv_min = re.search(r'"torchvision>=([^"]+)"', pp).group(1)
print(f"pyproject minimums: torch>={torch_min}, torchvision>={tv_min}")

for path in ("cicd/single_gpu.py", "cicd/multigpu.py"):
    txt = Path(path).read_text()
    t = re.search(r'"PYTORCH_VERSION":\s*os\.environ\.get\("PYTORCH_VERSION",\s*"([^"]+)"\)', txt)
    v = re.search(r'"TORCHVISION_VERSION":\s*os\.environ\.get\("TORCHVISION_VERSION",\s*"([^"]+)"\)', txt)
    print(f"{path} defaults: torch={t.group(1)}, torchvision={v.group(1)}")
PY

rg -n 'TORCHVISION_VERSION' .github/workflows cicd docker || true

Repository: axolotl-ai-cloud/axolotl

Length of output: 1714


Align default PyTorch/TorchVision versions with pyproject.toml constraints across CI files.

The defaults in cicd/single_gpu.py (line 26), cicd/multigpu.py (line 27), and docker/Dockerfile-uv-base all specify torch=2.6.0 and torchvision=0.21.0, but pyproject.toml requires torch>=2.9.1 and torchvision>=0.24.1. Without environment variable overrides, this silently builds against an older, incompatible ABI. Update these defaults to 2.9.1 and 0.24.1 respectively.

Suggested changes
# cicd/single_gpu.py & cicd/multigpu.py
-    "PYTORCH_VERSION": os.environ.get("PYTORCH_VERSION", "2.6.0"),
-    "TORCHVISION_VERSION": os.environ.get("TORCHVISION_VERSION", "0.21.0"),
+    "PYTORCH_VERSION": os.environ.get("PYTORCH_VERSION", "2.9.1"),
+    "TORCHVISION_VERSION": os.environ.get("TORCHVISION_VERSION", "0.24.1"),

# docker/Dockerfile-uv-base
-ARG TORCHVISION_VERSION="0.21.0"
+ARG TORCHVISION_VERSION="0.24.1"
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@cicd/single_gpu.py` at line 26, Update the CI default PyTorch/TorchVision
versions to match pyproject.toml constraints: change the default environment
values used in cicd single/multi-gpu configs so that TORCH_VERSION is "2.9.1"
and TORCHVISION_VERSION is "0.24.1"; locate the variables named TORCH_VERSION
and TORCHVISION_VERSION (e.g., the assignment line with "TORCHVISION_VERSION":
os.environ.get("TORCHVISION_VERSION", "0.21.0") in single_gpu.py and the
analogous line in multigpu.py) and update their fallback strings to "2.9.1" and
"0.24.1" (also ensure Dockerfile-uv-base uses the same versions).

Comment on lines +934 to +945
if data.get("fsdp") or data.get("fsdp_config"):
fsdp_version = cls._resolve_fsdp_version(data)
if str(fsdp_version) != "2":
raise ValueError(
"q_galore_adamw8bit requires FSDP2. Set fsdp_version: 2."
)
fsdp_config = data.get("fsdp_config") or {}
if fsdp_config.get("use_orig_params") is False:
raise ValueError(
"q_galore_adamw8bit requires fsdp_config.use_orig_params=True so "
"that per-parameter projection state survives FSDP sharding."
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Search for FSDP2 use_orig_params default behavior
rg -n -C3 'use_orig_params' --type=py -g '!test*'

Repository: axolotl-ai-cloud/axolotl

Length of output: 4102


Fix use_orig_params validation to reject unset values.

The validation at line 941 checks if fsdp_config.get("use_orig_params") is False, which only rejects explicitly set False values. When use_orig_params is omitted (defaults to None), the check passes despite the requirement that it must be explicitly True. Since the schema allows None as the default and q_galore_adamw8bit requires the projection state to survive FSDP sharding, the validator should enforce explicit enablement:

Suggested fix
if fsdp_config.get("use_orig_params") is not True:
    raise ValueError(
        "q_galore_adamw8bit requires fsdp_config.use_orig_params=True so "
        "that per-parameter projection state survives FSDP sharding."
    )
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/axolotl/utils/schemas/validation.py` around lines 934 - 945, The
validator for q_galore_adamw8bit currently only rejects
fsdp_config.use_orig_params when it is explicitly False; update the check in the
block that calls cls._resolve_fsdp_version and reads fsdp_config so that it
requires use_orig_params to be explicitly True (i.e., treat None/absent the same
as False) by replacing the current boolean check with a strict is not True
condition and raise the same ValueError referencing fsdp_config.use_orig_params
and q_galore_adamw8bit.

@ved1beta ved1beta closed this May 14, 2026
@codecov
Copy link
Copy Markdown

codecov Bot commented May 14, 2026

Codecov Report

❌ Patch coverage is 43.28358% with 38 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
src/axolotl/utils/optimizers/qgalore.py 0.00% 25 Missing ⚠️
src/axolotl/core/builders/base.py 11.11% 8 Missing ⚠️
src/axolotl/utils/schemas/validation.py 77.27% 5 Missing ⚠️

📢 Thoughts on this report? Let us know!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant