fix: docker build failing#3622
Conversation
|
Important Review skippedAuto incremental reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
📝 WalkthroughWalkthroughVersion bump to 0.16.2.dev0 with corresponding dependency constraint updates. PyTorch minimum requirement raised from 2.6.0 to 2.9.1, Axolotl example version constraints updated to ≥0.16.1, xformers bumped to ≥0.0.33.post2, and related documentation references updated. Changes
Estimated code review effort🎯 1 (Trivial) | ⏱️ ~4 minutes Possibly related PRs
Suggested labels
Suggested reviewers
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (2)
src/axolotl/utils/schemas/config.py (1)
1019-1019: Consider documentingautobehavior explicitly in the field description.Line 1019 is correct, but adding one short note about how
autoresolves would make the schema self-explanatory for users reading generated docs.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/axolotl/utils/schemas/config.py` at line 1019, Update the field description that currently reads "Whether to use torch.compile and which backend to use." to explicitly explain what the "auto" option does; e.g., add a short sentence saying that "auto" resolves at runtime by selecting an appropriate backend when torch.compile is available (or falling back/disabled when not supported), so readers of the schema (the field with description "Whether to use torch.compile and which backend to use.") immediately understand how "auto" is handled.docs/faq.qmd (1)
60-60: Add--ipc=hostto the Docker guidance in this FAQ answer.The updated image tag is good, but users can still hit shared-memory/DataLoader failures if they run without
--ipc=host. Please add a short Docker run example (or note) including this flag here.Based on learnings: For Axolotl Docker commands, the
--ipc=hostflag should be included by default to prevent shared memory failures with PyTorch DataLoaders and multiprocessing.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@docs/faq.qmd` at line 60, Update the FAQ answer that mentions the torch 2.10 Docker tag to include the recommended Docker runtime flag `--ipc=host` and a concise docker run example; specifically, modify the paragraph referencing the `main-py3.12-cu128-2.10.0` tag to note that users should run containers with `--ipc=host` (e.g., docker run --gpus all --ipc=host -it --rm <image>:main-py3.12-cu128-2.10.0) to avoid PyTorch DataLoader/shared-memory failures.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@examples/colab-notebooks/colab-axolotl-example.ipynb`:
- Line 39: The pip version constraint for axolotl[flash-attn]>=0.16.1 is being
parsed by the shell as a redirection; update the notebook cell so the package
specifier is quoted (e.g., wrap axolotl[flash-attn]>=0.16.1 in quotes) so the
>=0.16.1 constraint is passed to pip instead of the shell—modify the cell
containing the !pip install --no-build-isolation axolotl[flash-attn]>=0.16.1
command accordingly (the second install for cut-cross-entropy can remain as-is).
In `@pyproject.toml`:
- Line 82: Add an early platform guard in
src/axolotl/utils/schemas/validation.py to reject configs that enable
xformers_attention on aarch64: implement a validator (e.g.,
validate_xformers_on_aarch64) that checks platform.machine() (or
platform.uname().machine) for "aarch64" and if so raises a clear ValueError when
the config key xformers_attention is true; wire this validator into the existing
configuration validation flow (the module's main validate_config / schema
validation entrypoint) so the error is raised at validation time rather than at
runtime import.
---
Nitpick comments:
In `@docs/faq.qmd`:
- Line 60: Update the FAQ answer that mentions the torch 2.10 Docker tag to
include the recommended Docker runtime flag `--ipc=host` and a concise docker
run example; specifically, modify the paragraph referencing the
`main-py3.12-cu128-2.10.0` tag to note that users should run containers with
`--ipc=host` (e.g., docker run --gpus all --ipc=host -it --rm
<image>:main-py3.12-cu128-2.10.0) to avoid PyTorch DataLoader/shared-memory
failures.
In `@src/axolotl/utils/schemas/config.py`:
- Line 1019: Update the field description that currently reads "Whether to use
torch.compile and which backend to use." to explicitly explain what the "auto"
option does; e.g., add a short sentence saying that "auto" resolves at runtime
by selecting an appropriate backend when torch.compile is available (or falling
back/disabled when not supported), so readers of the schema (the field with
description "Whether to use torch.compile and which backend to use.")
immediately understand how "auto" is handled.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: fb6e5df2-dfee-4d56-bcea-948e81531761
📒 Files selected for processing (18)
VERSIONdocs/faq.qmdexamples/LiquidAI/README.mdexamples/apertus/README.mdexamples/arcee/README.mdexamples/colab-notebooks/colab-axolotl-example.ipynbexamples/devstral/README.mdexamples/gemma3n/README.mdexamples/gpt-oss/README.mdexamples/granite4/README.mdexamples/hunyuan/README.mdexamples/magistral/README.mdexamples/seed-oss/README.mdexamples/smolvlm2/README.mdexamples/voxtral/README.mdpyproject.tomlsrc/axolotl/integrations/kd/README.mdsrc/axolotl/utils/schemas/config.py
| "!pip install --no-build-isolation axolotl[flash-attn]>=0.9.1\n", | ||
| "!pip install \"cut-cross-entropy[transformers] @ git+https://github.com/axolotl-ai-cloud/ml-cross-entropy.git@fec1a88\"" | ||
| ] | ||
| "source": "%%capture\n# This step can take ~5-10 minutes to install dependencies\n!pip install --no-build-isolation axolotl[flash-attn]>=0.16.1\n!pip install \"cut-cross-entropy[transformers] @ git+https://github.com/axolotl-ai-cloud/ml-cross-entropy.git@fec1a88\"" |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Verify the notebook currently contains an unquoted requirement comparator on the pip line.
python - <<'PY'
import json
from pathlib import Path
nb = json.loads(Path("examples/colab-notebooks/colab-axolotl-example.ipynb").read_text())
for idx, cell in enumerate(nb.get("cells", [])):
if cell.get("cell_type") != "code":
continue
src = cell.get("source", "")
if isinstance(src, list):
src = "".join(src)
if "axolotl[flash-attn]>=0.16.1" in src:
print(f"Found in code cell index {idx}:")
for line in src.splitlines():
if "axolotl[flash-attn]" in line:
print(line)
PYRepository: axolotl-ai-cloud/axolotl
Length of output: 158
🏁 Script executed:
python - <<'PY'
import json
from pathlib import Path
nb = json.loads(Path("examples/colab-notebooks/colab-axolotl-example.ipynb").read_text())
for idx, cell in enumerate(nb.get("cells", [])):
if cell.get("cell_type") != "code":
continue
src = cell.get("source", "")
if isinstance(src, list):
src = "".join(src)
# Look for pip install lines with unquoted version specifiers
if "!pip install" in src:
print(f"Cell {idx}:")
for line in src.splitlines():
if "!pip install" in line:
print(f" {line}")
PYRepository: axolotl-ai-cloud/axolotl
Length of output: 260
Quote the pip version constraint in the shell command
Line 39 uses ! shell execution, so >=0.16.1 will be interpreted as shell redirection (>) instead of a pip version specifier. This bypasses the intended minimum version constraint and installs an unpinned version, breaking reproducibility.
Proposed fix
- "source": "%%capture\n# This step can take ~5-10 minutes to install dependencies\n!pip install --no-build-isolation axolotl[flash-attn]>=0.16.1\n!pip install \"cut-cross-entropy[transformers] @ git+https://github.com/axolotl-ai-cloud/ml-cross-entropy.git@fec1a88\""
+ "source": "%%capture\n# This step can take ~5-10 minutes to install dependencies\n!pip install --no-build-isolation \"axolotl[flash-attn]>=0.16.1\"\n!pip install \"cut-cross-entropy[transformers] @ git+https://github.com/axolotl-ai-cloud/ml-cross-entropy.git@fec1a88\""📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| "source": "%%capture\n# This step can take ~5-10 minutes to install dependencies\n!pip install --no-build-isolation axolotl[flash-attn]>=0.16.1\n!pip install \"cut-cross-entropy[transformers] @ git+https://github.com/axolotl-ai-cloud/ml-cross-entropy.git@fec1a88\"" | |
| "source": "%%capture\n# This step can take ~5-10 minutes to install dependencies\n!pip install --no-build-isolation \"axolotl[flash-attn]>=0.16.1\"\n!pip install \"cut-cross-entropy[transformers] @ git+https://github.com/axolotl-ai-cloud/ml-cross-entropy.git@fec1a88\"" |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@examples/colab-notebooks/colab-axolotl-example.ipynb` at line 39, The pip
version constraint for axolotl[flash-attn]>=0.16.1 is being parsed by the shell
as a redirection; update the notebook cell so the package specifier is quoted
(e.g., wrap axolotl[flash-attn]>=0.16.1 in quotes) so the >=0.16.1 constraint is
passed to pip instead of the shell—modify the cell containing the !pip install
--no-build-isolation axolotl[flash-attn]>=0.16.1 command accordingly (the second
install for cut-cross-entropy can remain as-is).
| "bitsandbytes==0.49.1 ; sys_platform != 'darwin'", | ||
| "triton>=3.4.0 ; sys_platform != 'darwin'", | ||
| "xformers>=0.0.23.post1 ; sys_platform != 'darwin'", | ||
| "xformers>=0.0.33.post2 ; sys_platform != 'darwin' and platform_machine != 'aarch64'", |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Verify whether xformers_attention is validated against platform/arch.
# Expected: either explicit aarch64 guard exists, or this confirms it is missing.
set -euo pipefail
echo "== xformers_attention references =="
rg -n -C3 '\bxformers_attention\b'
echo
echo "== xformers import/error paths =="
rg -n -C3 'import xformers|xformers not found|llama_attn_hijack_xformers'
echo
echo "== Platform/arch guards related to xformers =="
rg -n -C3 'aarch64|platform_machine|sys_platform|xformers'Repository: axolotl-ai-cloud/axolotl
Length of output: 48435
🏁 Script executed:
# Check the complete validation.py file around the check_attention_fields method
cat -n src/axolotl/utils/schemas/validation.py | sed -n '180,230p'Repository: axolotl-ai-cloud/axolotl
Length of output: 2338
🏁 Script executed:
# Search for any platform_machine or aarch64 checks in validation files
rg -n "platform_machine|aarch64" src/axolotl/utils/schemas/Repository: axolotl-ai-cloud/axolotl
Length of output: 50
🏁 Script executed:
# Check the patch_manager to see when xformers patches are applied and if there are guards
cat -n src/axolotl/loaders/patch_manager.py | sed -n '735,760p'Repository: axolotl-ai-cloud/axolotl
Length of output: 1313
🏁 Script executed:
# Look for any early validation of xformers_attention before patches
rg -n "xformers_attention" src/axolotl/utils/schemas/validation.py -A 5 -B 5Repository: axolotl-ai-cloud/axolotl
Length of output: 976
Add a guard for xformers_attention on aarch64 to fail early.
Users can still enable xformers_attention in configuration on aarch64 systems even though xformers is excluded from installation on that platform. This causes a delayed runtime import error when patches are applied. Add upfront validation in src/axolotl/utils/schemas/validation.py to reject this configuration on aarch64 systems and provide a clear error message.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@pyproject.toml` at line 82, Add an early platform guard in
src/axolotl/utils/schemas/validation.py to reject configs that enable
xformers_attention on aarch64: implement a validator (e.g.,
validate_xformers_on_aarch64) that checks platform.machine() (or
platform.uname().machine) for "aarch64" and if so raises a clear ValueError when
the config key xformers_attention is true; wire this validator into the existing
configuration validation flow (the module's main validate_config / schema
validation entrypoint) so the error is raised at validation time rather than at
runtime import.
Codecov Report✅ All modified and coverable lines are covered by tests. 📢 Thoughts on this report? Let us know! |
Description
BREAKING CHANGE
Flash attention isn't installed on docker images by default. It would be pulled on runtime from HF kernel's repo for newer images from better source.
Ring flash attention isn't installed by default too. For users needing this, please install via
pip install axolotl[ring-flash-attn]. This would reduce our docker image size too.Motivation and Context
How has this been tested?
AI Usage Disclaimer
Screenshots (if appropriate)
Types of changes
Social Handles (Optional)
Summary by CodeRabbit
Chores
Documentation