Conversation
|
""" WalkthroughThis update removes the conditional installation of Changes
Sequence Diagram(s)sequenceDiagram
participant GitHub Actions
participant Dockerfile
participant Python Env
GitHub Actions->>Dockerfile: Build with CUDA 12.6.3
Dockerfile->>Python Env: Install dependencies
Python Env->>Python Env: Install flash-attn==2.8.0.post2 (always, no conditional)
Possibly related PRs
Suggested labels
Suggested reviewers
Poem
📜 Recent review detailsConfiguration used: CodeRabbit UI 📒 Files selected for processing (2)
🚧 Files skipped from review as they are similar to previous changes (2)
⏰ Context from checks skipped due to timeout of 90000ms (8)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Actionable comments posted: 0
🧹 Nitpick comments (1)
setup.py (1)
113-120: Deduplicate theflash-attnversion string to prevent future driftThe literal
"2.8.0.post2"is now repeated in two places. If a future upgrade is made but one location is forgotten, the extras will diverge and install conflicting wheels. Store the version in a single constant and interpolate it to keep the spec dry.- extras_require = { - "flash-attn": ["flash-attn==2.8.0.post2"], - "ring-flash-attn": [ - "flash-attn==2.8.0.post2", +FLASH_ATTN_VERSION = "2.8.0.post2" + +extras_require = { + "flash-attn": [f"flash-attn=={FLASH_ATTN_VERSION}"], + "ring-flash-attn": [ + f"flash-attn=={FLASH_ATTN_VERSION}",An alternative is to loosen the pin (
>=2.8.0.post2,<2.9) to pick up future micro-patches automatically while avoiding breaking changes.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
docker/Dockerfile-base(0 hunks)docker/Dockerfile-uv-base(0 hunks)setup.py(1 hunks)
💤 Files with no reviewable changes (2)
- docker/Dockerfile-uv-base
- docker/Dockerfile-base
⏰ Context from checks skipped due to timeout of 90000ms (8)
- GitHub Check: PyTest (3.11, 2.5.1)
- GitHub Check: PyTest from Source Dist (3.11, 2.6.0)
- GitHub Check: PyTest (3.11, 2.6.0)
- GitHub Check: PyTest (3.11, 2.7.1)
- GitHub Check: PyTest from Source Dist (3.11, 2.7.1)
- GitHub Check: PyTest from Source Dist (3.11, 2.5.1)
- GitHub Check: test-axolotl-multigpu (124, 12.4.1, 3.11, 2.6.0, vllm, 2, true)
- GitHub Check: test-axolotl-multigpu (124, 12.4.1, 3.11, 2.5.1, 2, true)
Codecov ReportAll modified and coverable lines are covered by tests ✅ 📢 Thoughts on this report? Let us know! |
There was a problem hiding this comment.
Actionable comments posted: 1
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (5)
.github/workflows/tests.yml(2 hunks)docker/Dockerfile-base(0 hunks)docker/Dockerfile-uv-base(0 hunks)setup.py(1 hunks)src/axolotl/train.py(1 hunks)
💤 Files with no reviewable changes (2)
- docker/Dockerfile-uv-base
- docker/Dockerfile-base
🚧 Files skipped from review as they are similar to previous changes (2)
- setup.py
- .github/workflows/tests.yml
⏰ Context from checks skipped due to timeout of 90000ms (10)
- GitHub Check: test-axolotl-multigpu (126, 12.6.3, 3.11, 2.7.1, 2, true)
- GitHub Check: test-axolotl-multigpu (124, 12.4.1, 3.11, 2.6.0, vllm, 2, true)
- GitHub Check: test-axolotl-multigpu (124, 12.4.1, 3.11, 2.5.1, 2, true)
- GitHub Check: PyTest from Source Dist (3.11, 2.6.0)
- GitHub Check: PyTest from Source Dist (3.11, 2.5.1)
- GitHub Check: PyTest from Source Dist (3.11, 2.7.1)
- GitHub Check: PyTest (3.11, 2.6.0)
- GitHub Check: PyTest (3.11, 2.5.1)
- GitHub Check: PyTest (3.11, 2.7.1)
- GitHub Check: pre-commit
| # TODO: disabling for now as not compatible with FSDP2 + torchao low bit optimizers | ||
| # if cfg.bf16: | ||
| # torch.set_default_dtype(torch.bfloat16) |
There was a problem hiding this comment.
💡 Verification agent
🧩 Analysis chain
Add user notification and formal tracking for this behavior change.
This change disables automatic dtype setting when cfg.bf16=True, which could significantly impact training behavior and performance. Users may not be aware their training configuration has changed.
Consider:
- Adding a warning log message when
cfg.bf16=Trueto inform users the default dtype is not being set - Creating a formal issue to track re-enabling this functionality when FSDP2/torchao compatibility is resolved
Add a warning message to inform users about this behavior change:
LOG.info("Starting trainer...")
# TODO: disabling for now as not compatible with FSDP2 + torchao low bit optimizers
+if cfg.bf16:
+ LOG.warning("bf16 is enabled but torch.set_default_dtype(torch.bfloat16) is disabled due to FSDP2/torchao compatibility issues")
# if cfg.bf16:
# torch.set_default_dtype(torch.bfloat16)
trainer.train(resume_from_checkpoint=resume_from_checkpoint)Verify this change doesn't break existing training workflows:
🏁 Script executed:
#!/bin/bash
# Search for tests or examples that rely on bf16 default dtype behavior
rg -A 5 -B 5 "bf16.*True|bf16.*=.*True" --type pyLength of output: 31026
🏁 Script executed:
#!/bin/bash
sed -n '200,260p' src/axolotl/train.pyLength of output: 2198
Add warning log and formally track bf16 dtype workaround
To ensure users are aware that cfg.bf16=True no longer sets the default dtype—and to make the TODO actionable—you should:
• In src/axolotl/train.py (at the LOG.info("Starting trainer...") block, lines 226–228):
LOG.info("Starting trainer...")
# TODO: disabling for now as not compatible with FSDP2 + torchao low bit optimizers
+# Warn users when bf16 is requested but default dtype override is disabled
+if cfg.bf16:
+ LOG.warning(
+ "bf16 is enabled but torch.set_default_dtype(torch.bfloat16) "
+ "has been disabled due to FSDP2/torchao compatibility issues."
+ )
# if cfg.bf16:
# torch.set_default_dtype(torch.bfloat16)
trainer.train(resume_from_checkpoint=resume_from_checkpoint)• Create a tracking issue (e.g. “Re-enable torch.set_default_dtype(bfloat16) when FSDP2/torchao compatibility is resolved”) in your project’s issue tracker and reference it in the TODO comment.
• Manually verify that training runs with cfg.bf16=True still execute without silent dtype mismatches, especially for workflows depending on default bfloat16 tensors.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| # TODO: disabling for now as not compatible with FSDP2 + torchao low bit optimizers | |
| # if cfg.bf16: | |
| # torch.set_default_dtype(torch.bfloat16) | |
| LOG.info("Starting trainer...") | |
| # TODO: disabling for now as not compatible with FSDP2 + torchao low bit optimizers | |
| # Warn users when bf16 is requested but default dtype override is disabled | |
| if cfg.bf16: | |
| LOG.warning( | |
| "bf16 is enabled but torch.set_default_dtype(torch.bfloat16) " | |
| "has been disabled due to FSDP2/torchao compatibility issues." | |
| ) | |
| # if cfg.bf16: | |
| # torch.set_default_dtype(torch.bfloat16) | |
| trainer.train(resume_from_checkpoint=resume_from_checkpoint) |
🤖 Prompt for AI Agents
In src/axolotl/train.py around lines 226 to 228, add a warning log message near
the LOG.info("Starting trainer...") call to inform users that setting
cfg.bf16=True currently does not change the default dtype due to incompatibility
with FSDP2 and torchao low bit optimizers. Update the TODO comment to reference
a newly created tracking issue in the project’s issue tracker titled something
like “Re-enable torch.set_default_dtype(bfloat16) when FSDP2/torchao
compatibility is resolved.” Finally, manually test training runs with
cfg.bf16=True to confirm no silent dtype mismatches occur.
Summary by CodeRabbit
Chores
is_latest: truein workflow configuration.axolotl_extrasto an empty list.flash-attnpackage in Dockerfiles.New Features
flash-attnpackage version in optional dependencies to 2.8.0.post2.