use skip_move_to_device for all cases#3015
Conversation
📝 WalkthroughWalkthroughAn optional configuration field Changes
Estimated code review effort🎯 1 (Trivial) | ⏱️ ~3 minutes Suggested labels
Suggested reviewers
Note ⚡️ Unit Test Generation is now available in beta!Learn more here, or try it out under "Finishing Touches" below. 📜 Recent review detailsConfiguration used: .coderabbit.yaml 📒 Files selected for processing (2)
✅ Files skipped from review due to trivial changes (1)
🚧 Files skipped from review as they are similar to previous changes (1)
✨ Finishing Touches
🧪 Generate unit tests
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
Documentation and Community
|
There was a problem hiding this comment.
Actionable comments posted: 1
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
src/axolotl/loaders/model.py(1 hunks)
🧰 Additional context used
🧠 Learnings (1)
📓 Common learnings
Learnt from: NanoCode012
PR: axolotl-ai-cloud/axolotl#2854
File: README.md:73-77
Timestamp: 2025-07-02T02:56:20.788Z
Learning: For Axolotl Docker commands, the `--ipc=host` flag should be included by default to prevent shared memory failures that commonly occur with PyTorch DataLoaders and multiprocessing during machine learning training workflows.
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (8)
- GitHub Check: PyTest from Source Dist (3.11, 2.7.0)
- GitHub Check: PyTest from Source Dist (3.11, 2.7.1)
- GitHub Check: pre-commit
- GitHub Check: PyTest (3.11, 2.7.1)
- GitHub Check: PyTest from Source Dist (3.11, 2.6.0)
- GitHub Check: pre-commit
- GitHub Check: PyTest (3.11, 2.6.0)
- GitHub Check: PyTest (3.11, 2.7.0)
| self.model._tp_size = self.cfg.tensor_parallel_size | ||
| self.model._device_mesh = self.model_kwargs["device_mesh"] | ||
|
|
||
| skip_move_to_device = True |
There was a problem hiding this comment.
💡 Verification agent
🧩 Analysis chain
Verify the impact of unconditionally skipping device movement.
This change overrides all the existing conditional logic for device placement, forcing skip_move_to_device = True in all scenarios. While this may resolve specific device placement issues, it could potentially break configurations that rely on the existing conditional logic.
Please run the following script to understand the existing device placement logic and verify this change won't cause regressions:
Consider a more targeted approach that addresses the specific issue without bypassing all existing logic. Could you clarify:
- What specific device placement issue is this change intended to resolve?
- Have you tested this change across different configurations (FSDP, DeepSpeed, multi-GPU, etc.)?
- Would a conditional approach be more appropriate than universally skipping device movement?
🏁 Script executed:
#!/bin/bash
# Description: Analyze the existing conditional logic for skip_move_to_device and its usage
# Find all places where skip_move_to_device is set to True conditionally
echo "=== Existing conditional logic for skip_move_to_device ==="
rg -A 2 -B 2 "skip_move_to_device = True" src/axolotl/loaders/model.py
echo -e "\n=== Usage of skip_move_to_device parameter ==="
rg -A 5 -B 2 "skip_move_to_device" src/axolotl/loaders/model.py
echo -e "\n=== Method that uses skip_move_to_device ==="
ast-grep --pattern $'def $_($$$, skip_move_to_device: $_) {
$$$
}'Length of output: 4620
Unconditional skip_move_to_device assignment bypasses all device‐placement logic
At src/axolotl/loaders/model.py:848, you’ve added:
skip_move_to_device = Trueimmediately before returning, which makes _build_model() always return True—regardless of any of the earlier, more granular conditions (FSDP CPU-efficient loading, QLoRA+FSDP, GPTQ, DeepSpeed Zero3, tensor parallelism workarounds, etc.). As a result, _apply_post_lora_load_setup will never move the model to the target device in any configuration.
• Impact
- Overrides all existing logic that selectively skips device moves only when necessary
- Forces all models to stay on their initial device, potentially breaking single-GPU or default-placement workflows and degrading performance in non‐specialized setups
• Requested changes
- Remove or scope this unconditional assignment so that you only skip moving the model under the specific conditions you intend to address.
- Clarify which device‐placement issue this change is meant to solve, and add targeted tests (FSDP, DeepSpeed, multi‐GPU, single‐GPU) to guard against regressions.
Please adjust the implementation to preserve the original conditional logic and only bypass model movement in the exact scenarios you’ve identified.
🤖 Prompt for AI Agents
In src/axolotl/loaders/model.py at line 848, the unconditional assignment
skip_move_to_device = True overrides all prior conditional logic for device
placement, causing models to never move to the target device. To fix this,
remove the unconditional assignment and instead set skip_move_to_device to True
only under the specific conditions that require skipping device moves,
preserving the original conditional checks for FSDP, QLoRA, DeepSpeed, and
tensor parallelism. Also, clarify in comments which device-placement issues are
addressed and add targeted tests for different configurations to prevent
regressions.
Codecov Report✅ All modified and coverable lines are covered by tests. 📢 Thoughts on this report? Let us know! |
|
What's the reason behind this change? |
helps prevent oom when using FSDP+GRPO |
Description
WIP
multigpu tests https://github.com/axolotl-ai-cloud/axolotl/actions/runs/16738721812
Motivation and Context
How has this been tested?
Screenshots (if appropriate)
Types of changes
Social Handles (Optional)
Summary by CodeRabbit
Summary by CodeRabbit