add 120b and deepspeed zero3 examples#3035
Conversation
|
Warning Rate limit exceeded@NanoCode012 has exceeded the limit for the number of commits or files that can be reviewed per hour. Please wait 3 minutes and 53 seconds before requesting another review. ⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. 📒 Files selected for processing (6)
📝 WalkthroughWalkthroughThree YAML configuration files for GPT-OSS model training are introduced or updated. Two new files provide training setups for the 120B and 20B model variants using FSDP v2 and DeepSpeed Zero3 strategies, respectively. The third file receives a comment update clarifying the compatibility of a specific loading option with model quantization. Additionally, the GPT-OSS README was updated to simplify installation instructions and adjust model config comments. Changes
Estimated code review effort🎯 2 (Simple) | ⏱️ ~8 minutes Suggested reviewers
✨ Finishing Touches🧪 Generate unit tests
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
Documentation and Community
|
There was a problem hiding this comment.
Actionable comments posted: 3
🧹 Nitpick comments (1)
examples/gpt-oss/gpt-oss-20b-fft-fsdp2-offload.yaml (1)
66-69: Good clarification on cpu_ram_efficient_loading vs MXFP4Leaving cpu_ram_efficient_loading commented is correct here given MXFP4; the added note will prevent misconfigurations.
Consider adding a short link to the 120B dequantized example for users seeking a working reference.
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
examples/gpt-oss/gpt-oss-120b-fft-fsdp2-offload.yaml(1 hunks)examples/gpt-oss/gpt-oss-20b-fft-deepspeed-zero3.yaml(1 hunks)examples/gpt-oss/gpt-oss-20b-fft-fsdp2-offload.yaml(1 hunks)
🧰 Additional context used
🧠 Learnings (1)
📓 Common learnings
Learnt from: winglian
PR: axolotl-ai-cloud/axolotl#2707
File: src/axolotl/utils/data/sft.py:247-254
Timestamp: 2025-05-29T22:23:39.312Z
Learning: In distributed training scenarios with batch dispatching, placeholder datasets for non-zero ranks may intentionally use temporary files that persist during training. These files are typically very small and don't require explicit cleanup due to their minimal resource impact and specific training requirements.
🔇 Additional comments (8)
examples/gpt-oss/gpt-oss-120b-fft-fsdp2-offload.yaml (4)
60-67: FSDP2 config looks appropriate for 120B dequantized with CPU offloadoffload_params + SHARDED_STATE_DICT + TRANSFORMER_BASED_WRAP with reshard_after_forward is a solid setup for memory pressure.
11-11: experimental_skip_move_to_device: true is appropriate for FSDP shardingThis prevents initial GPU OOM before sharding. Keep it enabled for this scale.
64-66: No changes needed:GptOssDecoderLayeris correctly defined
Thetransformer_layer_cls_to_wrap: GptOssDecoderLayersetting matches the entry inMOE_ARCH_BLOCK(src/axolotl/common/architectures.py), so the class name is resolvable and requires no updates.
4-5: Ensureuse_kernels: falsedoesn’t disable your selected attention implementationI didn’t find any code paths that automatically block a
kernels-community/vllm-flash-attn3backend whenuse_kernelsis set tofalse. Please double-check that disabling “all fused kernels” globally does not inadvertently disable your flash-attention setup:• File: examples/gpt-oss/gpt-oss-120b-fft-fsdp2-offload.yaml
– Lines 4–5:use_kernels: false
– Lines 44–45:
yaml flash_attention: true attn_implementation: kernels-community/vllm-flash-attn3If turning off
use_kernelsdoes gate these community kernels, either add an explicit guard in your config loader or update the docs to call this out.examples/gpt-oss/gpt-oss-20b-fft-deepspeed-zero3.yaml (4)
35-35: Verify 8-bit optimizer compatibility with DeepSpeed ZeRO-3The config currently uses an 8-bit optimizer, which may conflict with ZeRO-3’s parameter partitioning/offload. Please confirm that
adamw_torch_8bithas been tested with ZeRO-3 in your environment; if not, switch to a supported optimizer.• File: examples/gpt-oss/gpt-oss-20b-fft-deepspeed-zero3.yaml
Line: 35Suggested change if untested or incompatible:
- optimizer: adamw_torch_8bit + optimizer: adamw_torch
2-2: Confirmed:use_kernels=falsedoes not disable the Flash Attention backendThe grep output shows that in
examples/gpt-oss/gpt-oss-20b-fft-deepspeed-zero3.yaml:
- Line 2:
use_kernels: false- Line 42:
flash_attention: true- Line 43:
attn_implementation: kernels-community/vllm-flash-attn3Disabling kernels via
use_kernels: falsedoes not override or disable the selected attention backend. No changes required.
10-10: Skip_move_to_device is correctly honored with DeepSpeed ZeRO-3The loader forces skip_move_to_device=True when ZeRO-3 is enabled and then respects the experimental flag override, and the post-load setup only calls model.to() if skip_move_to_device is False.
Key locations:
- In
_build_model(around line 803):
if is_deepspeed_zero3_enabled(): skip_move_to_device = True- Experimental override (lines 813–815):
if self.cfg.experimental_skip_move_to_device is not None: skip_move_to_device = ...- In
_apply_post_lora_load_setup(line 393): skip moving model when skip_move_to_device is TrueNo changes needed.
58-60: Double-check DeepSpeed Zero-3 BF16 config filePlease manually verify that the referenced config exists and includes the correct settings:
- deepspeed_configs/zero3_bf16.json is present at the specified path
"zero_optimization"block setsstage: 3(and any desired offload options)"bf16"precision is enabled- Any CPU/GPU offload parameters are configured as expected
|
📖 Documentation Preview: https://6895ca83ff37d21757e33941--resonant-treacle-0fd729.netlify.app Deployed on Netlify from commit edc43a5 |
NanoCode012
left a comment
There was a problem hiding this comment.
Where does the deepspeed
Summary by CodeRabbit