cp: Updating Configs for LLAMA3 70B LoRa (2292) into r0.3.0#2311
cp: Updating Configs for LLAMA3 70B LoRa (2292) into r0.3.0#2311
Updating Configs for LLAMA3 70B LoRa (2292) into r0.3.0#2311Conversation
Signed-off-by: Raghav Hrishikeshan Mukundan <rmukundan@nvidia.com> Signed-off-by: NeMo Bot <nemo-bot@nvidia.com>
|
/ok to test 81eddd8 |
📝 WalkthroughWalkthroughThis pull request updates sequence length, padding, and parallelism configurations for Llama3 70B models across GB300, GB200, and H100 hardware variants. Sequence length increased to 4096 with padding support for some configs, while tensor and pipeline parallelism settings are reduced to 1 for improved efficiency. Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Suggested labels
Suggested reviewers
🚥 Pre-merge checks | ✅ 3 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
🧪 Generate unit tests (beta)
Tip Issue Planner is now in beta. Read the docs and try it out! Share your feedback on Discord. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
scripts/performance/configs/llama/llama3_llm_finetune.py (1)
234-246:⚠️ Potential issue | 🟠 MajorAdd
pad_cu_seqlensandpad_to_max_lengthsettings to GB200 LoRA config for CUDA graph compatibility.The GB200 LoRA config enables
cuda_graph_impl="transformer_engine"withcuda_graph_scope="mlp"(identical to GB300), usespacked_sequence=True, and for FP8 variants usesseq_length=4096(same as GB300). However, it lacks the padding settings that GB300 explicitly includes with a comment explaining they are "required for CUDA graphs and avoids NaN issues in attention kernels."Add these lines before the return statement:
cfg.dataset.packed_sequence_specs.pad_cu_seqlens = True cfg.dataset.dataset_kwargs["pad_to_max_length"] = True
🧹 Nitpick comments (1)
scripts/performance/configs/llama/llama3_llm_finetune.py (1)
234-235: Consider making the precision-dependent seq_length selection more explicit/extensible.The
if precision.lower() == "bf16" else 4096pattern is concise but could silently assign4096to any future precision string (e.g.,"nvfp4"). This is fine for now since the workload base configs only define BF16, FP8_CS, and FP8_MX variants for GB200 LoRA, but worth noting if more precisions are added later.
|
Tested and updated golden values |
beep boop [🤖]: Hi @rhmukundan 👋,
Summary by CodeRabbit