Skip to content

cp: Updating Configs for LLAMA3 70B LoRa (2292) into r0.3.0#2311

Merged
ko3n1g merged 1 commit intor0.3.0from
cherry-pick-2292-r0.3.0
Feb 10, 2026
Merged

cp: Updating Configs for LLAMA3 70B LoRa (2292) into r0.3.0#2311
ko3n1g merged 1 commit intor0.3.0from
cherry-pick-2292-r0.3.0

Conversation

@ko3n1g
Copy link
Copy Markdown
Contributor

@ko3n1g ko3n1g commented Feb 10, 2026

beep boop [🤖]: Hi @rhmukundan 👋,

we've cherry picked #2292 into  for you! 🚀

Please review and approve this cherry pick by your convenience!

Summary by CodeRabbit

  • Chores
    • Revised Llama3 70B model configuration settings across multiple hardware platforms (GB300, GB200, H100). Updated sequence length handling with dynamic precision-based configuration, adjusted parallelism and batch size parameters, and introduced new hardware-specific configuration variants with CUDA graph support.

Signed-off-by: Raghav Hrishikeshan Mukundan <rmukundan@nvidia.com>
Signed-off-by: NeMo Bot <nemo-bot@nvidia.com>
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot bot commented Feb 10, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@ko3n1g
Copy link
Copy Markdown
Contributor Author

ko3n1g commented Feb 10, 2026

/ok to test 81eddd8

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Feb 10, 2026

📝 Walkthrough

Walkthrough

This pull request updates sequence length, padding, and parallelism configurations for Llama3 70B models across GB300, GB200, and H100 hardware variants. Sequence length increased to 4096 with padding support for some configs, while tensor and pipeline parallelism settings are reduced to 1 for improved efficiency.

Changes

Cohort / File(s) Summary
Sequence Length & Padding Config
scripts/performance/configs/llama/llama3_llm_finetune.py
GB300 LoRA config: seq_length increased from 2048 to 4096 with added padding configuration (pad_cu_seqlens, pad_to_max_length). GB200 LoRA config: dynamic seq_length based on precision (2048 for bf16, 4096 for others).
Parallelism & Batch Size Tuning
scripts/performance/configs/llama/llama3_workload_base_configs.py
GB300/GB200 LoRA: tensor/pipeline/context parallelism reduced to 1; GB300 global batch reduced to 32; GB200 FP8 CS: new explicit replace() variant with adjusted parallelism and batch settings; H100 LoRA: tensor parallelism reduced to 1, context parallelism added; H100 BF16: recompute_num_layers reduced from 2 to 1; H100 FP8 CS: now uses replace() with tensor_model_parallel_size=2.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Suggested labels

r0.3.0, performance

Suggested reviewers

  • malay-nagda
🚥 Pre-merge checks | ✅ 3 | ❌ 1
❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Test Results For Major Changes ⚠️ Warning PR contains significant LLAMA3 70B LoRa configuration changes (sequence length 2048→4096, batch size 64→32, parallelism adjustments) but PR description lacks test results, performance benchmarks, or regression analysis. Update PR description with test results, before-and-after performance metrics, hardware/precision context, and validation evidence that configuration changes do not cause training regressions.
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly identifies the main change as updating configurations for LLAMA3 70B LoRa, directly aligning with the raw summary which shows configuration updates to both llama3_llm_finetune.py and llama3_workload_base_configs.py files.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch cherry-pick-2292-r0.3.0

Tip

Issue Planner is now in beta. Read the docs and try it out! Share your feedback on Discord.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
scripts/performance/configs/llama/llama3_llm_finetune.py (1)

234-246: ⚠️ Potential issue | 🟠 Major

Add pad_cu_seqlens and pad_to_max_length settings to GB200 LoRA config for CUDA graph compatibility.

The GB200 LoRA config enables cuda_graph_impl="transformer_engine" with cuda_graph_scope="mlp" (identical to GB300), uses packed_sequence=True, and for FP8 variants uses seq_length=4096 (same as GB300). However, it lacks the padding settings that GB300 explicitly includes with a comment explaining they are "required for CUDA graphs and avoids NaN issues in attention kernels."

Add these lines before the return statement:

cfg.dataset.packed_sequence_specs.pad_cu_seqlens = True
cfg.dataset.dataset_kwargs["pad_to_max_length"] = True
🧹 Nitpick comments (1)
scripts/performance/configs/llama/llama3_llm_finetune.py (1)

234-235: Consider making the precision-dependent seq_length selection more explicit/extensible.

The if precision.lower() == "bf16" else 4096 pattern is concise but could silently assign 4096 to any future precision string (e.g., "nvfp4"). This is fine for now since the workload base configs only define BF16, FP8_CS, and FP8_MX variants for GB200 LoRA, but worth noting if more precisions are added later.

@ko3n1g
Copy link
Copy Markdown
Contributor Author

ko3n1g commented Feb 10, 2026

Tested and updated golden values

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants