Skip to content

Updating Configs for LLAMA3 70B LoRa#2292

Merged
ko3n1g merged 2 commits intomainfrom
rmukundan/update_llama3_lora_base_configs
Feb 10, 2026
Merged

Updating Configs for LLAMA3 70B LoRa#2292
ko3n1g merged 2 commits intomainfrom
rmukundan/update_llama3_lora_base_configs

Conversation

@rhmukundan
Copy link
Copy Markdown
Contributor

@rhmukundan rhmukundan commented Feb 9, 2026

Summary by CodeRabbit

  • New Features

    • Added new H100 FP8 precision variant for Llama3 70B fine-tuning configuration.
  • Updates

    • Optimized fine-tuning configurations for Llama3 70B across GB300, GB200, and H100 hardware platforms with improved parameter settings.

Signed-off-by: Raghav Hrishikeshan Mukundan <rmukundan@nvidia.com>
@rhmukundan rhmukundan self-assigned this Feb 9, 2026
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot bot commented Feb 9, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Feb 9, 2026

📝 Walkthrough

Walkthrough

Configuration updates for LLaMA3 fine-tuning and LORA training on GB300, GB200, and H100 hardware. Changes include dynamic sequence length calculation based on precision and adjustments to model parallelism parameters (tensor, pipeline, context parallel sizes) and batch sizes across hardware variants.

Changes

Cohort / File(s) Summary
Sequence Length Configuration
scripts/performance/configs/llama/llama3_llm_finetune.py
Updated seq_length handling: GB300 LORA increased from 2048 to 4096; GB200 LORA now uses dynamic seq_length (2048 for bf16 precision, 4096 otherwise). Added logic variable to compute precision-dependent values.
Parallelism Configuration Updates
scripts/performance/configs/llama/llama3_workload_base_configs.py
Restructured LoRA 70B configs across all hardware. GB300 base reduced parallelism to 1 across tensor/pipeline/context; FP8_MX_V1 now derived from base with pipeline_parallel=2. GB200 reduced to 1,1,1 parallelism; FP8_CS_V1 explicitly set with pipeline_parallel=2 and micro_batch=1, global_batch=32. H100 base adjusted tensor_parallel to 1, BF16_V1 recompute_num_layers reduced to 1; new FP8_CS_V1 variant added with tensor_parallel=2.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~22 minutes

🚥 Pre-merge checks | ✅ 3 | ❌ 1
❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Test Results For Major Changes ⚠️ Warning PR contains significant configuration changes (seq_length, parallelism, batch sizes) without documented test results or performance validation. Add test results documenting performance, convergence validation, and before-after benchmarks for the configuration changes.
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'Updating Configs for LLAMA3 70B LoRa' directly corresponds to the main changes in the pull request, which involve updating configuration values for LLAMA3 70B LoRA models across multiple hardware platforms (GB300, GB200, H100).
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch rmukundan/update_llama3_lora_base_configs

No actionable comments were generated in the recent review. 🎉

🧹 Recent nitpick comments
scripts/performance/configs/llama/llama3_llm_finetune.py (1)

234-241: Comment says "FP8 variants" but the else branch is a catch-all.

The comment on line 234 specifies "FP8 variants use seq_length=4096", but the code defaults any non-bf16 precision to 4096. If a non-FP8 precision (e.g., nvfp4) is ever passed to this function, it will silently get seq_length=4096 without the developer being aware this was an unintended default.

Consider either making the condition explicit or broadening the comment to match actual behavior.

Option: make the precision check explicit
-    # BF16 uses seq_length=2048, FP8 variants use seq_length=4096
-    seq_length = 2048 if precision.lower() == "bf16" else 4096
+    # BF16 uses seq_length=2048, all other precisions (FP8 CS, FP8 MX, etc.) use seq_length=4096
+    seq_length = 2048 if precision.lower() == "bf16" else 4096

Or, for a stricter approach:

-    # BF16 uses seq_length=2048, FP8 variants use seq_length=4096
-    seq_length = 2048 if precision.lower() == "bf16" else 4096
+    if precision.lower() == "bf16":
+        seq_length = 2048
+    elif precision.lower() in ("fp8_cs", "fp8_mx"):
+        seq_length = 4096
+    else:
+        raise ValueError(f"Unsupported precision for GB200 LORA config: {precision}")
scripts/performance/configs/llama/llama3_workload_base_configs.py (1)

574-600: GB200 FP8_CS_V1 no longer derives from _LLAMA3_70B_LORA_CONFIG_GB200 — verify this is intentional.

LLAMA3_70B_LORA_CONFIG_GB200_FP8_CS_V1 (lines 588-599) is now a fully standalone config built from BASE_LLAMA3_70B_CONFIG rather than a replace(...) of _LLAMA3_70B_LORA_CONFIG_GB200. While the key fields (cuda_graph_impl/scope, peft, etc.) are replicated, the global_batch_size diverges: the base GB200 config uses GBS=64 while FP8_CS_V1 uses GBS=32.

This also cascades to LLAMA3_70B_LORA_CONFIG_GB200_FP8_MX_V1 (line 600) which is aliased to FP8_CS_V1.

If this is intentional (FP8 on GB200 needs PP=2 and reduced GBS), the structure is fine. A minor improvement would be to derive from the private base to reduce duplication:

♻️ Reduce duplication by deriving from the base
-LLAMA3_70B_LORA_CONFIG_GB200_FP8_CS_V1 = replace(
-    BASE_LLAMA3_70B_CONFIG,
-    num_gpus=8,
-    peft="lora",
-    tensor_model_parallel_size=1,
-    pipeline_model_parallel_size=2,
-    context_parallel_size=1,
-    micro_batch_size=1,
-    global_batch_size=32,
-    cuda_graph_impl="transformer_engine",
-    cuda_graph_scope="mlp",
-)
+LLAMA3_70B_LORA_CONFIG_GB200_FP8_CS_V1 = replace(
+    _LLAMA3_70B_LORA_CONFIG_GB200,
+    pipeline_model_parallel_size=2,
+    global_batch_size=32,
+)

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@rhmukundan rhmukundan requested a review from erhoo82 February 9, 2026 23:30
@malay-nagda malay-nagda added performance performance/release Performance items related with NeMo release performance/optimize Performance optimization tracking r0.3.0 Cherry-pick label for r0.3.0 release branch labels Feb 10, 2026
@rhmukundan
Copy link
Copy Markdown
Contributor Author

/ok to test a6ffa2b

@rhmukundan rhmukundan enabled auto-merge (squash) February 10, 2026 17:01
@ko3n1g ko3n1g disabled auto-merge February 10, 2026 17:34
@ko3n1g ko3n1g merged commit aa10ef7 into main Feb 10, 2026
21 checks passed
@ko3n1g ko3n1g deleted the rmukundan/update_llama3_lora_base_configs branch February 10, 2026 17:34
ko3n1g pushed a commit that referenced this pull request Feb 10, 2026
Signed-off-by: Raghav Hrishikeshan Mukundan <rmukundan@nvidia.com>
Signed-off-by: NeMo Bot <nemo-bot@nvidia.com>
rhmukundan added a commit that referenced this pull request Feb 10, 2026
Signed-off-by: Raghav Hrishikeshan Mukundan <rmukundan@nvidia.com>
Signed-off-by: NeMo Bot <nemo-bot@nvidia.com>
sowmen pushed a commit to sowmen/Megatron-Bridge that referenced this pull request Feb 11, 2026
Signed-off-by: Raghav Hrishikeshan Mukundan <rmukundan@nvidia.com>
Signed-off-by: sowmen <sowmendipta@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

performance/optimize Performance optimization tracking performance/release Performance items related with NeMo release performance r0.3.0 Cherry-pick label for r0.3.0 release branch

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants