Updating Configs for LLAMA3 70B LoRa by rhmukundan · Pull Request #2292 · NVIDIA-NeMo/Megatron-Bridge

rhmukundan · 2026-02-09T23:21:26Z

Summary by CodeRabbit

New Features
- Added new H100 FP8 precision variant for Llama3 70B fine-tuning configuration.
Updates
- Optimized fine-tuning configurations for Llama3 70B across GB300, GB200, and H100 hardware platforms with improved parameter settings.

Signed-off-by: Raghav Hrishikeshan Mukundan <rmukundan@nvidia.com>

copy-pr-bot · 2026-02-09T23:21:30Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

coderabbitai · 2026-02-09T23:25:39Z

📝 Walkthrough

Walkthrough

Configuration updates for LLaMA3 fine-tuning and LORA training on GB300, GB200, and H100 hardware. Changes include dynamic sequence length calculation based on precision and adjustments to model parallelism parameters (tensor, pipeline, context parallel sizes) and batch sizes across hardware variants.

Changes

Cohort / File(s)	Summary
Sequence Length Configuration `scripts/performance/configs/llama/llama3_llm_finetune.py`	Updated seq_length handling: GB300 LORA increased from 2048 to 4096; GB200 LORA now uses dynamic seq_length (2048 for bf16 precision, 4096 otherwise). Added logic variable to compute precision-dependent values.
Parallelism Configuration Updates `scripts/performance/configs/llama/llama3_workload_base_configs.py`	Restructured LoRA 70B configs across all hardware. GB300 base reduced parallelism to 1 across tensor/pipeline/context; FP8_MX_V1 now derived from base with pipeline_parallel=2. GB200 reduced to 1,1,1 parallelism; FP8_CS_V1 explicitly set with pipeline_parallel=2 and micro_batch=1, global_batch=32. H100 base adjusted tensor_parallel to 1, BF16_V1 recompute_num_layers reduced to 1; new FP8_CS_V1 variant added with tensor_parallel=2.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~22 minutes

🚥 Pre-merge checks | ✅ 3 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Test Results For Major Changes	⚠️ Warning	PR contains significant configuration changes (seq_length, parallelism, batch sizes) without documented test results or performance validation.	Add test results documenting performance, convergence validation, and before-after benchmarks for the configuration changes.

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title 'Updating Configs for LLAMA3 70B LoRa' directly corresponds to the main changes in the pull request, which involve updating configuration values for LLAMA3 70B LoRA models across multiple hardware platforms (GB300, GB200, H100).
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch rmukundan/update_llama3_lora_base_configs

No actionable comments were generated in the recent review. 🎉

🧹 Recent nitpick comments

scripts/performance/configs/llama/llama3_llm_finetune.py (1)
234-241: Comment says "FP8 variants" but the else branch is a catch-all.

The comment on line 234 specifies "FP8 variants use seq_length=4096", but the code defaults any non-bf16 precision to 4096. If a non-FP8 precision (e.g., nvfp4) is ever passed to this function, it will silently get seq_length=4096 without the developer being aware this was an unintended default.

Consider either making the condition explicit or broadening the comment to match actual behavior.
Option: make the precision check explicit
-    # BF16 uses seq_length=2048, FP8 variants use seq_length=4096
-    seq_length = 2048 if precision.lower() == "bf16" else 4096
+    # BF16 uses seq_length=2048, all other precisions (FP8 CS, FP8 MX, etc.) use seq_length=4096
+    seq_length = 2048 if precision.lower() == "bf16" else 4096
Or, for a stricter approach:
-    # BF16 uses seq_length=2048, FP8 variants use seq_length=4096
-    seq_length = 2048 if precision.lower() == "bf16" else 4096
+    if precision.lower() == "bf16":
+        seq_length = 2048
+    elif precision.lower() in ("fp8_cs", "fp8_mx"):
+        seq_length = 4096
+    else:
+        raise ValueError(f"Unsupported precision for GB200 LORA config: {precision}")
scripts/performance/configs/llama/llama3_workload_base_configs.py (1)
574-600: GB200 FP8_CS_V1 no longer derives from _LLAMA3_70B_LORA_CONFIG_GB200 — verify this is intentional.

LLAMA3_70B_LORA_CONFIG_GB200_FP8_CS_V1 (lines 588-599) is now a fully standalone config built from BASE_LLAMA3_70B_CONFIG rather than a replace(...) of _LLAMA3_70B_LORA_CONFIG_GB200. While the key fields (cuda_graph_impl/scope, peft, etc.) are replicated, the global_batch_size diverges: the base GB200 config uses GBS=64 while FP8_CS_V1 uses GBS=32.

This also cascades to LLAMA3_70B_LORA_CONFIG_GB200_FP8_MX_V1 (line 600) which is aliased to FP8_CS_V1.

If this is intentional (FP8 on GB200 needs PP=2 and reduced GBS), the structure is fine. A minor improvement would be to derive from the private base to reduce duplication:
♻️ Reduce duplication by deriving from the base
-LLAMA3_70B_LORA_CONFIG_GB200_FP8_CS_V1 = replace(
-    BASE_LLAMA3_70B_CONFIG,
-    num_gpus=8,
-    peft="lora",
-    tensor_model_parallel_size=1,
-    pipeline_model_parallel_size=2,
-    context_parallel_size=1,
-    micro_batch_size=1,
-    global_batch_size=32,
-    cuda_graph_impl="transformer_engine",
-    cuda_graph_scope="mlp",
-)
+LLAMA3_70B_LORA_CONFIG_GB200_FP8_CS_V1 = replace(
+    _LLAMA3_70B_LORA_CONFIG_GB200,
+    pipeline_model_parallel_size=2,
+    global_batch_size=32,
+)

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

rhmukundan · 2026-02-10T16:56:48Z

/ok to test a6ffa2b

Signed-off-by: Raghav Hrishikeshan Mukundan <rmukundan@nvidia.com> Signed-off-by: NeMo Bot <nemo-bot@nvidia.com>

Signed-off-by: Raghav Hrishikeshan Mukundan <rmukundan@nvidia.com> Signed-off-by: sowmen <sowmendipta@gmail.com>

Updating Configs for LLAMA3 70B LoRa

fcffa19

Signed-off-by: Raghav Hrishikeshan Mukundan <rmukundan@nvidia.com>

rhmukundan requested a review from malay-nagda February 9, 2026 23:21

rhmukundan self-assigned this Feb 9, 2026

rhmukundan requested a review from erhoo82 February 9, 2026 23:30

malay-nagda added performance performance/release Performance items related with NeMo release performance/optimize Performance optimization tracking r0.3.0 Cherry-pick label for r0.3.0 release branch labels Feb 10, 2026

malay-nagda approved these changes Feb 10, 2026

View reviewed changes

Merge branch 'main' into rmukundan/update_llama3_lora_base_configs

a6ffa2b

copy-pr-bot bot temporarily deployed to nemo-ci February 10, 2026 16:57 Inactive

copy-pr-bot bot temporarily deployed to test February 10, 2026 16:57 Inactive

rhmukundan enabled auto-merge (squash) February 10, 2026 17:01

copy-pr-bot bot temporarily deployed to nemo-ci February 10, 2026 17:22 Inactive

copy-pr-bot bot temporarily deployed to nemo-ci February 10, 2026 17:31 Inactive

ko3n1g disabled auto-merge February 10, 2026 17:34

ko3n1g merged commit aa10ef7 into main Feb 10, 2026
21 checks passed

ko3n1g deleted the rmukundan/update_llama3_lora_base_configs branch February 10, 2026 17:34

ko3n1g pushed a commit that referenced this pull request Feb 10, 2026

Updating Configs for LLAMA3 70B LoRa (#2292)

81eddd8

Signed-off-by: Raghav Hrishikeshan Mukundan <rmukundan@nvidia.com> Signed-off-by: NeMo Bot <nemo-bot@nvidia.com>

copy-pr-bot bot temporarily deployed to nemo-ci February 10, 2026 17:42 Inactive

copy-pr-bot bot had a problem deploying to nemo-ci February 10, 2026 17:42 Failure

copy-pr-bot bot temporarily deployed to nemo-ci February 10, 2026 17:42 Inactive

rhmukundan added a commit that referenced this pull request Feb 10, 2026

Updating Configs for LLAMA3 70B LoRa (#2292)

15cbef4

Signed-off-by: Raghav Hrishikeshan Mukundan <rmukundan@nvidia.com> Signed-off-by: NeMo Bot <nemo-bot@nvidia.com>

sowmen pushed a commit to sowmen/Megatron-Bridge that referenced this pull request Feb 11, 2026

Updating Configs for LLAMA3 70B LoRa (NVIDIA-NeMo#2292)

0369053

Signed-off-by: Raghav Hrishikeshan Mukundan <rmukundan@nvidia.com> Signed-off-by: sowmen <sowmendipta@gmail.com>

This was referenced Feb 17, 2026

Onboarding LLAMA3 70B LoRa to B300 and B200 chips #2397

Merged

Onboard LLAMA3 LoRa to B200 & B300 Chips #2396

Open

coderabbitai bot mentioned this pull request Feb 27, 2026

Fix lint and import error in perf script llama3_llm_finetune.py #2592

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Updating Configs for LLAMA3 70B LoRa#2292

Updating Configs for LLAMA3 70B LoRa#2292
ko3n1g merged 2 commits intomainfrom
rmukundan/update_llama3_lora_base_configs

rhmukundan commented Feb 9, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

copy-pr-bot bot commented Feb 9, 2026

Uh oh!

coderabbitai bot commented Feb 9, 2026

Walkthrough

Changes

Estimated code review effort

Uh oh!

rhmukundan commented Feb 10, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

rhmukundan commented Feb 9, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

copy-pr-bot bot commented Feb 9, 2026

Uh oh!

coderabbitai bot commented Feb 9, 2026

Walkthrough

Changes

Estimated code review effort

Uh oh!

rhmukundan commented Feb 10, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

rhmukundan commented Feb 9, 2026 •

edited by coderabbitai bot

Loading