Skip to content

cp: Fix LLAMA3 LoRa TFLOPs Formula (2416) into r0.3.0#2533

Closed
svcnvidia-nemo-ci wants to merge 1 commit intor0.3.0from
cherry-pick-2416-r0.3.0
Closed

cp: Fix LLAMA3 LoRa TFLOPs Formula (2416) into r0.3.0#2533
svcnvidia-nemo-ci wants to merge 1 commit intor0.3.0from
cherry-pick-2416-r0.3.0

Conversation

@svcnvidia-nemo-ci
Copy link
Copy Markdown
Contributor

@svcnvidia-nemo-ci svcnvidia-nemo-ci commented Feb 25, 2026

beep boop [🤖]: Hi @rhmukundan 👋,

we've cherry picked #2416 into  for you! 🚀

Please review and approve this cherry pick by your convenience!

Summary by CodeRabbit

  • New Features
    • Added LoRA model support to FLOPs calculations with specialized computation methods that separately track metrics for frozen and unfrozen model components.
    • When LoRA is enabled, calculations now use optimized computation paths for improved accuracy.

Signed-off-by: Raghav Hrishikeshan Mukundan <rmukundan@nvidia.com>
Signed-off-by: NeMo Bot <nemo-bot@nvidia.com>
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot bot commented Feb 25, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@svcnvidia-nemo-ci
Copy link
Copy Markdown
Contributor Author

/ok to test 276b77d

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Feb 25, 2026

📝 Walkthrough

Walkthrough

This change adds LoRA-aware FLOPs calculation to the training utilities. When LoRA is detected, it bypasses the standard model-specific TFLOPS method and instead uses a specialized computation path that calculates frozen and unfrozen FLOPs separately using predefined statistics and model configuration dimensions.

Changes

Cohort / File(s) Summary
LoRA FLOPs Detection and Calculation
src/megatron/bridge/training/utils/flop_utils.py
Added LoRA type import and conditional branching in transformer_flops function. When LoRA is active, uses _LORA_SEQ_STATS mapping with model config parameters (hidden size, layers, heads, ffn size, vocab size) to compute frozen FLOPs (weighted 2/3) and unfrozen FLOPs (weighted 1/3) instead of invoking model-specific TFLOPS. Existing non-LoRA logic remains intact.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

  • Fix LLAMA3 LoRa TFLOPs Formula #2416 — Directly related modification to the same FLOPs utility file implementing LoRA-specific branching logic with _LORA_SEQ_STATS mapping and model config-derived dimensions.

Suggested labels

performance, r0.3.0

Suggested reviewers

  • rhmukundan
  • guyueh1
🚥 Pre-merge checks | ✅ 3 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Test Results For Major Changes ⚠️ Warning PR lacks testing documentation and validation for LoRA FLOPs formula fix despite modifying numerically sensitive calculations. Add unit/functional test results, before-and-after FLOPs verification for LLAMA3 LoRA configs, and address three outstanding review comments on correctness issues.
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly identifies this as a cherry-pick of a specific fix (LLAMA3 LoRA TFLOPs Formula from issue #2416) into a release branch (r0.3.0), which aligns with the changeset that adds LoRA-aware FLOPs calculation logic.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
  • 📝 Generate docstrings (stacked PR)
  • 📝 Generate docstrings (commit on current branch)
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch cherry-pick-2416-r0.3.0

Tip

Try Coding Plans. Let us write the prompt for your AI agent so you can ship faster (with fewer bugs).
Share your feedback on Discord.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/megatron/bridge/training/utils/flop_utils.py`:
- Around line 204-215: The LoRA FLOPs calculation uses cfg.model.vocab_size
directly which undercounts when vocab padding is applied; compute the padded
vocab size with the existing helper (call calculate_padded_vocab_size with the
model vocab) and replace the plain vocab_size term used inside
model_flops_frozen's logits factor (the expression involving avg_tokens,
n_layers, hs, ... , 6 * vocab_size / (n_layers * hs)) so the logits TFLOPs use
the padded vocabulary size consistently with the non‑LoRA path.
- Around line 190-219: The file fails ruff formatting; run the formatter and
commit the changes: run ruff-format (or pre-commit run --all-files) on
src/megatron/bridge/training/utils/flop_utils.py, fix formatting diffs around
the is_lora block (including _LORA_SEQ_STATS, the seq_len lookup, and the
expressions computing model_flops_frozen, model_flops_unfrozen and the return
line), and re-commit the formatted file so CI no longer reports a formatting
delta.
- Around line 196-199: Replace the hard raise when seq_len is missing from
_LORA_SEQ_STATS with a graceful fallback: log a warning (use the module logger
or warnings.warn) indicating missing LoRA stats for seq_len, then fall back to
the standard transformer path by deriving reasonable defaults for avg_seqlen2
and avg_tokens (e.g., estimate avg_seqlen2 = seq_len * seq_len and avg_tokens =
seq_len, or call an existing transformer stats helper if available) instead of
raising; keep references to _LORA_SEQ_STATS, seq_len, avg_seqlen2, and
avg_tokens so the rest of the function can continue using those values.

ℹ️ Review info

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between c55b3dd and 276b77d.

📒 Files selected for processing (1)
  • src/megatron/bridge/training/utils/flop_utils.py

Comment on lines +190 to +219
if is_lora:
_LORA_SEQ_STATS = {
4096: (842603, 4096),
2048: (488991, 2030),
}
seq_len = cfg.model.seq_length
if seq_len not in _LORA_SEQ_STATS:
raise ValueError(f"No LoRA stats for seq_length={seq_len}. Add it to _LORA_SEQ_STATS.")
avg_seqlen2, avg_tokens = _LORA_SEQ_STATS[seq_len]

hs = cfg.model.hidden_size
n_layers = cfg.model.num_layers
n_heads = cfg.model.num_attention_heads
ffn_hs = cfg.model.ffn_hidden_size
vocab_size = cfg.model.vocab_size

model_flops_frozen = (
avg_tokens
* n_layers
* hs**2
* (
12
+ 12 * num_query_groups / n_heads
+ 18 * ffn_hs / hs
+ 6 * vocab_size / (n_layers * hs)
)
)
model_flops_unfrozen = n_layers * hs**2 * (12 * avg_seqlen2 / hs)

return batch_size * (model_flops_frozen * (2.0 / 3.0) + model_flops_unfrozen)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Please run formatting before merge (ruff-format).

CI already reports this file was reformatted by the pre-commit hook; please run pre-commit run --all-files and commit the formatting delta.

As per coding guidelines: "Use ruff for linting and formatting Python code".

🧰 Tools
🪛 GitHub Actions: CICD NeMo

[error] 207-214: pre-commit ruff-format hook failed: 1 file reformatted. Run 'pre-commit run --all-files' or commit again to apply formatting changes.

🪛 Ruff (0.15.2)

[warning] 197-197: Avoid specifying long messages outside the exception class

(TRY003)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/megatron/bridge/training/utils/flop_utils.py` around lines 190 - 219, The
file fails ruff formatting; run the formatter and commit the changes: run
ruff-format (or pre-commit run --all-files) on
src/megatron/bridge/training/utils/flop_utils.py, fix formatting diffs around
the is_lora block (including _LORA_SEQ_STATS, the seq_len lookup, and the
expressions computing model_flops_frozen, model_flops_unfrozen and the return
line), and re-commit the formatted file so CI no longer reports a formatting
delta.

Comment on lines +196 to +199
if seq_len not in _LORA_SEQ_STATS:
raise ValueError(f"No LoRA stats for seq_length={seq_len}. Add it to _LORA_SEQ_STATS.")
avg_seqlen2, avg_tokens = _LORA_SEQ_STATS[seq_len]

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Avoid hard-failing on unsupported LoRA seq_length values.

Line 196–197 currently raises for any seq_length not present in _LORA_SEQ_STATS, which can stop otherwise valid training runs just for metrics computation. Prefer a graceful fallback to the standard transformer path when stats are missing.

Proposed fix
-        if is_lora:
+        if is_lora:
             _LORA_SEQ_STATS = {
             }
             seq_len = cfg.model.seq_length
-            if seq_len not in _LORA_SEQ_STATS:
-                raise ValueError(f"No LoRA stats for seq_length={seq_len}. Add it to _LORA_SEQ_STATS.")
-            avg_seqlen2, avg_tokens = _LORA_SEQ_STATS[seq_len]
+            seq_stats = _LORA_SEQ_STATS.get(seq_len)
+            if seq_stats is None:
+                # Fallback to standard transformer FLOPs path below when stats are unavailable.
+                pass
+            else:
+                avg_seqlen2, avg_tokens = seq_stats
 
-            hs = cfg.model.hidden_size
-            n_layers = cfg.model.num_layers
-            n_heads = cfg.model.num_attention_heads
-            ffn_hs = cfg.model.ffn_hidden_size
-            vocab_size = cfg.model.vocab_size
+                hs = cfg.model.hidden_size
+                n_layers = cfg.model.num_layers
+                n_heads = cfg.model.num_attention_heads
+                ffn_hs = cfg.model.ffn_hidden_size
+                vocab_size = cfg.model.vocab_size
 
-            model_flops_frozen = (
-                avg_tokens
-                * n_layers
-                * hs**2
-                * (
-                    12
-                    + 12 * num_query_groups / n_heads
-                    + 18 * ffn_hs / hs
-                    + 6 * vocab_size / (n_layers * hs)
+                model_flops_frozen = (
+                    avg_tokens
+                    * n_layers
+                    * hs**2
+                    * (
+                        12
+                        + 12 * num_query_groups / n_heads
+                        + 18 * ffn_hs / hs
+                        + 6 * vocab_size / (n_layers * hs)
+                    )
                 )
-            )
-            model_flops_unfrozen = n_layers * hs**2 * (12 * avg_seqlen2 / hs)
+                model_flops_unfrozen = n_layers * hs**2 * (12 * avg_seqlen2 / hs)
 
-            return batch_size * (model_flops_frozen * (2.0 / 3.0) + model_flops_unfrozen)
+                return batch_size * (model_flops_frozen * (2.0 / 3.0) + model_flops_unfrozen)
🧰 Tools
🪛 Ruff (0.15.2)

[warning] 197-197: Avoid specifying long messages outside the exception class

(TRY003)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/megatron/bridge/training/utils/flop_utils.py` around lines 196 - 199,
Replace the hard raise when seq_len is missing from _LORA_SEQ_STATS with a
graceful fallback: log a warning (use the module logger or warnings.warn)
indicating missing LoRA stats for seq_len, then fall back to the standard
transformer path by deriving reasonable defaults for avg_seqlen2 and avg_tokens
(e.g., estimate avg_seqlen2 = seq_len * seq_len and avg_tokens = seq_len, or
call an existing transformer stats helper if available) instead of raising; keep
references to _LORA_SEQ_STATS, seq_len, avg_seqlen2, and avg_tokens so the rest
of the function can continue using those values.

Comment on lines +204 to +215
vocab_size = cfg.model.vocab_size

model_flops_frozen = (
avg_tokens
* n_layers
* hs**2
* (
12
+ 12 * num_query_groups / n_heads
+ 18 * ffn_hs / hs
+ 6 * vocab_size / (n_layers * hs)
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Use padded vocab size in LoRA FLOPs math for consistency.

Line 204 uses cfg.model.vocab_size, while the non-LoRA path uses calculate_padded_vocab_size(...) for logits FLOPs. This can undercount TFLOPs when vocab padding is enabled.

Proposed fix
-            vocab_size = cfg.model.vocab_size
+            vocab_size = calculate_padded_vocab_size(
+                cfg.model.vocab_size,
+                cfg.model.make_vocab_size_divisible_by,
+                cfg.model.tensor_model_parallel_size,
+                logging_enabled=False,
+            )
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
vocab_size = cfg.model.vocab_size
model_flops_frozen = (
avg_tokens
* n_layers
* hs**2
* (
12
+ 12 * num_query_groups / n_heads
+ 18 * ffn_hs / hs
+ 6 * vocab_size / (n_layers * hs)
)
vocab_size = calculate_padded_vocab_size(
cfg.model.vocab_size,
cfg.model.make_vocab_size_divisible_by,
cfg.model.tensor_model_parallel_size,
logging_enabled=False,
)
model_flops_frozen = (
avg_tokens
* n_layers
* hs**2
* (
12
12 * num_query_groups / n_heads
18 * ffn_hs / hs
6 * vocab_size / (n_layers * hs)
)
🧰 Tools
🪛 GitHub Actions: CICD NeMo

[error] 207-214: pre-commit ruff-format hook failed: 1 file reformatted. Run 'pre-commit run --all-files' or commit again to apply formatting changes.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/megatron/bridge/training/utils/flop_utils.py` around lines 204 - 215, The
LoRA FLOPs calculation uses cfg.model.vocab_size directly which undercounts when
vocab padding is applied; compute the padded vocab size with the existing helper
(call calculate_padded_vocab_size with the model vocab) and replace the plain
vocab_size term used inside model_flops_frozen's logits factor (the expression
involving avg_tokens, n_layers, hs, ... , 6 * vocab_size / (n_layers * hs)) so
the logits TFLOPs use the padded vocabulary size consistently with the non‑LoRA
path.

@ko3n1g
Copy link
Copy Markdown
Contributor

ko3n1g commented Feb 25, 2026

merged into #2509

@ko3n1g ko3n1g closed this Feb 25, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants