cp: Fix LLAMA3 LoRa TFLOPs Formula (2416) into r0.3.0#2533
cp: Fix LLAMA3 LoRa TFLOPs Formula (2416) into r0.3.0#2533svcnvidia-nemo-ci wants to merge 1 commit intor0.3.0from
Fix LLAMA3 LoRa TFLOPs Formula (2416) into r0.3.0#2533Conversation
Signed-off-by: Raghav Hrishikeshan Mukundan <rmukundan@nvidia.com> Signed-off-by: NeMo Bot <nemo-bot@nvidia.com>
|
/ok to test 276b77d |
📝 WalkthroughWalkthroughThis change adds LoRA-aware FLOPs calculation to the training utilities. When LoRA is detected, it bypasses the standard model-specific TFLOPS method and instead uses a specialized computation path that calculates frozen and unfrozen FLOPs separately using predefined statistics and model configuration dimensions. Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related PRs
Suggested labels
Suggested reviewers
🚥 Pre-merge checks | ✅ 3 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches
🧪 Generate unit tests (beta)
Tip Try Coding Plans. Let us write the prompt for your AI agent so you can ship faster (with fewer bugs). Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/megatron/bridge/training/utils/flop_utils.py`:
- Around line 204-215: The LoRA FLOPs calculation uses cfg.model.vocab_size
directly which undercounts when vocab padding is applied; compute the padded
vocab size with the existing helper (call calculate_padded_vocab_size with the
model vocab) and replace the plain vocab_size term used inside
model_flops_frozen's logits factor (the expression involving avg_tokens,
n_layers, hs, ... , 6 * vocab_size / (n_layers * hs)) so the logits TFLOPs use
the padded vocabulary size consistently with the non‑LoRA path.
- Around line 190-219: The file fails ruff formatting; run the formatter and
commit the changes: run ruff-format (or pre-commit run --all-files) on
src/megatron/bridge/training/utils/flop_utils.py, fix formatting diffs around
the is_lora block (including _LORA_SEQ_STATS, the seq_len lookup, and the
expressions computing model_flops_frozen, model_flops_unfrozen and the return
line), and re-commit the formatted file so CI no longer reports a formatting
delta.
- Around line 196-199: Replace the hard raise when seq_len is missing from
_LORA_SEQ_STATS with a graceful fallback: log a warning (use the module logger
or warnings.warn) indicating missing LoRA stats for seq_len, then fall back to
the standard transformer path by deriving reasonable defaults for avg_seqlen2
and avg_tokens (e.g., estimate avg_seqlen2 = seq_len * seq_len and avg_tokens =
seq_len, or call an existing transformer stats helper if available) instead of
raising; keep references to _LORA_SEQ_STATS, seq_len, avg_seqlen2, and
avg_tokens so the rest of the function can continue using those values.
| if is_lora: | ||
| _LORA_SEQ_STATS = { | ||
| 4096: (842603, 4096), | ||
| 2048: (488991, 2030), | ||
| } | ||
| seq_len = cfg.model.seq_length | ||
| if seq_len not in _LORA_SEQ_STATS: | ||
| raise ValueError(f"No LoRA stats for seq_length={seq_len}. Add it to _LORA_SEQ_STATS.") | ||
| avg_seqlen2, avg_tokens = _LORA_SEQ_STATS[seq_len] | ||
|
|
||
| hs = cfg.model.hidden_size | ||
| n_layers = cfg.model.num_layers | ||
| n_heads = cfg.model.num_attention_heads | ||
| ffn_hs = cfg.model.ffn_hidden_size | ||
| vocab_size = cfg.model.vocab_size | ||
|
|
||
| model_flops_frozen = ( | ||
| avg_tokens | ||
| * n_layers | ||
| * hs**2 | ||
| * ( | ||
| 12 | ||
| + 12 * num_query_groups / n_heads | ||
| + 18 * ffn_hs / hs | ||
| + 6 * vocab_size / (n_layers * hs) | ||
| ) | ||
| ) | ||
| model_flops_unfrozen = n_layers * hs**2 * (12 * avg_seqlen2 / hs) | ||
|
|
||
| return batch_size * (model_flops_frozen * (2.0 / 3.0) + model_flops_unfrozen) |
There was a problem hiding this comment.
Please run formatting before merge (ruff-format).
CI already reports this file was reformatted by the pre-commit hook; please run pre-commit run --all-files and commit the formatting delta.
As per coding guidelines: "Use ruff for linting and formatting Python code".
🧰 Tools
🪛 GitHub Actions: CICD NeMo
[error] 207-214: pre-commit ruff-format hook failed: 1 file reformatted. Run 'pre-commit run --all-files' or commit again to apply formatting changes.
🪛 Ruff (0.15.2)
[warning] 197-197: Avoid specifying long messages outside the exception class
(TRY003)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/megatron/bridge/training/utils/flop_utils.py` around lines 190 - 219, The
file fails ruff formatting; run the formatter and commit the changes: run
ruff-format (or pre-commit run --all-files) on
src/megatron/bridge/training/utils/flop_utils.py, fix formatting diffs around
the is_lora block (including _LORA_SEQ_STATS, the seq_len lookup, and the
expressions computing model_flops_frozen, model_flops_unfrozen and the return
line), and re-commit the formatted file so CI no longer reports a formatting
delta.
| if seq_len not in _LORA_SEQ_STATS: | ||
| raise ValueError(f"No LoRA stats for seq_length={seq_len}. Add it to _LORA_SEQ_STATS.") | ||
| avg_seqlen2, avg_tokens = _LORA_SEQ_STATS[seq_len] | ||
|
|
There was a problem hiding this comment.
Avoid hard-failing on unsupported LoRA seq_length values.
Line 196–197 currently raises for any seq_length not present in _LORA_SEQ_STATS, which can stop otherwise valid training runs just for metrics computation. Prefer a graceful fallback to the standard transformer path when stats are missing.
Proposed fix
- if is_lora:
+ if is_lora:
_LORA_SEQ_STATS = {
}
seq_len = cfg.model.seq_length
- if seq_len not in _LORA_SEQ_STATS:
- raise ValueError(f"No LoRA stats for seq_length={seq_len}. Add it to _LORA_SEQ_STATS.")
- avg_seqlen2, avg_tokens = _LORA_SEQ_STATS[seq_len]
+ seq_stats = _LORA_SEQ_STATS.get(seq_len)
+ if seq_stats is None:
+ # Fallback to standard transformer FLOPs path below when stats are unavailable.
+ pass
+ else:
+ avg_seqlen2, avg_tokens = seq_stats
- hs = cfg.model.hidden_size
- n_layers = cfg.model.num_layers
- n_heads = cfg.model.num_attention_heads
- ffn_hs = cfg.model.ffn_hidden_size
- vocab_size = cfg.model.vocab_size
+ hs = cfg.model.hidden_size
+ n_layers = cfg.model.num_layers
+ n_heads = cfg.model.num_attention_heads
+ ffn_hs = cfg.model.ffn_hidden_size
+ vocab_size = cfg.model.vocab_size
- model_flops_frozen = (
- avg_tokens
- * n_layers
- * hs**2
- * (
- 12
- + 12 * num_query_groups / n_heads
- + 18 * ffn_hs / hs
- + 6 * vocab_size / (n_layers * hs)
+ model_flops_frozen = (
+ avg_tokens
+ * n_layers
+ * hs**2
+ * (
+ 12
+ + 12 * num_query_groups / n_heads
+ + 18 * ffn_hs / hs
+ + 6 * vocab_size / (n_layers * hs)
+ )
)
- )
- model_flops_unfrozen = n_layers * hs**2 * (12 * avg_seqlen2 / hs)
+ model_flops_unfrozen = n_layers * hs**2 * (12 * avg_seqlen2 / hs)
- return batch_size * (model_flops_frozen * (2.0 / 3.0) + model_flops_unfrozen)
+ return batch_size * (model_flops_frozen * (2.0 / 3.0) + model_flops_unfrozen)🧰 Tools
🪛 Ruff (0.15.2)
[warning] 197-197: Avoid specifying long messages outside the exception class
(TRY003)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/megatron/bridge/training/utils/flop_utils.py` around lines 196 - 199,
Replace the hard raise when seq_len is missing from _LORA_SEQ_STATS with a
graceful fallback: log a warning (use the module logger or warnings.warn)
indicating missing LoRA stats for seq_len, then fall back to the standard
transformer path by deriving reasonable defaults for avg_seqlen2 and avg_tokens
(e.g., estimate avg_seqlen2 = seq_len * seq_len and avg_tokens = seq_len, or
call an existing transformer stats helper if available) instead of raising; keep
references to _LORA_SEQ_STATS, seq_len, avg_seqlen2, and avg_tokens so the rest
of the function can continue using those values.
| vocab_size = cfg.model.vocab_size | ||
|
|
||
| model_flops_frozen = ( | ||
| avg_tokens | ||
| * n_layers | ||
| * hs**2 | ||
| * ( | ||
| 12 | ||
| + 12 * num_query_groups / n_heads | ||
| + 18 * ffn_hs / hs | ||
| + 6 * vocab_size / (n_layers * hs) | ||
| ) |
There was a problem hiding this comment.
Use padded vocab size in LoRA FLOPs math for consistency.
Line 204 uses cfg.model.vocab_size, while the non-LoRA path uses calculate_padded_vocab_size(...) for logits FLOPs. This can undercount TFLOPs when vocab padding is enabled.
Proposed fix
- vocab_size = cfg.model.vocab_size
+ vocab_size = calculate_padded_vocab_size(
+ cfg.model.vocab_size,
+ cfg.model.make_vocab_size_divisible_by,
+ cfg.model.tensor_model_parallel_size,
+ logging_enabled=False,
+ )📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| vocab_size = cfg.model.vocab_size | |
| model_flops_frozen = ( | |
| avg_tokens | |
| * n_layers | |
| * hs**2 | |
| * ( | |
| 12 | |
| + 12 * num_query_groups / n_heads | |
| + 18 * ffn_hs / hs | |
| + 6 * vocab_size / (n_layers * hs) | |
| ) | |
| vocab_size = calculate_padded_vocab_size( | |
| cfg.model.vocab_size, | |
| cfg.model.make_vocab_size_divisible_by, | |
| cfg.model.tensor_model_parallel_size, | |
| logging_enabled=False, | |
| ) | |
| model_flops_frozen = ( | |
| avg_tokens | |
| * n_layers | |
| * hs**2 | |
| * ( | |
| 12 | |
| 12 * num_query_groups / n_heads | |
| 18 * ffn_hs / hs | |
| 6 * vocab_size / (n_layers * hs) | |
| ) |
🧰 Tools
🪛 GitHub Actions: CICD NeMo
[error] 207-214: pre-commit ruff-format hook failed: 1 file reformatted. Run 'pre-commit run --all-files' or commit again to apply formatting changes.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/megatron/bridge/training/utils/flop_utils.py` around lines 204 - 215, The
LoRA FLOPs calculation uses cfg.model.vocab_size directly which undercounts when
vocab padding is applied; compute the padded vocab size with the existing helper
(call calculate_padded_vocab_size with the model vocab) and replace the plain
vocab_size term used inside model_flops_frozen's logits factor (the expression
involving avg_tokens, n_layers, hs, ... , 6 * vocab_size / (n_layers * hs)) so
the logits TFLOPs use the padded vocabulary size consistently with the non‑LoRA
path.
|
merged into #2509 |
beep boop [🤖]: Hi @rhmukundan 👋,
Summary by CodeRabbit