Skip to content
Closed
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
38 changes: 36 additions & 2 deletions src/megatron/bridge/training/utils/flop_utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -14,14 +14,17 @@

import torch.nn.functional as F

from megatron.bridge.peft.lora import LoRA
from megatron.bridge.training.config import ConfigContainer
from megatron.bridge.utils.vocab_utils import calculate_padded_vocab_size


def num_floating_point_operations(cfg: ConfigContainer, batch_size: int = 1):
"""Return the number of floating point operations"""
# If the model provider has a custom TFLOPS calculation method, use it.
if hasattr(cfg.model, "_get_num_floating_point_operations"):
peft = getattr(cfg, "peft", None)
is_lora = isinstance(peft, LoRA)
# If the model provider has a custom TFLOPS calculation method, use it (non-LoRA only).
if not is_lora and hasattr(cfg.model, "_get_num_floating_point_operations"):
return cfg.model._get_num_floating_point_operations(batch_size)

def calculate_layer_counts():
Expand Down Expand Up @@ -183,6 +186,37 @@ def transformer_flops():
num_query_groups = (
cfg.model.num_attention_heads if cfg.model.num_query_groups is None else cfg.model.num_query_groups
)

if is_lora:
_LORA_SEQ_STATS = {
4096: (842603, 4096),
2048: (488991, 2030),
}
seq_len = cfg.model.seq_length
if seq_len not in _LORA_SEQ_STATS:
raise ValueError(f"No LoRA stats for seq_length={seq_len}. Add it to _LORA_SEQ_STATS.")
avg_seqlen2, avg_tokens = _LORA_SEQ_STATS[seq_len]

Comment on lines +196 to +199
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Avoid hard-failing on unsupported LoRA seq_length values.

Line 196–197 currently raises for any seq_length not present in _LORA_SEQ_STATS, which can stop otherwise valid training runs just for metrics computation. Prefer a graceful fallback to the standard transformer path when stats are missing.

Proposed fix
-        if is_lora:
+        if is_lora:
             _LORA_SEQ_STATS = {
             }
             seq_len = cfg.model.seq_length
-            if seq_len not in _LORA_SEQ_STATS:
-                raise ValueError(f"No LoRA stats for seq_length={seq_len}. Add it to _LORA_SEQ_STATS.")
-            avg_seqlen2, avg_tokens = _LORA_SEQ_STATS[seq_len]
+            seq_stats = _LORA_SEQ_STATS.get(seq_len)
+            if seq_stats is None:
+                # Fallback to standard transformer FLOPs path below when stats are unavailable.
+                pass
+            else:
+                avg_seqlen2, avg_tokens = seq_stats
 
-            hs = cfg.model.hidden_size
-            n_layers = cfg.model.num_layers
-            n_heads = cfg.model.num_attention_heads
-            ffn_hs = cfg.model.ffn_hidden_size
-            vocab_size = cfg.model.vocab_size
+                hs = cfg.model.hidden_size
+                n_layers = cfg.model.num_layers
+                n_heads = cfg.model.num_attention_heads
+                ffn_hs = cfg.model.ffn_hidden_size
+                vocab_size = cfg.model.vocab_size
 
-            model_flops_frozen = (
-                avg_tokens
-                * n_layers
-                * hs**2
-                * (
-                    12
-                    + 12 * num_query_groups / n_heads
-                    + 18 * ffn_hs / hs
-                    + 6 * vocab_size / (n_layers * hs)
+                model_flops_frozen = (
+                    avg_tokens
+                    * n_layers
+                    * hs**2
+                    * (
+                        12
+                        + 12 * num_query_groups / n_heads
+                        + 18 * ffn_hs / hs
+                        + 6 * vocab_size / (n_layers * hs)
+                    )
                 )
-            )
-            model_flops_unfrozen = n_layers * hs**2 * (12 * avg_seqlen2 / hs)
+                model_flops_unfrozen = n_layers * hs**2 * (12 * avg_seqlen2 / hs)
 
-            return batch_size * (model_flops_frozen * (2.0 / 3.0) + model_flops_unfrozen)
+                return batch_size * (model_flops_frozen * (2.0 / 3.0) + model_flops_unfrozen)
🧰 Tools
🪛 Ruff (0.15.2)

[warning] 197-197: Avoid specifying long messages outside the exception class

(TRY003)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/megatron/bridge/training/utils/flop_utils.py` around lines 196 - 199,
Replace the hard raise when seq_len is missing from _LORA_SEQ_STATS with a
graceful fallback: log a warning (use the module logger or warnings.warn)
indicating missing LoRA stats for seq_len, then fall back to the standard
transformer path by deriving reasonable defaults for avg_seqlen2 and avg_tokens
(e.g., estimate avg_seqlen2 = seq_len * seq_len and avg_tokens = seq_len, or
call an existing transformer stats helper if available) instead of raising; keep
references to _LORA_SEQ_STATS, seq_len, avg_seqlen2, and avg_tokens so the rest
of the function can continue using those values.

hs = cfg.model.hidden_size
n_layers = cfg.model.num_layers
n_heads = cfg.model.num_attention_heads
ffn_hs = cfg.model.ffn_hidden_size
vocab_size = cfg.model.vocab_size

model_flops_frozen = (
avg_tokens
* n_layers
* hs**2
* (
12
+ 12 * num_query_groups / n_heads
+ 18 * ffn_hs / hs
+ 6 * vocab_size / (n_layers * hs)
)
Comment on lines +204 to +215
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Use padded vocab size in LoRA FLOPs math for consistency.

Line 204 uses cfg.model.vocab_size, while the non-LoRA path uses calculate_padded_vocab_size(...) for logits FLOPs. This can undercount TFLOPs when vocab padding is enabled.

Proposed fix
-            vocab_size = cfg.model.vocab_size
+            vocab_size = calculate_padded_vocab_size(
+                cfg.model.vocab_size,
+                cfg.model.make_vocab_size_divisible_by,
+                cfg.model.tensor_model_parallel_size,
+                logging_enabled=False,
+            )
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
vocab_size = cfg.model.vocab_size
model_flops_frozen = (
avg_tokens
* n_layers
* hs**2
* (
12
+ 12 * num_query_groups / n_heads
+ 18 * ffn_hs / hs
+ 6 * vocab_size / (n_layers * hs)
)
vocab_size = calculate_padded_vocab_size(
cfg.model.vocab_size,
cfg.model.make_vocab_size_divisible_by,
cfg.model.tensor_model_parallel_size,
logging_enabled=False,
)
model_flops_frozen = (
avg_tokens
* n_layers
* hs**2
* (
12
12 * num_query_groups / n_heads
18 * ffn_hs / hs
6 * vocab_size / (n_layers * hs)
)
🧰 Tools
🪛 GitHub Actions: CICD NeMo

[error] 207-214: pre-commit ruff-format hook failed: 1 file reformatted. Run 'pre-commit run --all-files' or commit again to apply formatting changes.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/megatron/bridge/training/utils/flop_utils.py` around lines 204 - 215, The
LoRA FLOPs calculation uses cfg.model.vocab_size directly which undercounts when
vocab padding is applied; compute the padded vocab size with the existing helper
(call calculate_padded_vocab_size with the model vocab) and replace the plain
vocab_size term used inside model_flops_frozen's logits factor (the expression
involving avg_tokens, n_layers, hs, ... , 6 * vocab_size / (n_layers * hs)) so
the logits TFLOPs use the padded vocabulary size consistently with the non‑LoRA
path.

)
model_flops_unfrozen = n_layers * hs**2 * (12 * avg_seqlen2 / hs)

return batch_size * (model_flops_frozen * (2.0 / 3.0) + model_flops_unfrozen)
Comment on lines +190 to +219
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Please run formatting before merge (ruff-format).

CI already reports this file was reformatted by the pre-commit hook; please run pre-commit run --all-files and commit the formatting delta.

As per coding guidelines: "Use ruff for linting and formatting Python code".

🧰 Tools
🪛 GitHub Actions: CICD NeMo

[error] 207-214: pre-commit ruff-format hook failed: 1 file reformatted. Run 'pre-commit run --all-files' or commit again to apply formatting changes.

🪛 Ruff (0.15.2)

[warning] 197-197: Avoid specifying long messages outside the exception class

(TRY003)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/megatron/bridge/training/utils/flop_utils.py` around lines 190 - 219, The
file fails ruff formatting; run the formatter and commit the changes: run
ruff-format (or pre-commit run --all-files) on
src/megatron/bridge/training/utils/flop_utils.py, fix formatting diffs around
the is_lora block (including _LORA_SEQ_STATS, the seq_len lookup, and the
expressions computing model_flops_frozen, model_flops_unfrozen and the return
line), and re-commit the formatted file so CI no longer reports a formatting
delta.

# MoE.
if cfg.model.num_moe_experts is None:
# Every Transformer MLP is dense.
Expand Down
Loading