Add refactored recipe files for pretrain configs of LLMs#2067
Add refactored recipe files for pretrain configs of LLMs#2067
Conversation
8449d49 to
6734170
Compare
|
/ok to test cfc6df8 |
|
/ok to test d7cfa85 |
|
/ok to test fb8938a |
|
/ok to test ab9c99c |
📝 WalkthroughWalkthroughThis PR introduces a centralized pretraining configuration helper ( Changes
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes The refactoring is systematic and pattern-consistent across 15+ files, but requires careful verification that default configurations are sensible, no critical settings are lost in the transition from kwargs to explicit field assignments, and the Suggested reviewers
🚥 Pre-merge checks | ✅ 3 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
/ok to test 1885a66 |
There was a problem hiding this comment.
Actionable comments posted: 3
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
src/megatron/bridge/recipes/nemotronh/nemotron_nano_v2.py (1)
19-23: Duplicate import ofNemotronNanoModelProvider9Bv2.The same class is imported twice:
- Line 19:
from megatron.bridge.models import NemotronNanoModelProvider9Bv2- Lines 20-22:
from megatron.bridge.models.nemotronh import (NemotronNanoModelProvider9Bv2, ...)Remove the redundant import on line 19.
Proposed fix
-from megatron.bridge.models import NemotronNanoModelProvider9Bv2 from megatron.bridge.models.nemotronh import ( NemotronNanoModelProvider9Bv2, NemotronNanoModelProvider12Bv2, )src/megatron/bridge/recipes/nemotronh/nemotronh.py (1)
520-532: Incorrect TypedDict annotation for recommended_kwargs.The
recommended_kwargsdictionary is annotated asNemotronHFinetuneKwargs, which defines keys liketensor_model_parallel_size,pipeline_model_parallel_size, andsequence_parallel(inherited fromNemotronHCommonKwargs). However, the dictionary uses keystensor_parallelism,pipeline_parallelism, andsequence_parallelisminstead. While these keys match the function parameter names for_nemotronh_finetune_common, they violate the TypedDict contract and will cause type checking errors in mypy or pyright. Either rename the dictionary keys to match the TypedDict field names, or adjust the type annotation.
🤖 Fix all issues with AI agents
In `@src/megatron/bridge/recipes/gpt_oss/gpt_oss.py`:
- Around line 152-157: The config currently leaves cfg.model.pipeline_dtype as
None while cfg.model.pipeline_model_parallel_size is >1; update both occurrences
of cfg.model.pipeline_dtype in the GPT-OSS recipe to be torch.bfloat16 when
cfg.model.pipeline_model_parallel_size > 1 (i.e., set cfg.model.pipeline_dtype =
torch.bfloat16), matching other PP>1 configs and ensuring this change is applied
to both places where pipeline_dtype is currently None.
In `@src/megatron/bridge/recipes/llama/llama3.py`:
- Around line 1013-1017: The CommOverlapConfig for cfg.comm_overlap currently
references userbuffers_bf16_h100_h8192_tp4_mbs1_seqlen8192 while the model
parallelism (TP) for this recipe is TP=8; update the comm overlap config to the
matching TP=8 variant (e.g. replace
userbuffers_bf16_h100_h8192_tp4_mbs1_seqlen8192 with the TP=8 equivalent such as
userbuffers_bf16_h100_h8192_tp8_mbs1_seqlen8192) so
CommOverlapConfig(tp_comm_overlap=True, tp_comm_overlap_cfg=...) matches the
actual TP setting.
- Around line 659-663: Replace the runtime-checking assert with an explicit
exception: instead of using assert mixed_precision_recipe in [...] in llama3.py,
raise a ValueError when mixed_precision_recipe is not in the allowed list
(include the same descriptive message, e.g. f"Invalid low precision recipe:
{mixed_precision_recipe}. This recipe has not been tested yet."). Update the
validation near the mixed_precision_recipe variable so it always runs (not
skipped by -O) by using if mixed_precision_recipe not in [...] : raise
ValueError(...).
🧹 Nitpick comments (13)
src/megatron/bridge/recipes/qwen/qwen3_next.py (1)
389-391: Potential runtime error when no CUDA device is available.The check
torch.cuda.get_device_properties(0).major == 10will raise aRuntimeErrorif no CUDA device is present. While this is a training config where CUDA is expected, consider adding a guard:if disable_jit_fuser is None: - disable_jit_fuser = torch.cuda.get_device_properties(0).major == 10 + disable_jit_fuser = torch.cuda.is_available() and torch.cuda.get_device_properties(0).major == 10This would make the config generation more robust for dry-run or validation scenarios.
src/megatron/bridge/recipes/gemma/gemma2.py (2)
17-17: Consider using modern type annotation syntax.Per coding guidelines, prefer:
list[str]instead ofList[str]str | Noneinstead ofOptional[str]MixedPrecisionConfig | strinstead ofUnion[MixedPrecisionConfig, str]The imports on line 17 use the older
typingmodule equivalents.Proposed fix
-from typing import List, Optional, UnionThen update type hints throughout the file (e.g., in
Gemma2CommonKwargsandGemma2FinetuneKwargsclasses) to use modern syntax likelist[str] | None.
117-362: Significant code duplication across Gemma2 pretrain configs.The three pretrain functions (
gemma2_2b_pretrain_config,gemma2_9b_pretrain_config,gemma2_27b_pretrain_config) share ~95% identical code, differing only in:
- HuggingFace model path
tensor_model_parallel_size(2, 8, 8)pipeline_model_parallel_size(1, 1, 2)pipeline_dtype(None, bfloat16, bfloat16)Consider extracting a shared helper to reduce maintenance burden:
def _gemma2_pretrain_base( hf_path: str, tensor_model_parallel_size: int, pipeline_model_parallel_size: int, pipeline_dtype: torch.dtype | None, ) -> ConfigContainer: cfg = _pretrain_common() cfg.model = AutoBridge.from_hf_pretrained(hf_path).to_megatron_provider(load_weights=False) cfg.tokenizer.tokenizer_model = hf_path # ... common settings ... cfg.model.tensor_model_parallel_size = tensor_model_parallel_size cfg.model.pipeline_model_parallel_size = pipeline_model_parallel_size cfg.model.pipeline_dtype = pipeline_dtype return cfgThis would make each variant a simple 5-line wrapper. However, the current explicit approach does have the benefit of making each recipe self-contained and easy to customize independently.
src/megatron/bridge/recipes/qwen/qwen3_moe.py (1)
16-16: Same typing style suggestion as gemma2.py.Consider updating to modern type annotation syntax (
list[str] | Noneinstead ofOptional[List[str]]).src/megatron/bridge/recipes/moonlight/moonlight_16b.py (1)
130-148: Simplify the list copy expression.The expression
list([list(x) for x in layout])is redundant—the outerlist()call is unnecessary since the list comprehension already produces a list.Suggested fix
- if layout is not None: - layout = list([list(x) for x in layout]) + if layout is not None: + layout = [list(x) for x in layout]src/megatron/bridge/recipes/glm/glm45.py (1)
267-382: Significant duplication with glm45_355b_pretrain_config.The
glm45_air_106b_pretrain_configshares ~90% of its code withglm45_355b_pretrain_config. Consider extracting the common configuration logic into a private helper to reduce maintenance burden. The key differences are:
- HF path:
"zai-org/GLM-4.5-Air"vs"zai-org/GLM-4.5"- Parallelism: TP=1, PP=4, EP=8 vs TP=2, PP=8, EP=16
src/megatron/bridge/recipes/kimi/kimi_k2.py (1)
43-44: Simplify the list copy expression.Same redundant pattern as noted in moonlight_16b.py.
Suggested fix
if layout is not None: - layout = list([list(x) for x in layout]) + layout = [list(x) for x in layout]src/megatron/bridge/recipes/olmoe/olmoe_7b.py (1)
128-145: Pipeline layout helper is well-documented.The comment noting "OLMoE has 16 layers" helps understand the layout mappings. Consider simplifying the list copy:
Suggested fix
if layout is not None: - layout = list([list(x) for x in layout]) + layout = [list(x) for x in layout]src/megatron/bridge/recipes/qwen/qwen2.py (1)
516-927: Remaining Qwen2.5 pretrain configs follow consistent pattern.The configs for 1.5B, 7B, 14B, 32B, and 72B all follow the same structure with appropriate parallelism scaling. The 32B and 72B configs correctly set
pipeline_dtype=torch.bfloat16for PP > 1.Consider extracting common configuration logic.
There's significant duplication across all 11 pretrain configs. A private helper like
_qwen2_pretrain_common(hf_path, tp, pp, ...)could reduce ~800 lines to ~200 while maintaining explicit per-variant entry points.src/megatron/bridge/recipes/llama/llama3.py (4)
43-88:Llama3CommonKwargsappears to be unused after the refactor.The pretrain config functions are now parameterless and no longer accept
**kwargs. ThisTypedDictis dead code that should be removed to avoid confusion.♻️ Suggested fix
-class Llama3CommonKwargs(TypedDict, total=False): - """Typed options accepted by Llama3 family recipe helpers.""" - - # Core identifiers - hf_path: str - dir: str | None - name: str - # Dataset configuration - data_paths: list[str] | None - data_args_path: str | None - train_data_path: list[str] | None - valid_data_path: list[str] | None - test_data_path: list[str] | None - per_split_data_args_path: str | None - mock: bool - # Model configuration - tensor_model_parallel_size: int - pipeline_model_parallel_size: int - pipeline_dtype: torch.dtype | None - virtual_pipeline_model_parallel_size: int | None - context_parallel_size: int - sequence_parallel: bool - use_megatron_fsdp: bool - account_for_embedding_in_pipeline_split: bool - account_for_loss_in_pipeline_split: bool - # Training hyperparameters - train_iters: int - global_batch_size: int - micro_batch_size: int - seq_length: int - lr: float - min_lr: float - adam_eps: float - lr_warmup_iters: int - lr_decay_iters: int | None - eval_interval: int - save_interval: int - use_null_tokenizer: bool - # W&B logging - wandb_project: str | None - wandb_entity: str | None - wandb_exp_name: str | None - # Precision / overlap configs - precision_config: MixedPrecisionConfig | str | None - comm_overlap_config: CommOverlapConfig | None - -
140-234: Significant code duplication across all pretrain config functions.All 14 pretrain config functions share ~50 identical lines (transformer_impl, cuda_graph settings, kernel selections, memory saving, optimizer precision, DDP config). Only the HF path, parallelism settings, and seq_length vary. Consider extracting common post-setup into a helper or using a builder pattern.
♻️ Example approach - extract common settings to a helper
def _apply_common_pretrain_settings(cfg: ConfigContainer, seq_length: int = 8192) -> None: """Apply common settings shared by all Llama pretrain configs.""" # Tokenizer - NullTokenizer by default cfg.tokenizer.tokenizer_type = "NullTokenizer" cfg.tokenizer.tokenizer_model = None cfg.tokenizer.vocab_size = DEFAULT_NULL_TOKENIZER_VOCAB_SIZE # Dataset cfg.dataset.blend = None cfg.dataset.num_workers = 8 cfg.dataset.seq_length = seq_length # Training cfg.train.train_iters = 1168251 cfg.train.global_batch_size = 512 cfg.train.micro_batch_size = 1 cfg.train.eval_interval = 2000 cfg.train.manual_gc = True cfg.train.manual_gc_interval = 100 cfg.scheduler.lr_warmup_iters = 2000 cfg.logger.log_timers_to_tensorboard = True # TE & CUDA Graph cfg.model.transformer_impl = "transformer_engine" cfg.model.cuda_graph_impl = "none" cfg.model.cuda_graph_scope = "full" cfg.model.cuda_graph_warmup_steps = 3 # Kernel selections cfg.model.attention_backend = None cfg.model.cross_entropy_loss_fusion = True cfg.model.cross_entropy_fusion_impl = "te" # Memory saving cfg.model.recompute_granularity = None cfg.model.recompute_modules = None cfg.model.fine_grained_activation_offloading = False cfg.model.offload_modules = None # Optimizer precision cfg.optimizer.use_precision_aware_optimizer = False cfg.optimizer.main_grads_dtype = torch.float32 cfg.optimizer.main_params_dtype = torch.float32 cfg.optimizer.exp_avg_dtype = torch.float32 cfg.optimizer.exp_avg_sq_dtype = torch.float32 cfg.checkpoint.save_interval = 500 # DDP cfg.ddp.overlap_grad_reduce = True cfg.ddp.overlap_param_gather = True cfg.ddp.check_for_nan_in_grad = True cfg.ddp.use_distributed_optimizer = True cfg.ddp.use_megatron_fsdp = False cfg.ddp.grad_reduce_in_fp32 = True cfg.ddp.average_in_collective = True cfg.ddp.data_parallel_sharding_strategy = "no_shard" def llama32_1b_pretrain_config() -> ConfigContainer: """Return a pre-training config for Llama 3.2 1B.""" cfg = _pretrain_common() cfg.model = AutoBridge.from_hf_pretrained("meta-llama/Llama-3.2-1B").to_megatron_provider(load_weights=False) _apply_common_pretrain_settings(cfg, seq_length=8192) # Model-specific parallelism cfg.model.tensor_model_parallel_size = 1 cfg.model.pipeline_model_parallel_size = 1 cfg.model.context_parallel_size = 1 cfg.model.sequence_parallel = False cfg.model.seq_length = 8192 return cfg
158-168:seq_lengthis set in two places - consider using a single source of truth.Both
cfg.dataset.seq_lengthandcfg.model.seq_lengthmust be kept in sync. Consider defining a local constant to avoid divergence:+ seq_length = 8192 + # Dataset config - mock data by default cfg.dataset.blend = None cfg.dataset.num_workers = 8 - cfg.dataset.seq_length = 8192 + cfg.dataset.seq_length = seq_length ... - cfg.model.seq_length = 8192 + cfg.model.seq_length = seq_length
224-233: Several DDP settings redundantly override_pretrain_common()defaults.Settings like
overlap_grad_reduce,overlap_param_gather,check_for_nan_in_grad,grad_reduce_in_fp32, andaverage_in_collectiveare already set to the same values in_pretrain_common(). Onlydata_parallel_sharding_strategyanduse_megatron_fsdpare actual overrides. Consider keeping only the intentional overrides to improve readability.♻️ Keep only intentional overrides
# DDP config - cfg.ddp.overlap_grad_reduce = True - cfg.ddp.overlap_param_gather = True - cfg.ddp.check_for_nan_in_grad = True - cfg.ddp.use_distributed_optimizer = True + # Override sharding strategy from _pretrain_common default cfg.ddp.use_megatron_fsdp = False - cfg.ddp.grad_reduce_in_fp32 = True - cfg.ddp.average_in_collective = True cfg.ddp.data_parallel_sharding_strategy = "no_shard"
| # Parallelism settings | ||
| cfg.model.tensor_model_parallel_size = 2 | ||
| cfg.model.pipeline_model_parallel_size = 4 | ||
| cfg.model.pipeline_model_parallel_layout = None | ||
| cfg.model.pipeline_dtype = None | ||
| cfg.model.virtual_pipeline_model_parallel_size = None |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
# First, let's verify the file exists and check the exact lines mentioned
head -n 280 src/megatron/bridge/recipes/gpt_oss/gpt_oss.py | tail -n 130 | cat -nRepository: NVIDIA-NeMo/Megatron-Bridge
Length of output: 5970
🏁 Script executed:
# Check for all pipeline_dtype mentions in recipes and training code
echo "=== pipeline_dtype in recipes ==="
rg -n "pipeline_dtype" src/megatron/bridge/recipes -g '*.py' -A 2
echo -e "\n=== pipeline_dtype in training ==="
rg -n "pipeline_dtype" src/megatron/bridge/training -g '*.py' -A 2Repository: NVIDIA-NeMo/Megatron-Bridge
Length of output: 42328
🏁 Script executed:
# Find all PP>1 configs to compare pipeline_dtype settings
echo "=== PP>1 configs ==="
rg -n "pipeline_model_parallel_size\s*=\s*[2-9]" src/megatron/bridge/recipes -g '*.py' -B 5 -A 5Repository: NVIDIA-NeMo/Megatron-Bridge
Length of output: 29932
Set pipeline_dtype to torch.bfloat16 for PP>1 in GPT-OSS configs.
Lines 156 and 271 set cfg.model.pipeline_dtype = None while pipeline_model_parallel_size = 4. Across the codebase, all other PP>1 configs (Qwen, Deepseek, Llama, Gemma) explicitly set pipeline_dtype to torch.bfloat16, with inline comments indicating it is required for PP > 1. This inconsistency can cause validation failures or unintended numerical changes.
Suggested fix
- cfg.model.pipeline_dtype = None
+ cfg.model.pipeline_dtype = torch.bfloat16Apply to both lines 156 and 271.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| # Parallelism settings | |
| cfg.model.tensor_model_parallel_size = 2 | |
| cfg.model.pipeline_model_parallel_size = 4 | |
| cfg.model.pipeline_model_parallel_layout = None | |
| cfg.model.pipeline_dtype = None | |
| cfg.model.virtual_pipeline_model_parallel_size = None | |
| # Parallelism settings | |
| cfg.model.tensor_model_parallel_size = 2 | |
| cfg.model.pipeline_model_parallel_size = 4 | |
| cfg.model.pipeline_model_parallel_layout = None | |
| cfg.model.pipeline_dtype = torch.bfloat16 | |
| cfg.model.virtual_pipeline_model_parallel_size = None |
🤖 Prompt for AI Agents
In `@src/megatron/bridge/recipes/gpt_oss/gpt_oss.py` around lines 152 - 157, The
config currently leaves cfg.model.pipeline_dtype as None while
cfg.model.pipeline_model_parallel_size is >1; update both occurrences of
cfg.model.pipeline_dtype in the GPT-OSS recipe to be torch.bfloat16 when
cfg.model.pipeline_model_parallel_size > 1 (i.e., set cfg.model.pipeline_dtype =
torch.bfloat16), matching other PP>1 configs and ensuring this change is applied
to both places where pipeline_dtype is currently None.
There was a problem hiding this comment.
GPTOSS does has cfg.model.pipeline_dtype set to None explicitly in the current config. Hence left it that way to match the old behavior.
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Check the current main branch to see what the old config had at line 178
git fetch origin main 2>/dev/null || true
git show origin/main:src/megatron/bridge/recipes/gpt_oss/gpt_oss.py | head -n 185 | tail -n 15
echo -e "\n=== Checking for pipeline_dtype usage in MoE models (PP>1) ==="
# Check other MoE models (OLMoE, DeepSeek, GLM, etc.) to see their pipeline_dtype patterns
rg -n "expert_model_parallel_size.*=" src/megatron/bridge/recipes -g '*.py' -A 10 -B 5 | rg -A 10 -B 5 "pipeline_dtype"
Length of output: 50024
@athitten, thank you for the clarification! You're right that the old _gpt_oss_common function had pipeline_dtype default to None, and preserving that behavior makes sense for backward compatibility.
I did notice that the codebase has mixed patterns for MoE models with PP>1:
- Models using
torch.bfloat16: DeepSeek v2/v3, Qwen3_next, Qwen3_moe, Kimi K2, with explicit comments like "Required for PP > 1" - Models using
None: GLM45 (both 355B with PP=8 and Air 106B with PP=4) also useNonewith PP>1
Since GLM45 follows the same pattern as GPT-OSS, your approach is consistent with at least some of the MoE models in the codebase. If the old behavior worked correctly with None, then maintaining it is the right call for this refactor.
Just wanted to flag the inconsistency in case it's worth verifying whether None vs torch.bfloat16 has any performance or correctness implications for PP>1, but I understand you're prioritizing backward compatibility here. 👍
🧠 Learnings used
Learnt from: aroshanghias-nvd
Repo: NVIDIA-NeMo/Megatron-Bridge PR: 2040
File: src/megatron/bridge/models/mimo/llava_provider.py:109-113
Timestamp: 2026-01-28T17:39:44.596Z
Learning: In `src/megatron/bridge/models/mimo/llava_provider.py`, the `TransformerConfig` used for the `MultimodalProjector` includes `num_attention_heads=1` as a required placeholder. While the MLP projector doesn't use attention, `TransformerConfig` requires a positive value for `num_attention_heads` (default of 0 is not valid), so using 1 as a placeholder is the correct approach.
There was a problem hiding this comment.
@athitten : can you help to update the pipeline dtype to bf16 actually, i think the old value is not good.
There was a problem hiding this comment.
Seems like the humans are having a chat. I'll hop back into my burrow for now. If you need me again, just tag @coderabbitai in a new comment, and I'll come hopping out!
| # Communication overlap for 70B | ||
| cfg.comm_overlap = CommOverlapConfig( | ||
| tp_comm_overlap=True, | ||
| tp_comm_overlap_cfg=userbuffers_bf16_h100_h8192_tp4_mbs1_seqlen8192, | ||
| ) |
There was a problem hiding this comment.
Same communication overlap mismatch as 70B 16K config.
This config also uses TP=8 (line 952) with the TP=4 comm overlap config.
🤖 Prompt for AI Agents
In `@src/megatron/bridge/recipes/llama/llama3.py` around lines 1013 - 1017, The
CommOverlapConfig for cfg.comm_overlap currently references
userbuffers_bf16_h100_h8192_tp4_mbs1_seqlen8192 while the model parallelism (TP)
for this recipe is TP=8; update the comm overlap config to the matching TP=8
variant (e.g. replace userbuffers_bf16_h100_h8192_tp4_mbs1_seqlen8192 with the
TP=8 equivalent such as userbuffers_bf16_h100_h8192_tp8_mbs1_seqlen8192) so
CommOverlapConfig(tp_comm_overlap=True, tp_comm_overlap_cfg=...) matches the
actual TP setting.
There was a problem hiding this comment.
All of them 70B, 70B 16k, 70B 64k use the same userbuffers_bf16_h100_h8192_tp4_mbs1_seqlen8192 comm overlap config. This matches with the existing behavior
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Search for all available userbuffers comm overlap configurations in the codebase
rg -n "^def userbuffers_" --type py -A 1
# Also check what's exported from comm_overlap module
rg -n "userbuffers_" src/megatron/bridge/training/comm_overlap.py -C 2
Length of output: 4930
@athitten, thank you for clarifying! I can see from the available comm overlap configurations that there is indeed no TP=8 variant for h8192 (70B hidden size) models on H100 - only the TP=4 variant userbuffers_bf16_h100_h8192_tp4_mbs1_seqlen8192 exists for BF16 on H100.
The available configurations show:
- h8192 (70B): TP=4 and TP=2 (LoRA) variants only
- h16384 (405B): TP=8 variant available
So it makes sense that all 70B configs (including the TP=8 variants for 16K and 64K) use the same TP=4 comm overlap config as the closest match. I'll note this as intentional behavior.
🧬 Code Graph Analysis Results
<src/megatron/bridge/recipes/llama/llama3.py> // Summary description of relevant snippet
-
Summary:
- In the pretraining configurations for Llama 3 70B variants (e.g., llama3_70b_pretrain_config and llama3_70b_16k_pretrain_config), the code enables communication overlap by constructing a CommOverlapConfig with tensor-parallel overlap enabled and using a pre-defined TP overlap configuration.
- The specific overlap configuration used is the pre-defined function/variable named userbuffers_bf16_h100_h8192_tp4_mbs1_seqlen8192, assigned to the tp_comm_overlap_cfg field of CommOverlapConfig.
- This pattern is referenced by the user comment (70B, 70B 16k, 70B 64k all using the same userbuffers_bf16_h100_h8192_tp4_mbs1_seqlen8192 config) and is visible in the llama3.py pretrain config sections for 70B with CommOverlap enabled.
Details to look for (conceptually):
- Definition or import of CommOverlapConfig
- Import/definition of userbuffers_bf16_h100_h8192_tp4_mbs1_seqlen8192
- Instances of:
- llama3_70b_pretrain_config() -> CommOverlapConfig(tp_comm_overlap=True, tp_comm_overlap_cfg=userbuffers_bf16_h100_h8192_tp4_mbs1_seqlen8192)
- llama3_70b_16k_pretrain_config() -> similarly sets CommOverlapConfig with the same tp_comm_overlap_cfg
Parameters and behavior (from context in file and related module):
- tp_comm_overlap: boolean to enable tensor-parallel overlap
- tp_comm_overlap_cfg: a specific configuration object/function (here: userbuffers_bf16_h100_h8192_tp4_mbs1_seqlen8192) describing overlap specifics (TP, MBS, sequence length)
- This setup is intended to optimize comms for 70B variants with SP enabled and specific parallelism patterns.
File/Location:
- Source file: src/megatron/bridge/recipes/llama/llama3.py
- Context: within pretrain config factory functions for 70B variants (e.g., llama3_70b_pretrain_config and llama3_70b_16k_pretrain_config)
- Note: The exact line numbers are not provided here, but this description corresponds to the “Communication overlap for 70B” sections where CommOverlapConfig is assigned with tp_comm_overlap and tp_comm_overlap_cfg set to userbuffers_bf16_h100_h8192_tp4_mbs1_seqlen8192.
✏️ Learnings added
Learnt from: athitten
Repo: NVIDIA-NeMo/Megatron-Bridge PR: 2067
File: src/megatron/bridge/recipes/llama/llama3.py:1013-1017
Timestamp: 2026-01-28T18:47:54.691Z
Learning: In src/megatron/bridge/recipes/llama/llama3.py, all Llama 3 70B pretraining configs (llama3_70b_pretrain_config, llama3_70b_16k_pretrain_config, llama3_70b_64k_pretrain_config) intentionally use the same comm overlap config `userbuffers_bf16_h100_h8192_tp4_mbs1_seqlen8192`, regardless of their TP setting (TP=4 or TP=8). This is because there is no TP=8 comm overlap variant available for h8192 models on H100 - the TP=4 config is the closest match.
<!-- </add_learning>
Note: Learnings are effective only in the context of similar code segments. To apply general coding guidelines, please configure review instructions. You can manage existing learnings in the app.
|
/ok to test 15677f0 |
|
/ok to test e47a886 |
Signed-off-by: Abhishree <abhishreetm@gmail.com>
Signed-off-by: Abhishree <abhishreetm@gmail.com>
Signed-off-by: Abhishree <abhishreetm@gmail.com>
Signed-off-by: Abhishree <abhishreetm@gmail.com>
Signed-off-by: Abhishree <abhishreetm@gmail.com>
|
/ok to test 7c9f47d |
| # If --hf-path is provided, override the model's HuggingFace path | ||
| if args.hf_path: | ||
| logger.info(f"Using custom HuggingFace path: {args.hf_path}") | ||
| recipe_kwargs["hf_path"] = args.hf_path | ||
| # Import AutoBridge to create a new model provider with the custom HF path | ||
| from megatron.bridge.models import AutoBridge | ||
|
|
||
| cfg: ConfigContainer = pretrain_config(**recipe_kwargs) | ||
| logger.info("Loaded base configuration") | ||
| cfg.model = AutoBridge.from_hf_pretrained(args.hf_path).to_megatron_provider(load_weights=False) |
There was a problem hiding this comment.
@yaoyu-33 won't this be an issue since users now have to re-apply other model configs set as default in the recipe?
Signed-off-by: Abhishree <abhishreetm@gmail.com>
|
/ok to test ad67f81 |
|
Wait to merge after code freeze, release branch cut |
What does this PR do ?
Add a one line overview of what this PR aims to accomplish.
Changelog
GitHub Actions CI
See the CI sectionin the Contributing doc for how to trigger the CI. A Nvidia developer will need to approve and trigger the CI for external contributors.
Before your PR is "Ready for review"
Pre checks:
If you haven't finished some of the above items you can still open "Draft" PR.
Additional Information
Summary by CodeRabbit
Release Notes
✏️ Tip: You can customize this high-level summary in your review settings.