Skip to content
Merged
Show file tree
Hide file tree
Changes from 4 commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -116,7 +116,7 @@ override-dependencies = [
]

[tool.uv.sources]
megatron-core = { path = "3rdparty/Megatron-LM/" }
megatron-core = { path = "3rdparty/Megatron-LM/", editable = true }
nvidia-modelopt = { git = "https://github.com/NVIDIA/TensorRT-Model-Optimizer.git", rev = "0a4f0a8b933121f7af080261a0a5a7717f2c5d49" }
nvidia-resiliency-ext = { git = "https://github.com/NVIDIA/nvidia-resiliency-ext.git", rev = "v0.4.1" } # Requires a source install to compile cupti for cuda13

Expand Down
15 changes: 15 additions & 0 deletions src/megatron/bridge/training/initialize.py
Original file line number Diff line number Diff line change
Expand Up @@ -14,13 +14,15 @@

import datetime
import os
import time
import warnings
from typing import Callable, Optional

import torch
import torch.distributed
import torch.nn.functional as F
from megatron.core import parallel_state, tensor_parallel
from megatron.core.datasets.utils import compile_helpers
from megatron.core.fusions.fused_bias_dropout import bias_dropout_add_fused_train
from megatron.core.fusions.fused_bias_gelu import bias_gelu
from megatron.core.fusions.fused_bias_swiglu import bias_swiglu
Expand Down Expand Up @@ -111,6 +113,19 @@ def initialize_megatron(
train_config.decrease_batch_size_if_needed,
)

# =========================

if get_rank_safe() == 0:
start_time = time.time()
print("> compiling dataset index builder ...")

compile_helpers()
print(
">>> done with dataset index builder. Compilation time: {:.3f} seconds".format(time.time() - start_time),
flush=True,
)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

wc -l ./src/megatron/bridge/training/initialize.py

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 119


🏁 Script executed:

cat -n ./src/megatron/bridge/training/initialize.py | sed -n '100,150p'

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 2211


🏁 Script executed:

cat -n ./src/megatron/bridge/training/initialize.py | sed -n '1,50p'

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 2338


🏁 Script executed:

grep -n "def finish_mpu_init\|def _initialize_distributed\|torch.distributed\|compile_helpers" ./src/megatron/bridge/training/initialize.py | head -20

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 1318


🏁 Script executed:

cat -n ./src/megatron/bridge/training/initialize.py | sed -n '146,250p'

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 5230


🏁 Script executed:

cat -n ./src/megatron/bridge/training/initialize.py | sed -n '523,600p'

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 3980


🏁 Script executed:

cat -n ./src/megatron/bridge/training/initialize.py | sed -n '179,225p'

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 2306


🏁 Script executed:

grep -n "barrier\|compile_helpers" ./src/megatron/bridge/training/initialize.py

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 299


Move compilation after process-group init to avoid rank timeouts.

Non‑zero ranks will enter torch.distributed.init_process_group while rank 0 is compiling. If compilation is slow, this blocks the collective and can hit distributed_timeout_minutes. Move the compile step into finish_mpu_init() right after _initialize_distributed() and add a barrier so all ranks wait for rank 0 to finish.

🔧 Suggested relocation (remove here, add in finish_mpu_init)
-    if get_rank_safe() == 0:
-        start_time = time.time()
-        print("> compiling dataset index builder ...")
-
-        compile_helpers()
-        print(
-            ">>> done with dataset index builder. Compilation time: {:.3f} seconds".format(time.time() - start_time),
-            flush=True,
-        )
     def finish_mpu_init() -> ProcessGroupCollection:
         # Pytorch distributed.
         pg_collection = _initialize_distributed(
             model_config=model_config,
             dist_config=dist_config,
             num_distributed_optimizer_instances=num_distributed_optimizer_instances,
             get_embedding_ranks=get_embedding_ranks,
             get_position_embedding_ranks=get_position_embedding_ranks,
             restart_store=restart_store,
             use_inprocess_restart=use_inprocess_restart,
         )
+
+        if get_rank_safe() == 0:
+            start_time = time.time()
+            print("> compiling dataset index builder ...")
+            compile_helpers()
+            print(
+                ">>> done with dataset index builder. Compilation time: {:.3f} seconds".format(
+                    time.time() - start_time
+                ),
+                flush=True,
+            )
+        torch.distributed.barrier()
🤖 Prompt for AI Agents
In `@src/megatron/bridge/training/initialize.py` around lines 116 - 126, The
compile step currently gated by get_rank_safe() in initialize.py should be
removed from this pre-init location and instead run inside finish_mpu_init()
immediately after calling _initialize_distributed(); move the compile_helpers()
invocation (and its timing/prints) into finish_mpu_init(), then call
torch.distributed.barrier() (or dist.barrier()) right after compilation so all
ranks wait for rank 0 to finish; update references to get_rank_safe(),
compile_helpers(), finish_mpu_init(), and _initialize_distributed() accordingly
and delete the original conditional block in initialize.py to avoid rank
timeouts.

torch.distributed.barrier()

# init rerun global state
init_rerun_state(rerun_state_machine_config)

Expand Down
4 changes: 2 additions & 2 deletions uv.lock

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

Loading