Conversation
Signed-off-by: oliver könig <okoenig@nvidia.com>
📝 WalkthroughWalkthroughThe change adds a rank-0 guarded initialization block to compile a dataset index builder during Megatron startup. The block records timing metrics, prints status messages before and after the compilation step, and ensures the operation runs only once on the primary rank to avoid redundant execution across distributed processes. Changes
Estimated code review effort🎯 2 (Simple) | ⏱️ ~8 minutes 🚥 Pre-merge checks | ✅ 3 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In `@src/megatron/bridge/training/initialize.py`:
- Around line 116-126: The compile step currently gated by get_rank_safe() in
initialize.py should be removed from this pre-init location and instead run
inside finish_mpu_init() immediately after calling _initialize_distributed();
move the compile_helpers() invocation (and its timing/prints) into
finish_mpu_init(), then call torch.distributed.barrier() (or dist.barrier())
right after compilation so all ranks wait for rank 0 to finish; update
references to get_rank_safe(), compile_helpers(), finish_mpu_init(), and
_initialize_distributed() accordingly and delete the original conditional block
in initialize.py to avoid rank timeouts.
| # ========================= | ||
|
|
||
| if get_rank_safe() == 0: | ||
| start_time = time.time() | ||
| print("> compiling dataset index builder ...") | ||
|
|
||
| compile_helpers() | ||
| print( | ||
| ">>> done with dataset index builder. Compilation time: {:.3f} seconds".format(time.time() - start_time), | ||
| flush=True, | ||
| ) |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
wc -l ./src/megatron/bridge/training/initialize.pyRepository: NVIDIA-NeMo/Megatron-Bridge
Length of output: 119
🏁 Script executed:
cat -n ./src/megatron/bridge/training/initialize.py | sed -n '100,150p'Repository: NVIDIA-NeMo/Megatron-Bridge
Length of output: 2211
🏁 Script executed:
cat -n ./src/megatron/bridge/training/initialize.py | sed -n '1,50p'Repository: NVIDIA-NeMo/Megatron-Bridge
Length of output: 2338
🏁 Script executed:
grep -n "def finish_mpu_init\|def _initialize_distributed\|torch.distributed\|compile_helpers" ./src/megatron/bridge/training/initialize.py | head -20Repository: NVIDIA-NeMo/Megatron-Bridge
Length of output: 1318
🏁 Script executed:
cat -n ./src/megatron/bridge/training/initialize.py | sed -n '146,250p'Repository: NVIDIA-NeMo/Megatron-Bridge
Length of output: 5230
🏁 Script executed:
cat -n ./src/megatron/bridge/training/initialize.py | sed -n '523,600p'Repository: NVIDIA-NeMo/Megatron-Bridge
Length of output: 3980
🏁 Script executed:
cat -n ./src/megatron/bridge/training/initialize.py | sed -n '179,225p'Repository: NVIDIA-NeMo/Megatron-Bridge
Length of output: 2306
🏁 Script executed:
grep -n "barrier\|compile_helpers" ./src/megatron/bridge/training/initialize.pyRepository: NVIDIA-NeMo/Megatron-Bridge
Length of output: 299
Move compilation after process-group init to avoid rank timeouts.
Non‑zero ranks will enter torch.distributed.init_process_group while rank 0 is compiling. If compilation is slow, this blocks the collective and can hit distributed_timeout_minutes. Move the compile step into finish_mpu_init() right after _initialize_distributed() and add a barrier so all ranks wait for rank 0 to finish.
🔧 Suggested relocation (remove here, add in finish_mpu_init)
- if get_rank_safe() == 0:
- start_time = time.time()
- print("> compiling dataset index builder ...")
-
- compile_helpers()
- print(
- ">>> done with dataset index builder. Compilation time: {:.3f} seconds".format(time.time() - start_time),
- flush=True,
- ) def finish_mpu_init() -> ProcessGroupCollection:
# Pytorch distributed.
pg_collection = _initialize_distributed(
model_config=model_config,
dist_config=dist_config,
num_distributed_optimizer_instances=num_distributed_optimizer_instances,
get_embedding_ranks=get_embedding_ranks,
get_position_embedding_ranks=get_position_embedding_ranks,
restart_store=restart_store,
use_inprocess_restart=use_inprocess_restart,
)
+
+ if get_rank_safe() == 0:
+ start_time = time.time()
+ print("> compiling dataset index builder ...")
+ compile_helpers()
+ print(
+ ">>> done with dataset index builder. Compilation time: {:.3f} seconds".format(
+ time.time() - start_time
+ ),
+ flush=True,
+ )
+ torch.distributed.barrier()🤖 Prompt for AI Agents
In `@src/megatron/bridge/training/initialize.py` around lines 116 - 126, The
compile step currently gated by get_rank_safe() in initialize.py should be
removed from this pre-init location and instead run inside finish_mpu_init()
immediately after calling _initialize_distributed(); move the compile_helpers()
invocation (and its timing/prints) into finish_mpu_init(), then call
torch.distributed.barrier() (or dist.barrier()) right after compilation so all
ranks wait for rank 0 to finish; update references to get_rank_safe(),
compile_helpers(), finish_mpu_init(), and _initialize_distributed() accordingly
and delete the original conditional block in initialize.py to avoid rank
timeouts.
Signed-off-by: oliver könig <okoenig@nvidia.com>
What does this PR do ?
Add a one line overview of what this PR aims to accomplish.
Changelog
GitHub Actions CI
See the CI sectionin the Contributing doc for how to trigger the CI. A Nvidia developer will need to approve and trigger the CI for external contributors.
Before your PR is "Ready for review"
Pre checks:
If you haven't finished some of the above items you can still open "Draft" PR.
Additional Information
Summary by CodeRabbit