feat: Add dataset compile helper by ko3n1g · Pull Request #2236 · NVIDIA-NeMo/Megatron-Bridge

ko3n1g · 2026-02-05T09:56:27Z

What does this PR do ?

Add a one line overview of what this PR aims to accomplish.

Changelog

Add specific line by line info of high level changes in this PR.

GitHub Actions CI

See the CI sectionin the Contributing doc for how to trigger the CI. A Nvidia developer will need to approve and trigger the CI for external contributors.

Before your PR is "Ready for review"

Pre checks:

Make sure you read and followed Contributor guidelines
Did you write any new necessary tests?
Did you add or update any necessary documentation?
Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
- Reviewer: Does the PR have correct import guards for all optional libraries?

If you haven't finished some of the above items you can still open "Draft" PR.

Additional Information

Related to # (issue)

Summary by CodeRabbit

Chores
- Introduced a one-time dataset index builder compilation step during initialization. Users will now see a startup message indicating when this compilation begins and completes, along with elapsed time tracking. This compilation occurs only on the primary process to avoid redundant operations across distributed training instances.

Signed-off-by: oliver könig <okoenig@nvidia.com>

coderabbitai · 2026-02-05T10:03:36Z

📝 Walkthrough

Walkthrough

The change adds a rank-0 guarded initialization block to compile a dataset index builder during Megatron startup. The block records timing metrics, prints status messages before and after the compilation step, and ensures the operation runs only once on the primary rank to avoid redundant execution across distributed processes.

Changes

Cohort / File(s)	Summary
Dataset Index Builder Compilation `src/megatron/bridge/training/initialize.py`	Added time import and compile_helpers import. Introduced a rank-0 guarded block in initialization that records start time, prints compilation message, calls compile_helpers(), and logs elapsed time. Provides one-shot compilation of dataset index builder at startup with precise timing instrumentation.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~8 minutes

🚥 Pre-merge checks | ✅ 3 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Test Results For Major Changes	⚠️ Warning	PR introduces a new feature (major change) but PR description lacks test results, testing information, or performance metrics required by the custom check.	Add testing information to PR description including test results, distributed environment verification, convergence validation, and performance impact measurements.

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title 'feat: Add dataset compile helper' is directly related to the main change: introducing a dataset index builder compilation step during initialization.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch ko3n1g/feat/run-dataset-compile-helper

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🤖 Fix all issues with AI agents

In `@src/megatron/bridge/training/initialize.py`:
- Around line 116-126: The compile step currently gated by get_rank_safe() in
initialize.py should be removed from this pre-init location and instead run
inside finish_mpu_init() immediately after calling _initialize_distributed();
move the compile_helpers() invocation (and its timing/prints) into
finish_mpu_init(), then call torch.distributed.barrier() (or dist.barrier())
right after compilation so all ranks wait for rank 0 to finish; update
references to get_rank_safe(), compile_helpers(), finish_mpu_init(), and
_initialize_distributed() accordingly and delete the original conditional block
in initialize.py to avoid rank timeouts.

coderabbitai · 2026-02-05T10:03:39Z

src/megatron/bridge/training/initialize.py

+    # =========================
+
+    if get_rank_safe() == 0:
+        start_time = time.time()
+        print("> compiling dataset index builder ...")
+
+        compile_helpers()
+        print(
+            ">>> done with dataset index builder. Compilation time: {:.3f} seconds".format(time.time() - start_time),
+            flush=True,
+        )


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

wc -l ./src/megatron/bridge/training/initialize.py

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 119

🏁 Script executed:

cat -n ./src/megatron/bridge/training/initialize.py | sed -n '100,150p'

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 2211

🏁 Script executed:

cat -n ./src/megatron/bridge/training/initialize.py | sed -n '1,50p'

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 2338

🏁 Script executed:

grep -n "def finish_mpu_init\|def _initialize_distributed\|torch.distributed\|compile_helpers" ./src/megatron/bridge/training/initialize.py | head -20

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 1318

🏁 Script executed:

cat -n ./src/megatron/bridge/training/initialize.py | sed -n '146,250p'

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 5230

🏁 Script executed:

cat -n ./src/megatron/bridge/training/initialize.py | sed -n '523,600p'

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 3980

🏁 Script executed:

cat -n ./src/megatron/bridge/training/initialize.py | sed -n '179,225p'

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 2306

🏁 Script executed:

grep -n "barrier\|compile_helpers" ./src/megatron/bridge/training/initialize.py

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 299

Move compilation after process-group init to avoid rank timeouts.

Non‑zero ranks will enter torch.distributed.init_process_group while rank 0 is compiling. If compilation is slow, this blocks the collective and can hit distributed_timeout_minutes. Move the compile step into finish_mpu_init() right after _initialize_distributed() and add a barrier so all ranks wait for rank 0 to finish.

🔧 Suggested relocation (remove here, add in finish_mpu_init)

- if get_rank_safe() == 0: - start_time = time.time() - print("> compiling dataset index builder ...") - - compile_helpers() - print( - ">>> done with dataset index builder. Compilation time: {:.3f} seconds".format(time.time() - start_time), - flush=True, - )

def finish_mpu_init() -> ProcessGroupCollection: # Pytorch distributed. pg_collection = _initialize_distributed( model_config=model_config, dist_config=dist_config, num_distributed_optimizer_instances=num_distributed_optimizer_instances, get_embedding_ranks=get_embedding_ranks, get_position_embedding_ranks=get_position_embedding_ranks, restart_store=restart_store, use_inprocess_restart=use_inprocess_restart, ) + + if get_rank_safe() == 0: + start_time = time.time() + print("> compiling dataset index builder ...") + compile_helpers() + print( + ">>> done with dataset index builder. Compilation time: {:.3f} seconds".format( + time.time() - start_time + ), + flush=True, + ) + torch.distributed.barrier()

🤖 Prompt for AI Agents

In `@src/megatron/bridge/training/initialize.py` around lines 116 - 126, The compile step currently gated by get_rank_safe() in initialize.py should be removed from this pre-init location and instead run inside finish_mpu_init() immediately after calling _initialize_distributed(); move the compile_helpers() invocation (and its timing/prints) into finish_mpu_init(), then call torch.distributed.barrier() (or dist.barrier()) right after compilation so all ranks wait for rank 0 to finish; update references to get_rank_safe(), compile_helpers(), finish_mpu_init(), and _initialize_distributed() accordingly and delete the original conditional block in initialize.py to avoid rank timeouts.

Signed-off-by: oliver könig <okoenig@nvidia.com>

feat: Add dataset compile helper

9990de0

Signed-off-by: oliver könig <okoenig@nvidia.com>

copy-pr-bot bot temporarily deployed to nemo-ci February 5, 2026 09:56 Inactive

copy-pr-bot bot temporarily deployed to test February 5, 2026 09:57 Inactive

coderabbitai bot reviewed Feb 5, 2026

View reviewed changes

copy-pr-bot bot temporarily deployed to nemo-ci February 5, 2026 10:14 Inactive

copy-pr-bot bot temporarily deployed to nemo-ci February 5, 2026 10:21 Inactive