Skip to content

feat: Add dataset compile helper#2236

Merged
ko3n1g merged 6 commits intomainfrom
ko3n1g/feat/run-dataset-compile-helper
Feb 5, 2026
Merged

feat: Add dataset compile helper#2236
ko3n1g merged 6 commits intomainfrom
ko3n1g/feat/run-dataset-compile-helper

Conversation

@ko3n1g
Copy link
Copy Markdown
Contributor

@ko3n1g ko3n1g commented Feb 5, 2026

What does this PR do ?

Add a one line overview of what this PR aims to accomplish.

Changelog

  • Add specific line by line info of high level changes in this PR.

GitHub Actions CI

See the CI sectionin the Contributing doc for how to trigger the CI. A Nvidia developer will need to approve and trigger the CI for external contributors.

Before your PR is "Ready for review"

Pre checks:

  • Make sure you read and followed Contributor guidelines
  • Did you write any new necessary tests?
  • Did you add or update any necessary documentation?
  • Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
    • Reviewer: Does the PR have correct import guards for all optional libraries?

If you haven't finished some of the above items you can still open "Draft" PR.

Additional Information

  • Related to # (issue)

Summary by CodeRabbit

  • Chores
    • Introduced a one-time dataset index builder compilation step during initialization. Users will now see a startup message indicating when this compilation begins and completes, along with elapsed time tracking. This compilation occurs only on the primary process to avoid redundant operations across distributed training instances.

Signed-off-by: oliver könig <okoenig@nvidia.com>
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Feb 5, 2026

📝 Walkthrough

Walkthrough

The change adds a rank-0 guarded initialization block to compile a dataset index builder during Megatron startup. The block records timing metrics, prints status messages before and after the compilation step, and ensures the operation runs only once on the primary rank to avoid redundant execution across distributed processes.

Changes

Cohort / File(s) Summary
Dataset Index Builder Compilation
src/megatron/bridge/training/initialize.py
Added time import and compile_helpers import. Introduced a rank-0 guarded block in initialization that records start time, prints compilation message, calls compile_helpers(), and logs elapsed time. Provides one-shot compilation of dataset index builder at startup with precise timing instrumentation.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~8 minutes

🚥 Pre-merge checks | ✅ 3 | ❌ 1
❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Test Results For Major Changes ⚠️ Warning PR introduces a new feature (major change) but PR description lacks test results, testing information, or performance metrics required by the custom check. Add testing information to PR description including test results, distributed environment verification, convergence validation, and performance impact measurements.
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'feat: Add dataset compile helper' is directly related to the main change: introducing a dataset index builder compilation step during initialization.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch ko3n1g/feat/run-dataset-compile-helper

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In `@src/megatron/bridge/training/initialize.py`:
- Around line 116-126: The compile step currently gated by get_rank_safe() in
initialize.py should be removed from this pre-init location and instead run
inside finish_mpu_init() immediately after calling _initialize_distributed();
move the compile_helpers() invocation (and its timing/prints) into
finish_mpu_init(), then call torch.distributed.barrier() (or dist.barrier())
right after compilation so all ranks wait for rank 0 to finish; update
references to get_rank_safe(), compile_helpers(), finish_mpu_init(), and
_initialize_distributed() accordingly and delete the original conditional block
in initialize.py to avoid rank timeouts.

Comment on lines +116 to +126
# =========================

if get_rank_safe() == 0:
start_time = time.time()
print("> compiling dataset index builder ...")

compile_helpers()
print(
">>> done with dataset index builder. Compilation time: {:.3f} seconds".format(time.time() - start_time),
flush=True,
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

wc -l ./src/megatron/bridge/training/initialize.py

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 119


🏁 Script executed:

cat -n ./src/megatron/bridge/training/initialize.py | sed -n '100,150p'

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 2211


🏁 Script executed:

cat -n ./src/megatron/bridge/training/initialize.py | sed -n '1,50p'

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 2338


🏁 Script executed:

grep -n "def finish_mpu_init\|def _initialize_distributed\|torch.distributed\|compile_helpers" ./src/megatron/bridge/training/initialize.py | head -20

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 1318


🏁 Script executed:

cat -n ./src/megatron/bridge/training/initialize.py | sed -n '146,250p'

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 5230


🏁 Script executed:

cat -n ./src/megatron/bridge/training/initialize.py | sed -n '523,600p'

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 3980


🏁 Script executed:

cat -n ./src/megatron/bridge/training/initialize.py | sed -n '179,225p'

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 2306


🏁 Script executed:

grep -n "barrier\|compile_helpers" ./src/megatron/bridge/training/initialize.py

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 299


Move compilation after process-group init to avoid rank timeouts.

Non‑zero ranks will enter torch.distributed.init_process_group while rank 0 is compiling. If compilation is slow, this blocks the collective and can hit distributed_timeout_minutes. Move the compile step into finish_mpu_init() right after _initialize_distributed() and add a barrier so all ranks wait for rank 0 to finish.

🔧 Suggested relocation (remove here, add in finish_mpu_init)
-    if get_rank_safe() == 0:
-        start_time = time.time()
-        print("> compiling dataset index builder ...")
-
-        compile_helpers()
-        print(
-            ">>> done with dataset index builder. Compilation time: {:.3f} seconds".format(time.time() - start_time),
-            flush=True,
-        )
     def finish_mpu_init() -> ProcessGroupCollection:
         # Pytorch distributed.
         pg_collection = _initialize_distributed(
             model_config=model_config,
             dist_config=dist_config,
             num_distributed_optimizer_instances=num_distributed_optimizer_instances,
             get_embedding_ranks=get_embedding_ranks,
             get_position_embedding_ranks=get_position_embedding_ranks,
             restart_store=restart_store,
             use_inprocess_restart=use_inprocess_restart,
         )
+
+        if get_rank_safe() == 0:
+            start_time = time.time()
+            print("> compiling dataset index builder ...")
+            compile_helpers()
+            print(
+                ">>> done with dataset index builder. Compilation time: {:.3f} seconds".format(
+                    time.time() - start_time
+                ),
+                flush=True,
+            )
+        torch.distributed.barrier()
🤖 Prompt for AI Agents
In `@src/megatron/bridge/training/initialize.py` around lines 116 - 126, The
compile step currently gated by get_rank_safe() in initialize.py should be
removed from this pre-init location and instead run inside finish_mpu_init()
immediately after calling _initialize_distributed(); move the compile_helpers()
invocation (and its timing/prints) into finish_mpu_init(), then call
torch.distributed.barrier() (or dist.barrier()) right after compilation so all
ranks wait for rank 0 to finish; update references to get_rank_safe(),
compile_helpers(), finish_mpu_init(), and _initialize_distributed() accordingly
and delete the original conditional block in initialize.py to avoid rank
timeouts.

Signed-off-by: oliver könig <okoenig@nvidia.com>
Signed-off-by: oliver könig <okoenig@nvidia.com>
yaoyu-33
yaoyu-33 previously approved these changes Feb 5, 2026
Signed-off-by: oliver könig <okoenig@nvidia.com>
Signed-off-by: oliver könig <okoenig@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants