Adding support for training with megatron-lm by Kipok · Pull Request #873 · NVIDIA-NeMo/Skills

Kipok · 2025-09-30T23:29:19Z

Support is very minimal right now. Not adding docs or tests as we'd likely need to refine this as we get more use-cases

Users typically would need to mount your custom megatron-lm folder as well as define a custom container.

containers:
  megatron: <path to custom .sqsh>

mounts:
  - <path to megatron-lm custom branch on cluster>:/opt/Megatron-LM

example command

from nemo_skills.pipeline.cli import train_megatron_lm, wrap_arguments


output_dir = '/workspace/test-megatron`

env_vars = [
    "export UB_TIMEOUT=720",
    "export CUDA_DEVICE_MAX_CONNECTIONS=1",
    "export NVTE_FWD_LAYERNORM_SM_MARGIN=16",
    "export NVTE_BWD_LAYERNORM_SM_MARGIN=16",
    "export NCCL_P2P_NET_CHUNKSIZE=2097152",
    "export NCCL_DEBUG=WARN",
    "export TORCHINDUCTOR_WORKER_START=fork",
    "export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True",
    f"export TRITON_CACHE_DIR={output_dir}/triton-cache",
    f"export TRITON_HOME={output_dir}/triton-cache",
]
env_vars_cmd = " && ".join(env_vars)

BLEND_PATH=<>
PRETRAINED_CKPT=<>
TOKENIZER_MODEL=<>
PROMPT_FORMAT="nemotron-nano-v2"


megatron_options = f" \
    --sft \
    --sft-tokenizer-prompt-format {PROMPT_FORMAT} \
    --tokenizer-type SFTTokenizer \
    \
    --tensor-model-parallel-size 8 \
    --expert-model-parallel-size 8 \
    --expert-tensor-parallel-size 1 \
    --pipeline-model-parallel-size 1 \
  
    ... tons of other options
"

train_megatron_lm(
    ctx=wrap_arguments(megatron_options),
    cluster='dfw',
    expname='test-megatron-lm',
    output_dir=output_dir,
    entrypoint='pretrain_mamba.py',
    tokenizer_model=TOKENIZER_MODEL,
    megatron_model=PRETRAINED_CKPT,
    per_split_data_path=BLEND_PATH,
    num_gpus=8,
    num_nodes=2,
    num_training_jobs=1,
    partition="interactive",
    init_cmd=env_vars_cmd,
)

Summary by CodeRabbit

New Features
- Added a Megatron-LM training subcommand to the pipeline CLI.
- Configure cluster/partition, nodes/GPUs, timeouts, output/log directories, mount paths, and runtime extras.
- Optional Weights & Biases integration with project/group settings.
- Submit multiple training jobs with dependencies and exclusive scheduling options.
- Support for experiment reuse and dry-run execution.
- Validations for paths and environment checks to prevent misconfiguration.
- Improved logging and experiment/task summaries for submitted runs.

Signed-off-by: Igor Gitman <igitman@nvidia.com>

coderabbitai · 2025-09-30T23:29:26Z

Walkthrough

Adds a Megatron-LM training subcommand and implementation: registers a Typer sub-app under "megatron_lm", wires the CLI to import the new train command, and implements helpers to build the training shell command and submit configurable experiment tasks to the cluster with optional WandB support.

Changes

Cohort / File(s)	Summary
CLI wiring `nemo_skills/pipeline/cli.py`	Imports `train_megatron_lm` to expose the Megatron-LM training command via the CLI.
Megatron-LM sub-app registration `nemo_skills/pipeline/megatron_lm/__init__.py`	Adds `megatron_lm_app` (Typer) and registers it under the main app as the `megatron_lm` subcommand group.
Megatron-LM training implementation `nemo_skills/pipeline/megatron_lm/train.py`	Adds `get_training_cmd` to construct the shell training command (init, entrypoint, model/tokenizer/data args, optional WandB flags) and `train_megatron_lm` to configure cluster/mounts/paths, create experiments/tasks, set Slurm/container options, support dry-run/reuse, and submit jobs.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  actor User
  participant CLI as CLI (Typer)
  participant SubApp as megatron_lm_app
  participant Builder as get_training_cmd
  participant ExpMgr as Experiment Manager
  participant Slurm as Cluster/Slurm
  participant W&B as Weights & Biases

  User->>CLI: nemo-skills megatron_lm train [args]
  CLI->>SubApp: invoke train_megatron_lm
  SubApp->>SubApp: resolve cluster, mounts, paths
  SubApp->>Builder: build training shell command
  Builder-->>SubApp: train_cmd
  SubApp->>ExpMgr: get_exp(...)
  loop for each training job
    SubApp->>ExpMgr: add_task(train_cmd, resources, logs, deps)
  end
  alt dry-run
    ExpMgr-->>User: planned tasks output
  else execute
    ExpMgr->>Slurm: run_exp(...)
    opt WandB enabled
      SubApp->>W&B: include WandB args / init info
    end
    Slurm-->>ExpMgr: job statuses
    ExpMgr-->>User: experiment result / last task
  end

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Poem

A rabbit taps the launch command, hop!
Megatron lines the training crop.
Typer guides the queueing art,
Jobs depart, each plays its part.
WandB winks — the run’s begun. 🐇

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title Check	✅ Passed	The title succinctly describes the primary feature addition—support for training with Megatron-LM—and directly reflects the core changes in wiring the CLI and training entrypoint. It is concise and provides a clear understanding of the pull request’s intent.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✨ Finishing touches

📝 Generate Docstrings

🧪 Generate unit tests

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch igitman/megatron-sft2

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (1)

nemo_skills/pipeline/megatron_lm/train.py (1)

77-77: Clarify necessity of unused parameter.

The comment indicates --data-cache-path is "unused for sft", yet it's still passed to the training script. If this parameter is truly unused, consider removing it to reduce confusion. If it's needed for compatibility or future use, update the comment to clarify.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between a37231c and 2b68884.

📒 Files selected for processing (3)

nemo_skills/pipeline/cli.py (1 hunks)
nemo_skills/pipeline/megatron_lm/__init__.py (1 hunks)
nemo_skills/pipeline/megatron_lm/train.py (1 hunks)

🧰 Additional context used

🧬 Code graph analysis (2)

nemo_skills/pipeline/cli.py (1)

nemo_skills/pipeline/megatron_lm/train.py (1)

train_megatron_lm (96-243)

nemo_skills/pipeline/megatron_lm/train.py (5)

nemo_skills/pipeline/app.py (1)

typer_unpacker (25-53)

nemo_skills/pipeline/utils/exp.py (3)

add_task (333-611)

get_exp (648-664)

run_exp (614-645)

nemo_skills/pipeline/utils/mounts.py (4)

check_if_mounted (49-56)

check_mounts (59-145)

get_mounted_path (148-193)

resolve_mount_paths (308-350)

nemo_skills/pipeline/utils/cluster.py (2)

get_cluster_config (232-286)

get_timeout_str (101-105)

nemo_skills/utils.py (2)

get_logger_name (130-131)

setup_logging (85-120)

🪛 Ruff (0.13.1)

nemo_skills/pipeline/megatron_lm/train.py

126-128: Do not perform function call typer.Option in argument defaults; instead, perform the call within the function, or read the default from a module-level singleton variable

(B008)

166-168: Do not perform function call typer.Option in argument defaults; instead, perform the call within the function, or read the default from a module-level singleton variable

(B008)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)

GitHub Check: pre-commit
GitHub Check: unit-tests

🔇 Additional comments (10)

nemo_skills/pipeline/cli.py (1)

27-27: LGTM!

The import correctly wires the Megatron-LM training command into the CLI, following the established pattern for other training pipelines.

nemo_skills/pipeline/megatron_lm/__init__.py (1)

15-21: LGTM!

The Megatron-LM sub-application is correctly structured and registered following the Typer pattern used elsewhere in the codebase.

nemo_skills/pipeline/megatron_lm/train.py (8)

55-62: LGTM!

The timeout calculation correctly converts from DD:HH:MM:SS format to minutes, including proper round-up for leftover seconds.

65-79: Verify hardcoded Megatron-LM path.

The command assumes Megatron-LM is installed at /opt/Megatron-LM. Ensure this path matches the container configuration in cluster_config["containers"]["megatron"], or consider making it configurable for flexibility.

82-89: LGTM!

Weights & Biases configuration is correctly applied when enabled, and extra arguments are properly appended.

92-180: LGTM!

The function signature, decorator configuration, and initial setup correctly follow the established patterns used in other pipeline commands. The use of allow_extra_args appropriately enables passing arbitrary arguments to the underlying Megatron-LM script.

200-214: LGTM!

The training command is correctly constructed with all necessary parameters and remapped paths.

218-238: Sequential job chaining logic is correct.

The loop correctly chains multiple training jobs sequentially via task_dependencies, and the final run_exp call properly respects the dry_run flag.

241-248: LGTM!

The return logic correctly handles experiment reuse scenarios, and the main block follows the standard CLI entry point pattern.

216-239: Validate presence of “megatron” container key
Add an explicit check after loading cluster_config to ensure cluster_config["containers"] contains "megatron" and raise a clear, descriptive error if it’s missing. Update the cluster‐config docs to require a containers.megatron entry.

nemo_skills/pipeline/megatron_lm/train.py

Signed-off-by: Igor Gitman <igitman@nvidia.com>

coderabbitai

Actionable comments posted: 1

♻️ Duplicate comments (1)

nemo_skills/pipeline/megatron_lm/train.py (1)
191-192: Remap tokenizer_model via get_mounted_path for consistency.

The code only validates that tokenizer_model is mounted but does not remap it to the in-container path, unlike megatron_model and per_split_data_path (handled via check_mounts at lines 184-189). This inconsistency was flagged in a previous review.

Based on the prior review, apply this diff:
     if tokenizer_model.startswith("/"):
-        check_if_mounted(cluster_config, tokenizer_model)
+        tokenizer_model = get_mounted_path(cluster_config, tokenizer_model)
This ensures tokenizer_model is correctly remapped to the container path, matching the handling of other absolute paths.

🧹 Nitpick comments (1)

nemo_skills/pipeline/megatron_lm/train.py (1)
64-88: Minor formatting refinements.

Two small suggestions:

Line 76's comment "unused for sft" may confuse readers—consider removing if the parameter is required by the script but ignored.

Line 86 adds leading/trailing spaces around extra_arguments, which could result in double spaces if extra_arguments is empty or already padded.

Apply this diff to tighten the formatting:
-    f"    --data-cache-path {output_dir}/megatron-lm-data-cache "  # unused for sft
+    f"    --data-cache-path {output_dir}/megatron-lm-data-cache "
-    cmd += f" {extra_arguments} "
+    cmd += f" {extra_arguments}"

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 2b68884 and b53a3d8.

📒 Files selected for processing (1)

nemo_skills/pipeline/megatron_lm/train.py (1 hunks)

🧰 Additional context used

🧬 Code graph analysis (1)

nemo_skills/pipeline/megatron_lm/train.py (5)

nemo_skills/pipeline/app.py (1)

typer_unpacker (25-53)

nemo_skills/pipeline/utils/exp.py (3)

add_task (333-611)

get_exp (648-664)

run_exp (614-645)

nemo_skills/pipeline/utils/mounts.py (3)

check_if_mounted (49-56)

check_mounts (59-145)

resolve_mount_paths (308-350)

nemo_skills/pipeline/utils/cluster.py (2)

get_cluster_config (232-286)

get_timeout_str (101-105)

nemo_skills/utils.py (2)

get_logger_name (130-131)

setup_logging (85-120)

🪛 Ruff (0.13.1)

nemo_skills/pipeline/megatron_lm/train.py

125-127: Do not perform function call typer.Option in argument defaults; instead, perform the call within the function, or read the default from a module-level singleton variable

(B008)

165-167: Do not perform function call typer.Option in argument defaults; instead, perform the call within the function, or read the default from a module-level singleton variable

(B008)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)

GitHub Check: pre-commit
GitHub Check: unit-tests

🔇 Additional comments (6)

nemo_skills/pipeline/megatron_lm/train.py (6)

1-35: LGTM!

Imports and module setup are well-organized and follow the project's conventions.

54-61: LGTM!

Timeout calculation correctly rounds up when there are leftover seconds.

125-127: Static analysis false positive; typer_unpacker handles this pattern.

The ruff B008 warnings about typer.Option in argument defaults are false positives here. The @typer_unpacker decorator (line 94) resolves ParameterInfo defaults at runtime, as shown in the relevant code snippets.

Also applies to: 165-167

173-189: LGTM!

The cluster config resolution, mount path setup, and path validation logic are correct and follow the project's conventions.

194-208: LGTM!

The training command construction correctly passes all required parameters to get_training_cmd.

240-242: LGTM!

Standard typer CLI entry point with command name preservation.

coderabbitai · 2025-09-30T23:51:54Z

nemo_skills/pipeline/megatron_lm/train.py

+    with get_exp(expname, cluster_config, _reuse_exp) as exp:
+        prev_task = _task_dependencies
+        for job_id in range(num_training_jobs):
+            prev_task = add_task(
+                exp,
+                cmd=train_cmd,
+                task_name=f"{expname}-{job_id}",
+                log_dir=f"{log_dir}/training-logs",
+                container=cluster_config["containers"]["megatron"],
+                num_tasks=num_gpus,
+                num_gpus=num_gpus,
+                num_nodes=num_nodes,
+                cluster_config=cluster_config,
+                partition=partition,
+                time_min=time_min,
+                run_after=run_after,
+                reuse_code=reuse_code,
+                reuse_code_exp=reuse_code_exp,
+                task_dependencies=[prev_task] if prev_task is not None else None,
+                slurm_kwargs={"exclusive": exclusive} if exclusive else None,
+                installation_command=installation_command,
+                skip_hf_home_check=skip_hf_home_check,
+            )
+        run_exp(exp, cluster_config, sequential=False, dry_run=dry_run)
+
+    if _reuse_exp:
+        return [prev_task]
+    return exp


🛠️ Refactor suggestion | 🟠 Major

🧩 Analysis chain

Verify num_tasks=num_gpus matches Megatron-LM's MPI expectations.

The task submission loop (line 219) sets num_tasks=num_gpus, which assumes one MPI rank per GPU. Ensure this matches Megatron-LM's expected process topology, especially for multi-node training where tensor/pipeline parallelism may require different rank counts.

The rest of the task submission logic—dependency chaining, resource allocation, and return handling—is implemented correctly.

Run the following script to check how other training commands in the codebase handle num_tasks:

🏁 Script executed:

#!/bin/bash # Description: Check how other training pipelines set num_tasks in add_task calls rg -nP 'add_task\(' -A15 --type=py -g '!**/test/**' | rg -P 'num_tasks\s*='

Length of output: 148

🏁 Script executed:

#!/bin/bash rg -nP 'add_task\(' -C3 --glob '*.py'

Length of output: 11121

🏁 Script executed:

#!/bin/bash sed -n '350,380p' nemo_skills/pipeline/train.py

Length of output: 1173

🏁 Script executed:

#!/bin/bash sed -n '330,450p' nemo_skills/pipeline/utils/exp.py

Length of output: 5140

🏁 Script executed:

sed -n '1,200p' nemo_skills/pipeline/megatron_lm/train.py

Length of output: 8192

🏁 Script executed:

sed -n '450,600p' nemo_skills/pipeline/utils/exp.py

Length of output: 7048

Compute num_tasks based on executor before calling add_task

In nemo_skills/pipeline/megatron_lm/train.py, you pass num_tasks=num_gpus unconditionally. In other pipelines (e.g. nemo_skills/pipeline/train.py), num_tasks is computed as:

num_tasks = num_gpus if cluster_config["executor"] == "slurm" else 1

…before passing it into add_task. Mirroring this pattern here prevents unintended local MPI launches on non-Slurm executors and ensures you launch exactly one MPI rank per GPU, matching Megatron-LM’s expected topology for multi-node tensor/pipeline parallelism. Double-check that one rank per GPU aligns with your Megatron-LM configuration.

🤖 Prompt for AI Agents

nemo_skills/pipeline/megatron_lm/train.py around lines 210 to 237: num_tasks is being passed as num_gpus unconditionally which can launch multiple local MPI ranks on non-Slurm executors; compute num_tasks first using the executor (e.g., num_tasks = num_gpus if cluster_config["executor"] == "slurm" else 1) and then pass that num_tasks variable into add_task instead of num_gpus; keep the rest of add_task arguments the same and ensure this one-rank-per-GPU choice matches your Megatron-LM topology expectations.

gwarmstrong

LGTM with a minor question

gwarmstrong · 2025-10-01T00:14:30Z

nemo_skills/pipeline/megatron_lm/train.py

+    ),
+    output_dir: str = typer.Option(..., help="Where to put results"),
+    expname: str = typer.Option("megatron-lm-train", help="Experiment name"),
+    entrypoint: str = typer.Option(..., help="Entrypoint script name, e.g. pretrain_gpt.py or pretrain_mamba.py"),


where can we get the list of possible entrypoints?

It's kind of arbitrary as far as I understand and people can have their own starting scripts, so I'm not sure we can check for correctness here. E.g. there is a whole bunch of pretrain_* in here https://github.com/nvidia/megatron-lm, but when working on custom branch it could be more

Signed-off-by: Igor Gitman <igitman@nvidia.com>

Signed-off-by: Igor Gitman <igitman@nvidia.com> Signed-off-by: SeanNaren <snarenthiran@nvidia.com>

Signed-off-by: Igor Gitman <igitman@nvidia.com> Signed-off-by: dgitman <dgitman@nvidia.com>

Kipok added 5 commits September 23, 2025 09:07

Add megatron sft backend

2b4afc8

Signed-off-by: Igor Gitman <igitman@nvidia.com>

Merge branch 'main' into igitman/megatron-sft2

b0dc2a8

Reorganization

90924af

Signed-off-by: Igor Gitman <igitman@nvidia.com>

Fixes

c8dc4d0

Signed-off-by: Igor Gitman <igitman@nvidia.com>

Merge branch 'main' into igitman/megatron-sft2

2b68884

coderabbitai bot reviewed Sep 30, 2025

View reviewed changes

nemo_skills/pipeline/megatron_lm/train.py Outdated Show resolved Hide resolved

Clean up mount interface

b53a3d8

Signed-off-by: Igor Gitman <igitman@nvidia.com>

coderabbitai bot reviewed Sep 30, 2025

View reviewed changes

gwarmstrong approved these changes Oct 1, 2025

View reviewed changes

Kipok merged commit db689b1 into main Oct 1, 2025
6 checks passed

Kipok deleted the igitman/megatron-sft2 branch October 1, 2025 00:38

wasiahmad pushed a commit that referenced this pull request Oct 1, 2025

Adding support for training with megatron-lm (#873)

65e99b2

Signed-off-by: Igor Gitman <igitman@nvidia.com>

SeanNaren pushed a commit to SeanNaren/NeMo-Skills that referenced this pull request Oct 9, 2025

Adding support for training with megatron-lm (NVIDIA-NeMo#873)

0233e66

Signed-off-by: Igor Gitman <igitman@nvidia.com> Signed-off-by: SeanNaren <snarenthiran@nvidia.com>

SeanNaren pushed a commit that referenced this pull request Oct 9, 2025

Adding support for training with megatron-lm (#873)

1b5f580

Signed-off-by: Igor Gitman <igitman@nvidia.com> Signed-off-by: SeanNaren <snarenthiran@nvidia.com>

dgtm777 pushed a commit that referenced this pull request Mar 18, 2026

Adding support for training with megatron-lm (#873)

9200025

Signed-off-by: Igor Gitman <igitman@nvidia.com> Signed-off-by: dgitman <dgitman@nvidia.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding support for training with megatron-lm#873

Adding support for training with megatron-lm#873
Kipok merged 6 commits intomainfrom
igitman/megatron-sft2

Kipok commented Sep 30, 2025 •

edited

Loading

Uh oh!

coderabbitai bot commented Sep 30, 2025 •

edited

Loading

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Sep 30, 2025

Uh oh!

gwarmstrong left a comment

Uh oh!

gwarmstrong Oct 1, 2025

Uh oh!

Kipok Oct 1, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Kipok commented Sep 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Sep 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Poem

Pre-merge checks and finishing touches

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Sep 30, 2025

Choose a reason for hiding this comment

Uh oh!

gwarmstrong left a comment

Choose a reason for hiding this comment

Uh oh!

gwarmstrong Oct 1, 2025

Choose a reason for hiding this comment

Uh oh!

Kipok Oct 1, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Kipok commented Sep 30, 2025 •

edited

Loading

coderabbitai bot commented Sep 30, 2025 •

edited

Loading