Skip to content

Adding support for training with megatron-lm#873

Merged
Kipok merged 6 commits intomainfrom
igitman/megatron-sft2
Oct 1, 2025
Merged

Adding support for training with megatron-lm#873
Kipok merged 6 commits intomainfrom
igitman/megatron-sft2

Conversation

@Kipok
Copy link
Collaborator

@Kipok Kipok commented Sep 30, 2025

Support is very minimal right now. Not adding docs or tests as we'd likely need to refine this as we get more use-cases

Users typically would need to mount your custom megatron-lm folder as well as define a custom container.

containers:
  megatron: <path to custom .sqsh>

mounts:
  - <path to megatron-lm custom branch on cluster>:/opt/Megatron-LM

example command

from nemo_skills.pipeline.cli import train_megatron_lm, wrap_arguments


output_dir = '/workspace/test-megatron`

env_vars = [
    "export UB_TIMEOUT=720",
    "export CUDA_DEVICE_MAX_CONNECTIONS=1",
    "export NVTE_FWD_LAYERNORM_SM_MARGIN=16",
    "export NVTE_BWD_LAYERNORM_SM_MARGIN=16",
    "export NCCL_P2P_NET_CHUNKSIZE=2097152",
    "export NCCL_DEBUG=WARN",
    "export TORCHINDUCTOR_WORKER_START=fork",
    "export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True",
    f"export TRITON_CACHE_DIR={output_dir}/triton-cache",
    f"export TRITON_HOME={output_dir}/triton-cache",
]
env_vars_cmd = " && ".join(env_vars)

BLEND_PATH=<>
PRETRAINED_CKPT=<>
TOKENIZER_MODEL=<>
PROMPT_FORMAT="nemotron-nano-v2"


megatron_options = f" \
    --sft \
    --sft-tokenizer-prompt-format {PROMPT_FORMAT} \
    --tokenizer-type SFTTokenizer \
    \
    --tensor-model-parallel-size 8 \
    --expert-model-parallel-size 8 \
    --expert-tensor-parallel-size 1 \
    --pipeline-model-parallel-size 1 \
  
    ... tons of other options
"

train_megatron_lm(
    ctx=wrap_arguments(megatron_options),
    cluster='dfw',
    expname='test-megatron-lm',
    output_dir=output_dir,
    entrypoint='pretrain_mamba.py',
    tokenizer_model=TOKENIZER_MODEL,
    megatron_model=PRETRAINED_CKPT,
    per_split_data_path=BLEND_PATH,
    num_gpus=8,
    num_nodes=2,
    num_training_jobs=1,
    partition="interactive",
    init_cmd=env_vars_cmd,
)

Summary by CodeRabbit

  • New Features
    • Added a Megatron-LM training subcommand to the pipeline CLI.
    • Configure cluster/partition, nodes/GPUs, timeouts, output/log directories, mount paths, and runtime extras.
    • Optional Weights & Biases integration with project/group settings.
    • Submit multiple training jobs with dependencies and exclusive scheduling options.
    • Support for experiment reuse and dry-run execution.
    • Validations for paths and environment checks to prevent misconfiguration.
    • Improved logging and experiment/task summaries for submitted runs.

Signed-off-by: Igor Gitman <igitman@nvidia.com>
Signed-off-by: Igor Gitman <igitman@nvidia.com>
Signed-off-by: Igor Gitman <igitman@nvidia.com>
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Sep 30, 2025

Walkthrough

Adds a Megatron-LM training subcommand and implementation: registers a Typer sub-app under "megatron_lm", wires the CLI to import the new train command, and implements helpers to build the training shell command and submit configurable experiment tasks to the cluster with optional WandB support.

Changes

Cohort / File(s) Summary
CLI wiring
nemo_skills/pipeline/cli.py
Imports train_megatron_lm to expose the Megatron-LM training command via the CLI.
Megatron-LM sub-app registration
nemo_skills/pipeline/megatron_lm/__init__.py
Adds megatron_lm_app (Typer) and registers it under the main app as the megatron_lm subcommand group.
Megatron-LM training implementation
nemo_skills/pipeline/megatron_lm/train.py
Adds get_training_cmd to construct the shell training command (init, entrypoint, model/tokenizer/data args, optional WandB flags) and train_megatron_lm to configure cluster/mounts/paths, create experiments/tasks, set Slurm/container options, support dry-run/reuse, and submit jobs.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  actor User
  participant CLI as CLI (Typer)
  participant SubApp as megatron_lm_app
  participant Builder as get_training_cmd
  participant ExpMgr as Experiment Manager
  participant Slurm as Cluster/Slurm
  participant W&B as Weights & Biases

  User->>CLI: nemo-skills megatron_lm train [args]
  CLI->>SubApp: invoke train_megatron_lm
  SubApp->>SubApp: resolve cluster, mounts, paths
  SubApp->>Builder: build training shell command
  Builder-->>SubApp: train_cmd
  SubApp->>ExpMgr: get_exp(...)
  loop for each training job
    SubApp->>ExpMgr: add_task(train_cmd, resources, logs, deps)
  end
  alt dry-run
    ExpMgr-->>User: planned tasks output
  else execute
    ExpMgr->>Slurm: run_exp(...)
    opt WandB enabled
      SubApp->>W&B: include WandB args / init info
    end
    Slurm-->>ExpMgr: job statuses
    ExpMgr-->>User: experiment result / last task
  end
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Poem

A rabbit taps the launch command, hop!
Megatron lines the training crop.
Typer guides the queueing art,
Jobs depart, each plays its part.
WandB winks — the run’s begun. 🐇

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title Check ✅ Passed The title succinctly describes the primary feature addition—support for training with Megatron-LM—and directly reflects the core changes in wiring the CLI and training entrypoint. It is concise and provides a clear understanding of the pull request’s intent.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
✨ Finishing touches
  • 📝 Generate Docstrings
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch igitman/megatron-sft2

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
nemo_skills/pipeline/megatron_lm/train.py (1)

77-77: Clarify necessity of unused parameter.

The comment indicates --data-cache-path is "unused for sft", yet it's still passed to the training script. If this parameter is truly unused, consider removing it to reduce confusion. If it's needed for compatibility or future use, update the comment to clarify.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between a37231c and 2b68884.

📒 Files selected for processing (3)
  • nemo_skills/pipeline/cli.py (1 hunks)
  • nemo_skills/pipeline/megatron_lm/__init__.py (1 hunks)
  • nemo_skills/pipeline/megatron_lm/train.py (1 hunks)
🧰 Additional context used
🧬 Code graph analysis (2)
nemo_skills/pipeline/cli.py (1)
nemo_skills/pipeline/megatron_lm/train.py (1)
  • train_megatron_lm (96-243)
nemo_skills/pipeline/megatron_lm/train.py (5)
nemo_skills/pipeline/app.py (1)
  • typer_unpacker (25-53)
nemo_skills/pipeline/utils/exp.py (3)
  • add_task (333-611)
  • get_exp (648-664)
  • run_exp (614-645)
nemo_skills/pipeline/utils/mounts.py (4)
  • check_if_mounted (49-56)
  • check_mounts (59-145)
  • get_mounted_path (148-193)
  • resolve_mount_paths (308-350)
nemo_skills/pipeline/utils/cluster.py (2)
  • get_cluster_config (232-286)
  • get_timeout_str (101-105)
nemo_skills/utils.py (2)
  • get_logger_name (130-131)
  • setup_logging (85-120)
🪛 Ruff (0.13.1)
nemo_skills/pipeline/megatron_lm/train.py

126-128: Do not perform function call typer.Option in argument defaults; instead, perform the call within the function, or read the default from a module-level singleton variable

(B008)


166-168: Do not perform function call typer.Option in argument defaults; instead, perform the call within the function, or read the default from a module-level singleton variable

(B008)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
  • GitHub Check: pre-commit
  • GitHub Check: unit-tests
🔇 Additional comments (10)
nemo_skills/pipeline/cli.py (1)

27-27: LGTM!

The import correctly wires the Megatron-LM training command into the CLI, following the established pattern for other training pipelines.

nemo_skills/pipeline/megatron_lm/__init__.py (1)

15-21: LGTM!

The Megatron-LM sub-application is correctly structured and registered following the Typer pattern used elsewhere in the codebase.

nemo_skills/pipeline/megatron_lm/train.py (8)

55-62: LGTM!

The timeout calculation correctly converts from DD:HH:MM:SS format to minutes, including proper round-up for leftover seconds.


65-79: Verify hardcoded Megatron-LM path.

The command assumes Megatron-LM is installed at /opt/Megatron-LM. Ensure this path matches the container configuration in cluster_config["containers"]["megatron"], or consider making it configurable for flexibility.


82-89: LGTM!

Weights & Biases configuration is correctly applied when enabled, and extra arguments are properly appended.


92-180: LGTM!

The function signature, decorator configuration, and initial setup correctly follow the established patterns used in other pipeline commands. The use of allow_extra_args appropriately enables passing arbitrary arguments to the underlying Megatron-LM script.


200-214: LGTM!

The training command is correctly constructed with all necessary parameters and remapped paths.


218-238: Sequential job chaining logic is correct.

The loop correctly chains multiple training jobs sequentially via task_dependencies, and the final run_exp call properly respects the dry_run flag.


241-248: LGTM!

The return logic correctly handles experiment reuse scenarios, and the main block follows the standard CLI entry point pattern.


216-239: Validate presence of “megatron” container key
Add an explicit check after loading cluster_config to ensure cluster_config["containers"] contains "megatron" and raise a clear, descriptive error if it’s missing. Update the cluster‐config docs to require a containers.megatron entry.

Signed-off-by: Igor Gitman <igitman@nvidia.com>
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (1)
nemo_skills/pipeline/megatron_lm/train.py (1)

191-192: Remap tokenizer_model via get_mounted_path for consistency.

The code only validates that tokenizer_model is mounted but does not remap it to the in-container path, unlike megatron_model and per_split_data_path (handled via check_mounts at lines 184-189). This inconsistency was flagged in a previous review.

Based on the prior review, apply this diff:

     if tokenizer_model.startswith("/"):
-        check_if_mounted(cluster_config, tokenizer_model)
+        tokenizer_model = get_mounted_path(cluster_config, tokenizer_model)

This ensures tokenizer_model is correctly remapped to the container path, matching the handling of other absolute paths.

🧹 Nitpick comments (1)
nemo_skills/pipeline/megatron_lm/train.py (1)

64-88: Minor formatting refinements.

Two small suggestions:

  1. Line 76's comment "unused for sft" may confuse readers—consider removing if the parameter is required by the script but ignored.
  2. Line 86 adds leading/trailing spaces around extra_arguments, which could result in double spaces if extra_arguments is empty or already padded.

Apply this diff to tighten the formatting:

-    f"    --data-cache-path {output_dir}/megatron-lm-data-cache "  # unused for sft
+    f"    --data-cache-path {output_dir}/megatron-lm-data-cache "
-    cmd += f" {extra_arguments} "
+    cmd += f" {extra_arguments}"
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 2b68884 and b53a3d8.

📒 Files selected for processing (1)
  • nemo_skills/pipeline/megatron_lm/train.py (1 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
nemo_skills/pipeline/megatron_lm/train.py (5)
nemo_skills/pipeline/app.py (1)
  • typer_unpacker (25-53)
nemo_skills/pipeline/utils/exp.py (3)
  • add_task (333-611)
  • get_exp (648-664)
  • run_exp (614-645)
nemo_skills/pipeline/utils/mounts.py (3)
  • check_if_mounted (49-56)
  • check_mounts (59-145)
  • resolve_mount_paths (308-350)
nemo_skills/pipeline/utils/cluster.py (2)
  • get_cluster_config (232-286)
  • get_timeout_str (101-105)
nemo_skills/utils.py (2)
  • get_logger_name (130-131)
  • setup_logging (85-120)
🪛 Ruff (0.13.1)
nemo_skills/pipeline/megatron_lm/train.py

125-127: Do not perform function call typer.Option in argument defaults; instead, perform the call within the function, or read the default from a module-level singleton variable

(B008)


165-167: Do not perform function call typer.Option in argument defaults; instead, perform the call within the function, or read the default from a module-level singleton variable

(B008)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
  • GitHub Check: pre-commit
  • GitHub Check: unit-tests
🔇 Additional comments (6)
nemo_skills/pipeline/megatron_lm/train.py (6)

1-35: LGTM!

Imports and module setup are well-organized and follow the project's conventions.


54-61: LGTM!

Timeout calculation correctly rounds up when there are leftover seconds.


125-127: Static analysis false positive; typer_unpacker handles this pattern.

The ruff B008 warnings about typer.Option in argument defaults are false positives here. The @typer_unpacker decorator (line 94) resolves ParameterInfo defaults at runtime, as shown in the relevant code snippets.

Also applies to: 165-167


173-189: LGTM!

The cluster config resolution, mount path setup, and path validation logic are correct and follow the project's conventions.


194-208: LGTM!

The training command construction correctly passes all required parameters to get_training_cmd.


240-242: LGTM!

Standard typer CLI entry point with command name preservation.

Comment on lines +210 to +237
with get_exp(expname, cluster_config, _reuse_exp) as exp:
prev_task = _task_dependencies
for job_id in range(num_training_jobs):
prev_task = add_task(
exp,
cmd=train_cmd,
task_name=f"{expname}-{job_id}",
log_dir=f"{log_dir}/training-logs",
container=cluster_config["containers"]["megatron"],
num_tasks=num_gpus,
num_gpus=num_gpus,
num_nodes=num_nodes,
cluster_config=cluster_config,
partition=partition,
time_min=time_min,
run_after=run_after,
reuse_code=reuse_code,
reuse_code_exp=reuse_code_exp,
task_dependencies=[prev_task] if prev_task is not None else None,
slurm_kwargs={"exclusive": exclusive} if exclusive else None,
installation_command=installation_command,
skip_hf_home_check=skip_hf_home_check,
)
run_exp(exp, cluster_config, sequential=False, dry_run=dry_run)

if _reuse_exp:
return [prev_task]
return exp
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion | 🟠 Major

🧩 Analysis chain

Verify num_tasks=num_gpus matches Megatron-LM's MPI expectations.

The task submission loop (line 219) sets num_tasks=num_gpus, which assumes one MPI rank per GPU. Ensure this matches Megatron-LM's expected process topology, especially for multi-node training where tensor/pipeline parallelism may require different rank counts.

The rest of the task submission logic—dependency chaining, resource allocation, and return handling—is implemented correctly.

Run the following script to check how other training commands in the codebase handle num_tasks:


🏁 Script executed:

#!/bin/bash
# Description: Check how other training pipelines set num_tasks in add_task calls
rg -nP 'add_task\(' -A15 --type=py -g '!**/test/**' | rg -P 'num_tasks\s*='

Length of output: 148


🏁 Script executed:

#!/bin/bash
rg -nP 'add_task\(' -C3 --glob '*.py'

Length of output: 11121


🏁 Script executed:

#!/bin/bash
sed -n '350,380p' nemo_skills/pipeline/train.py

Length of output: 1173


🏁 Script executed:

#!/bin/bash
sed -n '330,450p' nemo_skills/pipeline/utils/exp.py

Length of output: 5140


🏁 Script executed:

sed -n '1,200p' nemo_skills/pipeline/megatron_lm/train.py

Length of output: 8192


🏁 Script executed:

sed -n '450,600p' nemo_skills/pipeline/utils/exp.py

Length of output: 7048


Compute num_tasks based on executor before calling add_task

In nemo_skills/pipeline/megatron_lm/train.py, you pass num_tasks=num_gpus unconditionally. In other pipelines (e.g. nemo_skills/pipeline/train.py), num_tasks is computed as:

num_tasks = num_gpus if cluster_config["executor"] == "slurm" else 1

…before passing it into add_task. Mirroring this pattern here prevents unintended local MPI launches on non-Slurm executors and ensures you launch exactly one MPI rank per GPU, matching Megatron-LM’s expected topology for multi-node tensor/pipeline parallelism. Double-check that one rank per GPU aligns with your Megatron-LM configuration.

🤖 Prompt for AI Agents
nemo_skills/pipeline/megatron_lm/train.py around lines 210 to 237: num_tasks is
being passed as num_gpus unconditionally which can launch multiple local MPI
ranks on non-Slurm executors; compute num_tasks first using the executor (e.g.,
num_tasks = num_gpus if cluster_config["executor"] == "slurm" else 1) and then
pass that num_tasks variable into add_task instead of num_gpus; keep the rest of
add_task arguments the same and ensure this one-rank-per-GPU choice matches your
Megatron-LM topology expectations.

Copy link
Collaborator

@gwarmstrong gwarmstrong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM with a minor question

),
output_dir: str = typer.Option(..., help="Where to put results"),
expname: str = typer.Option("megatron-lm-train", help="Experiment name"),
entrypoint: str = typer.Option(..., help="Entrypoint script name, e.g. pretrain_gpt.py or pretrain_mamba.py"),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

where can we get the list of possible entrypoints?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's kind of arbitrary as far as I understand and people can have their own starting scripts, so I'm not sure we can check for correctness here. E.g. there is a whole bunch of pretrain_* in here https://github.com/nvidia/megatron-lm, but when working on custom branch it could be more

@Kipok Kipok merged commit db689b1 into main Oct 1, 2025
6 checks passed
@Kipok Kipok deleted the igitman/megatron-sft2 branch October 1, 2025 00:38
wasiahmad pushed a commit that referenced this pull request Oct 1, 2025
Signed-off-by: Igor Gitman <igitman@nvidia.com>
SeanNaren pushed a commit to SeanNaren/NeMo-Skills that referenced this pull request Oct 9, 2025
Signed-off-by: Igor Gitman <igitman@nvidia.com>
Signed-off-by: SeanNaren <snarenthiran@nvidia.com>
SeanNaren pushed a commit that referenced this pull request Oct 9, 2025
Signed-off-by: Igor Gitman <igitman@nvidia.com>
Signed-off-by: SeanNaren <snarenthiran@nvidia.com>
dgtm777 pushed a commit that referenced this pull request Mar 18, 2026
Signed-off-by: Igor Gitman <igitman@nvidia.com>
Signed-off-by: dgitman <dgitman@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants