Adding support for training with megatron-lm#873
Conversation
Signed-off-by: Igor Gitman <igitman@nvidia.com>
Signed-off-by: Igor Gitman <igitman@nvidia.com>
WalkthroughAdds a Megatron-LM training subcommand and implementation: registers a Typer sub-app under "megatron_lm", wires the CLI to import the new train command, and implements helpers to build the training shell command and submit configurable experiment tasks to the cluster with optional WandB support. Changes
Sequence Diagram(s)sequenceDiagram
autonumber
actor User
participant CLI as CLI (Typer)
participant SubApp as megatron_lm_app
participant Builder as get_training_cmd
participant ExpMgr as Experiment Manager
participant Slurm as Cluster/Slurm
participant W&B as Weights & Biases
User->>CLI: nemo-skills megatron_lm train [args]
CLI->>SubApp: invoke train_megatron_lm
SubApp->>SubApp: resolve cluster, mounts, paths
SubApp->>Builder: build training shell command
Builder-->>SubApp: train_cmd
SubApp->>ExpMgr: get_exp(...)
loop for each training job
SubApp->>ExpMgr: add_task(train_cmd, resources, logs, deps)
end
alt dry-run
ExpMgr-->>User: planned tasks output
else execute
ExpMgr->>Slurm: run_exp(...)
opt WandB enabled
SubApp->>W&B: include WandB args / init info
end
Slurm-->>ExpMgr: job statuses
ExpMgr-->>User: experiment result / last task
end
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Poem
Pre-merge checks and finishing touches✅ Passed checks (3 passed)
✨ Finishing touches
🧪 Generate unit tests
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
nemo_skills/pipeline/megatron_lm/train.py (1)
77-77: Clarify necessity of unused parameter.The comment indicates
--data-cache-pathis "unused for sft", yet it's still passed to the training script. If this parameter is truly unused, consider removing it to reduce confusion. If it's needed for compatibility or future use, update the comment to clarify.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
nemo_skills/pipeline/cli.py(1 hunks)nemo_skills/pipeline/megatron_lm/__init__.py(1 hunks)nemo_skills/pipeline/megatron_lm/train.py(1 hunks)
🧰 Additional context used
🧬 Code graph analysis (2)
nemo_skills/pipeline/cli.py (1)
nemo_skills/pipeline/megatron_lm/train.py (1)
train_megatron_lm(96-243)
nemo_skills/pipeline/megatron_lm/train.py (5)
nemo_skills/pipeline/app.py (1)
typer_unpacker(25-53)nemo_skills/pipeline/utils/exp.py (3)
add_task(333-611)get_exp(648-664)run_exp(614-645)nemo_skills/pipeline/utils/mounts.py (4)
check_if_mounted(49-56)check_mounts(59-145)get_mounted_path(148-193)resolve_mount_paths(308-350)nemo_skills/pipeline/utils/cluster.py (2)
get_cluster_config(232-286)get_timeout_str(101-105)nemo_skills/utils.py (2)
get_logger_name(130-131)setup_logging(85-120)
🪛 Ruff (0.13.1)
nemo_skills/pipeline/megatron_lm/train.py
126-128: Do not perform function call typer.Option in argument defaults; instead, perform the call within the function, or read the default from a module-level singleton variable
(B008)
166-168: Do not perform function call typer.Option in argument defaults; instead, perform the call within the function, or read the default from a module-level singleton variable
(B008)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
- GitHub Check: pre-commit
- GitHub Check: unit-tests
🔇 Additional comments (10)
nemo_skills/pipeline/cli.py (1)
27-27: LGTM!The import correctly wires the Megatron-LM training command into the CLI, following the established pattern for other training pipelines.
nemo_skills/pipeline/megatron_lm/__init__.py (1)
15-21: LGTM!The Megatron-LM sub-application is correctly structured and registered following the Typer pattern used elsewhere in the codebase.
nemo_skills/pipeline/megatron_lm/train.py (8)
55-62: LGTM!The timeout calculation correctly converts from DD:HH:MM:SS format to minutes, including proper round-up for leftover seconds.
65-79: Verify hardcoded Megatron-LM path.The command assumes Megatron-LM is installed at
/opt/Megatron-LM. Ensure this path matches the container configuration incluster_config["containers"]["megatron"], or consider making it configurable for flexibility.
82-89: LGTM!Weights & Biases configuration is correctly applied when enabled, and extra arguments are properly appended.
92-180: LGTM!The function signature, decorator configuration, and initial setup correctly follow the established patterns used in other pipeline commands. The use of
allow_extra_argsappropriately enables passing arbitrary arguments to the underlying Megatron-LM script.
200-214: LGTM!The training command is correctly constructed with all necessary parameters and remapped paths.
218-238: Sequential job chaining logic is correct.The loop correctly chains multiple training jobs sequentially via
task_dependencies, and the finalrun_expcall properly respects thedry_runflag.
241-248: LGTM!The return logic correctly handles experiment reuse scenarios, and the main block follows the standard CLI entry point pattern.
216-239: Validate presence of “megatron” container key
Add an explicit check after loadingcluster_configto ensurecluster_config["containers"]contains"megatron"and raise a clear, descriptive error if it’s missing. Update the cluster‐config docs to require acontainers.megatronentry.
Signed-off-by: Igor Gitman <igitman@nvidia.com>
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (1)
nemo_skills/pipeline/megatron_lm/train.py (1)
191-192: Remap tokenizer_model viaget_mounted_pathfor consistency.The code only validates that
tokenizer_modelis mounted but does not remap it to the in-container path, unlikemegatron_modelandper_split_data_path(handled viacheck_mountsat lines 184-189). This inconsistency was flagged in a previous review.Based on the prior review, apply this diff:
if tokenizer_model.startswith("/"): - check_if_mounted(cluster_config, tokenizer_model) + tokenizer_model = get_mounted_path(cluster_config, tokenizer_model)This ensures
tokenizer_modelis correctly remapped to the container path, matching the handling of other absolute paths.
🧹 Nitpick comments (1)
nemo_skills/pipeline/megatron_lm/train.py (1)
64-88: Minor formatting refinements.Two small suggestions:
- Line 76's comment "unused for sft" may confuse readers—consider removing if the parameter is required by the script but ignored.
- Line 86 adds leading/trailing spaces around
extra_arguments, which could result in double spaces ifextra_argumentsis empty or already padded.Apply this diff to tighten the formatting:
- f" --data-cache-path {output_dir}/megatron-lm-data-cache " # unused for sft + f" --data-cache-path {output_dir}/megatron-lm-data-cache "- cmd += f" {extra_arguments} " + cmd += f" {extra_arguments}"
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
nemo_skills/pipeline/megatron_lm/train.py(1 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
nemo_skills/pipeline/megatron_lm/train.py (5)
nemo_skills/pipeline/app.py (1)
typer_unpacker(25-53)nemo_skills/pipeline/utils/exp.py (3)
add_task(333-611)get_exp(648-664)run_exp(614-645)nemo_skills/pipeline/utils/mounts.py (3)
check_if_mounted(49-56)check_mounts(59-145)resolve_mount_paths(308-350)nemo_skills/pipeline/utils/cluster.py (2)
get_cluster_config(232-286)get_timeout_str(101-105)nemo_skills/utils.py (2)
get_logger_name(130-131)setup_logging(85-120)
🪛 Ruff (0.13.1)
nemo_skills/pipeline/megatron_lm/train.py
125-127: Do not perform function call typer.Option in argument defaults; instead, perform the call within the function, or read the default from a module-level singleton variable
(B008)
165-167: Do not perform function call typer.Option in argument defaults; instead, perform the call within the function, or read the default from a module-level singleton variable
(B008)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
- GitHub Check: pre-commit
- GitHub Check: unit-tests
🔇 Additional comments (6)
nemo_skills/pipeline/megatron_lm/train.py (6)
1-35: LGTM!Imports and module setup are well-organized and follow the project's conventions.
54-61: LGTM!Timeout calculation correctly rounds up when there are leftover seconds.
125-127: Static analysis false positive; typer_unpacker handles this pattern.The ruff B008 warnings about
typer.Optionin argument defaults are false positives here. The@typer_unpackerdecorator (line 94) resolvesParameterInfodefaults at runtime, as shown in the relevant code snippets.Also applies to: 165-167
173-189: LGTM!The cluster config resolution, mount path setup, and path validation logic are correct and follow the project's conventions.
194-208: LGTM!The training command construction correctly passes all required parameters to
get_training_cmd.
240-242: LGTM!Standard typer CLI entry point with command name preservation.
| with get_exp(expname, cluster_config, _reuse_exp) as exp: | ||
| prev_task = _task_dependencies | ||
| for job_id in range(num_training_jobs): | ||
| prev_task = add_task( | ||
| exp, | ||
| cmd=train_cmd, | ||
| task_name=f"{expname}-{job_id}", | ||
| log_dir=f"{log_dir}/training-logs", | ||
| container=cluster_config["containers"]["megatron"], | ||
| num_tasks=num_gpus, | ||
| num_gpus=num_gpus, | ||
| num_nodes=num_nodes, | ||
| cluster_config=cluster_config, | ||
| partition=partition, | ||
| time_min=time_min, | ||
| run_after=run_after, | ||
| reuse_code=reuse_code, | ||
| reuse_code_exp=reuse_code_exp, | ||
| task_dependencies=[prev_task] if prev_task is not None else None, | ||
| slurm_kwargs={"exclusive": exclusive} if exclusive else None, | ||
| installation_command=installation_command, | ||
| skip_hf_home_check=skip_hf_home_check, | ||
| ) | ||
| run_exp(exp, cluster_config, sequential=False, dry_run=dry_run) | ||
|
|
||
| if _reuse_exp: | ||
| return [prev_task] | ||
| return exp |
There was a problem hiding this comment.
🛠️ Refactor suggestion | 🟠 Major
🧩 Analysis chain
Verify num_tasks=num_gpus matches Megatron-LM's MPI expectations.
The task submission loop (line 219) sets num_tasks=num_gpus, which assumes one MPI rank per GPU. Ensure this matches Megatron-LM's expected process topology, especially for multi-node training where tensor/pipeline parallelism may require different rank counts.
The rest of the task submission logic—dependency chaining, resource allocation, and return handling—is implemented correctly.
Run the following script to check how other training commands in the codebase handle num_tasks:
🏁 Script executed:
#!/bin/bash
# Description: Check how other training pipelines set num_tasks in add_task calls
rg -nP 'add_task\(' -A15 --type=py -g '!**/test/**' | rg -P 'num_tasks\s*='Length of output: 148
🏁 Script executed:
#!/bin/bash
rg -nP 'add_task\(' -C3 --glob '*.py'Length of output: 11121
🏁 Script executed:
#!/bin/bash
sed -n '350,380p' nemo_skills/pipeline/train.pyLength of output: 1173
🏁 Script executed:
#!/bin/bash
sed -n '330,450p' nemo_skills/pipeline/utils/exp.pyLength of output: 5140
🏁 Script executed:
sed -n '1,200p' nemo_skills/pipeline/megatron_lm/train.pyLength of output: 8192
🏁 Script executed:
sed -n '450,600p' nemo_skills/pipeline/utils/exp.pyLength of output: 7048
Compute num_tasks based on executor before calling add_task
In nemo_skills/pipeline/megatron_lm/train.py, you pass num_tasks=num_gpus unconditionally. In other pipelines (e.g. nemo_skills/pipeline/train.py), num_tasks is computed as:
num_tasks = num_gpus if cluster_config["executor"] == "slurm" else 1…before passing it into add_task. Mirroring this pattern here prevents unintended local MPI launches on non-Slurm executors and ensures you launch exactly one MPI rank per GPU, matching Megatron-LM’s expected topology for multi-node tensor/pipeline parallelism. Double-check that one rank per GPU aligns with your Megatron-LM configuration.
🤖 Prompt for AI Agents
nemo_skills/pipeline/megatron_lm/train.py around lines 210 to 237: num_tasks is
being passed as num_gpus unconditionally which can launch multiple local MPI
ranks on non-Slurm executors; compute num_tasks first using the executor (e.g.,
num_tasks = num_gpus if cluster_config["executor"] == "slurm" else 1) and then
pass that num_tasks variable into add_task instead of num_gpus; keep the rest of
add_task arguments the same and ensure this one-rank-per-GPU choice matches your
Megatron-LM topology expectations.
gwarmstrong
left a comment
There was a problem hiding this comment.
LGTM with a minor question
| ), | ||
| output_dir: str = typer.Option(..., help="Where to put results"), | ||
| expname: str = typer.Option("megatron-lm-train", help="Experiment name"), | ||
| entrypoint: str = typer.Option(..., help="Entrypoint script name, e.g. pretrain_gpt.py or pretrain_mamba.py"), |
There was a problem hiding this comment.
where can we get the list of possible entrypoints?
There was a problem hiding this comment.
It's kind of arbitrary as far as I understand and people can have their own starting scripts, so I'm not sure we can check for correctness here. E.g. there is a whole bunch of pretrain_* in here https://github.com/nvidia/megatron-lm, but when working on custom branch it could be more
Signed-off-by: Igor Gitman <igitman@nvidia.com>
Signed-off-by: Igor Gitman <igitman@nvidia.com> Signed-off-by: SeanNaren <snarenthiran@nvidia.com>
Signed-off-by: Igor Gitman <igitman@nvidia.com> Signed-off-by: SeanNaren <snarenthiran@nvidia.com>
Signed-off-by: Igor Gitman <igitman@nvidia.com> Signed-off-by: dgitman <dgitman@nvidia.com>
Support is very minimal right now. Not adding docs or tests as we'd likely need to refine this as we get more use-cases
Users typically would need to mount your custom megatron-lm folder as well as define a custom container.
example command
Summary by CodeRabbit