Skip to content

fix: Fixes to make Megatron backend match dtensor#1389

Merged
terrykong merged 19 commits intomainfrom
ashors/fix-mcore-scale
Oct 31, 2025
Merged

fix: Fixes to make Megatron backend match dtensor#1389
terrykong merged 19 commits intomainfrom
ashors/fix-mcore-scale

Conversation

@ashors1
Copy link
Contributor

@ashors1 ashors1 commented Oct 17, 2025

What does this PR do ?

Add a one line overview of what this PR aims to accomplish.

Issues

List issues that this PR closes (syntax):

Usage

  • You can potentially add a usage example below
# Add a code snippet demonstrating how to use this

Before your PR is "Ready for review"

Pre checks:

  • Make sure you read and followed Contributor guidelines
  • Did you write any new necessary tests?
  • Did you run the unit tests and functional tests locally? Visit our Testing Guide for how to run tests
  • Did you add or update any necessary documentation? Visit our Document Development Guide for how to write, build and test the docs.

Additional Information

  • ...

Summary by CodeRabbit

  • New Features

    • Added supervised fine-tuning recipe configuration for Qwen2.5-Math-7B with Megatron tensor parallelism support.
  • Tests

    • Added gradient norm consistency validation tests across different parallelism configurations.
  • Chores

    • Simplified configuration format by removing deprecated distributed data parallel settings from multiple configuration files.
    • Enhanced internal gradient handling for per-token loss computation in policy training.

Signed-off-by: ashors1 <ashors@nvidia.com>
@ashors1 ashors1 changed the title enable megatron calculate_per_token_loss fix: enable megatron calculate_per_token_loss Oct 17, 2025
Signed-off-by: ashors1 <ashors@nvidia.com>
Signed-off-by: ashors1 <ashors@nvidia.com>
Signed-off-by: Anna Shors <ashors@nvidia.com>
Signed-off-by: ashors1 <ashors@nvidia.com>
Signed-off-by: ashors1 <ashors@nvidia.com>
Signed-off-by: ashors1 <ashors@nvidia.com>
Signed-off-by: ashors1 <ashors@nvidia.com>
Signed-off-by: Anna Shors <ashors@nvidia.com>
Signed-off-by: Anna Shors <ashors@nvidia.com>
Signed-off-by: Anna Shors <ashors@nvidia.com>
@ashors1 ashors1 marked this pull request as ready for review October 28, 2025 17:10
@ashors1 ashors1 requested review from a team as code owners October 28, 2025 17:10
@terrykong terrykong added r0.4.0 and removed r0.4.0 labels Oct 28, 2025
Signed-off-by: Anna Shors <ashors@nvidia.com>
Signed-off-by: ashors1 <ashors@nvidia.com>
@ashors1 ashors1 changed the title fix: enable megatron calculate_per_token_loss fix: Fixes to make Megatron backend match dtensor Oct 29, 2025
terrykong
terrykong previously approved these changes Oct 29, 2025
@terrykong terrykong added the CI:L1 Run doctests, unit tests, and functional tests label Oct 29, 2025
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Oct 30, 2025

📝 Walkthrough

Walkthrough

The PR removes the average_in_collective configuration flag from MegatronDDPConfig across example configs and tests, adds conditional logic to override it when per-token loss calculation is enabled, and introduces a new SFT recipe for Qwen2.5-Math-7B with Megatron-based tensor model parallelism.

Changes

Cohort / File(s) Summary
Config flag removal
examples/configs/distillation_math.yaml, examples/configs/distillation_math_megatron.yaml, examples/configs/dpo.yaml, examples/configs/grpo_math_1B.yaml, examples/configs/grpo_math_1B_megatron.yaml, examples/configs/rm.yaml, examples/configs/sft.yaml, examples/configs/sft_openmathinstruct2_megatron.yaml, examples/configs/vlm_grpo_3B.yaml, examples/configs/vlm_grpo_3B_megatron.yaml
Removed average_in_collective: true from policy.megatron_cfg.distributed_data_parallel_config in all referenced example configurations.
New SFT recipe
examples/configs/recipes/llm/sft-qwen2.5-math-7b-megatron.yaml
New SFT recipe configuration for Qwen2.5-Math-7B with Megatron setup including tensor model parallelism (TP size 4), context parallelism, sequence packing, optimizer settings, and WandB/TensorBoard logging for 2-node 8-GPU cluster.
TypedDict field removal
nemo_rl/models/policy/__init__.py
Removed average_in_collective: bool field from MegatronDDPConfig TypedDict definition.
Policy worker gradient handling
nemo_rl/models/policy/megatron_policy_worker.py
Added GPU/TP gradient handling: enable calculate_per_token_loss and perform_initialization in model config, add MoE aux loss compatibility assertion, override average_in_collective to False when per-token loss is True, and introduce new check_tensor_parallel_attributes() utility method to inspect model parameters' TP attributes.
Test config updates
tests/unit/models/generation/test_vllm_generation.py, tests/unit/models/policy/test_megatron_worker.py, tools/refit_verifier.py
Removed average_in_collective entry from distributed data parallel configs in test/tool configurations.
New test infrastructure
tests/unit/models/policy/test_megatron_worker.py
Added imports (numpy, ray), new test function test_megatron_gradient_norm_consistency_across_parallelism() to validate gradient norms and losses across multiple DP/TP configurations.
New SFT test script
tests/test_suites/llm/sft-qwen2.5-math-7b-megatron.sh
New Bash test script that runs SFT training with Qwen2.5-Math-7B-Megatron, converts TensorBoard logs to JSON, and conditionally validates metrics against configured thresholds.
Test suite registration
tests/test_suites/nightly.txt
Added test entry tests/test_suites/llm/sft-qwen2.5-math-7b-megatron.sh under SFT section of nightly test suite with comment "validate TP/DP".

Sequence Diagram

sequenceDiagram
    participant Config as Training Config
    participant PolicyWorker as MegatronPolicyWorker
    participant Model as Model Config
    
    Config->>PolicyWorker: Initialize with policy settings
    PolicyWorker->>Model: Enable calculate_per_token_loss
    PolicyWorker->>Model: Enable perform_initialization
    
    alt calculate_per_token_loss is True
        PolicyWorker->>Model: Override average_in_collective = False
        Note over PolicyWorker,Model: Avoid MCore assertion errors
    else calculate_per_token_loss is False
        Note over PolicyWorker,Model: Use default average_in_collective
    end
    
    PolicyWorker->>PolicyWorker: check_tensor_parallel_attributes()
    Note over PolicyWorker: Inspect TP status, partition dims,<br/>shapes across model parameters
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

  • nemo_rl/models/policy/megatron_policy_worker.py: Requires careful review of the conditional logic for overriding average_in_collective and validation that the new check_tensor_parallel_attributes() method correctly inspects model parameters.
  • tests/unit/models/policy/test_megatron_worker.py: New gradient norm consistency test function involves multi-configuration training and comparison logic that requires verification of test correctness and assertion thresholds.
  • examples/configs/recipes/llm/sft-qwen2.5-math-7b-megatron.yaml: Review hyperparameter choices, distributed settings, and data configuration for the new recipe.
  • Remaining config removals are repetitive and require minimal effort per file.

Possibly related PRs

Suggested reviewers

  • zpqiu
  • parthchadha
  • yfw

Pre-merge checks and finishing touches

❌ Failed checks (1 warning, 1 inconclusive)
Check name Status Explanation Resolution
Test Results For Major Changes ⚠️ Warning The PR description in GitHub is incomplete—it contains only placeholder text from a template ("Add a one line overview of what this PR aims to accomplish" without actual content) and unchecked checklists. However, the CodeRabbit auto-generated summary does mention that "gradient norm consistency validation tests across different parallelism configurations" were added, and the raw summary confirms a new test function test_megatron_gradient_norm_consistency_across_parallelism and test script sft-qwen2.5-math-7b-megatron.sh are included. While these tests exist in the code, the PR description itself does not document the actual test results, convergence validation, or confirmation that these changes do not introduce training regressions—which is critical since the changes modify distributed gradient averaging behavior that directly affects training numerics. The PR description must be updated to include explicit test results and validation data. Specifically, add: (1) results from running the new gradient norm consistency tests showing they pass across different parallelism configurations, (2) convergence validation on at least one affected model (e.g., the Qwen2.5-Math-7B SFT recipe) comparing training metrics before and after these changes, and (3) confirmation that the nightly test suite validations pass. This documentation is essential since the changes affect distributed training gradient averaging—a numerically sensitive component—and the current template-only PR description does not demonstrate there are no regressions.
Title Check ❓ Inconclusive The title "fix: Fixes to make Megatron backend match dtensor" is related to the pull request's stated objective of aligning Megatron with dtensor behavior. However, the title lacks specificity about the actual changes made. The primary modifications involve removing the average_in_collective configuration parameter across multiple files, removing it from the MegatronDDPConfig TypedDict definition, and adding new gradient handling logic. A teammate scanning the commit history would understand that the PR addresses some compatibility issue but would not clearly grasp what specific changes were implemented. Additionally, the repetitive phrasing "Fixes to fix" suggests imprecision in the title composition. Consider revising the title to be more specific about the key changes, such as "Remove average_in_collective config to align Megatron DDP with dtensor" or "Fix Megatron backend DDP configuration to match dtensor behavior." This would clearly communicate the primary technical change while maintaining the fix-focused intent.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Docstring Coverage ✅ Passed Docstring coverage is 85.71% which is sufficient. The required threshold is 80.00%.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch ashors/fix-mcore-scale

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (1)
tests/unit/models/policy/test_megatron_worker.py (1)

2436-2436: Consider adding explicit strict=True to zip() call.

While the test already asserts equal length on lines 2432-2434, adding strict=True to the zip() call would provide additional runtime enforcement and follows Python 3.10+ best practices. Since the coding guidelines target Python 3.12+, this parameter is available.

Apply this diff:

-            for i, (gn, ref_gn) in enumerate(zip(grad_norm, reference_grad_norm)):
+            for i, (gn, ref_gn) in enumerate(zip(grad_norm, reference_grad_norm, strict=True)):

Based on static analysis hint from Ruff.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between b3aac89 and f2cb2ba.

📒 Files selected for processing (18)
  • examples/configs/distillation_math.yaml (0 hunks)
  • examples/configs/distillation_math_megatron.yaml (0 hunks)
  • examples/configs/dpo.yaml (0 hunks)
  • examples/configs/grpo_math_1B.yaml (0 hunks)
  • examples/configs/grpo_math_1B_megatron.yaml (0 hunks)
  • examples/configs/recipes/llm/sft-qwen2.5-math-7b-megatron.yaml (1 hunks)
  • examples/configs/rm.yaml (0 hunks)
  • examples/configs/sft.yaml (0 hunks)
  • examples/configs/sft_openmathinstruct2_megatron.yaml (0 hunks)
  • examples/configs/vlm_grpo_3B.yaml (0 hunks)
  • examples/configs/vlm_grpo_3B_megatron.yaml (0 hunks)
  • nemo_rl/models/policy/__init__.py (0 hunks)
  • nemo_rl/models/policy/megatron_policy_worker.py (3 hunks)
  • tests/test_suites/llm/sft-qwen2.5-math-7b-megatron.sh (1 hunks)
  • tests/test_suites/nightly.txt (1 hunks)
  • tests/unit/models/generation/test_vllm_generation.py (0 hunks)
  • tests/unit/models/policy/test_megatron_worker.py (2 hunks)
  • tools/refit_verifier.py (0 hunks)
💤 Files with no reviewable changes (13)
  • examples/configs/sft_openmathinstruct2_megatron.yaml
  • examples/configs/dpo.yaml
  • examples/configs/distillation_math.yaml
  • examples/configs/vlm_grpo_3B_megatron.yaml
  • examples/configs/vlm_grpo_3B.yaml
  • examples/configs/grpo_math_1B.yaml
  • tools/refit_verifier.py
  • examples/configs/grpo_math_1B_megatron.yaml
  • tests/unit/models/generation/test_vllm_generation.py
  • examples/configs/sft.yaml
  • examples/configs/distillation_math_megatron.yaml
  • examples/configs/rm.yaml
  • nemo_rl/models/policy/init.py
🧰 Additional context used
📓 Path-based instructions (10)
tests/test_suites/nightly.txt

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Append the new driver script path (relative to tests/test_suites/) to tests/test_suites/nightly.txt

Files:

  • tests/test_suites/nightly.txt
tests/test_suites/**

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Place driver shell scripts and common.env under tests/test_suites// and list nightly tests in tests/test_suites/nightly.txt

Files:

  • tests/test_suites/nightly.txt
  • tests/test_suites/llm/sft-qwen2.5-math-7b-megatron.sh
**/*.py

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.py: Follow the Google Python Style Guide for all Python code
Target Python 3.12+ for all Python code in NeMo-RL
Indent Python code with 4 spaces; do not use tabs
Python filenames should be snake_case (e.g., some_file.py)
Class names should be PascalCase
Function and method names should be snake_case
Local variable names should be snake_case; if starting with a number, prefix with k (e.g., k_99th_percentile)
Global variables should be UPPER_SNAKE_CASE and prefixed with G_ (e.g., G_MY_GLOBAL)
Constants should be UPPER_SNAKE_CASE
Avoid shadowing variables declared in an outer scope
Initialize all externally visible members of a class in the constructor
For public interfaces used outside a file, prefer docstrings over comments
Use comments mainly for code within a function or interfaces local to a file
Commented-out code must include a nearby comment explaining usage and why it is commented out; otherwise remove before merging
Use Google-style docstrings for classes and functions (Sphinx-parseable)
Avoid using reflection when functionality can be easily achieved without it
Limit except clauses to the smallest specific set of exceptions possible
For duck-typing via try/except, keep the try body minimal and use else for main logic
Add the NVIDIA copyright header (with current year) at the top of all Python files, excluding tests/ and test-only scripts

Files:

  • nemo_rl/models/policy/megatron_policy_worker.py
  • tests/unit/models/policy/test_megatron_worker.py
nemo_rl/**/*.py

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

nemo_rl/**/*.py: Do not set non-None configuration defaults in code; YAML is the single source of truth for defaults
Access required config attributes directly (e.g., policy_cfg["precision"]) and assume presence; do not introduce hidden defaults
Express configuration optionality via TypedDict using typing.NotRequired
When adding a new config key to a TypedDict subclass, document the key’s purpose, valid values/types, and recommended default in code
For any class or function decorated with @ray.remote, add '# pragma: no cover' on the class/def line (and on remote functions)

Files:

  • nemo_rl/models/policy/megatron_policy_worker.py
examples/configs/recipes/**/*.yaml

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

examples/configs/recipes/**/*.yaml: Recipe YAMLs under examples/configs/recipes/** are runnable snapshots and may omit documentation
When adding support for a new model, add a recipe YAML under examples/configs/recipes/ in the appropriate domain (llm/ or vlm/) with the correct name

Files:

  • examples/configs/recipes/llm/sft-qwen2.5-math-7b-megatron.yaml
examples/configs/recipes/llm/*.yaml

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

LLM recipe YAML filenames must follow: --ng-[-modifiers][-long][.vN].yaml

Files:

  • examples/configs/recipes/llm/sft-qwen2.5-math-7b-megatron.yaml
examples/configs/recipes/**/*.{yaml,sh}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Known exception: Deepscaler recipes may encode context length in place of the cluster tuple (e.g., grpo-deepscaler-1.5b-8K.*); allowed but document intended hardware in the script

Files:

  • examples/configs/recipes/llm/sft-qwen2.5-math-7b-megatron.yaml
examples/configs/recipes/**

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Place recipe YAMLs under examples/configs/recipes//

Files:

  • examples/configs/recipes/llm/sft-qwen2.5-math-7b-megatron.yaml
**/*.sh

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.sh: Follow the Google Shell Style Guide for all shell scripts
Use uv run to execute Python scripts in shell/driver scripts instead of activating virtualenvs and calling python directly
Add the NVIDIA copyright header (with current year) at the top of all shell scripts, excluding tests/ and test-only scripts

Files:

  • tests/test_suites/llm/sft-qwen2.5-math-7b-megatron.sh
tests/test_suites/llm/*.sh

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

LLM driver script filenames must mirror the YAML base name and follow the same pattern with .sh extension

Files:

  • tests/test_suites/llm/sft-qwen2.5-math-7b-megatron.sh
🧠 Learnings (10)
📚 Learning: 2025-09-20T14:58:45.492Z
Learnt from: CR
PR: NVIDIA-NeMo/RL#0
File: CODING_GUIDELINES.md:0-0
Timestamp: 2025-09-20T14:58:45.492Z
Learning: Applies to tests/test_suites/nightly.txt : Append the new driver script path (relative to tests/test_suites/) to tests/test_suites/nightly.txt

Applied to files:

  • tests/test_suites/nightly.txt
📚 Learning: 2025-09-20T14:58:45.492Z
Learnt from: CR
PR: NVIDIA-NeMo/RL#0
File: CODING_GUIDELINES.md:0-0
Timestamp: 2025-09-20T14:58:45.492Z
Learning: Applies to tests/test_suites/** : Place driver shell scripts and common.env under tests/test_suites/<domain>/ and list nightly tests in tests/test_suites/nightly.txt

Applied to files:

  • tests/test_suites/nightly.txt
📚 Learning: 2025-09-20T14:58:45.492Z
Learnt from: CR
PR: NVIDIA-NeMo/RL#0
File: CODING_GUIDELINES.md:0-0
Timestamp: 2025-09-20T14:58:45.492Z
Learning: Applies to tests/test_suites/llm/*.sh : LLM driver script filenames must mirror the YAML base name and follow the same pattern with .sh extension

Applied to files:

  • tests/test_suites/nightly.txt
  • tests/test_suites/llm/sft-qwen2.5-math-7b-megatron.sh
📚 Learning: 2025-09-17T01:52:21.399Z
Learnt from: ffrujeri
PR: NVIDIA-NeMo/RL#1023
File: nemo_rl/utils/checkpoint.py:58-65
Timestamp: 2025-09-17T01:52:21.399Z
Learning: model_state_dict_keys is not intended to be part of the nemo-rl CheckpointingConfig TypedDict - it's handled at the automodel implementation layer, not as a general checkpointing configuration parameter.

Applied to files:

  • nemo_rl/models/policy/megatron_policy_worker.py
📚 Learning: 2025-09-20T14:58:45.492Z
Learnt from: CR
PR: NVIDIA-NeMo/RL#0
File: CODING_GUIDELINES.md:0-0
Timestamp: 2025-09-20T14:58:45.492Z
Learning: Applies to examples/configs/recipes/llm/*.yaml : LLM recipe YAML filenames must follow: <algo>-<model>-<nodes>n<gpus>g-<strategy-and-params>[-modifiers][-long][.vN].yaml

Applied to files:

  • examples/configs/recipes/llm/sft-qwen2.5-math-7b-megatron.yaml
📚 Learning: 2025-09-20T14:58:45.492Z
Learnt from: CR
PR: NVIDIA-NeMo/RL#0
File: CODING_GUIDELINES.md:0-0
Timestamp: 2025-09-20T14:58:45.492Z
Learning: Applies to examples/configs/recipes/**/*.{yaml,sh} : Known exception: Deepscaler recipes may encode context length in place of the cluster tuple (e.g., grpo-deepscaler-1.5b-8K.*); allowed but document intended hardware in the script

Applied to files:

  • examples/configs/recipes/llm/sft-qwen2.5-math-7b-megatron.yaml
📚 Learning: 2025-10-12T14:46:57.171Z
Learnt from: zpqiu
PR: NVIDIA-NeMo/RL#1324
File: tests/test_suites/llm/distillation-qwen3-32b-to-1.7b-base-1n8g-megatron-tp2pp2cp2-pack.sh:6-11
Timestamp: 2025-10-12T14:46:57.171Z
Learning: Test scripts in tests/test_suites/llm/ follow a standard configuration pattern that includes NUM_NODES, STEPS_PER_RUN, MAX_STEPS, NUM_RUNS (calculated as `$(( (MAX_STEPS + STEPS_PER_RUN - 1) / STEPS_PER_RUN ))`), and NUM_MINUTES. These variables are part of the test infrastructure's standard interface and should not be flagged as unused even if not directly referenced within the individual script, as they are consumed by external launch tooling or common.env.

Applied to files:

  • tests/test_suites/llm/sft-qwen2.5-math-7b-megatron.sh
📚 Learning: 2025-09-20T14:58:45.492Z
Learnt from: CR
PR: NVIDIA-NeMo/RL#0
File: CODING_GUIDELINES.md:0-0
Timestamp: 2025-09-20T14:58:45.492Z
Learning: Applies to tests/test_suites/**/*.{sh} : For new model support, add a matching driver shell script under tests/test_suites/<domain>/ that sources common.env and invokes 'uv run ... --config <yaml>'

Applied to files:

  • tests/test_suites/llm/sft-qwen2.5-math-7b-megatron.sh
📚 Learning: 2025-10-12T14:46:55.513Z
Learnt from: zpqiu
PR: NVIDIA-NeMo/RL#1324
File: tests/test_suites/llm/distillation-qwen3-32b-to-1.7b-base-1n8g-megatron-tp2pp2cp2-pack.sh:16-30
Timestamp: 2025-10-12T14:46:55.513Z
Learning: In the NVIDIA-NeMo/RL repository, test scripts under tests/ follow a consistent pattern: use `cd $PROJECT_ROOT` without quotes or error handling, and pass arguments with `$@` unquoted. Maintain this consistency when adding new test scripts.

Applied to files:

  • tests/test_suites/llm/sft-qwen2.5-math-7b-megatron.sh
📚 Learning: 2025-09-19T07:28:29.887Z
Learnt from: shuo-nvidia
PR: NVIDIA-NeMo/RL#1006
File: tests/test_suites/llm/distillation-qwen3-32b-to-4b-base-2n8g-fsdp2tp2-long.v1.sh:1-4
Timestamp: 2025-09-19T07:28:29.887Z
Learning: The NVIDIA-NeMo/RL project prefers to maintain consistent formatting across test scripts rather than applying individual bash hardening improvements like `set -euo pipefail` or proper quoting for sourcing files.

Applied to files:

  • tests/test_suites/llm/sft-qwen2.5-math-7b-megatron.sh
🧬 Code graph analysis (1)
tests/unit/models/policy/test_megatron_worker.py (7)
tests/unit/conftest.py (1)
  • tiny_llama_model_path (500-524)
nemo_rl/distributed/batched_data_dict.py (2)
  • to (825-832)
  • BatchedDataDict (75-860)
nemo_rl/distributed/virtual_cluster.py (2)
  • RayVirtualCluster (186-505)
  • shutdown (477-496)
nemo_rl/models/policy/lm_policy.py (4)
  • Policy (59-809)
  • prepare_for_training (652-655)
  • train (465-558)
  • shutdown (749-756)
nemo_rl/algorithms/loss_functions.py (1)
  • NLLLoss (379-455)
nemo_rl/models/policy/megatron_policy_worker.py (3)
  • prepare_for_training (2003-2026)
  • train (899-1149)
  • shutdown (2229-2234)
nemo_rl/distributed/worker_groups.py (2)
  • run_all_workers_single_data (728-772)
  • shutdown (930-1004)
🪛 Ruff (0.14.2)
tests/unit/models/policy/test_megatron_worker.py

2436-2436: zip() without an explicit strict= parameter

Add explicit value for parameter strict=

(B905)

🪛 Shellcheck (0.11.0)
tests/test_suites/llm/sft-qwen2.5-math-7b-megatron.sh

[warning] 9-9: NUM_NODES appears unused. Verify use (or export if used externally).

(SC2034)


[warning] 12-12: NUM_RUNS appears unused. Verify use (or export if used externally).

(SC2034)


[warning] 13-13: NUM_MINUTES appears unused. Verify use (or export if used externally).

(SC2034)


[warning] 19-19: Use 'cd ... || exit' or 'cd ... || return' in case cd fails.

(SC2164)


[error] 32-32: Double quote array expansions to avoid re-splitting elements.

(SC2068)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
  • GitHub Check: Lint check
  • GitHub Check: Post submodule check comment / Comment on PR
  • GitHub Check: Post automodel integration comment / Comment on PR
🔇 Additional comments (7)
examples/configs/recipes/llm/sft-qwen2.5-math-7b-megatron.yaml (1)

36-42: Data paths verified and correct.

Both examples/prompts/math.txt and the openmathinstruct2 dataset are available in the repository. The prompt file exists at the referenced path, and the dataset is implemented as OpenMathInstruct2Dataset with full integration in the data loading framework. No changes needed.

tests/test_suites/nightly.txt (1)

70-71: Nightly test entry correctly appended.

The new test path is properly added to the SFT/Megatron section with a descriptive comment. Path format (relative to tests/test_suites/) and placement follow established patterns in the file.

tests/test_suites/llm/sft-qwen2.5-math-7b-megatron.sh (1)

1-43: Verification complete—all dependencies satisfied.

All nine required variables and functions are properly defined in tests/test_suites/llm/common.env: PROJECT_ROOT, CONFIG_PATH, LOG_DIR, CKPT_DIR, EXP_NAME, RUN_LOG, JSON_METRICS, exit_if_max_steps_reached, and they are correctly initialized and exported. The script has no missing dependencies.

nemo_rl/models/policy/megatron_policy_worker.py (2)

707-709: Good addition of explanatory comment.

The comment explains why average_in_collective=False is required when calculate_per_token_loss=True, which addresses the past review feedback requesting clarification of this setting. The relationship between these two configuration options is now documented.

Based on past review comments requesting descriptive explanations for this configuration.


2250-2292: LGTM! Well-structured introspection utility.

The new check_tensor_parallel_attributes method provides a clear way to inspect tensor parallel attributes on model parameters. The implementation correctly:

  • Iterates through all model parameters
  • Captures relevant TP attributes (tensor_model_parallel, partition_dim, partition_stride)
  • Returns a well-structured dictionary with comprehensive information
  • Is properly tested in the test suite (see test_megatron_gradient_norm_consistency_across_parallelism)

This is a useful diagnostic tool for validating tensor parallel setup.

tests/unit/models/policy/test_megatron_worker.py (2)

19-21: LGTM! Imports support new test functionality.

The numpy import is used for numerical comparisons in the gradient norm consistency test (line 2450), and the ray import is used to retrieve results from distributed workers (line 2378). Both are necessary for the new test function.


2271-2465: Excellent comprehensive test for gradient norm consistency!

This test thoroughly validates that:

  1. Gradient norms are consistent across different parallelization strategies (DP, TP)
  2. Losses are consistent across configurations
  3. Tensor parallel attributes are correctly set on model parameters when TP > 1

The test design is sound:

  • Uses reproducible test data (fixed seed)
  • Tests three meaningful configurations: DP1TP1 (baseline), DP2, and TP2
  • Uses float32 for stable gradient comparisons
  • Validates tensor parallel attributes via the new check_tensor_parallel_attributes method
  • Provides clear diagnostic output for debugging

The test directly validates the changes made in megatron_policy_worker.py related to per-token loss handling and tensor parallel setup.

Signed-off-by: ashors1 <ashors@nvidia.com>
@terrykong terrykong added CI:L1 Run doctests, unit tests, and functional tests and removed CI:L1 Run doctests, unit tests, and functional tests labels Oct 30, 2025
terrykong
terrykong previously approved these changes Oct 30, 2025
@terrykong terrykong added CI:L1 Run doctests, unit tests, and functional tests and removed CI:L1 Run doctests, unit tests, and functional tests labels Oct 31, 2025
@terrykong terrykong merged commit 90fb0a8 into main Oct 31, 2025
40 of 41 checks passed
@terrykong terrykong deleted the ashors/fix-mcore-scale branch October 31, 2025 21:23
ashors1 added a commit that referenced this pull request Oct 31, 2025
Signed-off-by: ashors1 <ashors@nvidia.com>
Signed-off-by: Anna Shors <ashors@nvidia.com>
terrykong pushed a commit that referenced this pull request Nov 2, 2025
Signed-off-by: ashors1 <ashors@nvidia.com>
Signed-off-by: Anna Shors <ashors@nvidia.com>
lbliii pushed a commit that referenced this pull request Nov 3, 2025
Signed-off-by: ashors1 <ashors@nvidia.com>
Signed-off-by: Anna Shors <ashors@nvidia.com>
Signed-off-by: Lawrence Lane <llane@nvidia.com>
@coderabbitai coderabbitai bot mentioned this pull request Nov 13, 2025
4 tasks
PrinsYin pushed a commit to PrinsYin/RL that referenced this pull request Nov 30, 2025
Signed-off-by: ashors1 <ashors@nvidia.com>
Signed-off-by: Anna Shors <ashors@nvidia.com>
@coderabbitai coderabbitai bot mentioned this pull request Feb 18, 2026
4 tasks
yuanhangsu1986 pushed a commit to yuanhangsu1986/RL-Nemontron-Edge-Omni that referenced this pull request Feb 21, 2026
Signed-off-by: ashors1 <ashors@nvidia.com>
Signed-off-by: Anna Shors <ashors@nvidia.com>
Signed-off-by: yuanhangs <yuanhangs@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CI:L1 Run doctests, unit tests, and functional tests r0.4.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants