Skip to content

perf: Update moe_token_dispatcher_type default to alltoall#2004

Merged
terrykong merged 3 commits intoNVIDIA-NeMo:mainfrom
parthmannan:pmannan/update_moe_default
Mar 6, 2026
Merged

perf: Update moe_token_dispatcher_type default to alltoall#2004
terrykong merged 3 commits intoNVIDIA-NeMo:mainfrom
parthmannan:pmannan/update_moe_default

Conversation

@parthmannan
Copy link
Copy Markdown
Contributor

@parthmannan parthmannan commented Feb 22, 2026

What does this PR do ?

Add a one line overview of what this PR aims to accomplish.

Issues

List issues that this PR closes (syntax):

Usage

  • You can potentially add a usage example below
# Add a code snippet demonstrating how to use this

Before your PR is "Ready for review"

Pre checks:

  • Make sure you read and followed Contributor guidelines
  • Did you write any new necessary tests?
  • Did you run the unit tests and functional tests locally? Visit our Testing Guide for how to run tests
  • Did you add or update any necessary documentation? Visit our Document Development Guide for how to write, build and test the docs.

Additional Information

  • ...

Summary by CodeRabbit

  • Chores
    • Updated MOE token dispatcher configuration from "allgather" to "alltoall" across multiple example configurations.
    • Updated corresponding test configurations and documentation to reflect the new dispatcher default.

@parthmannan parthmannan requested review from a team as code owners February 22, 2026 21:19
@parthmannan parthmannan requested review from guyueh1 and removed request for a team February 22, 2026 21:19
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Feb 22, 2026

📝 Walkthrough

Walkthrough

The PR updates Megatron MoE token dispatcher configuration from "allgather" to "alltoall" across example configuration files, test files, and documentation, changing the distributed token routing strategy.

Changes

Cohort / File(s) Summary
Example Configurations
examples/configs/distillation_math.yaml, examples/configs/distillation_math_megatron.yaml, examples/configs/dpo.yaml, examples/configs/grpo_math_1B.yaml, examples/configs/grpo_math_1B_megatron.yaml, examples/configs/sft.yaml, examples/configs/sft_openmathinstruct2_megatron.yaml, examples/configs/vlm_grpo_3B.yaml, examples/configs/vlm_grpo_3B_megatron.yaml, examples/nemo_gym/grpo_workplace_assistant_nemotron_nano_v2_9b.yaml
Updated moe_token_dispatcher_type from "allgather" to "alltoall" in Megatron policy configurations across all example YAML files.
Documentation
nemo_rl/models/policy/__init__.py
Updated documentation comment for MegatronConfig.moe_token_dispatcher_type to reflect new default value of "alltoall".
Test Configurations
tests/unit/models/generation/test_vllm_generation.py, tests/unit/models/megatron/test_megatron_setup.py, tests/unit/models/policy/test_megatron_worker.py
Updated Megatron test configurations to expect moe_token_dispatcher_type value of "alltoall" instead of "allgather".

Estimated code review effort

🎯 1 (Trivial) | ⏱️ ~5 minutes

Possibly related PRs

Suggested labels

Performance

Suggested reviewers

  • terrykong
🚥 Pre-merge checks | ✅ 3 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Test Results For Major Changes ⚠️ Warning PR changes moe_token_dispatcher_type default from 'allgather' to 'alltoall' across multiple configs, a significant distributed training change affecting performance and convergence, but PR description lacks test results or regression validation. Add test results and performance metrics demonstrating no regressions from the dispatcher type change, including convergence validation and before/after performance numbers.
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and specifically describes the main change: updating the default value of moe_token_dispatcher_type from 'allgather' to 'alltoall' across multiple configuration files.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
  • 📝 Generate docstrings (stacked PR)
  • 📝 Generate docstrings (commit on current branch)
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Tip

Issue Planner is now in beta. Read the docs and try it out! Share your feedback on Discord.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@parthmannan parthmannan changed the title Update moe_token_dispatcher_type default to alltoall perf: Update moe_token_dispatcher_type default to alltoall Feb 22, 2026
Signed-off-by: Parth Mannan <pmannan@nvidia.com>
Signed-off-by: Parth Mannan <pmannan@nvidia.com>
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
nemo_rl/models/policy/__init__.py (1)

1-1: ⚠️ Potential issue | 🟡 Minor

Update the copyright year to 2026.

The file was modified but the header still shows 2025; update it to the current year.

🔧 Suggested fix
-# Copyright (c) 2025, NVIDIA CORPORATION.  All rights reserved.
+# Copyright (c) 2026, NVIDIA CORPORATION.  All rights reserved.

As per coding guidelines: Add the NVIDIA copyright header (with current year) to all Python files and shell scripts, excluding tests (files under tests/ or test-only scripts).

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nemo_rl/models/policy/__init__.py` at line 1, Replace the outdated copyright
year in the top-of-file header comment from 2025 to 2026; locate the header
comment at the beginning of the module (the copyright comment line in
nemo_rl/models/policy/__init__.py) and update the year to "2026" so the NVIDIA
copyright header is current.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Outside diff comments:
In `@nemo_rl/models/policy/__init__.py`:
- Line 1: Replace the outdated copyright year in the top-of-file header comment
from 2025 to 2026; locate the header comment at the beginning of the module (the
copyright comment line in nemo_rl/models/policy/__init__.py) and update the year
to "2026" so the NVIDIA copyright header is current.

@parthmannan parthmannan force-pushed the pmannan/update_moe_default branch from 90a737d to 493ddef Compare February 22, 2026 21:49
@parthmannan parthmannan added the CI:L2 Run doctests, unit tests, functional tests, and convergence tests label Feb 23, 2026
@guyueh1
Copy link
Copy Markdown
Contributor

guyueh1 commented Mar 5, 2026

@terrykong can we merge this? It's a default config value change

@guyueh1 guyueh1 added the Performance Related to improving performance label Mar 5, 2026
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot bot commented Mar 5, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@terrykong terrykong added CI:Lfast Runs a fast test suite and re-use nightly `main` container (but sync dependencies to PRs version) and removed CI:L2 Run doctests, unit tests, functional tests, and convergence tests labels Mar 5, 2026
@terrykong terrykong enabled auto-merge (squash) March 5, 2026 23:38
@terrykong
Copy link
Copy Markdown
Collaborator

CI failed L1, but it's b/c it's on a fork.

https://github.com/NVIDIA-NeMo/RL/actions/runs/22741665979/job/65956222345?pr=2004

setting to docs CI since the other tests passed

@terrykong terrykong added CI:docs Run doctest and removed CI:Lfast Runs a fast test suite and re-use nightly `main` container (but sync dependencies to PRs version) labels Mar 6, 2026
@terrykong terrykong merged commit 919e373 into NVIDIA-NeMo:main Mar 6, 2026
56 of 60 checks passed
@anwithk anwithk added this to the v0.6 Release milestone Mar 20, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CI:docs Run doctest Performance Related to improving performance

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants