DSv3 EP=8 for B200, PP8-VP2 for B300 BF16, Lm3.1 405B TP4-CP1 GB300 FP8-CS by malay-nagda · Pull Request #2175 · NVIDIA-NeMo/Megatron-Bridge

malay-nagda · 2026-02-02T11:49:54Z

What does this PR do ?

Change EP=16 to EP=8 for B200 (both BF16 and FP8)
Change PP16-VP1 to PP8-VP2 for B300 BF16
Change TP2-CP2 to TP4-CP1 for Llama3.1 405B GB300 FP8-CS

Changelog

BASE_DEEPSEEK_V3_CONFIG,
    ...
    - expert_model_parallel_size=16,
    + expert_model_parallel_size=8,

DEEPSEEK_V3_PRETRAIN_CONFIG_B300_BF16_V2 = replace(
    DEEPSEEK_V3_PRETRAIN_CONFIG_B300_V2,
    pipeline_model_parallel_size=8,
    virtual_pipeline_model_parallel_size=2,
)

LLAMA31_405B_PRETRAIN_CONFIG_GB300_FP8_CS_V2 = replace(
    LLAMA31_405B_PRETRAIN_CONFIG_GB300_FP8_CS_V1,
    tensor_model_parallel_size=4,
    pipeline_model_parallel_size=8,
    context_parallel_size=1,

GitHub Actions CI

See the CI sectionin the Contributing doc for how to trigger the CI. A Nvidia developer will need to approve and trigger the CI for external contributors.

Before your PR is "Ready for review"

Pre checks:

Make sure you read and followed Contributor guidelines
Did you write any new necessary tests?
Did you add or update any necessary documentation?
Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
- Reviewer: Does the PR have correct import guards for all optional libraries?

If you haven't finished some of the above items you can still open "Draft" PR.

Additional Information

Related to # (issue)

Summary by CodeRabbit

Chores
- Adjusted performance workload configuration parameters for optimization.

Signed-off-by: Malay Nagda <malayn@nvidia.com>

copy-pr-bot · 2026-02-02T11:49:58Z

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

coderabbitai · 2026-02-02T11:53:46Z

📝 Walkthrough

Walkthrough

The expert_model_parallel_size parameter in the DEEPSEEK_V3_PRETRAIN_CONFIG_B200_V1 configuration is reduced from 16 to 8. No other configuration values, control flow, or logic are affected by this change.

Changes

Cohort / File(s)	Summary
Configuration Parameter Update `scripts/performance/configs/deepseek/deepseek_workload_base_configs.py`	Reduced `expert_model_parallel_size` from 16 to 8 in DEEPSEEK_V3_PRETRAIN_CONFIG_B200_V1.

Estimated code review effort

🎯 1 (Trivial) | ⏱️ ~3 minutes

🚥 Pre-merge checks | ✅ 2 | ❌ 2

❌ Failed checks (1 warning, 1 inconclusive)

Check name	Status	Explanation	Resolution
Test Results For Major Changes	⚠️ Warning	PR changes DeepSeek V3 expert_model_parallel_size from 16 to 8 on B200 without required performance benchmarks, convergence data, or testing confirmation per CONTRIBUTING.md guidelines.	Add performance metrics (throughput, GPU utilization, memory efficiency), convergence validation, explanation of EP=8 optimization for B200, and testing confirmation.
Title check	❓ Inconclusive	The title is highly technical and contains multiple unrelated configuration changes, making it difficult to identify the primary change. While it mentions 'DSv3 EP=8 for B200' which relates to the code change, the title also includes unrelated items (PP8-VP2 for B300, Lm3.1 405B, etc.) that are not reflected in the actual changeset.	Simplify the title to focus on the primary change: 'Change DeepSeek V3 expert parallelism to 8 for B200' or similar. Remove unrelated configuration items not present in this PR's changeset.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch malay/b200_dsv3_ep8

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

Signed-off-by: Malay Nagda <malayn@nvidia.com>

scripts/performance/configs/deepseek/deepseek_workload_base_configs.py

DSv3 EP=8 for B200

11cdbac

Signed-off-by: Malay Nagda <malayn@nvidia.com>

malay-nagda requested review from dingqingy-nv, erhoo82 and ko3n1g February 2, 2026 11:50

malay-nagda marked this pull request as ready for review February 2, 2026 11:50

copy-pr-bot bot temporarily deployed to nemo-ci February 2, 2026 11:50 Inactive

copy-pr-bot bot temporarily deployed to test February 2, 2026 11:51 Inactive

copy-pr-bot bot temporarily deployed to nemo-ci February 2, 2026 11:54 Inactive

copy-pr-bot bot temporarily deployed to nemo-ci February 2, 2026 12:01 Inactive

copy-pr-bot bot temporarily deployed to nemo-ci February 2, 2026 12:10 Inactive

copy-pr-bot bot had a problem deploying to nemo-ci February 2, 2026 12:10 Failure