dsv3_gb300_revert- BF16 & FP8-MX scale by malay-nagda · Pull Request #2277 · NVIDIA-NeMo/Megatron-Bridge

malay-nagda · 2026-02-09T11:20:09Z

What does this PR do ?

Add a one line overview of what this PR aims to accomplish.

Changelog

Add specific line by line info of high level changes in this PR.

GitHub Actions CI

See the CI sectionin the Contributing doc for how to trigger the CI. A Nvidia developer will need to approve and trigger the CI for external contributors.

Before your PR is "Ready for review"

Pre checks:

Make sure you read and followed Contributor guidelines
Did you write any new necessary tests?
Did you add or update any necessary documentation?
Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
- Reviewer: Does the PR have correct import guards for all optional libraries?

If you haven't finished some of the above items you can still open "Draft" PR.

Additional Information

Related to # (issue)

Summary by CodeRabbit

Chores
- Updated DeepSeek V3 pretraining configurations to optimize performance parameters across different hardware variants, including batch size adjustments and computational efficiency settings for large-scale model training workloads.

Signed-off-by: Malay Nagda <malayn@nvidia.com>

copy-pr-bot · 2026-02-09T11:20:13Z

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

coderabbitai · 2026-02-09T14:01:12Z

📝 Walkthrough

Walkthrough

Updates DeepSeek V3 workload base configuration variants by replacing simple aliases with explicit replace() calls that customize micro-batch sizing, pipeline parallelism, expert distribution, MOE dispatcher backend, CUDA graph parameters, and recompute modules for GB300 GPU cluster configurations.

Changes

Cohort / File(s)	Summary
DeepSeek Configuration Variants `scripts/performance/configs/deepseek/deepseek_workload_base_configs.py`	Restructured `DEEPSEEK_V3_PRETRAIN_CONFIG_GB300_BF16_V1` from alias to explicit config with micro_batch_size, pipeline/virtual pipeline sizes, expert parallelism, MOE dispatcher backend, CUDA graph settings, and recompute modules. Updated `BF16_V2` to derive from `BF16_V1` with global_batch_size override. Changed `FP8_MX_LARGE_SCALE` base from `GB300_FP8_MX_V1` to `GB300_BF16_V1`.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~8 minutes

Possibly related PRs

Ko3n1g/chore/reapply 2152 and 2209 #2273: Modifies the same DeepSeek configuration constants and variants in the same file.
Revert #2152 and 2209 #2271: Restructures the same DeepSeek configuration definitions with replace-based patterns.
cp: b300 dsv3 bf16 hang fix (2260) into r0.3.0 #2270: Updates DeepSeek V3 BF16 configs with recompute_modules modifications including "moe_act".

Suggested labels

performance

Suggested reviewers

ko3n1g
dingqingy-nv

🚥 Pre-merge checks | ✅ 2 | ❌ 2

❌ Failed checks (2 warnings)

Check name	Status	Explanation	Resolution
Title check	⚠️ Warning	The title mentions 'BF16 & FP8-MX scale' but lacks clarity about the actual change, which involves replacing configuration aliases with explicit `replace()` calls to customize multiple parameters. The term 'revert' is misleading as this is not reverting changes but making forward modifications.	Use a clearer, more specific title such as 'Update DeepSeek V3 GB300 configurations to customize batch size, parallelism, and optimization settings' or similar, removing the ambiguous 'revert' terminology.
Test Results For Major Changes	⚠️ Warning	PR contains major DeepSeek V3 parallelism and configuration changes but lacks test results, performance metrics, convergence verification, or before/after comparisons in description.	Add comprehensive documentation including test results, before/after performance metrics, convergence verification, and address missing pp_layout override justification.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch malay/dsv3_gb300_revert_2602

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🤖 Fix all issues with AI agents

In `@scripts/performance/configs/deepseek/deepseek_workload_base_configs.py`:
- Around line 55-66: DEEPSEEK_V3_PRETRAIN_CONFIG_GB300_BF16_V1 currently
inherits a pp_layout from DEEPSEEK_V3_PRETRAIN_CONFIG_GB300_V1 that assumes
(pipeline_model_parallel_size=2, virtual_pipeline_model_parallel_size=8); update
the replace call for DEEPSEEK_V3_PRETRAIN_CONFIG_GB300_BF16_V1 to either set
pp_layout explicitly for the new (pipeline_model_parallel_size=4,
virtual_pipeline_model_parallel_size=4) configuration or remove pp_layout from
the replace override so the framework can auto-compute the correct layout based
on the updated pipeline_model_parallel_size and
virtual_pipeline_model_parallel_size values.

coderabbitai · 2026-02-09T14:01:15Z

scripts/performance/configs/deepseek/deepseek_workload_base_configs.py

+DEEPSEEK_V3_PRETRAIN_CONFIG_GB300_BF16_V1 = replace(
+    DEEPSEEK_V3_PRETRAIN_CONFIG_GB300_V1,
+    micro_batch_size=1,
+    pipeline_model_parallel_size=4,
+    virtual_pipeline_model_parallel_size=4,
+    expert_model_parallel_size=64,
+    moe_flex_dispatcher_backend="hybridep",
+    moe_a2a_overlap=False,
+    cuda_graph_impl="transformer_engine",
+    cuda_graph_scope=["attn", "moe_router", "moe_preprocess"],
+    recompute_modules=["moe_act"],
+)


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# First, read the target file to understand the configuration structure cd scripts/performance/configs/deepseek || exit 1 cat -n deepseek_workload_base_configs.py | head -80

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 3435

🏁 Script executed:

# Search for pp_layout usage across the codebase to understand how it's validated/used rg -n -C 3 'pp_layout' --type=py | head -100

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 4348

🏁 Script executed:

# Look for validation logic that checks compatibility between pp_layout and parallelism params rg -n -C 5 'pipeline_model_parallel_size|virtual_pipeline_model_parallel_size' --type=py | grep -C 3 'pp_layout' | head -50

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 3548

🏁 Script executed:

# Read more of the file to see other related configurations and understand the pattern sed -n '55,160p' scripts/performance/configs/deepseek/deepseek_workload_base_configs.py | cat -n

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 5120

🏁 Script executed:

# Check the set_deepseek_v3_pipeline_model_parallel_layout function to understand what it does rg -n -A 20 'def set_deepseek_v3_pipeline_model_parallel_layout' --type=py

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 2579

🏁 Script executed:

# Check if there are any tests or validation that would catch this mismatch rg -n 'pp_layout.*Et\*4' --type=py

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 179

🏁 Script executed:

# Verify that (4, 4) is indeed the correct size for BF16_V1 grep -n 'pipeline_model_parallel_size=4' scripts/performance/configs/deepseek/deepseek_workload_base_configs.py | head -20

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 290

Override pp_layout in DEEPSEEK_V3_PRETRAIN_CONFIG_GB300_BF16_V1 for the new parallelism configuration.

BF16_V1 inherits pp_layout="Et*4|(t*4|)*14tmL" from GB300_V1, which is designed for pipeline_model_parallel_size=2 and virtual_pipeline_model_parallel_size=8. However, BF16_V1 changes both to 4 without overriding pp_layout. The configuration framework will use the inherited layout string instead of auto-computing the correct one for (4, 4), leading to incorrect layer distribution across pipeline stages at runtime.

Either explicitly set pp_layout to match the (4, 4) configuration or remove it entirely to trigger auto-computation based on the updated parallelism parameters.

🤖 Prompt for AI Agents

In `@scripts/performance/configs/deepseek/deepseek_workload_base_configs.py` around lines 55 - 66, DEEPSEEK_V3_PRETRAIN_CONFIG_GB300_BF16_V1 currently inherits a pp_layout from DEEPSEEK_V3_PRETRAIN_CONFIG_GB300_V1 that assumes (pipeline_model_parallel_size=2, virtual_pipeline_model_parallel_size=8); update the replace call for DEEPSEEK_V3_PRETRAIN_CONFIG_GB300_BF16_V1 to either set pp_layout explicitly for the new (pipeline_model_parallel_size=4, virtual_pipeline_model_parallel_size=4) configuration or remove pp_layout from the replace override so the framework can auto-compute the correct layout based on the updated pipeline_model_parallel_size and virtual_pipeline_model_parallel_size values.

Signed-off-by: Malay Nagda <malayn@nvidia.com>

Signed-off-by: Malay Nagda <malayn@nvidia.com> Co-authored-by: malay-nagda <malayn@nvidia.com>

Signed-off-by: Malay Nagda <malayn@nvidia.com> Signed-off-by: sowmen <sowmendipta@gmail.com>

dsv3_gb300_revert- BF16 & FP8-MX scale

1a4cd4a

Signed-off-by: Malay Nagda <malayn@nvidia.com>

malay-nagda added the r0.3.0 Cherry-pick label for r0.3.0 release branch label Feb 9, 2026

malay-nagda marked this pull request as ready for review February 9, 2026 13:56

malay-nagda requested review from dingqingy-nv, erhoo82 and ko3n1g February 9, 2026 13:56

copy-pr-bot bot temporarily deployed to nemo-ci February 9, 2026 13:57 Inactive

copy-pr-bot bot temporarily deployed to test February 9, 2026 13:57 Inactive

coderabbitai bot reviewed Feb 9, 2026

View reviewed changes

copy-pr-bot bot temporarily deployed to nemo-ci February 9, 2026 14:21 Inactive

copy-pr-bot bot temporarily deployed to nemo-ci February 9, 2026 14:29 Inactive

copy-pr-bot bot temporarily deployed to nemo-ci February 9, 2026 14:41 Inactive

dingqingy-nv approved these changes Feb 9, 2026

View reviewed changes

ko3n1g merged commit 941c0b2 into main Feb 9, 2026
53 of 54 checks passed

ko3n1g deleted the malay/dsv3_gb300_revert_2602 branch February 9, 2026 19:49

dingqingy-nv mentioned this pull request Feb 9, 2026

Revert DSV3 GB300 BF16 and large scale proxy recipe #2285

Closed

ko3n1g pushed a commit that referenced this pull request Feb 9, 2026

dsv3_gb300_revert- BF16 & FP8-MX scale (#2277)

b7cf4ca

Signed-off-by: Malay Nagda <malayn@nvidia.com>

ko3n1g added a commit that referenced this pull request Feb 9, 2026

cp: dsv3_gb300_revert- BF16 & FP8-MX scale (#2277) (#2286)

1db8398

Signed-off-by: Malay Nagda <malayn@nvidia.com> Co-authored-by: malay-nagda <malayn@nvidia.com>

sowmen pushed a commit to sowmen/Megatron-Bridge that referenced this pull request Feb 11, 2026

dsv3_gb300_revert- BF16 & FP8-MX scale (NVIDIA-NeMo#2277)

64adf05

Signed-off-by: Malay Nagda <malayn@nvidia.com> Signed-off-by: sowmen <sowmendipta@gmail.com>

ko3n1g mentioned this pull request Feb 24, 2026

260201: Cherrypick various changes #2509

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

dsv3_gb300_revert- BF16 & FP8-MX scale#2277

dsv3_gb300_revert- BF16 & FP8-MX scale#2277
ko3n1g merged 1 commit intomainfrom
malay/dsv3_gb300_revert_2602

malay-nagda commented Feb 9, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

copy-pr-bot bot commented Feb 9, 2026

Uh oh!

coderabbitai bot commented Feb 9, 2026

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Feb 9, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

malay-nagda commented Feb 9, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do ?

Changelog

GitHub Actions CI

Before your PR is "Ready for review"

Additional Information

Summary by CodeRabbit

Uh oh!

copy-pr-bot bot commented Feb 9, 2026

Uh oh!

coderabbitai bot commented Feb 9, 2026

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 9, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

malay-nagda commented Feb 9, 2026 •

edited by coderabbitai bot

Loading