perf: perf script change for qwen30b-a3b by youngeunkwon0405 · Pull Request #1526 · NVIDIA-NeMo/RL

youngeunkwon0405 · 2025-11-15T01:11:01Z

What does this PR do ?

On-policy

889 --> ~1170 Token/sec/gpu
Async 1-off
903 --> ~1341 Token/sec/gpu

This config change isn't the only factor of the above perf number change. It is the sum of many other changes made so far.

Issues

List issues that this PR closes (syntax):

Usage

You can potentially add a usage example below

# Add a code snippet demonstrating how to use this

Before your PR is "Ready for review"

Pre checks:

Make sure you read and followed Contributor guidelines
Did you write any new necessary tests?
Did you run the unit tests and functional tests locally? Visit our Testing Guide for how to run tests
Did you add or update any necessary documentation? Visit our Document Development Guide for how to write, build and test the docs.

Additional Information

...

Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>

coderabbitai · 2025-11-15T01:16:40Z

📝 Walkthrough

Walkthrough

Parallelism configuration adjustments applied to two GRPO Qwen3-30B performance recipe files: tensor model parallelism reduced from 2 to 1, sequence parallelism disabled, and pipeline parallelism increased from 1 to 2 in the async variant with vLLM tensor parallelism halved from 4 to 2.

Changes

Cohort / File(s)	Summary
GRPO Qwen3-30B performance recipes `examples/configs/recipes/llm/performance/grpo-qwen3-30ba3b-4n8g-async-1off.yaml`, `examples/configs/recipes/llm/performance/grpo-qwen3-30ba3b-4n8g.yaml`	tensor_model_parallel_size: 2 → 1; sequence_parallel: true → false; pipeline_model_parallel_size: 1 → 2 (async-1off only); vllm_cfg tensor_parallel_size: 4 → 2 (async-1off only)

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~8 minutes

Review parallelism parameter consistency between files and ensure downstream references (e.g., make_sequence_length_divisible_by) remain valid with updated tensor_model_parallel_size
Verify vLLM tensor parallelism reduction aligns with overall model partitioning strategy in async configuration

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Test Results For Major Changes	⚠️ Warning	PR makes major model parallelism changes (reducing tensor_model_parallel_size from 2 to 1, disabling sequence_parallel) that significantly impact performance, but provides no performance metrics, testing results, or justification.	Add performance test results comparing old and new configurations, including throughput, memory utilization, and convergence metrics with hardware/batch size context.

✅ Passed checks (3 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title 'perf: perf script change for qwen30b-a3b' directly addresses the main changes: performance configuration updates to Qwen 30B model scripts. It is concise and specific to the changeset.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 6fc917f and af8250e.

📒 Files selected for processing (2)

examples/configs/recipes/llm/performance/grpo-qwen3-30ba3b-4n8g-async-1off.yaml (2 hunks)
examples/configs/recipes/llm/performance/grpo-qwen3-30ba3b-4n8g.yaml (1 hunks)

🧰 Additional context used

🧠 Learnings (2)

📓 Common learnings

Learnt from: CR
Repo: NVIDIA-NeMo/RL PR: 0
File: coderabbit-custom-pre-merge-checks-unique-id-file-non-traceable-F7F2B60C-1728-4C9A-8889-4F2235E186CA.txt:0-0
Timestamp: 2025-09-20T14:59:08.052Z
Learning: If a change could affect performance, include before-and-after performance numbers in the PR description, along with configuration and context.

Learnt from: adil-a
Repo: NVIDIA-NeMo/RL PR: 1440
File: examples/configs/sft_automodel.yaml:48-58
Timestamp: 2025-10-30T20:50:44.126Z
Learning: In DTensor configurations for MoE (Mixture of Experts) models, expert_parallel_size and data_parallel_size can be applied together without multiplying the GPU requirements. Expert Parallelism (EP) only applies to MoE layers, while Data Parallelism/FSDP applies to non-MoE layers. Therefore, configurations like expert_parallel_size: 8 and data_parallel_size: 8 are valid on an 8-GPU cluster for MoE models.

📚 Learning: 2025-10-30T20:50:44.126Z

Learnt from: adil-a
Repo: NVIDIA-NeMo/RL PR: 1440
File: examples/configs/sft_automodel.yaml:48-58
Timestamp: 2025-10-30T20:50:44.126Z
Learning: In DTensor configurations for MoE (Mixture of Experts) models, expert_parallel_size and data_parallel_size can be applied together without multiplying the GPU requirements. Expert Parallelism (EP) only applies to MoE layers, while Data Parallelism/FSDP applies to non-MoE layers. Therefore, configurations like expert_parallel_size: 8 and data_parallel_size: 8 are valid on an 8-GPU cluster for MoE models.

Applied to files:

examples/configs/recipes/llm/performance/grpo-qwen3-30ba3b-4n8g.yaml
examples/configs/recipes/llm/performance/grpo-qwen3-30ba3b-4n8g-async-1off.yaml

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)

GitHub Check: Lint check
GitHub Check: Post submodule check comment / Comment on PR
GitHub Check: Post automodel integration comment / Comment on PR

🔇 Additional comments (1)

examples/configs/recipes/llm/performance/grpo-qwen3-30ba3b-4n8g-async-1off.yaml (1)

13-16: Incorrect characterization of configuration changes.

The review misidentifies the actual configuration modifications:

Parameter Base Async Review claimed

tensor_model_parallel_size 1 1 2→1 ❌

pipeline_model_parallel_size 1 2 1→2 ✓

sequence_parallel false false true→false ❌

vllm tensor_parallel_size 4 2 4→2 ✓

Only 2 of 4 claimed changes are accurate. The actual modifications are:

pipeline_model_parallel_size: 1 → 2

vllm_cfg.tensor_parallel_size: 4 → 2

Added async_engine: true and gpu_memory_utilization: 0.8

Since the specific changes cited in the review do not match the actual code, the analysis is factually incorrect.

Likely an incorrect or invalid review comment.

examples/configs/recipes/llm/performance/grpo-qwen3-30ba3b-4n8g.yaml

youngeunkwon0405 · 2025-11-15T01:35:49Z

Hi @guyueh1, I updated the perf recipe and also updated the numbers in the perf tracker reflecting this change.

youngeunkwon0405 · 2025-11-19T02:05:32Z

Hi @terrykong, can we merge this recipe update PR?

Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>

Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com> Signed-off-by: yuanhangs <yuanhangs@nvidia.com>

youngeunkwon0405 added 3 commits November 14, 2025 14:15

tp2 to tp1

40353d5

Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>

modify async logprob batchsize

a8eed41

Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>

async recipe

af8250e

Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>

youngeunkwon0405 requested a review from guyueh1 November 15, 2025 01:11

youngeunkwon0405 self-assigned this Nov 15, 2025

youngeunkwon0405 requested a review from a team as a code owner November 15, 2025 01:11

coderabbitai bot reviewed Nov 15, 2025

View reviewed changes

examples/configs/recipes/llm/performance/grpo-qwen3-30ba3b-4n8g.yaml Show resolved Hide resolved

youngeunkwon0405 added the CI:docs Run doctest label Nov 15, 2025

youngeunkwon0405 temporarily deployed to nemo-ci November 15, 2025 01:34 — with GitHub Actions Inactive

guyueh1 approved these changes Nov 19, 2025

View reviewed changes

youngeunkwon0405 requested a review from terrykong November 19, 2025 02:05

terrykong approved these changes Nov 20, 2025

View reviewed changes

terrykong merged commit 1c371a9 into NVIDIA-NeMo:main Nov 20, 2025
45 of 46 checks passed

coderabbitai bot mentioned this pull request Nov 21, 2025

perf: [Perf script] QWEN3 30B-A3B tensor_parallel_size from 4 to 2 #1558

Merged

4 tasks

PrinsYin pushed a commit to PrinsYin/RL that referenced this pull request Nov 30, 2025

perf: perf script change for qwen30b-a3b (NVIDIA-NeMo#1526)

2023cdd

Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>

This was referenced Dec 16, 2025

perf: Add qwen3 30b-a3b async-8-off recipe #1642

Merged

test: Perf recipe for v0.5 #1661

Closed

test: Perf recipe for v0.5 #1667

Merged

coderabbitai bot mentioned this pull request Jan 5, 2026

perf: [Perf recipe] Change TP 16->32 for deepseek GB200 sync benchmark #1715

Merged

4 tasks

DeL-TaiseiOzaki pushed a commit to DeL-TaiseiOzaki/RL that referenced this pull request Jan 8, 2026

perf: perf script change for qwen30b-a3b (NVIDIA-NeMo#1526)

67f0fad

Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>

yuanhangsu1986 pushed a commit to yuanhangsu1986/RL-Nemontron-Edge-Omni that referenced this pull request Feb 21, 2026

perf: perf script change for qwen30b-a3b (NVIDIA-NeMo#1526)

da22cf0

Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com> Signed-off-by: yuanhangs <yuanhangs@nvidia.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: perf script change for qwen30b-a3b#1526

perf: perf script change for qwen30b-a3b#1526
terrykong merged 3 commits intoNVIDIA-NeMo:mainfrom
youngeunkwon0405:perf-script-change-qwen30b

youngeunkwon0405 commented Nov 15, 2025 •

edited

Loading

Uh oh!

coderabbitai bot commented Nov 15, 2025 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

youngeunkwon0405 commented Nov 15, 2025

Uh oh!

youngeunkwon0405 commented Nov 19, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Parameter	Base	Async	Review claimed
tensor_model_parallel_size	1	1	2→1 ❌
pipeline_model_parallel_size	1	2	1→2 ✓
sequence_parallel	false	false	true→false ❌
vllm tensor_parallel_size	4	2	4→2 ✓

Conversation

youngeunkwon0405 commented Nov 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do ?

Issues

Usage

Before your PR is "Ready for review"

Additional Information

Uh oh!

coderabbitai bot commented Nov 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Pre-merge checks and finishing touches

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

youngeunkwon0405 commented Nov 15, 2025

Uh oh!

youngeunkwon0405 commented Nov 19, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

youngeunkwon0405 commented Nov 15, 2025 •

edited

Loading

coderabbitai bot commented Nov 15, 2025 •

edited

Loading