Skip to content

perf: perf script change for qwen30b-a3b#1526

Merged
terrykong merged 3 commits intoNVIDIA-NeMo:mainfrom
youngeunkwon0405:perf-script-change-qwen30b
Nov 20, 2025
Merged

perf: perf script change for qwen30b-a3b#1526
terrykong merged 3 commits intoNVIDIA-NeMo:mainfrom
youngeunkwon0405:perf-script-change-qwen30b

Conversation

@youngeunkwon0405
Copy link
Contributor

@youngeunkwon0405 youngeunkwon0405 commented Nov 15, 2025

What does this PR do ?

On-policy

  • 889 --> ~1170 Token/sec/gpu
    Async 1-off
  • 903 --> ~1341 Token/sec/gpu

This config change isn't the only factor of the above perf number change. It is the sum of many other changes made so far.

Issues

List issues that this PR closes (syntax):

Usage

  • You can potentially add a usage example below
# Add a code snippet demonstrating how to use this

Before your PR is "Ready for review"

Pre checks:

  • Make sure you read and followed Contributor guidelines
  • Did you write any new necessary tests?
  • Did you run the unit tests and functional tests locally? Visit our Testing Guide for how to run tests
  • Did you add or update any necessary documentation? Visit our Document Development Guide for how to write, build and test the docs.

Additional Information

  • ...

Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>
Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>
Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>
@youngeunkwon0405 youngeunkwon0405 self-assigned this Nov 15, 2025
@youngeunkwon0405 youngeunkwon0405 requested a review from a team as a code owner November 15, 2025 01:11
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Nov 15, 2025

📝 Walkthrough

Walkthrough

Parallelism configuration adjustments applied to two GRPO Qwen3-30B performance recipe files: tensor model parallelism reduced from 2 to 1, sequence parallelism disabled, and pipeline parallelism increased from 1 to 2 in the async variant with vLLM tensor parallelism halved from 4 to 2.

Changes

Cohort / File(s) Summary
GRPO Qwen3-30B performance recipes
examples/configs/recipes/llm/performance/grpo-qwen3-30ba3b-4n8g-async-1off.yaml, examples/configs/recipes/llm/performance/grpo-qwen3-30ba3b-4n8g.yaml
tensor_model_parallel_size: 2 → 1; sequence_parallel: true → false; pipeline_model_parallel_size: 1 → 2 (async-1off only); vllm_cfg tensor_parallel_size: 4 → 2 (async-1off only)

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~8 minutes

  • Review parallelism parameter consistency between files and ensure downstream references (e.g., make_sequence_length_divisible_by) remain valid with updated tensor_model_parallel_size
  • Verify vLLM tensor parallelism reduction aligns with overall model partitioning strategy in async configuration

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Test Results For Major Changes ⚠️ Warning PR makes major model parallelism changes (reducing tensor_model_parallel_size from 2 to 1, disabling sequence_parallel) that significantly impact performance, but provides no performance metrics, testing results, or justification. Add performance test results comparing old and new configurations, including throughput, memory utilization, and convergence metrics with hardware/batch size context.
✅ Passed checks (3 passed)
Check name Status Explanation
Title check ✅ Passed The title 'perf: perf script change for qwen30b-a3b' directly addresses the main changes: performance configuration updates to Qwen 30B model scripts. It is concise and specific to the changeset.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 6fc917f and af8250e.

📒 Files selected for processing (2)
  • examples/configs/recipes/llm/performance/grpo-qwen3-30ba3b-4n8g-async-1off.yaml (2 hunks)
  • examples/configs/recipes/llm/performance/grpo-qwen3-30ba3b-4n8g.yaml (1 hunks)
🧰 Additional context used
🧠 Learnings (2)
📓 Common learnings
Learnt from: CR
Repo: NVIDIA-NeMo/RL PR: 0
File: coderabbit-custom-pre-merge-checks-unique-id-file-non-traceable-F7F2B60C-1728-4C9A-8889-4F2235E186CA.txt:0-0
Timestamp: 2025-09-20T14:59:08.052Z
Learning: If a change could affect performance, include before-and-after performance numbers in the PR description, along with configuration and context.
Learnt from: adil-a
Repo: NVIDIA-NeMo/RL PR: 1440
File: examples/configs/sft_automodel.yaml:48-58
Timestamp: 2025-10-30T20:50:44.126Z
Learning: In DTensor configurations for MoE (Mixture of Experts) models, expert_parallel_size and data_parallel_size can be applied together without multiplying the GPU requirements. Expert Parallelism (EP) only applies to MoE layers, while Data Parallelism/FSDP applies to non-MoE layers. Therefore, configurations like expert_parallel_size: 8 and data_parallel_size: 8 are valid on an 8-GPU cluster for MoE models.
📚 Learning: 2025-10-30T20:50:44.126Z
Learnt from: adil-a
Repo: NVIDIA-NeMo/RL PR: 1440
File: examples/configs/sft_automodel.yaml:48-58
Timestamp: 2025-10-30T20:50:44.126Z
Learning: In DTensor configurations for MoE (Mixture of Experts) models, expert_parallel_size and data_parallel_size can be applied together without multiplying the GPU requirements. Expert Parallelism (EP) only applies to MoE layers, while Data Parallelism/FSDP applies to non-MoE layers. Therefore, configurations like expert_parallel_size: 8 and data_parallel_size: 8 are valid on an 8-GPU cluster for MoE models.

Applied to files:

  • examples/configs/recipes/llm/performance/grpo-qwen3-30ba3b-4n8g.yaml
  • examples/configs/recipes/llm/performance/grpo-qwen3-30ba3b-4n8g-async-1off.yaml
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
  • GitHub Check: Lint check
  • GitHub Check: Post submodule check comment / Comment on PR
  • GitHub Check: Post automodel integration comment / Comment on PR
🔇 Additional comments (1)
examples/configs/recipes/llm/performance/grpo-qwen3-30ba3b-4n8g-async-1off.yaml (1)

13-16: Incorrect characterization of configuration changes.

The review misidentifies the actual configuration modifications:

Parameter Base Async Review claimed
tensor_model_parallel_size 1 1 2→1 ❌
pipeline_model_parallel_size 1 2 1→2 ✓
sequence_parallel false false true→false ❌
vllm tensor_parallel_size 4 2 4→2 ✓

Only 2 of 4 claimed changes are accurate. The actual modifications are:

  • pipeline_model_parallel_size: 1 → 2
  • vllm_cfg.tensor_parallel_size: 4 → 2
  • Added async_engine: true and gpu_memory_utilization: 0.8

Since the specific changes cited in the review do not match the actual code, the analysis is factually incorrect.

Likely an incorrect or invalid review comment.

@youngeunkwon0405
Copy link
Contributor Author

Hi @guyueh1, I updated the perf recipe and also updated the numbers in the perf tracker reflecting this change.

@youngeunkwon0405
Copy link
Contributor Author

Hi @terrykong, can we merge this recipe update PR?

@terrykong terrykong merged commit 1c371a9 into NVIDIA-NeMo:main Nov 20, 2025
45 of 46 checks passed
PrinsYin pushed a commit to PrinsYin/RL that referenced this pull request Nov 30, 2025
Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>
DeL-TaiseiOzaki pushed a commit to DeL-TaiseiOzaki/RL that referenced this pull request Jan 8, 2026
Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>
yuanhangsu1986 pushed a commit to yuanhangsu1986/RL-Nemontron-Edge-Omni that referenced this pull request Feb 21, 2026
Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>
Signed-off-by: yuanhangs <yuanhangs@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CI:docs Run doctest

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants