Skip to content

perf: [Perf script] QWEN3 30B-A3B tensor_parallel_size from 4 to 2#1558

Merged
terrykong merged 1 commit intomainfrom
youngeunkwon0405-patch-2
Nov 24, 2025
Merged

perf: [Perf script] QWEN3 30B-A3B tensor_parallel_size from 4 to 2#1558
terrykong merged 1 commit intomainfrom
youngeunkwon0405-patch-2

Conversation

@youngeunkwon0405
Copy link
Contributor

@youngeunkwon0405 youngeunkwon0405 commented Nov 21, 2025

What does this PR do ?

Add a one line overview of what this PR aims to accomplish.

Issues

List issues that this PR closes (syntax):

Usage

  • You can potentially add a usage example below
# Add a code snippet demonstrating how to use this

Before your PR is "Ready for review"

Pre checks:

  • Make sure you read and followed Contributor guidelines
  • Did you write any new necessary tests?
  • Did you run the unit tests and functional tests locally? Visit our Testing Guide for how to run tests
  • Did you add or update any necessary documentation? Visit our Document Development Guide for how to write, build and test the docs.

Additional Information

  • ...

Summary by CodeRabbit

  • Chores
    • Updated example configuration parameters for GRPO performance optimization scenarios.

✏️ Tip: You can customize this high-level summary in your review settings.

Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Nov 21, 2025

📝 Walkthrough

Walkthrough

A single configuration file in the GRPO Qwen model recipe is updated, reducing the tensor parallelism setting for the vLLM generation backend from 4 to 2. This is a parameter adjustment with no logic changes.

Changes

Cohort / File(s) Summary
Configuration Adjustment
examples/configs/recipes/llm/performance/grpo-qwen3-30ba3b-4n8g.yaml
Reduced generation.vllm_cfg.tensor_parallel_size from 4 to 2

Estimated code review effort

🎯 1 (Trivial) | ⏱️ ~2 minutes

  • Minimal change scope: single parameter update in a configuration file

Possibly related PRs

Suggested reviewers

  • guyueh1
  • terrykong

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Test Results For Major Changes ⚠️ Warning PR modifies performance configuration (tensor_parallel_size 4→2) but lacks before-and-after metrics, test results, and workload context required for performance changes. Add concrete before-and-after performance metrics, hardware/workload configuration, and benchmark evidence demonstrating the optimization improves performance.
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and specifically describes the main change: reducing tensor_parallel_size from 4 to 2 in the QWEN3 30B configuration file.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
✨ Finishing touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch youngeunkwon0405-patch-2

Tip

📝 Customizable high-level summaries are now available in beta!

You can now customize how CodeRabbit generates the high-level summary in your pull requests — including its content, structure, tone, and formatting.

  • Provide your own instructions using the high_level_summary_instructions setting.
  • Format the summary however you like (bullet lists, tables, multi-section layouts, contributor stats, etc.).
  • Use high_level_summary_in_walkthrough to move the summary from the description to the walkthrough section.

Example instruction:

"Divide the high-level summary into five sections:

  1. 📝 Description — Summarize the main change in 50–60 words, explaining what was done.
  2. 📓 References — List relevant issues, discussions, documentation, or related PRs.
  3. 📦 Dependencies & Requirements — Mention any new/updated dependencies, environment variable changes, or configuration updates.
  4. 📊 Contributor Summary — Include a Markdown table showing contributions:
    | Contributor | Lines Added | Lines Removed | Files Changed |
  5. ✔️ Additional Notes — Add any extra reviewer context.
    Keep each section concise (under 200 words) and use bullet or numbered lists for clarity."

Note: This feature is currently in beta for Pro-tier users, and pricing will be announced later.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 1c371a9 and 44fba45.

📒 Files selected for processing (1)
  • examples/configs/recipes/llm/performance/grpo-qwen3-30ba3b-4n8g.yaml (1 hunks)
🧰 Additional context used
🧠 Learnings (2)
📓 Common learnings
Learnt from: CR
Repo: NVIDIA-NeMo/RL PR: 0
File: coderabbit-custom-pre-merge-checks-unique-id-file-non-traceable-F7F2B60C-1728-4C9A-8889-4F2235E186CA.txt:0-0
Timestamp: 2025-09-20T14:59:08.052Z
Learning: If a change could affect performance, include before-and-after performance numbers in the PR description, along with configuration and context.
📚 Learning: 2025-10-30T20:50:44.126Z
Learnt from: adil-a
Repo: NVIDIA-NeMo/RL PR: 1440
File: examples/configs/sft_automodel.yaml:48-58
Timestamp: 2025-10-30T20:50:44.126Z
Learning: In DTensor configurations for MoE (Mixture of Experts) models, expert_parallel_size and data_parallel_size can be applied together without multiplying the GPU requirements. Expert Parallelism (EP) only applies to MoE layers, while Data Parallelism/FSDP applies to non-MoE layers. Therefore, configurations like expert_parallel_size: 8 and data_parallel_size: 8 are valid on an 8-GPU cluster for MoE models.

Applied to files:

  • examples/configs/recipes/llm/performance/grpo-qwen3-30ba3b-4n8g.yaml
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)
  • GitHub Check: build-container / main
  • GitHub Check: Lint check
  • GitHub Check: Lint check
  • GitHub Check: Post submodule check comment / Comment on PR
  • GitHub Check: Post automodel integration comment / Comment on PR

@youngeunkwon0405
Copy link
Contributor Author

Hi @terrykong, can I ask for your help to merge this PR?

@terrykong terrykong merged commit 5f6cfc7 into main Nov 24, 2025
54 of 56 checks passed
@terrykong terrykong deleted the youngeunkwon0405-patch-2 branch November 24, 2025 17:25
DeL-TaiseiOzaki pushed a commit to DeL-TaiseiOzaki/RL that referenced this pull request Jan 8, 2026
yuanhangsu1986 pushed a commit to yuanhangsu1986/RL-Nemontron-Edge-Omni that referenced this pull request Feb 21, 2026
…VIDIA-NeMo#1558)

Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>
Signed-off-by: yuanhangs <yuanhangs@nvidia.com>
seonjinn pushed a commit that referenced this pull request Mar 8, 2026
…1558)

Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>
seonjinn pushed a commit that referenced this pull request Mar 8, 2026
…1558)

Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>
seonjinn pushed a commit that referenced this pull request Mar 9, 2026
…1558)

Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CI:docs Run doctest Performance Related to improving performance

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants