perf: perf script change for qwen30b-a3b#1526
Conversation
Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>
Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>
Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>
📝 WalkthroughWalkthroughParallelism configuration adjustments applied to two GRPO Qwen3-30B performance recipe files: tensor model parallelism reduced from 2 to 1, sequence parallelism disabled, and pipeline parallelism increased from 1 to 2 in the async variant with vLLM tensor parallelism halved from 4 to 2. Changes
Estimated code review effort🎯 2 (Simple) | ⏱️ ~8 minutes
Pre-merge checks and finishing touches❌ Failed checks (1 warning)
✅ Passed checks (3 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
examples/configs/recipes/llm/performance/grpo-qwen3-30ba3b-4n8g-async-1off.yaml(2 hunks)examples/configs/recipes/llm/performance/grpo-qwen3-30ba3b-4n8g.yaml(1 hunks)
🧰 Additional context used
🧠 Learnings (2)
📓 Common learnings
Learnt from: CR
Repo: NVIDIA-NeMo/RL PR: 0
File: coderabbit-custom-pre-merge-checks-unique-id-file-non-traceable-F7F2B60C-1728-4C9A-8889-4F2235E186CA.txt:0-0
Timestamp: 2025-09-20T14:59:08.052Z
Learning: If a change could affect performance, include before-and-after performance numbers in the PR description, along with configuration and context.
Learnt from: adil-a
Repo: NVIDIA-NeMo/RL PR: 1440
File: examples/configs/sft_automodel.yaml:48-58
Timestamp: 2025-10-30T20:50:44.126Z
Learning: In DTensor configurations for MoE (Mixture of Experts) models, expert_parallel_size and data_parallel_size can be applied together without multiplying the GPU requirements. Expert Parallelism (EP) only applies to MoE layers, while Data Parallelism/FSDP applies to non-MoE layers. Therefore, configurations like expert_parallel_size: 8 and data_parallel_size: 8 are valid on an 8-GPU cluster for MoE models.
📚 Learning: 2025-10-30T20:50:44.126Z
Learnt from: adil-a
Repo: NVIDIA-NeMo/RL PR: 1440
File: examples/configs/sft_automodel.yaml:48-58
Timestamp: 2025-10-30T20:50:44.126Z
Learning: In DTensor configurations for MoE (Mixture of Experts) models, expert_parallel_size and data_parallel_size can be applied together without multiplying the GPU requirements. Expert Parallelism (EP) only applies to MoE layers, while Data Parallelism/FSDP applies to non-MoE layers. Therefore, configurations like expert_parallel_size: 8 and data_parallel_size: 8 are valid on an 8-GPU cluster for MoE models.
Applied to files:
examples/configs/recipes/llm/performance/grpo-qwen3-30ba3b-4n8g.yamlexamples/configs/recipes/llm/performance/grpo-qwen3-30ba3b-4n8g-async-1off.yaml
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
- GitHub Check: Lint check
- GitHub Check: Post submodule check comment / Comment on PR
- GitHub Check: Post automodel integration comment / Comment on PR
🔇 Additional comments (1)
examples/configs/recipes/llm/performance/grpo-qwen3-30ba3b-4n8g-async-1off.yaml (1)
13-16: Incorrect characterization of configuration changes.The review misidentifies the actual configuration modifications:
Parameter Base Async Review claimed tensor_model_parallel_size 1 1 2→1 ❌ pipeline_model_parallel_size 1 2 1→2 ✓ sequence_parallel false false true→false ❌ vllm tensor_parallel_size 4 2 4→2 ✓ Only 2 of 4 claimed changes are accurate. The actual modifications are:
pipeline_model_parallel_size: 1 → 2vllm_cfg.tensor_parallel_size: 4 → 2- Added
async_engine: trueandgpu_memory_utilization: 0.8Since the specific changes cited in the review do not match the actual code, the analysis is factually incorrect.
Likely an incorrect or invalid review comment.
|
Hi @guyueh1, I updated the perf recipe and also updated the numbers in the perf tracker reflecting this change. |
|
Hi @terrykong, can we merge this recipe update PR? |
Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>
Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>
Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com> Signed-off-by: yuanhangs <yuanhangs@nvidia.com>
What does this PR do ?
On-policy
Async 1-off
This config change isn't the only factor of the above perf number change. It is the sum of many other changes made so far.
Issues
List issues that this PR closes (syntax):
Usage
# Add a code snippet demonstrating how to use thisBefore your PR is "Ready for review"
Pre checks:
Additional Information