test: Add grpo-qwen3-30ba3b-4n8g-40k config to performance test suite.#1623
test: Add grpo-qwen3-30ba3b-4n8g-40k config to performance test suite.#1623
Conversation
📝 WalkthroughWalkthroughIntroduces a new YAML configuration file for GRPO performance testing with Qwen3-30B-A3B model, along with a corresponding shell script test that executes the performance experiment and registers it in the test manifest. Changes
Estimated code review effort🎯 2 (Simple) | ⏱️ ~10 minutes
Possibly related PRs
Suggested labels
Suggested reviewers
🚥 Pre-merge checks | ✅ 2 | ❌ 2❌ Failed checks (2 warnings)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
examples/configs/recipes/llm/performance/grpo-qwen3-30ba3b-4n8g-128K.yaml(1 hunks)tests/test_suites/llm/performance/grpo-qwen3-30ba3b-4n8g-128K.sh(1 hunks)tests/test_suites/performance.txt(1 hunks)
🧰 Additional context used
📓 Path-based instructions (5)
examples/configs/recipes/**/*.yaml
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
When adding support for a new model, create a recipe YAML under examples/configs/recipes/ in the appropriate domain subdirectory (llm, vlm, etc.)
Files:
examples/configs/recipes/llm/performance/grpo-qwen3-30ba3b-4n8g-128K.yaml
!(**/tests/**|**/test_*.py|**/test_*.sh)
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
Add the NVIDIA copyright header to all Python files and shell scripts (excluding tests). The header should include the current year
Files:
examples/configs/recipes/llm/performance/grpo-qwen3-30ba3b-4n8g-128K.yamltests/test_suites/performance.txttests/test_suites/llm/performance/grpo-qwen3-30ba3b-4n8g-128K.sh
**/*.sh
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
**/*.sh: Use uv run instead of python to execute scripts
Follow the Google Shell Style Guide for shell scripts
Files:
tests/test_suites/llm/performance/grpo-qwen3-30ba3b-4n8g-128K.sh
tests/test_suites/**/*.sh
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
tests/test_suites/**/*.sh: When adding support for a new model, create a corresponding driver shell script under tests/test_suites/ in the matching domain
Driver shell scripts should match the YAML base name with .sh extension and invoke training entrypoint with uv run
Files:
tests/test_suites/llm/performance/grpo-qwen3-30ba3b-4n8g-128K.sh
**/*.{py,sh}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
The NVIDIA copyright header should appear at the top of all Python files and shell scripts (excluding tests)
Files:
tests/test_suites/llm/performance/grpo-qwen3-30ba3b-4n8g-128K.sh
🧠 Learnings (6)
📓 Common learnings
Learnt from: CR
Repo: NVIDIA-NeMo/RL PR: 0
File: coderabbit-custom-pre-merge-checks-unique-id-file-non-traceable-F7F2B60C-1728-4C9A-8889-4F2235E186CA.txt:0-0
Timestamp: 2025-11-24T17:24:47.707Z
Learning: If a change could affect performance, the PR description should include before-and-after performance numbers, as well as the configuration and context in which they apply
📚 Learning: 2025-11-24T17:24:41.976Z
Learnt from: CR
Repo: NVIDIA-NeMo/RL PR: 0
File: CODING_GUIDELINES.md:0-0
Timestamp: 2025-11-24T17:24:41.976Z
Learning: Applies to examples/configs/recipes/llm/*.yaml : Recipe YAML files should follow the naming pattern: <algo>-<model>-<nodes>n<gpus>g-<strategy-and-params>[-modifiers][-long][.vN].yaml for LLM recipes
Applied to files:
examples/configs/recipes/llm/performance/grpo-qwen3-30ba3b-4n8g-128K.yaml
📚 Learning: 2025-11-24T17:24:47.707Z
Learnt from: CR
Repo: NVIDIA-NeMo/RL PR: 0
File: coderabbit-custom-pre-merge-checks-unique-id-file-non-traceable-F7F2B60C-1728-4C9A-8889-4F2235E186CA.txt:0-0
Timestamp: 2025-11-24T17:24:47.707Z
Learning: If a change could affect performance, the PR description should include before-and-after performance numbers, as well as the configuration and context in which they apply
Applied to files:
tests/test_suites/performance.txt
📚 Learning: 2025-10-12T14:46:57.171Z
Learnt from: zpqiu
Repo: NVIDIA-NeMo/RL PR: 1324
File: tests/test_suites/llm/distillation-qwen3-32b-to-1.7b-base-1n8g-megatron-tp2pp2cp2-pack.sh:6-11
Timestamp: 2025-10-12T14:46:57.171Z
Learning: Test scripts in tests/test_suites/llm/ follow a standard configuration pattern that includes NUM_NODES, STEPS_PER_RUN, MAX_STEPS, NUM_RUNS (calculated as `$(( (MAX_STEPS + STEPS_PER_RUN - 1) / STEPS_PER_RUN ))`), and NUM_MINUTES. These variables are part of the test infrastructure's standard interface and should not be flagged as unused even if not directly referenced within the individual script, as they are consumed by external launch tooling or common.env.
Applied to files:
tests/test_suites/performance.txttests/test_suites/llm/performance/grpo-qwen3-30ba3b-4n8g-128K.sh
📚 Learning: 2025-10-12T14:46:55.513Z
Learnt from: zpqiu
Repo: NVIDIA-NeMo/RL PR: 1324
File: tests/test_suites/llm/distillation-qwen3-32b-to-1.7b-base-1n8g-megatron-tp2pp2cp2-pack.sh:16-30
Timestamp: 2025-10-12T14:46:55.513Z
Learning: In the NVIDIA-NeMo/RL repository, test scripts under tests/ follow a consistent pattern: use `cd $PROJECT_ROOT` without quotes or error handling, and pass arguments with `$@` unquoted. Maintain this consistency when adding new test scripts.
Applied to files:
tests/test_suites/llm/performance/grpo-qwen3-30ba3b-4n8g-128K.sh
📚 Learning: 2025-11-24T17:24:41.976Z
Learnt from: CR
Repo: NVIDIA-NeMo/RL PR: 0
File: CODING_GUIDELINES.md:0-0
Timestamp: 2025-11-24T17:24:41.976Z
Learning: Applies to tests/test_suites/**/*.sh : Driver shell scripts should match the YAML base name with .sh extension and invoke training entrypoint with uv run
Applied to files:
tests/test_suites/llm/performance/grpo-qwen3-30ba3b-4n8g-128K.sh
🪛 Shellcheck (0.11.0)
tests/test_suites/llm/performance/grpo-qwen3-30ba3b-4n8g-128K.sh
[warning] 6-6: NUM_NODES appears unused. Verify use (or export if used externally).
(SC2034)
[warning] 9-9: NUM_RUNS appears unused. Verify use (or export if used externally).
(SC2034)
[warning] 10-10: NUM_MINUTES appears unused. Verify use (or export if used externally).
(SC2034)
[warning] 16-16: Use 'cd ... || exit' or 'cd ... || return' in case cd fails.
(SC2164)
[error] 28-28: Double quote array expansions to avoid re-splitting elements.
(SC2068)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
- GitHub Check: Lint check
- GitHub Check: Post submodule check comment / Comment on PR
- GitHub Check: Post automodel integration comment / Comment on PR
🔇 Additional comments (3)
tests/test_suites/performance.txt (1)
7-7: Manifest entry correctly registered.The new test script is properly added to the GRPO performance test suite manifest with correct path and placement.
examples/configs/recipes/llm/performance/grpo-qwen3-30ba3b-4n8g-128K.yaml (1)
1-44: Configuration structure and naming follow established patterns.The YAML configuration correctly implements a GRPO experiment for Qwen3-30B-A3B with:
- Proper naming convention (algo-model-nodes-gpus-modifier)
- Appropriate Megatron parallelism setup for MoE model (EMP=8)
- Sequence length (131072) matching the 128K modifier suffix
- Complete sections for policy, generation, logging, and cluster allocation
tests/test_suites/llm/performance/grpo-qwen3-30ba3b-4n8g-128K.sh (1)
1-39: Test script structure and patterns align with repository conventions.The script correctly follows established test patterns:
- Standard script initialization with common.env sourcing
- Configuration variables (NUM_NODES, STEPS_PER_RUN, MAX_STEPS, NUM_RUNS, NUM_MINUTES) follow the repo's test infrastructure interface (per learnings)
- Uses
uv runfor training entrypoint invocation and auxiliary scripts- Script name matches YAML base name with .sh extension
- Consistent with repo patterns:
cd $PROJECT_ROOTwithout error handling and unquoted$@usage (per learnings)The shellcheck warnings (SC2034, SC2164, SC2068) are expected false positives based on documented repo conventions.
|
@sfawzy-nv let's change the context length to 40k (40960) and do local testing, if it passes we will merge the benchmark with 40k context |
dbe1d97 to
51be999
Compare
51be999 to
8fcd208
Compare
f794072 to
51be999
Compare
Signed-off-by: Sherif Hosam Fouad Fawzy <sfawzy@login-eos01.eos.clusters.nvidia.com>
51be999 to
a430745
Compare
|
@terrykong can we merge this which adds a perf test for long context? We've tested it locally. |
|
@terrykong this can be merged? |
#1623) Signed-off-by: Sherif Hosam Fouad Fawzy <sfawzy@login-eos01.eos.clusters.nvidia.com> Co-authored-by: Sherif Hosam Fouad Fawzy <sfawzy@login-eos01.eos.clusters.nvidia.com> Co-authored-by: Sherif Fawzy <sfawzy@cw-dfw-cs-001-login-02.cm.cluster> Signed-off-by: Yi-Fu Wu <yifu.wu@gmail.com>
NVIDIA-NeMo#1623) Signed-off-by: Sherif Hosam Fouad Fawzy <sfawzy@login-eos01.eos.clusters.nvidia.com> Co-authored-by: Sherif Hosam Fouad Fawzy <sfawzy@login-eos01.eos.clusters.nvidia.com> Co-authored-by: Sherif Fawzy <sfawzy@cw-dfw-cs-001-login-02.cm.cluster>
NVIDIA-NeMo#1623) Signed-off-by: Sherif Hosam Fouad Fawzy <sfawzy@login-eos01.eos.clusters.nvidia.com> Co-authored-by: Sherif Hosam Fouad Fawzy <sfawzy@login-eos01.eos.clusters.nvidia.com> Co-authored-by: Sherif Fawzy <sfawzy@cw-dfw-cs-001-login-02.cm.cluster> Signed-off-by: yuanhangs <yuanhangs@nvidia.com>
NVIDIA-NeMo#1623) Signed-off-by: Sherif Hosam Fouad Fawzy <sfawzy@login-eos01.eos.clusters.nvidia.com> Co-authored-by: Sherif Hosam Fouad Fawzy <sfawzy@login-eos01.eos.clusters.nvidia.com> Co-authored-by: Sherif Fawzy <sfawzy@cw-dfw-cs-001-login-02.cm.cluster> Signed-off-by: yuanhangs <yuanhangs@nvidia.com>
#1623) Signed-off-by: Sherif Hosam Fouad Fawzy <sfawzy@login-eos01.eos.clusters.nvidia.com> Co-authored-by: Sherif Hosam Fouad Fawzy <sfawzy@login-eos01.eos.clusters.nvidia.com> Co-authored-by: Sherif Fawzy <sfawzy@cw-dfw-cs-001-login-02.cm.cluster>
#1623) Signed-off-by: Sherif Hosam Fouad Fawzy <sfawzy@login-eos01.eos.clusters.nvidia.com> Co-authored-by: Sherif Hosam Fouad Fawzy <sfawzy@login-eos01.eos.clusters.nvidia.com> Co-authored-by: Sherif Fawzy <sfawzy@cw-dfw-cs-001-login-02.cm.cluster>
#1623) Signed-off-by: Sherif Hosam Fouad Fawzy <sfawzy@login-eos01.eos.clusters.nvidia.com> Co-authored-by: Sherif Hosam Fouad Fawzy <sfawzy@login-eos01.eos.clusters.nvidia.com> Co-authored-by: Sherif Fawzy <sfawzy@cw-dfw-cs-001-login-02.cm.cluster>
What does this PR do ?
Add grpo-qwen3-30ba3b-4n8g-128k config to performance test suite.
Issues
List issues that this PR closes (syntax):
Usage
# Add a code snippet demonstrating how to use thisBefore your PR is "Ready for review"
Pre checks:
Additional Information
Summary by CodeRabbit
✏️ Tip: You can customize this high-level summary in your review settings.