Skip to content

test: Add grpo-qwen3-30ba3b-4n8g-40k config to performance test suite.#1623

Merged
terrykong merged 3 commits intomainfrom
sfawzy_nemorl
Jan 21, 2026
Merged

test: Add grpo-qwen3-30ba3b-4n8g-40k config to performance test suite.#1623
terrykong merged 3 commits intomainfrom
sfawzy_nemorl

Conversation

@sfawzy-nv
Copy link
Contributor

@sfawzy-nv sfawzy-nv commented Dec 11, 2025

What does this PR do ?

Add grpo-qwen3-30ba3b-4n8g-128k config to performance test suite.

Issues

List issues that this PR closes (syntax):

Usage

  • You can potentially add a usage example below
# Add a code snippet demonstrating how to use this

Before your PR is "Ready for review"

Pre checks:

  • Make sure you read and followed Contributor guidelines
  • Did you write any new necessary tests?
  • Did you run the unit tests and functional tests locally? Visit our Testing Guide for how to run tests
  • Did you add or update any necessary documentation? Visit our Document Development Guide for how to write, build and test the docs.

Additional Information

  • ...

Summary by CodeRabbit

  • New Features
    • Added GRPO experiment configuration for performance testing with Qwen3-30B model, featuring Megatron-like parallelism and comprehensive logging.
    • Introduced new performance test suite with TensorBoard metrics conversion and automated loss validation.

✏️ Tip: You can customize this high-level summary in your review settings.

@sfawzy-nv sfawzy-nv requested review from a team as code owners December 11, 2025 00:05
@sfawzy-nv sfawzy-nv requested review from guyueh1 and removed request for a team December 11, 2025 00:05
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Dec 11, 2025

📝 Walkthrough

Walkthrough

Introduces a new YAML configuration file for GRPO performance testing with Qwen3-30B-A3B model, along with a corresponding shell script test that executes the performance experiment and registers it in the test manifest.

Changes

Cohort / File(s) Summary
GRPO Performance Configuration
examples/configs/recipes/llm/performance/grpo-qwen3-30ba3-4n8g-128K.yaml
New YAML configuration file defining GRPO experiment parameters including Megatron-like parallelism settings (tensor/model parallelism), VLLM generation config, logging (WandB/TensorBoard), cluster GPU allocation, and model-specific training parameters.
GRPO Performance Test
tests/test_suites/llm/performance/grpo-qwen3-30ba3b-4n8g-128K.sh
New shell script for performance testing that defines experiment parameters (num nodes, steps, runs), executes the GRPO experiment via uv run, converts TensorBoard logs to JSON, and conditionally runs metrics checks.
Test Manifest
tests/test_suites/performance.txt
Single-line addition registering the new performance test script path.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

  • New configuration file: verify schema correctness and parameter alignment with existing GRPO configs
  • New test script: confirm proper environment variable usage, step calculation logic, and conditional metrics evaluation
  • No complex logic or structural modifications; primarily configuration and test orchestration

Possibly related PRs

Suggested labels

Performance, Run CICD

Suggested reviewers

  • guyueh1
  • terrykong
🚥 Pre-merge checks | ✅ 2 | ❌ 2
❌ Failed checks (2 warnings)
Check name Status Explanation Resolution
Test Results For Major Changes ⚠️ Warning PR adds new GRPO performance test configuration for Qwen3-30B-A3B-128K without documenting test results, baselines, or convergence validation. Add test results demonstrating successful execution, baseline performance metrics, convergence validation, and comparison with existing configuration to confirm no regressions.
Title check ⚠️ Warning The PR title mentions '40k' but the actual changes reference '128K', creating a discrepancy with the file contents. Update the PR title to 'test: Add grpo-qwen3-30ba3b-4n8g-128K config to performance test suite.' to accurately reflect the actual configuration being added.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 5bc5eba and dbe1d97.

📒 Files selected for processing (3)
  • examples/configs/recipes/llm/performance/grpo-qwen3-30ba3b-4n8g-128K.yaml (1 hunks)
  • tests/test_suites/llm/performance/grpo-qwen3-30ba3b-4n8g-128K.sh (1 hunks)
  • tests/test_suites/performance.txt (1 hunks)
🧰 Additional context used
📓 Path-based instructions (5)
examples/configs/recipes/**/*.yaml

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

When adding support for a new model, create a recipe YAML under examples/configs/recipes/ in the appropriate domain subdirectory (llm, vlm, etc.)

Files:

  • examples/configs/recipes/llm/performance/grpo-qwen3-30ba3b-4n8g-128K.yaml
!(**/tests/**|**/test_*.py|**/test_*.sh)

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Add the NVIDIA copyright header to all Python files and shell scripts (excluding tests). The header should include the current year

Files:

  • examples/configs/recipes/llm/performance/grpo-qwen3-30ba3b-4n8g-128K.yaml
  • tests/test_suites/performance.txt
  • tests/test_suites/llm/performance/grpo-qwen3-30ba3b-4n8g-128K.sh
**/*.sh

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.sh: Use uv run instead of python to execute scripts
Follow the Google Shell Style Guide for shell scripts

Files:

  • tests/test_suites/llm/performance/grpo-qwen3-30ba3b-4n8g-128K.sh
tests/test_suites/**/*.sh

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

tests/test_suites/**/*.sh: When adding support for a new model, create a corresponding driver shell script under tests/test_suites/ in the matching domain
Driver shell scripts should match the YAML base name with .sh extension and invoke training entrypoint with uv run

Files:

  • tests/test_suites/llm/performance/grpo-qwen3-30ba3b-4n8g-128K.sh
**/*.{py,sh}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

The NVIDIA copyright header should appear at the top of all Python files and shell scripts (excluding tests)

Files:

  • tests/test_suites/llm/performance/grpo-qwen3-30ba3b-4n8g-128K.sh
🧠 Learnings (6)
📓 Common learnings
Learnt from: CR
Repo: NVIDIA-NeMo/RL PR: 0
File: coderabbit-custom-pre-merge-checks-unique-id-file-non-traceable-F7F2B60C-1728-4C9A-8889-4F2235E186CA.txt:0-0
Timestamp: 2025-11-24T17:24:47.707Z
Learning: If a change could affect performance, the PR description should include before-and-after performance numbers, as well as the configuration and context in which they apply
📚 Learning: 2025-11-24T17:24:41.976Z
Learnt from: CR
Repo: NVIDIA-NeMo/RL PR: 0
File: CODING_GUIDELINES.md:0-0
Timestamp: 2025-11-24T17:24:41.976Z
Learning: Applies to examples/configs/recipes/llm/*.yaml : Recipe YAML files should follow the naming pattern: <algo>-<model>-<nodes>n<gpus>g-<strategy-and-params>[-modifiers][-long][.vN].yaml for LLM recipes

Applied to files:

  • examples/configs/recipes/llm/performance/grpo-qwen3-30ba3b-4n8g-128K.yaml
📚 Learning: 2025-11-24T17:24:47.707Z
Learnt from: CR
Repo: NVIDIA-NeMo/RL PR: 0
File: coderabbit-custom-pre-merge-checks-unique-id-file-non-traceable-F7F2B60C-1728-4C9A-8889-4F2235E186CA.txt:0-0
Timestamp: 2025-11-24T17:24:47.707Z
Learning: If a change could affect performance, the PR description should include before-and-after performance numbers, as well as the configuration and context in which they apply

Applied to files:

  • tests/test_suites/performance.txt
📚 Learning: 2025-10-12T14:46:57.171Z
Learnt from: zpqiu
Repo: NVIDIA-NeMo/RL PR: 1324
File: tests/test_suites/llm/distillation-qwen3-32b-to-1.7b-base-1n8g-megatron-tp2pp2cp2-pack.sh:6-11
Timestamp: 2025-10-12T14:46:57.171Z
Learning: Test scripts in tests/test_suites/llm/ follow a standard configuration pattern that includes NUM_NODES, STEPS_PER_RUN, MAX_STEPS, NUM_RUNS (calculated as `$(( (MAX_STEPS + STEPS_PER_RUN - 1) / STEPS_PER_RUN ))`), and NUM_MINUTES. These variables are part of the test infrastructure's standard interface and should not be flagged as unused even if not directly referenced within the individual script, as they are consumed by external launch tooling or common.env.

Applied to files:

  • tests/test_suites/performance.txt
  • tests/test_suites/llm/performance/grpo-qwen3-30ba3b-4n8g-128K.sh
📚 Learning: 2025-10-12T14:46:55.513Z
Learnt from: zpqiu
Repo: NVIDIA-NeMo/RL PR: 1324
File: tests/test_suites/llm/distillation-qwen3-32b-to-1.7b-base-1n8g-megatron-tp2pp2cp2-pack.sh:16-30
Timestamp: 2025-10-12T14:46:55.513Z
Learning: In the NVIDIA-NeMo/RL repository, test scripts under tests/ follow a consistent pattern: use `cd $PROJECT_ROOT` without quotes or error handling, and pass arguments with `$@` unquoted. Maintain this consistency when adding new test scripts.

Applied to files:

  • tests/test_suites/llm/performance/grpo-qwen3-30ba3b-4n8g-128K.sh
📚 Learning: 2025-11-24T17:24:41.976Z
Learnt from: CR
Repo: NVIDIA-NeMo/RL PR: 0
File: CODING_GUIDELINES.md:0-0
Timestamp: 2025-11-24T17:24:41.976Z
Learning: Applies to tests/test_suites/**/*.sh : Driver shell scripts should match the YAML base name with .sh extension and invoke training entrypoint with uv run

Applied to files:

  • tests/test_suites/llm/performance/grpo-qwen3-30ba3b-4n8g-128K.sh
🪛 Shellcheck (0.11.0)
tests/test_suites/llm/performance/grpo-qwen3-30ba3b-4n8g-128K.sh

[warning] 6-6: NUM_NODES appears unused. Verify use (or export if used externally).

(SC2034)


[warning] 9-9: NUM_RUNS appears unused. Verify use (or export if used externally).

(SC2034)


[warning] 10-10: NUM_MINUTES appears unused. Verify use (or export if used externally).

(SC2034)


[warning] 16-16: Use 'cd ... || exit' or 'cd ... || return' in case cd fails.

(SC2164)


[error] 28-28: Double quote array expansions to avoid re-splitting elements.

(SC2068)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
  • GitHub Check: Lint check
  • GitHub Check: Post submodule check comment / Comment on PR
  • GitHub Check: Post automodel integration comment / Comment on PR
🔇 Additional comments (3)
tests/test_suites/performance.txt (1)

7-7: Manifest entry correctly registered.

The new test script is properly added to the GRPO performance test suite manifest with correct path and placement.

examples/configs/recipes/llm/performance/grpo-qwen3-30ba3b-4n8g-128K.yaml (1)

1-44: Configuration structure and naming follow established patterns.

The YAML configuration correctly implements a GRPO experiment for Qwen3-30B-A3B with:

  • Proper naming convention (algo-model-nodes-gpus-modifier)
  • Appropriate Megatron parallelism setup for MoE model (EMP=8)
  • Sequence length (131072) matching the 128K modifier suffix
  • Complete sections for policy, generation, logging, and cluster allocation
tests/test_suites/llm/performance/grpo-qwen3-30ba3b-4n8g-128K.sh (1)

1-39: Test script structure and patterns align with repository conventions.

The script correctly follows established test patterns:

  • Standard script initialization with common.env sourcing
  • Configuration variables (NUM_NODES, STEPS_PER_RUN, MAX_STEPS, NUM_RUNS, NUM_MINUTES) follow the repo's test infrastructure interface (per learnings)
  • Uses uv run for training entrypoint invocation and auxiliary scripts
  • Script name matches YAML base name with .sh extension
  • Consistent with repo patterns: cd $PROJECT_ROOT without error handling and unquoted $@ usage (per learnings)

The shellcheck warnings (SC2034, SC2164, SC2068) are expected false positives based on documented repo conventions.

@guyueh1
Copy link
Contributor

guyueh1 commented Jan 5, 2026

@sfawzy-nv let's change the context length to 40k (40960) and do local testing, if it passes we will merge the benchmark with 40k context

@sfawzy-nv sfawzy-nv changed the title Add grpo-qwen3-30ba3b-4n8g-128k config to performance test suite. [feat] Add grpo-qwen3-30ba3b-4n8g-128k config to performance test suite. Jan 14, 2026
@sfawzy-nv sfawzy-nv changed the title [feat] Add grpo-qwen3-30ba3b-4n8g-128k config to performance test suite. [test] Add grpo-qwen3-30ba3b-4n8g-128k config to performance test suite. Jan 14, 2026
@sfawzy-nv sfawzy-nv changed the title [test] Add grpo-qwen3-30ba3b-4n8g-128k config to performance test suite. test: Add grpo-qwen3-30ba3b-4n8g-128k config to performance test suite. Jan 14, 2026
@sfawzy-nv sfawzy-nv force-pushed the sfawzy_nemorl branch 3 times, most recently from f794072 to 51be999 Compare January 14, 2026 15:28
Sherif Hosam Fouad Fawzy and others added 3 commits January 14, 2026 07:29
Signed-off-by: Sherif Hosam Fouad Fawzy <sfawzy@login-eos01.eos.clusters.nvidia.com>
@sfawzy-nv sfawzy-nv changed the title test: Add grpo-qwen3-30ba3b-4n8g-128k config to performance test suite. test: Add grpo-qwen3-30ba3b-4n8g-40k config to performance test suite. Jan 14, 2026
@guyueh1 guyueh1 added Performance Related to improving performance CI:L2 Run doctests, unit tests, functional tests, and convergence tests labels Jan 15, 2026
@guyueh1
Copy link
Contributor

guyueh1 commented Jan 20, 2026

@terrykong can we merge this which adds a perf test for long context? We've tested it locally.

@guyueh1
Copy link
Contributor

guyueh1 commented Jan 21, 2026

@terrykong this can be merged?

@terrykong terrykong merged commit f0abdf6 into main Jan 21, 2026
56 of 58 checks passed
@terrykong terrykong deleted the sfawzy_nemorl branch January 21, 2026 04:45
yfw pushed a commit that referenced this pull request Feb 9, 2026
#1623)

Signed-off-by: Sherif Hosam Fouad Fawzy <sfawzy@login-eos01.eos.clusters.nvidia.com>
Co-authored-by: Sherif Hosam Fouad Fawzy <sfawzy@login-eos01.eos.clusters.nvidia.com>
Co-authored-by: Sherif Fawzy <sfawzy@cw-dfw-cs-001-login-02.cm.cluster>
Signed-off-by: Yi-Fu Wu <yifu.wu@gmail.com>
xavier-owkin pushed a commit to owkin/Owkin-NeMo-RL that referenced this pull request Feb 10, 2026
NVIDIA-NeMo#1623)

Signed-off-by: Sherif Hosam Fouad Fawzy <sfawzy@login-eos01.eos.clusters.nvidia.com>
Co-authored-by: Sherif Hosam Fouad Fawzy <sfawzy@login-eos01.eos.clusters.nvidia.com>
Co-authored-by: Sherif Fawzy <sfawzy@cw-dfw-cs-001-login-02.cm.cluster>
yuanhangsu1986 pushed a commit to yuanhangsu1986/RL-Nemontron-Edge-Omni that referenced this pull request Feb 12, 2026
NVIDIA-NeMo#1623)

Signed-off-by: Sherif Hosam Fouad Fawzy <sfawzy@login-eos01.eos.clusters.nvidia.com>
Co-authored-by: Sherif Hosam Fouad Fawzy <sfawzy@login-eos01.eos.clusters.nvidia.com>
Co-authored-by: Sherif Fawzy <sfawzy@cw-dfw-cs-001-login-02.cm.cluster>
Signed-off-by: yuanhangs <yuanhangs@nvidia.com>
yuanhangsu1986 pushed a commit to yuanhangsu1986/RL-Nemontron-Edge-Omni that referenced this pull request Feb 21, 2026
NVIDIA-NeMo#1623)

Signed-off-by: Sherif Hosam Fouad Fawzy <sfawzy@login-eos01.eos.clusters.nvidia.com>
Co-authored-by: Sherif Hosam Fouad Fawzy <sfawzy@login-eos01.eos.clusters.nvidia.com>
Co-authored-by: Sherif Fawzy <sfawzy@cw-dfw-cs-001-login-02.cm.cluster>
Signed-off-by: yuanhangs <yuanhangs@nvidia.com>
seonjinn pushed a commit that referenced this pull request Mar 8, 2026
#1623)

Signed-off-by: Sherif Hosam Fouad Fawzy <sfawzy@login-eos01.eos.clusters.nvidia.com>
Co-authored-by: Sherif Hosam Fouad Fawzy <sfawzy@login-eos01.eos.clusters.nvidia.com>
Co-authored-by: Sherif Fawzy <sfawzy@cw-dfw-cs-001-login-02.cm.cluster>
seonjinn pushed a commit that referenced this pull request Mar 8, 2026
#1623)

Signed-off-by: Sherif Hosam Fouad Fawzy <sfawzy@login-eos01.eos.clusters.nvidia.com>
Co-authored-by: Sherif Hosam Fouad Fawzy <sfawzy@login-eos01.eos.clusters.nvidia.com>
Co-authored-by: Sherif Fawzy <sfawzy@cw-dfw-cs-001-login-02.cm.cluster>
seonjinn pushed a commit that referenced this pull request Mar 9, 2026
#1623)

Signed-off-by: Sherif Hosam Fouad Fawzy <sfawzy@login-eos01.eos.clusters.nvidia.com>
Co-authored-by: Sherif Hosam Fouad Fawzy <sfawzy@login-eos01.eos.clusters.nvidia.com>
Co-authored-by: Sherif Fawzy <sfawzy@cw-dfw-cs-001-login-02.cm.cluster>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CI:L2 Run doctests, unit tests, functional tests, and convergence tests Performance Related to improving performance

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants