test: Perf recipe for v0.5 by guyueh1 · Pull Request #1667 · NVIDIA-NeMo/RL

guyueh1 · 2025-12-19T21:31:55Z

What does this PR do ?

Add new performance tests for v0.5

Issues

List issues that this PR closes (syntax):

Usage

You can potentially add a usage example below

# Add a code snippet demonstrating how to use this

Before your PR is "Ready for review"

Pre checks:

Make sure you read and followed Contributor guidelines
Did you write any new necessary tests?
Did you run the unit tests and functional tests locally? Visit our Testing Guide for how to run tests
Did you add or update any necessary documentation? Visit our Document Development Guide for how to write, build and test the docs.

Additional Information

...

Summary by CodeRabbit

New Features
- Added performance test configurations for multiple LLM models (DeepSeek v3, LLaMA 3.1, Qwen3).
- Introduced FP8 quantization support for select model configurations.
- Added new performance test scripts for automated benchmarking.
Chores
- Updated test suite inventory with new performance test entries.

_{✏️ Tip: You can customize this high-level summary in your review settings.}

Signed-off-by: Guyue Huang <guyueh@nvidia.com>

coderabbitai · 2025-12-19T21:38:21Z

📝 Walkthrough

Walkthrough

This PR adds new GRPO performance recipe configurations for multiple model architectures (DeepSeek v3, LLaMA 3.1, Qwen3) with various cluster sizes and FP8 quantization variants, along with corresponding test scripts. It also removes deprecated configuration options from an existing recipe and updates the test inventory.

Changes

Cohort / File(s)	Summary
Legacy Configuration Updates `examples/configs/recipes/llm/grpo-moonlight-16ba3b-4n8g-megatron-fp8-e2e.yaml`	Removes deprecated `policy.sequence_packing` (algorithm: modified_ffd) and `generation.vllm_cfg.expert_parallel_size: 4` entries
DeepSeek v3 Performance Recipes `examples/configs/recipes/llm/performance/grpo-deepseek-v3-32n4g.yaml`, `grpo-deepseek-v3-64n4g-async-1off.yaml`, `grpo-deepseek-v3-64n8g-fp8-async-1off.yaml`	Adds new DeepSeek v3 configurations with pipeline/expert parallelism, vLLM tensor parallelism (16-32), and FP8 quantization settings. Variants cover 32/64 nodes with 4/8 GPUs per node.
LLaMA 3.1 Performance Recipes `examples/configs/recipes/llm/performance/grpo-llama3.1-8b-instruct-2n8g-async-1off.yaml`, `grpo-llama3.1-8b-instruct-2n8g-fp8-async-1off.yaml`	Adds LLaMA 3.1 8B configurations with Megatron pipeline parallelism settings and FP8 variant with blockwise quantization and vLLM FP8 generation config.
Qwen3 Performance Recipes `examples/configs/recipes/llm/performance/grpo-qwen3-235b-16n4g.yaml`, `grpo-qwen3-235b-32n4g-async-1off.yaml`	Adds Qwen3-235b configurations with 4-way pipeline parallelism, 23 layers in first/last stages, and vLLM tensor parallelism. Covers 16/32 nodes with 4 GPUs per node.
DeepSeek v3 Test Scripts `tests/test_suites/llm/performance/grpo-deepseek-v3-32n4g.sh`, `grpo-deepseek-v3-64n4g-async-1off.sh`, `grpo-deepseek-v3-64n8g-fp8-async-1off.sh`	Adds test harnesses for DeepSeek v3 GRPO runs with environment setup, model loading, TensorBoard conversion, and conditional metrics evaluation.
LLaMA 3.1 Test Scripts `tests/test_suites/llm/performance/grpo-llama3.1-8b-instruct-2n8g-fp8-async-1off.sh`	Adds test harness for LLaMA 3.1 FP8 async performance run with TensorBoard and metrics collection.
Qwen3 Test Scripts `tests/test_suites/llm/performance/grpo-qwen3-235b-16n4g.sh`, `grpo-qwen3-235b-32n4g-async-1off.sh`	Adds test harnesses for Qwen3-235b GRPO performance runs with logging, W&B integration, and conditional metrics checks.
Test Inventory `tests/test_suites/performance.txt`	Updates test suite inventory: removes one legacy entry, adds new GRPO performance tests for DeepSeek v3, LLaMA 3.1, and Qwen3 variants organized by H100 BF16 and GB200 BF16 sections with SYNC/ASYNC configurations.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~20 minutes

Configuration consistency: Verify that all new YAML configs follow the same structure and parameter patterns across similar variants
Test script consistency: Check that shell scripts consistently implement environment setup, experiment execution, log conversion, and metrics evaluation patterns
DeepSeek v3 FP8 configs: Ensure FP8 settings (fp8_type: e4m3, blockwise recipe, NVTE_FP8_BLOCK_SCALING_FP32_SCALES) are correctly applied
Parallelism parameters: Verify pipeline/expert parallelism and layer distribution settings are appropriate for each model/node configuration
Test inventory alignment: Confirm that all new test scripts are properly registered in performance.txt

Possibly related PRs

cp: feat: Onboard perf recipes in tests (1322) into r0.4.0 #1497: Adds and modifies the same set of LLM performance recipe YAMLs and test scripts in identical file paths.
feat: add config_cli.py and refactor configs + config pre-commit #1024: Related configuration removal/simplification pattern affecting sequence_packing and vllm_cfg.expert_parallel_size across recipes.
perf: perf script change for qwen30b-a3b #1526: Modifies similar LLM performance YAMLs with adjustments to model-parallelism and sequence-packing settings.

Suggested labels

CI:L2, Run CICD

Suggested reviewers

terrykong

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Test Results For Major Changes	⚠️ Warning	PR modifies existing configuration file removing options (sequence_packing, expert_parallel_size) but PR description lacks documentation of changes, regression testing, or performance impact analysis.	Update PR description to document why configuration options were removed and include regression testing results or performance comparisons demonstrating no negative impact.

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title 'test: Perf recipe for v0.5' is directly related to the main changes, which add new performance test configurations and scripts for v0.5, though it could be more specific about which models/configurations are covered.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch guyueh/perf_recipe_for_v0.5

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (1)

examples/configs/recipes/llm/performance/grpo-deepseek-v3-64n4g-async-1off.yaml (1)

17-19: Consider aligning logger directory naming with checkpoint directory.

The checkpoint directory uses grpo-deepseek-v3-64n4g-async-1off but the logger directory and W&B run name use grpo-deepseek-v3-64n4g-async-32T32G-1off. If "32T32G" refers to trajectory/generation configuration rather than cluster topology, consider documenting this naming convention to avoid confusion.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 91658c8 and fedf770.

📒 Files selected for processing (15)

examples/configs/recipes/llm/grpo-moonlight-16ba3b-4n8g-megatron-fp8-e2e.yaml (0 hunks)
examples/configs/recipes/llm/performance/grpo-deepseek-v3-32n4g.yaml (1 hunks)
examples/configs/recipes/llm/performance/grpo-deepseek-v3-64n4g-async-1off.yaml (1 hunks)
examples/configs/recipes/llm/performance/grpo-deepseek-v3-64n8g-fp8-async-1off.yaml (1 hunks)
examples/configs/recipes/llm/performance/grpo-llama3.1-8b-instruct-2n8g-async-1off.yaml (1 hunks)
examples/configs/recipes/llm/performance/grpo-llama3.1-8b-instruct-2n8g-fp8-async-1off.yaml (1 hunks)
examples/configs/recipes/llm/performance/grpo-qwen3-235b-16n4g.yaml (1 hunks)
examples/configs/recipes/llm/performance/grpo-qwen3-235b-32n4g-async-1off.yaml (1 hunks)
tests/test_suites/llm/performance/grpo-deepseek-v3-32n4g.sh (1 hunks)
tests/test_suites/llm/performance/grpo-deepseek-v3-64n4g-async-1off.sh (1 hunks)
tests/test_suites/llm/performance/grpo-deepseek-v3-64n8g-fp8-async-1off.sh (1 hunks)
tests/test_suites/llm/performance/grpo-llama3.1-8b-instruct-2n8g-fp8-async-1off.sh (1 hunks)
tests/test_suites/llm/performance/grpo-qwen3-235b-16n4g.sh (1 hunks)
tests/test_suites/llm/performance/grpo-qwen3-235b-32n4g-async-1off.sh (1 hunks)
tests/test_suites/performance.txt (1 hunks)

💤 Files with no reviewable changes (1)

examples/configs/recipes/llm/grpo-moonlight-16ba3b-4n8g-megatron-fp8-e2e.yaml

🧰 Additional context used

📓 Path-based instructions (5)

examples/configs/recipes/**/*.yaml