Skip to content

cp: test: Perf recipe for v0.5 (1667) into r0.5.0#1671

Merged
terrykong merged 1 commit intor0.5.0from
cherry-pick-1667-r0.5.0
Dec 22, 2025
Merged

cp: test: Perf recipe for v0.5 (1667) into r0.5.0#1671
terrykong merged 1 commit intor0.5.0from
cherry-pick-1667-r0.5.0

Conversation

@chtruong814
Copy link
Contributor

@chtruong814 chtruong814 commented Dec 20, 2025

beep boop [🤖]: Hi @guyueh1 👋,

we've cherry picked #1667 into  for you! 🚀

Please review and approve this cherry pick by your convenience!

Summary by CodeRabbit

  • New Features

    • Added performance recipe configurations for multiple large language models (DeepSeek v3, Llama 3.1, Qwen3) across various hardware setups and optimization modes.
    • Introduced asynchronous training configurations and FP8 precision optimization recipes.
    • Expanded test coverage with new performance test suites for H100 and GB200 hardware platforms.
  • Chores

    • Reorganized test suite structure to categorize performance tests by hardware type.

✏️ Tip: You can customize this high-level summary in your review settings.

Signed-off-by: Guyue Huang <guyueh@nvidia.com>
Signed-off-by: Guyue Huang <guyueh@login-lyris02.lyris.clusters.nvidia.com>
Co-authored-by: Guyue Huang <guyueh@login-lyris02.lyris.clusters.nvidia.com>
Co-authored-by: Seonjin <sna@nvidia.com>
Signed-off-by: NeMo Bot <nemo-bot@nvidia.com>
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Dec 20, 2025

📝 Walkthrough

Walkthrough

This pull request adds new GRPO performance recipe configurations for multiple LLM models (DeepSeek v3, Llama 3.1, Qwen3), corresponding test execution scripts, and test suite definitions. It also removes sequence packing and expert parallelism configuration from an existing recipe. All changes are configuration and test infrastructure additions with no code logic modifications.

Changes

Cohort / File(s) Summary
Config Removal
examples/configs/recipes/llm/grpo-moonlight-16ba3b-4n8g-megatron-fp8-e2e.yaml
Removed sequence_packing block and expert_parallel_size from vllm generation configuration.
DeepSeek v3 GRPO Recipes
examples/configs/recipes/llm/performance/grpo-deepseek-v3-32n4g.yaml, grpo-deepseek-v3-64n4g-async-1off.yaml, grpo-deepseek-v3-64n8g-fp8-async-1off.yaml
New YAML configs for DeepSeek v3 GRPO runs with varying node counts (32/64), GPU configurations (4/8 per node), and FP8 optimization settings. Includes megatron parallelism, vllm generation, logging, and cluster resource specifications.
Llama 3.1 GRPO Recipes
examples/configs/recipes/llm/performance/grpo-llama3.1-8b-instruct-2n4g.yaml, grpo-llama3.1-8b-instruct-2n4g-async-1off.yaml, grpo-llama3.1-8b-instruct-2n8g-async-1off.yaml, grpo-llama3.1-8b-instruct-2n8g-fp8-async-1off.yaml
New YAML configs for Llama 3.1 8B GRPO runs with 2 nodes, varying GPU counts (4/8 per node), async GRPO modes, and FP8 variants. Includes importance sampling correction, colocated generation, and async engine configuration.
Qwen3 GRPO Recipes
examples/configs/recipes/llm/performance/grpo-qwen3-235b-16n4g.yaml, grpo-qwen3-235b-32n4g-async-1off.yaml, grpo-qwen3-30ba3b-4n4g.yaml, grpo-qwen3-30ba3b-8n4g-async-1off.yaml, grpo-qwen3-32b-4n4g.yaml, grpo-qwen3-32b-8n4g-async-1off.yaml
New YAML configs for Qwen3 variants (235B, 30B-A3B, 32B) with configurations for various node/GPU counts (4–32 nodes, 4 GPUs per node), async GRPO, importance sampling, and expert/tensor parallelism tuning.
DeepSeek v3 Test Scripts
tests/test_suites/llm/performance/grpo-deepseek-v3-32n4g.sh, grpo-deepseek-v3-64n4g-async-1off.sh, grpo-deepseek-v3-64n8g-fp8-async-1off.sh
New test execution scripts for DeepSeek v3 experiments. Execute GRPO runs via uv/Python, convert TensorBoard logs to JSON, and conditionally run metrics validation based on step thresholds.
Llama 3.1 Test Scripts
tests/test_suites/llm/performance/grpo-llama3.1-8b-instruct-2n4g.sh, grpo-llama3.1-8b-instruct-2n4g-async-1off.sh, grpo-llama3.1-8b-instruct-2n8g-fp8-async-1off.sh
New test execution scripts for Llama 3.1 experiments with standard and async configurations, including TensorBoard log conversion and conditional metrics validation.
Qwen3 Test Scripts
tests/test_suites/llm/performance/grpo-qwen3-235b-16n4g.sh, grpo-qwen3-235b-32n4g-async-1off.sh, grpo-qwen3-30ba3b-4n4g.sh, grpo-qwen3-30ba3b-8n4g-async-1off.sh, grpo-qwen3-32b-4n4g.sh, grpo-qwen3-32b-8n4g-async-1off.sh
New test execution scripts for Qwen3 experiments, following the same pattern of configuration setup, experiment execution, log conversion, and conditional metrics validation.
Test Suite Definitions
tests/test_suites/performance_h100.txt, tests/test_suites/performance_gb200.txt
New test suite definition files organizing GRPO performance test scripts by hardware (H100/GB200) and execution mode (SYNC, ASYNC 1-off, ASYNC many-off) with FP8 variants.
Test Suite Unit Tests
tests/unit/test_recipes_and_test_suites.py
Updated to reference new H100 and GB200 performance test suite files and adjust test fixture logic to read from multiple suite definitions instead of a single performance.txt file.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

  • Areas requiring extra attention:
    • Verify consistency of resource allocation (GPU counts, node counts, tensor/pipeline parallelism settings) across similar recipe variants
    • Confirm all new test scripts follow the established patterns for environment setup, experiment execution, log conversion, and metrics validation
    • Validate YAML syntax and hierarchical structure across all new configuration files
    • Ensure test suite definition files (performance_h100.txt, performance_gb200.txt) correctly reference all corresponding shell scripts and maintain proper organization

Possibly related PRs

Suggested labels

r0.5.0, CI:L2

Suggested reviewers

  • guyueh1
  • terrykong

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly indicates a cherry-pick operation of PR #1667 into the r0.5.0 branch, accurately summarizing the main action.
Test Results For Major Changes ✅ Passed PR contains minor configuration recipes and test infrastructure additions (15+ YAML files, 14+ shell scripts) for existing GRPO functionality without modifying core training algorithms or introducing breaking changes.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch cherry-pick-1667-r0.5.0

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between d488064 and 9bd3d4e.

📒 Files selected for processing (29)
  • examples/configs/recipes/llm/grpo-moonlight-16ba3b-4n8g-megatron-fp8-e2e.yaml (0 hunks)
  • examples/configs/recipes/llm/performance/grpo-deepseek-v3-32n4g.yaml (1 hunks)
  • examples/configs/recipes/llm/performance/grpo-deepseek-v3-64n4g-async-1off.yaml (1 hunks)
  • examples/configs/recipes/llm/performance/grpo-deepseek-v3-64n8g-fp8-async-1off.yaml (1 hunks)
  • examples/configs/recipes/llm/performance/grpo-llama3.1-8b-instruct-2n4g-async-1off.yaml (1 hunks)
  • examples/configs/recipes/llm/performance/grpo-llama3.1-8b-instruct-2n4g.yaml (1 hunks)
  • examples/configs/recipes/llm/performance/grpo-llama3.1-8b-instruct-2n8g-async-1off.yaml (1 hunks)
  • examples/configs/recipes/llm/performance/grpo-llama3.1-8b-instruct-2n8g-fp8-async-1off.yaml (1 hunks)
  • examples/configs/recipes/llm/performance/grpo-qwen3-235b-16n4g.yaml (1 hunks)
  • examples/configs/recipes/llm/performance/grpo-qwen3-235b-32n4g-async-1off.yaml (1 hunks)
  • examples/configs/recipes/llm/performance/grpo-qwen3-30ba3b-4n4g.yaml (1 hunks)
  • examples/configs/recipes/llm/performance/grpo-qwen3-30ba3b-8n4g-async-1off.yaml (1 hunks)
  • examples/configs/recipes/llm/performance/grpo-qwen3-32b-4n4g.yaml (1 hunks)
  • examples/configs/recipes/llm/performance/grpo-qwen3-32b-8n4g-async-1off.yaml (1 hunks)
  • tests/test_suites/llm/performance/grpo-deepseek-v3-32n4g.sh (1 hunks)
  • tests/test_suites/llm/performance/grpo-deepseek-v3-64n4g-async-1off.sh (1 hunks)
  • tests/test_suites/llm/performance/grpo-deepseek-v3-64n8g-fp8-async-1off.sh (1 hunks)
  • tests/test_suites/llm/performance/grpo-llama3.1-8b-instruct-2n4g-async-1off.sh (1 hunks)
  • tests/test_suites/llm/performance/grpo-llama3.1-8b-instruct-2n4g.sh (1 hunks)
  • tests/test_suites/llm/performance/grpo-llama3.1-8b-instruct-2n8g-fp8-async-1off.sh (1 hunks)
  • tests/test_suites/llm/performance/grpo-qwen3-235b-16n4g.sh (1 hunks)
  • tests/test_suites/llm/performance/grpo-qwen3-235b-32n4g-async-1off.sh (1 hunks)
  • tests/test_suites/llm/performance/grpo-qwen3-30ba3b-4n4g.sh (1 hunks)
  • tests/test_suites/llm/performance/grpo-qwen3-30ba3b-8n4g-async-1off.sh (1 hunks)
  • tests/test_suites/llm/performance/grpo-qwen3-32b-4n4g.sh (1 hunks)
  • tests/test_suites/llm/performance/grpo-qwen3-32b-8n4g-async-1off.sh (1 hunks)
  • tests/test_suites/performance_gb200.txt (1 hunks)
  • tests/test_suites/performance_h100.txt (1 hunks)
  • tests/unit/test_recipes_and_test_suites.py (3 hunks)
💤 Files with no reviewable changes (1)
  • examples/configs/recipes/llm/grpo-moonlight-16ba3b-4n8g-megatron-fp8-e2e.yaml
🧰 Additional context used
📓 Path-based instructions (6)
**/*.sh

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.sh: Use uv run instead of python to execute scripts
Follow the Google Shell Style Guide for shell scripts

Files:

  • tests/test_suites/llm/performance/grpo-qwen3-32b-4n4g.sh
  • tests/test_suites/llm/performance/grpo-qwen3-32b-8n4g-async-1off.sh
  • tests/test_suites/llm/performance/grpo-qwen3-30ba3b-8n4g-async-1off.sh
  • tests/test_suites/llm/performance/grpo-qwen3-235b-16n4g.sh
  • tests/test_suites/llm/performance/grpo-qwen3-235b-32n4g-async-1off.sh
  • tests/test_suites/llm/performance/grpo-deepseek-v3-64n8g-fp8-async-1off.sh
  • tests/test_suites/llm/performance/grpo-llama3.1-8b-instruct-2n8g-fp8-async-1off.sh
  • tests/test_suites/llm/performance/grpo-qwen3-30ba3b-4n4g.sh
  • tests/test_suites/llm/performance/grpo-llama3.1-8b-instruct-2n4g-async-1off.sh
  • tests/test_suites/llm/performance/grpo-llama3.1-8b-instruct-2n4g.sh
  • tests/test_suites/llm/performance/grpo-deepseek-v3-32n4g.sh
  • tests/test_suites/llm/performance/grpo-deepseek-v3-64n4g-async-1off.sh
tests/test_suites/**/*.sh

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

tests/test_suites/**/*.sh: When adding support for a new model, create a corresponding driver shell script under tests/test_suites/ in the matching domain
Driver shell scripts should match the YAML base name with .sh extension and invoke training entrypoint with uv run

Files:

  • tests/test_suites/llm/performance/grpo-qwen3-32b-4n4g.sh
  • tests/test_suites/llm/performance/grpo-qwen3-32b-8n4g-async-1off.sh
  • tests/test_suites/llm/performance/grpo-qwen3-30ba3b-8n4g-async-1off.sh
  • tests/test_suites/llm/performance/grpo-qwen3-235b-16n4g.sh
  • tests/test_suites/llm/performance/grpo-qwen3-235b-32n4g-async-1off.sh
  • tests/test_suites/llm/performance/grpo-deepseek-v3-64n8g-fp8-async-1off.sh
  • tests/test_suites/llm/performance/grpo-llama3.1-8b-instruct-2n8g-fp8-async-1off.sh
  • tests/test_suites/llm/performance/grpo-qwen3-30ba3b-4n4g.sh
  • tests/test_suites/llm/performance/grpo-llama3.1-8b-instruct-2n4g-async-1off.sh
  • tests/test_suites/llm/performance/grpo-llama3.1-8b-instruct-2n4g.sh
  • tests/test_suites/llm/performance/grpo-deepseek-v3-32n4g.sh
  • tests/test_suites/llm/performance/grpo-deepseek-v3-64n4g-async-1off.sh
!(**/tests/**|**/test_*.py|**/test_*.sh)

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Add the NVIDIA copyright header to all Python files and shell scripts (excluding tests). The header should include the current year

Files:

  • tests/test_suites/llm/performance/grpo-qwen3-32b-4n4g.sh
  • examples/configs/recipes/llm/performance/grpo-deepseek-v3-64n8g-fp8-async-1off.yaml
  • examples/configs/recipes/llm/performance/grpo-qwen3-32b-8n4g-async-1off.yaml
  • tests/test_suites/llm/performance/grpo-qwen3-32b-8n4g-async-1off.sh
  • tests/test_suites/llm/performance/grpo-qwen3-30ba3b-8n4g-async-1off.sh
  • tests/test_suites/llm/performance/grpo-qwen3-235b-16n4g.sh
  • examples/configs/recipes/llm/performance/grpo-llama3.1-8b-instruct-2n8g-async-1off.yaml
  • tests/test_suites/llm/performance/grpo-qwen3-235b-32n4g-async-1off.sh
  • tests/test_suites/performance_h100.txt
  • examples/configs/recipes/llm/performance/grpo-qwen3-235b-32n4g-async-1off.yaml
  • examples/configs/recipes/llm/performance/grpo-qwen3-32b-4n4g.yaml
  • tests/test_suites/llm/performance/grpo-deepseek-v3-64n8g-fp8-async-1off.sh
  • examples/configs/recipes/llm/performance/grpo-qwen3-235b-16n4g.yaml
  • examples/configs/recipes/llm/performance/grpo-llama3.1-8b-instruct-2n4g-async-1off.yaml
  • tests/test_suites/llm/performance/grpo-llama3.1-8b-instruct-2n8g-fp8-async-1off.sh
  • examples/configs/recipes/llm/performance/grpo-qwen3-30ba3b-4n4g.yaml
  • tests/test_suites/performance_gb200.txt
  • tests/test_suites/llm/performance/grpo-qwen3-30ba3b-4n4g.sh
  • examples/configs/recipes/llm/performance/grpo-llama3.1-8b-instruct-2n8g-fp8-async-1off.yaml
  • examples/configs/recipes/llm/performance/grpo-deepseek-v3-32n4g.yaml
  • tests/test_suites/llm/performance/grpo-llama3.1-8b-instruct-2n4g-async-1off.sh
  • examples/configs/recipes/llm/performance/grpo-qwen3-30ba3b-8n4g-async-1off.yaml
  • tests/test_suites/llm/performance/grpo-llama3.1-8b-instruct-2n4g.sh
  • tests/test_suites/llm/performance/grpo-deepseek-v3-32n4g.sh
  • examples/configs/recipes/llm/performance/grpo-llama3.1-8b-instruct-2n4g.yaml
  • examples/configs/recipes/llm/performance/grpo-deepseek-v3-64n4g-async-1off.yaml
  • tests/test_suites/llm/performance/grpo-deepseek-v3-64n4g-async-1off.sh
  • tests/unit/test_recipes_and_test_suites.py
**/*.{py,sh}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

The NVIDIA copyright header should appear at the top of all Python files and shell scripts (excluding tests)

Files:

  • tests/test_suites/llm/performance/grpo-qwen3-32b-4n4g.sh
  • tests/test_suites/llm/performance/grpo-qwen3-32b-8n4g-async-1off.sh
  • tests/test_suites/llm/performance/grpo-qwen3-30ba3b-8n4g-async-1off.sh
  • tests/test_suites/llm/performance/grpo-qwen3-235b-16n4g.sh
  • tests/test_suites/llm/performance/grpo-qwen3-235b-32n4g-async-1off.sh
  • tests/test_suites/llm/performance/grpo-deepseek-v3-64n8g-fp8-async-1off.sh
  • tests/test_suites/llm/performance/grpo-llama3.1-8b-instruct-2n8g-fp8-async-1off.sh
  • tests/test_suites/llm/performance/grpo-qwen3-30ba3b-4n4g.sh
  • tests/test_suites/llm/performance/grpo-llama3.1-8b-instruct-2n4g-async-1off.sh
  • tests/test_suites/llm/performance/grpo-llama3.1-8b-instruct-2n4g.sh
  • tests/test_suites/llm/performance/grpo-deepseek-v3-32n4g.sh
  • tests/test_suites/llm/performance/grpo-deepseek-v3-64n4g-async-1off.sh
  • tests/unit/test_recipes_and_test_suites.py
examples/configs/recipes/**/*.yaml

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

When adding support for a new model, create a recipe YAML under examples/configs/recipes/ in the appropriate domain subdirectory (llm, vlm, etc.)

Files:

  • examples/configs/recipes/llm/performance/grpo-deepseek-v3-64n8g-fp8-async-1off.yaml
  • examples/configs/recipes/llm/performance/grpo-qwen3-32b-8n4g-async-1off.yaml
  • examples/configs/recipes/llm/performance/grpo-llama3.1-8b-instruct-2n8g-async-1off.yaml
  • examples/configs/recipes/llm/performance/grpo-qwen3-235b-32n4g-async-1off.yaml
  • examples/configs/recipes/llm/performance/grpo-qwen3-32b-4n4g.yaml
  • examples/configs/recipes/llm/performance/grpo-qwen3-235b-16n4g.yaml
  • examples/configs/recipes/llm/performance/grpo-llama3.1-8b-instruct-2n4g-async-1off.yaml
  • examples/configs/recipes/llm/performance/grpo-qwen3-30ba3b-4n4g.yaml
  • examples/configs/recipes/llm/performance/grpo-llama3.1-8b-instruct-2n8g-fp8-async-1off.yaml
  • examples/configs/recipes/llm/performance/grpo-deepseek-v3-32n4g.yaml
  • examples/configs/recipes/llm/performance/grpo-qwen3-30ba3b-8n4g-async-1off.yaml
  • examples/configs/recipes/llm/performance/grpo-llama3.1-8b-instruct-2n4g.yaml
  • examples/configs/recipes/llm/performance/grpo-deepseek-v3-64n4g-async-1off.yaml
**/*.py

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.py: Conform code to Python 3.12+
Indent code with 4 spaces. Do not use tabs
Use snake_case for file names
Use PascalCase for class names
Use snake_case for function and method names
Use snake_case for local variables
Prefix variable names that start with a number with 'k' (e.g., k_99th_percentile)
Use upper snake_case with 'G' prefix for global variables (e.g., G_MY_GLOBAL)
Use upper snake_case for constants
Avoid shadowing variables declared in an outer scope
Initialize all externally visible members of a class in the constructor
Prefer docstrings over comments for interfaces that may be used outside a file
Reserve comments for code within a function or interfaces that are local to a file
If a piece of code is commented out, include a comment describing its usage and why it's commented out. Remove debug comments before merging
Use Google style docstrings for classes and functions in Python, which can be parsed by Sphinx
Avoid using reflection when functionality can be easily achieved without reflection
When using try-except blocks, limit the except clause to the smallest set of specific errors possible
When using try-except blocks for duck-typing, keep the body of the try as small as possible and use the else block for logic
YAML is the single source of truth for configuration defaults. Do not set non-None defaults in code for configuration values
For required configuration attributes, access config directly and expect presence (e.g., policy_cfg['precision']) without hidden defaults
Use typing.NotRequired to mark optional attributes in TypedDict for configuration
When adding a new config key to a TypedDict subclass, document the key's purpose, valid values/types, and recommended default, and reflect the default in exemplar YAMLs under examples/configs/*.yaml
Follow the Google Python Style Guide for Python code

Files:

  • tests/unit/test_recipes_and_test_suites.py
🧠 Learnings (11)
📚 Learning: 2025-10-12T14:46:57.171Z
Learnt from: zpqiu
Repo: NVIDIA-NeMo/RL PR: 1324
File: tests/test_suites/llm/distillation-qwen3-32b-to-1.7b-base-1n8g-megatron-tp2pp2cp2-pack.sh:6-11
Timestamp: 2025-10-12T14:46:57.171Z
Learning: Test scripts in tests/test_suites/llm/ follow a standard configuration pattern that includes NUM_NODES, STEPS_PER_RUN, MAX_STEPS, NUM_RUNS (calculated as `$(( (MAX_STEPS + STEPS_PER_RUN - 1) / STEPS_PER_RUN ))`), and NUM_MINUTES. These variables are part of the test infrastructure's standard interface and should not be flagged as unused even if not directly referenced within the individual script, as they are consumed by external launch tooling or common.env.

Applied to files:

  • tests/test_suites/llm/performance/grpo-qwen3-32b-4n4g.sh
  • tests/test_suites/llm/performance/grpo-qwen3-32b-8n4g-async-1off.sh
  • tests/test_suites/llm/performance/grpo-qwen3-30ba3b-8n4g-async-1off.sh
  • tests/test_suites/llm/performance/grpo-qwen3-235b-16n4g.sh
  • tests/test_suites/llm/performance/grpo-qwen3-235b-32n4g-async-1off.sh
  • tests/test_suites/llm/performance/grpo-deepseek-v3-64n8g-fp8-async-1off.sh
  • tests/test_suites/llm/performance/grpo-llama3.1-8b-instruct-2n8g-fp8-async-1off.sh
  • tests/test_suites/llm/performance/grpo-qwen3-30ba3b-4n4g.sh
  • tests/test_suites/llm/performance/grpo-llama3.1-8b-instruct-2n4g-async-1off.sh
  • tests/test_suites/llm/performance/grpo-llama3.1-8b-instruct-2n4g.sh
  • tests/test_suites/llm/performance/grpo-deepseek-v3-32n4g.sh
  • tests/test_suites/llm/performance/grpo-deepseek-v3-64n4g-async-1off.sh
📚 Learning: 2025-10-12T14:46:55.513Z
Learnt from: zpqiu
Repo: NVIDIA-NeMo/RL PR: 1324
File: tests/test_suites/llm/distillation-qwen3-32b-to-1.7b-base-1n8g-megatron-tp2pp2cp2-pack.sh:16-30
Timestamp: 2025-10-12T14:46:55.513Z
Learning: In the NVIDIA-NeMo/RL repository, test scripts under tests/ follow a consistent pattern: use `cd $PROJECT_ROOT` without quotes or error handling, and pass arguments with `$@` unquoted. Maintain this consistency when adding new test scripts.

Applied to files:

  • tests/test_suites/llm/performance/grpo-qwen3-32b-4n4g.sh
  • tests/test_suites/llm/performance/grpo-qwen3-32b-8n4g-async-1off.sh
  • tests/test_suites/llm/performance/grpo-qwen3-30ba3b-8n4g-async-1off.sh
  • tests/test_suites/llm/performance/grpo-qwen3-235b-16n4g.sh
  • tests/test_suites/llm/performance/grpo-qwen3-235b-32n4g-async-1off.sh
  • tests/test_suites/llm/performance/grpo-deepseek-v3-64n8g-fp8-async-1off.sh
  • tests/test_suites/llm/performance/grpo-llama3.1-8b-instruct-2n8g-fp8-async-1off.sh
  • tests/test_suites/llm/performance/grpo-qwen3-30ba3b-4n4g.sh
  • tests/test_suites/llm/performance/grpo-llama3.1-8b-instruct-2n4g-async-1off.sh
  • tests/test_suites/llm/performance/grpo-llama3.1-8b-instruct-2n4g.sh
  • tests/test_suites/llm/performance/grpo-deepseek-v3-32n4g.sh
  • tests/test_suites/llm/performance/grpo-deepseek-v3-64n4g-async-1off.sh
📚 Learning: 2025-11-24T17:24:41.976Z
Learnt from: CR
Repo: NVIDIA-NeMo/RL PR: 0
File: CODING_GUIDELINES.md:0-0
Timestamp: 2025-11-24T17:24:41.976Z
Learning: Applies to tests/test_suites/**/*.sh : Driver shell scripts should match the YAML base name with .sh extension and invoke training entrypoint with uv run

Applied to files:

  • tests/test_suites/llm/performance/grpo-qwen3-32b-4n4g.sh
  • tests/test_suites/llm/performance/grpo-qwen3-32b-8n4g-async-1off.sh
  • tests/test_suites/llm/performance/grpo-qwen3-30ba3b-8n4g-async-1off.sh
  • tests/test_suites/llm/performance/grpo-qwen3-235b-16n4g.sh
  • tests/test_suites/llm/performance/grpo-qwen3-235b-32n4g-async-1off.sh
  • tests/test_suites/llm/performance/grpo-llama3.1-8b-instruct-2n8g-fp8-async-1off.sh
  • tests/test_suites/llm/performance/grpo-qwen3-30ba3b-4n4g.sh
  • tests/test_suites/llm/performance/grpo-llama3.1-8b-instruct-2n4g-async-1off.sh
  • tests/test_suites/llm/performance/grpo-llama3.1-8b-instruct-2n4g.sh
📚 Learning: 2025-11-24T17:24:41.976Z
Learnt from: CR
Repo: NVIDIA-NeMo/RL PR: 0
File: CODING_GUIDELINES.md:0-0
Timestamp: 2025-11-24T17:24:41.976Z
Learning: Applies to examples/configs/recipes/llm/*.yaml : Recipe YAML files should follow the naming pattern: <algo>-<model>-<nodes>n<gpus>g-<strategy-and-params>[-modifiers][-long][.vN].yaml for LLM recipes

Applied to files:

  • examples/configs/recipes/llm/performance/grpo-deepseek-v3-64n8g-fp8-async-1off.yaml
  • examples/configs/recipes/llm/performance/grpo-llama3.1-8b-instruct-2n8g-async-1off.yaml
  • examples/configs/recipes/llm/performance/grpo-qwen3-235b-32n4g-async-1off.yaml
  • examples/configs/recipes/llm/performance/grpo-qwen3-32b-4n4g.yaml
  • examples/configs/recipes/llm/performance/grpo-qwen3-235b-16n4g.yaml
  • examples/configs/recipes/llm/performance/grpo-llama3.1-8b-instruct-2n4g-async-1off.yaml
  • examples/configs/recipes/llm/performance/grpo-llama3.1-8b-instruct-2n8g-fp8-async-1off.yaml
  • examples/configs/recipes/llm/performance/grpo-deepseek-v3-32n4g.yaml
  • examples/configs/recipes/llm/performance/grpo-qwen3-30ba3b-8n4g-async-1off.yaml
  • examples/configs/recipes/llm/performance/grpo-llama3.1-8b-instruct-2n4g.yaml
  • examples/configs/recipes/llm/performance/grpo-deepseek-v3-64n4g-async-1off.yaml
📚 Learning: 2025-11-24T17:24:41.976Z
Learnt from: CR
Repo: NVIDIA-NeMo/RL PR: 0
File: CODING_GUIDELINES.md:0-0
Timestamp: 2025-11-24T17:24:41.976Z
Learning: Applies to examples/configs/recipes/vlm/*.yaml : Recipe YAML files should follow the naming pattern: vlm_<algo>-<model>-<nodes>n<gpus>g-<strategy>[-modifiers][.vN].yaml for VLM recipes

Applied to files:

  • examples/configs/recipes/llm/performance/grpo-deepseek-v3-64n8g-fp8-async-1off.yaml
  • examples/configs/recipes/llm/performance/grpo-qwen3-235b-32n4g-async-1off.yaml
  • examples/configs/recipes/llm/performance/grpo-qwen3-235b-16n4g.yaml
  • examples/configs/recipes/llm/performance/grpo-llama3.1-8b-instruct-2n8g-fp8-async-1off.yaml
  • examples/configs/recipes/llm/performance/grpo-deepseek-v3-32n4g.yaml
  • examples/configs/recipes/llm/performance/grpo-deepseek-v3-64n4g-async-1off.yaml
📚 Learning: 2025-09-19T07:28:29.887Z
Learnt from: shuo-nvidia
Repo: NVIDIA-NeMo/RL PR: 1006
File: tests/test_suites/llm/distillation-qwen3-32b-to-4b-base-2n8g-fsdp2tp2-long.v1.sh:1-4
Timestamp: 2025-09-19T07:28:29.887Z
Learning: The NVIDIA-NeMo/RL project prefers to maintain consistent formatting across test scripts rather than applying individual bash hardening improvements like `set -euo pipefail` or proper quoting for sourcing files.

Applied to files:

  • tests/test_suites/llm/performance/grpo-qwen3-235b-16n4g.sh
  • tests/test_suites/llm/performance/grpo-qwen3-235b-32n4g-async-1off.sh
  • tests/test_suites/llm/performance/grpo-deepseek-v3-64n8g-fp8-async-1off.sh
  • tests/test_suites/llm/performance/grpo-deepseek-v3-32n4g.sh
  • tests/test_suites/llm/performance/grpo-deepseek-v3-64n4g-async-1off.sh
📚 Learning: 2025-09-18T14:20:36.297Z
Learnt from: zpqiu
Repo: NVIDIA-NeMo/RL PR: 1006
File: examples/configs/recipes/llm/distillation-qwen3-32b-to-8b-base-2n8g-fsdp2tp2.v1.yaml:113-120
Timestamp: 2025-09-18T14:20:36.297Z
Learning: In distillation workflows, the teacher policy does not perform generation - it only does inference/logprob computation on sequences generated by the student policy. Therefore, teacher generation configuration mismatches (like vLLM tensor parallelism settings) and colocation concerns are not relevant.

Applied to files:

  • examples/configs/recipes/llm/performance/grpo-llama3.1-8b-instruct-2n8g-async-1off.yaml
📚 Learning: 2025-09-18T14:57:31.003Z
Learnt from: zpqiu
Repo: NVIDIA-NeMo/RL PR: 1006
File: nemo_rl/algorithms/distillation.py:312-354
Timestamp: 2025-09-18T14:57:31.003Z
Learning: The distillation algorithm's cluster setup logic is designed to follow the same patterns used in GRPO for handling distributed training clusters and resource allocation.

Applied to files:

  • examples/configs/recipes/llm/performance/grpo-llama3.1-8b-instruct-2n4g-async-1off.yaml
📚 Learning: 2025-11-24T17:24:41.976Z
Learnt from: CR
Repo: NVIDIA-NeMo/RL PR: 0
File: CODING_GUIDELINES.md:0-0
Timestamp: 2025-11-24T17:24:41.976Z
Learning: Applies to examples/configs/recipes/**/*.yaml : When adding support for a new model, create a recipe YAML under examples/configs/recipes/ in the appropriate domain subdirectory (llm, vlm, etc.)

Applied to files:

  • examples/configs/recipes/llm/performance/grpo-deepseek-v3-32n4g.yaml
📚 Learning: 2025-09-19T03:00:58.662Z
Learnt from: shuo-nvidia
Repo: NVIDIA-NeMo/RL PR: 1006
File: examples/configs/recipes/llm/distillation-qwen3-32b-to-1.7b-base-1n8g-fsdp2tp1.v1.yaml:85-101
Timestamp: 2025-09-19T03:00:58.662Z
Learning: In distillation and GRPO configurations, max_new_tokens is intentionally set to the full context window (max_total_sequence_length) for consistency across the codebase. Overflow cases when prompt + generation tokens exceed max_model_len are handled by safeguards implemented in vllm_worker.py.

Applied to files:

  • examples/configs/recipes/llm/performance/grpo-llama3.1-8b-instruct-2n4g.yaml
📚 Learning: 2025-11-24T17:24:41.976Z
Learnt from: CR
Repo: NVIDIA-NeMo/RL PR: 0
File: CODING_GUIDELINES.md:0-0
Timestamp: 2025-11-24T17:24:41.976Z
Learning: Applies to tests/test_suites/nightly.txt : When adding a nightly test for a new model, append the driver script path (relative to tests/test_suites/) to tests/test_suites/nightly.txt

Applied to files:

  • tests/unit/test_recipes_and_test_suites.py
🪛 Shellcheck (0.11.0)
tests/test_suites/llm/performance/grpo-qwen3-32b-4n4g.sh

[warning] 6-6: NUM_NODES appears unused. Verify use (or export if used externally).

(SC2034)


[warning] 9-9: NUM_RUNS appears unused. Verify use (or export if used externally).

(SC2034)


[warning] 10-10: NUM_MINUTES appears unused. Verify use (or export if used externally).

(SC2034)


[warning] 16-16: Use 'cd ... || exit' or 'cd ... || return' in case cd fails.

(SC2164)


[error] 28-28: Double quote array expansions to avoid re-splitting elements.

(SC2068)

tests/test_suites/llm/performance/grpo-qwen3-32b-8n4g-async-1off.sh

[warning] 6-6: NUM_NODES appears unused. Verify use (or export if used externally).

(SC2034)


[warning] 9-9: NUM_RUNS appears unused. Verify use (or export if used externally).

(SC2034)


[warning] 10-10: NUM_MINUTES appears unused. Verify use (or export if used externally).

(SC2034)


[warning] 16-16: Use 'cd ... || exit' or 'cd ... || return' in case cd fails.

(SC2164)


[error] 28-28: Double quote array expansions to avoid re-splitting elements.

(SC2068)

tests/test_suites/llm/performance/grpo-qwen3-30ba3b-8n4g-async-1off.sh

[warning] 6-6: NUM_NODES appears unused. Verify use (or export if used externally).

(SC2034)


[warning] 9-9: NUM_RUNS appears unused. Verify use (or export if used externally).

(SC2034)


[warning] 10-10: NUM_MINUTES appears unused. Verify use (or export if used externally).

(SC2034)


[warning] 16-16: Use 'cd ... || exit' or 'cd ... || return' in case cd fails.

(SC2164)


[error] 28-28: Double quote array expansions to avoid re-splitting elements.

(SC2068)

tests/test_suites/llm/performance/grpo-qwen3-235b-16n4g.sh

[warning] 8-8: NUM_NODES appears unused. Verify use (or export if used externally).

(SC2034)


[warning] 11-11: NUM_RUNS appears unused. Verify use (or export if used externally).

(SC2034)


[warning] 12-12: NUM_MINUTES appears unused. Verify use (or export if used externally).

(SC2034)


[warning] 18-18: Use 'cd ... || exit' or 'cd ... || return' in case cd fails.

(SC2164)


[error] 29-29: Double quote array expansions to avoid re-splitting elements.

(SC2068)

tests/test_suites/llm/performance/grpo-qwen3-235b-32n4g-async-1off.sh

[warning] 8-8: NUM_NODES appears unused. Verify use (or export if used externally).

(SC2034)


[warning] 11-11: NUM_RUNS appears unused. Verify use (or export if used externally).

(SC2034)


[warning] 12-12: NUM_MINUTES appears unused. Verify use (or export if used externally).

(SC2034)


[warning] 18-18: Use 'cd ... || exit' or 'cd ... || return' in case cd fails.

(SC2164)


[error] 29-29: Double quote array expansions to avoid re-splitting elements.

(SC2068)

tests/test_suites/llm/performance/grpo-deepseek-v3-64n8g-fp8-async-1off.sh

[warning] 10-10: NUM_NODES appears unused. Verify use (or export if used externally).

(SC2034)


[warning] 13-13: NUM_RUNS appears unused. Verify use (or export if used externally).

(SC2034)


[warning] 14-14: NUM_MINUTES appears unused. Verify use (or export if used externally).

(SC2034)


[warning] 20-20: Use 'cd ... || exit' or 'cd ... || return' in case cd fails.

(SC2164)


[error] 34-34: Double quote array expansions to avoid re-splitting elements.

(SC2068)

tests/test_suites/llm/performance/grpo-llama3.1-8b-instruct-2n8g-fp8-async-1off.sh

[warning] 6-6: NUM_NODES appears unused. Verify use (or export if used externally).

(SC2034)


[warning] 9-9: NUM_RUNS appears unused. Verify use (or export if used externally).

(SC2034)


[warning] 10-10: NUM_MINUTES appears unused. Verify use (or export if used externally).

(SC2034)


[warning] 16-16: Use 'cd ... || exit' or 'cd ... || return' in case cd fails.

(SC2164)


[error] 28-28: Double quote array expansions to avoid re-splitting elements.

(SC2068)

tests/test_suites/llm/performance/grpo-qwen3-30ba3b-4n4g.sh

[warning] 6-6: NUM_NODES appears unused. Verify use (or export if used externally).

(SC2034)


[warning] 9-9: NUM_RUNS appears unused. Verify use (or export if used externally).

(SC2034)


[warning] 10-10: NUM_MINUTES appears unused. Verify use (or export if used externally).

(SC2034)


[warning] 16-16: Use 'cd ... || exit' or 'cd ... || return' in case cd fails.

(SC2164)


[error] 28-28: Double quote array expansions to avoid re-splitting elements.

(SC2068)

tests/test_suites/llm/performance/grpo-llama3.1-8b-instruct-2n4g-async-1off.sh

[warning] 6-6: NUM_NODES appears unused. Verify use (or export if used externally).

(SC2034)


[warning] 9-9: NUM_RUNS appears unused. Verify use (or export if used externally).

(SC2034)


[warning] 10-10: NUM_MINUTES appears unused. Verify use (or export if used externally).

(SC2034)


[warning] 16-16: Use 'cd ... || exit' or 'cd ... || return' in case cd fails.

(SC2164)


[error] 28-28: Double quote array expansions to avoid re-splitting elements.

(SC2068)

tests/test_suites/llm/performance/grpo-llama3.1-8b-instruct-2n4g.sh

[warning] 6-6: NUM_NODES appears unused. Verify use (or export if used externally).

(SC2034)


[warning] 9-9: NUM_RUNS appears unused. Verify use (or export if used externally).

(SC2034)


[warning] 10-10: NUM_MINUTES appears unused. Verify use (or export if used externally).

(SC2034)


[warning] 16-16: Use 'cd ... || exit' or 'cd ... || return' in case cd fails.

(SC2164)


[error] 28-28: Double quote array expansions to avoid re-splitting elements.

(SC2068)

tests/test_suites/llm/performance/grpo-deepseek-v3-32n4g.sh

[warning] 10-10: NUM_NODES appears unused. Verify use (or export if used externally).

(SC2034)


[warning] 13-13: NUM_RUNS appears unused. Verify use (or export if used externally).

(SC2034)


[warning] 14-14: NUM_MINUTES appears unused. Verify use (or export if used externally).

(SC2034)


[warning] 20-20: Use 'cd ... || exit' or 'cd ... || return' in case cd fails.

(SC2164)


[error] 34-34: Double quote array expansions to avoid re-splitting elements.

(SC2068)

tests/test_suites/llm/performance/grpo-deepseek-v3-64n4g-async-1off.sh

[warning] 10-10: NUM_NODES appears unused. Verify use (or export if used externally).

(SC2034)


[warning] 13-13: NUM_RUNS appears unused. Verify use (or export if used externally).

(SC2034)


[warning] 14-14: NUM_MINUTES appears unused. Verify use (or export if used externally).

(SC2034)


[warning] 20-20: Use 'cd ... || exit' or 'cd ... || return' in case cd fails.

(SC2164)


[error] 34-34: Double quote array expansions to avoid re-splitting elements.

(SC2068)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)
  • GitHub Check: Lint check
  • GitHub Check: Lint check
  • GitHub Check: Lint check
  • GitHub Check: Post automodel integration comment / Comment on PR
  • GitHub Check: Post submodule check comment / Comment on PR
🔇 Additional comments (39)
tests/test_suites/llm/performance/grpo-llama3.1-8b-instruct-2n4g.sh (5)

1-3: LGTM!

The shebang and setup follow the standard pattern for test driver scripts.


5-11: LGTM!

The configuration section follows the standard test infrastructure pattern. The variables (NUM_NODES, STEPS_PER_RUN, MAX_STEPS, NUM_RUNS, NUM_MINUTES) are part of the standard interface consumed by external launch tooling or common.env.

Note: Shellcheck warnings about unused variables are false positives in this context.

Based on learnings, these variables are part of the test infrastructure's standard interface.


13-13: LGTM!

The early exit check follows the standard pattern and prevents redundant test execution when max steps are already reached.


15-29: LGTM!

The experiment execution correctly uses uv run and follows the established pattern for test scripts. The configuration parameters are appropriately passed via command-line overrides with WandB logging, GPU monitoring, TensorBoard, and checkpointing enabled.

Note: The cd $PROJECT_ROOT (line 16) without quotes and $@ (line 28) without quotes are intentional patterns maintained for consistency across test scripts.

Based on learnings, test scripts follow a consistent pattern with unquoted cd $PROJECT_ROOT and $@.


31-39: LGTM!

The post-processing steps correctly use uv run for TensorBoard log conversion and metrics evaluation. The conditional check with jq ensures metrics are only evaluated when the target training step is reached, which is an appropriate safeguard.

examples/configs/recipes/llm/performance/grpo-qwen3-235b-32n4g-async-1off.yaml (2)

2-21: LGTM! Configuration is internally consistent.

The configuration values are well-structured and internally consistent:

  • Naming across checkpoint_dir, log_dir, and wandb.name matches the filename pattern (32n4g-async-1off)
  • Cluster topology (32 nodes × 4 GPUs = 128 GPUs) aligns with the filename
  • Parallelism settings (pipeline: 4, tensor: 8) are appropriate for a 235B model

1-1: The defaults file grpo-qwen3-235b-32n8g-async-1off.yaml exists in the repository and is correctly referenced.

tests/test_suites/llm/performance/grpo-qwen3-30ba3b-8n4g-async-1off.sh (2)

6-11: LGTM! Standard test configuration pattern.

The configuration variables follow the established test infrastructure pattern. Static analysis warnings about unused variables (NUM_NODES, NUM_RUNS, NUM_MINUTES) are false positives—these are consumed by external launch tooling.

Based on learnings, this is the expected interface for test scripts in this directory.


16-29: LGTM! Follows coding guidelines and established patterns.

The experiment execution correctly uses uv run as required by the coding guidelines. The patterns for cd $PROJECT_ROOT (without quotes) and $@ (unquoted) are consistent with established conventions in this repository's test scripts.

Based on learnings and coding guidelines.

examples/configs/recipes/llm/performance/grpo-qwen3-30ba3b-8n4g-async-1off.yaml (1)

1-33: LGTM! Configuration is well-structured and consistent.

The async GRPO configuration properly extends the base 4n4g config for 8-node deployment. Parallelism settings (expert_model_parallel_size: 16 with TP=1, PP=1) align well with the 32-GPU cluster topology. The async parameters and generation settings are appropriate for this performance recipe.

tests/test_suites/llm/performance/grpo-qwen3-32b-8n4g-async-1off.sh (1)

1-39: LGTM! Test script follows project conventions correctly.

The script properly implements the standard test pattern with configuration variables consumed by external tooling. Uses uv run for all Python invocations and follows established conventions for argument passing and directory navigation. Static analysis warnings are expected false positives based on project conventions.

Based on learnings, test scripts in this repository follow these conventions that override typical shell best practices.

examples/configs/recipes/llm/performance/grpo-qwen3-32b-8n4g-async-1off.yaml (1)

1-33: LGTM! Well-configured async GRPO setup.

The configuration appropriately extends the base config for 8-node async deployment. Parallelism settings (TP=2, PP=1, SP=true) and async GRPO parameters are well-suited for the 32-GPU cluster. The separate generation configuration with vllm async engine is properly structured.

examples/configs/recipes/llm/performance/grpo-qwen3-32b-4n4g.yaml (1)

1-42: LGTM! Clean base configuration for Qwen3-32B.

This base recipe is well-structured with appropriate parallelism settings (TP=2, PP=1, SP=true) for the 16-GPU cluster. Learning rate, warmup parameters, and logging configuration are reasonable for performance testing. Properly serves as the foundation for derived async variants.

examples/configs/recipes/llm/performance/grpo-qwen3-30ba3b-4n4g.yaml (1)

1-45: LGTM! Appropriate MoE configuration for Qwen3-30B-A3B.

The configuration correctly uses expert parallelism (EP=16) as the primary parallelism strategy for this Mixture-of-Experts model, distributing experts across all 16 GPUs. The CUDA allocator configuration and higher warmup iterations (50) are reasonable choices for MoE model characteristics. Overall structure is consistent with other GRPO performance recipes.

tests/test_suites/llm/performance/grpo-qwen3-32b-4n4g.sh (1)

1-40: LGTM!

The script follows the established test suite patterns correctly: uses uv run for execution, defines standard configuration variables (NUM_NODES, STEPS_PER_RUN, MAX_STEPS, NUM_RUNS, NUM_MINUTES), and matches the consistent style for test scripts in this repository.

tests/test_suites/performance_h100.txt (1)

5-28: LGTM!

The test suite organization clearly categorizes performance tests by hardware configuration (H100 BF16/FP8) and synchronization strategy (SYNC, ASYNC 1-off, ASYNC many-off), improving discoverability and maintainability.

examples/configs/recipes/llm/performance/grpo-llama3.1-8b-instruct-2n8g-fp8-async-1off.yaml (1)

1-20: LGTM!

The FP8 configuration is properly structured with blockwise quantization recipe (e4m3 format), appropriate environment variable for FP32 scaling, and consistent vLLM precision settings. The async 1-off GRPO setup aligns with the corresponding base configuration.

examples/configs/recipes/llm/performance/grpo-deepseek-v3-32n4g.yaml (1)

1-21: LGTM!

The configuration appropriately adapts the DeepSeek v3 model for 32 nodes with 4 GPUs per node, with sequence packing disabled and parallelism settings (pipeline=8, expert=16, tensor=16 for vLLM) properly aligned for the target topology.

tests/test_suites/llm/performance/grpo-llama3.1-8b-instruct-2n8g-fp8-async-1off.sh (1)

1-39: LGTM!

The script correctly follows the repository's test suite conventions with proper use of uv run, standard configuration variables, and metrics validation. The FP8 async 1-off configuration aligns with the corresponding YAML recipe.

examples/configs/recipes/llm/performance/grpo-llama3.1-8b-instruct-2n4g.yaml (1)

1-58: LGTM!

The GRPO configuration for Llama-3.1-8B-Instruct is comprehensive and correctly structured. The max_new_tokens setting matches the full context window (4096), which is intentional for consistency across GRPO configurations and handled by safeguards in the vLLM worker implementation.

tests/test_suites/llm/performance/grpo-qwen3-235b-32n4g-async-1off.sh (1)

1-40: LGTM!

The script follows standard test suite patterns and appropriately disables NVLS (line 5) to prevent OOM issues on the Qwen3-235B model at this scale. The configuration correctly reflects 32 nodes with 4 GPUs per node for the async 1-off GRPO setup.

examples/configs/recipes/llm/performance/grpo-llama3.1-8b-instruct-2n8g-async-1off.yaml (1)

12-13: LGTM!

Setting pipeline_model_parallel_size: 1 is appropriate for this async 1-off configuration with 2 nodes and 8 GPUs per node, ensuring the model runs without pipeline parallelism.

tests/test_suites/performance_gb200.txt (1)

1-19: LGTM! Well-organized GB200 performance test suite.

The test suite is clearly structured with separate SYNC and ASYNC 1-off sections, making it easy to understand the test organization. The paths are consistent with the test scripts added in this PR.

tests/test_suites/llm/performance/grpo-deepseek-v3-32n4g.sh (1)

1-45: LGTM! Script follows established test patterns.

The test script correctly implements the standard configuration pattern, uses uv run for Python invocations per guidelines, and includes proper NCCL tuning for the DeepSeek v3 model. The configurable checkpoint path via environment variable is a good practice.

Based on learnings, the shellcheck warnings for NUM_NODES, NUM_RUNS, NUM_MINUTES, cd without error handling, and unquoted $@ are false positives—these follow the repository's consistent test script patterns where variables are consumed by external launch tooling.

examples/configs/recipes/llm/performance/grpo-qwen3-235b-16n4g.yaml (1)

1-18: LGTM! Configuration follows recipe conventions.

The YAML recipe correctly follows the naming pattern <algo>-<model>-<nodes>n<gpus>g and appropriately adjusts parallelism settings for the 4-GPU-per-node configuration. The checkpoint and log paths are consistent with the recipe name.

tests/test_suites/llm/performance/grpo-llama3.1-8b-instruct-2n4g-async-1off.sh (1)

1-39: LGTM! Async test script follows repository patterns.

The script correctly implements the standard test configuration pattern for asynchronous GRPO runs, uses uv run per guidelines, and includes proper conditional metrics validation.

Based on learnings, the shellcheck warnings are false positives consistent with the repository's test script conventions.

tests/test_suites/llm/performance/grpo-qwen3-30ba3b-4n4g.sh (1)

1-40: LGTM! Script follows standard test patterns.

The test script correctly implements the standard configuration and execution pattern, uses uv run per guidelines, and includes appropriate metrics validation.

Based on learnings, the shellcheck warnings are false positives aligned with the repository's consistent test script patterns.

tests/unit/test_recipes_and_test_suites.py (3)

31-34: LGTM! Clear test suite path definitions.

The new performance test suite paths for H100 and GB200 follow the same pattern as existing test suite definitions, maintaining consistency.


76-88: LGTM! Proper aggregation of performance test suites.

The fixture correctly reads from both H100 and GB200 performance test suite files and aggregates them into a single list, maintaining the existing line parsing logic.


110-124: LGTM! Test parameterization properly extended.

The test parameterization now includes both H100 and GB200 performance test suites with appropriate test IDs, ensuring all new test suites are validated.

tests/test_suites/llm/performance/grpo-qwen3-235b-16n4g.sh (1)

1-40: LGTM! Large model test script properly configured.

The script correctly implements the standard test pattern with appropriate NCCL tuning for the 235B parameter model. Uses uv run per guidelines and includes proper conditional metrics validation.

Based on learnings, the shellcheck warnings are false positives consistent with the repository's test script conventions.

examples/configs/recipes/llm/performance/grpo-llama3.1-8b-instruct-2n4g-async-1off.yaml (1)

1-32: LGTM! Async GRPO configuration is well-structured.

The recipe correctly configures asynchronous GRPO with 1-off behavior (max_trajectory_age_steps: 1) and enables the necessary features like importance sampling correction and async vLLM engine. The parallelism settings are appropriate for the 8B model size.

examples/configs/recipes/llm/performance/grpo-deepseek-v3-64n4g-async-1off.yaml (2)

2-25: Configuration structure looks good.

The recipe correctly inherits from the 8-GPU configuration and overrides cluster settings, megatron parallelism, and generation parameters appropriately for the 4-GPU-per-node setup.


1-1: Parent configuration file exists and is correctly referenced.

The defaults reference ./grpo-deepseek-v3-64n8g-async-1off.yaml exists in the same directory, ensuring configuration loading will succeed.

examples/configs/recipes/llm/performance/grpo-deepseek-v3-64n8g-fp8-async-1off.yaml (3)

6-22: FP8 configuration looks well-structured.

The FP8 settings are appropriately configured with e4m3 format, blockwise recipe, and proper environment variables. The exclusion of a_proj and b_proj from quantization via quantization_ignored_layer_kws is a sensible choice for MoE architectures like DeepSeek v3.


2-28: Consistent naming throughout the configuration.

Unlike similar configurations, this file maintains consistent naming (64n8g-fp8-async-1off) across checkpoint_dir, log_dir, and wandb.name, which will aid in debugging and log correlation.


1-1: Parent configuration file exists and is valid.

The referenced parent file ./grpo-deepseek-v3-64n8g-async-1off.yaml is present in the same directory and configuration loading will succeed.

tests/test_suites/llm/performance/grpo-deepseek-v3-64n8g-fp8-async-1off.sh (1)

1-45: Test script follows established patterns correctly.

The script properly uses uv run for executing Python scripts (lines 21, 38, 42), follows the standard test configuration pattern with variables consumed by external launch tooling, and correctly implements the test workflow: experiment execution → log conversion → conditional metrics checking.

The Shellcheck warnings can be safely ignored as they conflict with established repository patterns documented in learnings.

Based on learnings, test scripts maintain consistency by:

  • Using standard config variables (NUM_NODES, NUM_RUNS, NUM_MINUTES) for external tooling
  • Using cd $PROJECT_ROOT without error handling
  • Passing arguments with $@ unquoted
tests/test_suites/llm/performance/grpo-deepseek-v3-64n4g-async-1off.sh (1)

1-45: Test script implementation is correct.

This script follows the same well-established patterns as the sibling test scripts, correctly using uv run and maintaining consistency with the repository's test infrastructure conventions. The script name properly matches the corresponding YAML configuration file.

Based on learnings, the script correctly follows repository patterns for test scripts.

Comment on lines +20 to +22
log_dir: logs/grpo-deepseek-v3-64n4g-async-32T32G-1off
wandb:
name: grpo-deepseek-v3-64n4g-async-32T32G-1off
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Inconsistent naming: log paths include "32T32G" suffix not present elsewhere.

The log_dir and wandb.name use grpo-deepseek-v3-64n4g-async-32T32G-1off, but the checkpoint directory (line 3) and filename use grpo-deepseek-v3-64n4g-async-1off. The "32T32G" suffix appears only in logging paths, which may cause confusion when correlating checkpoints with logs.

🔎 Suggested fix for consistent naming
 logger:
-  log_dir: logs/grpo-deepseek-v3-64n4g-async-32T32G-1off
+  log_dir: logs/grpo-deepseek-v3-64n4g-async-1off
   wandb:
-    name: grpo-deepseek-v3-64n4g-async-32T32G-1off
+    name: grpo-deepseek-v3-64n4g-async-1off
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
log_dir: logs/grpo-deepseek-v3-64n4g-async-32T32G-1off
wandb:
name: grpo-deepseek-v3-64n4g-async-32T32G-1off
logger:
log_dir: logs/grpo-deepseek-v3-64n4g-async-1off
wandb:
name: grpo-deepseek-v3-64n4g-async-1off
🤖 Prompt for AI Agents
In
examples/configs/recipes/llm/performance/grpo-deepseek-v3-64n4g-async-1off.yaml
around lines 20–22, the log_dir and wandb.name include a "32T32G" suffix that is
inconsistent with the checkpoint directory/name on line 3; update log_dir and
wandb.name to match the checkpoint naming (remove "32T32G" so both use
grpo-deepseek-v3-64n4g-async-1off) to ensure logs and checkpoints correlate, or
alternatively add the same "32T32G" suffix to the checkpoint entries if that was
the intended canonical name—choose one naming source and make all three entries
identical.

@terrykong terrykong added the CI:L1 Run doctests, unit tests, and functional tests label Dec 21, 2025
@terrykong terrykong enabled auto-merge (squash) December 21, 2025 18:51
@terrykong terrykong merged commit 8c00f6b into r0.5.0 Dec 22, 2025
68 of 71 checks passed
@terrykong terrykong deleted the cherry-pick-1667-r0.5.0 branch December 22, 2025 00:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cherry-pick CI:L1 Run doctests, unit tests, and functional tests Run CICD

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants