Skip to content

cp: feat: add dapo recipe and test (1617) into r0.5.0#1687

Merged
terrykong merged 1 commit intor0.5.0from
cherry-pick-1617-r0.5.0
Dec 22, 2025
Merged

cp: feat: add dapo recipe and test (1617) into r0.5.0#1687
terrykong merged 1 commit intor0.5.0from
cherry-pick-1617-r0.5.0

Conversation

@chtruong814
Copy link
Contributor

@chtruong814 chtruong814 commented Dec 22, 2025

beep boop [🤖]: Hi @ZhiyuLi-Nvidia 👋,

we've cherry picked #1617 into  for you! 🚀

Please review and approve this cherry pick by your convenience!

Summary by CodeRabbit

  • Tests

    • Added performance testing configuration and test suite for DeepSeek v3 671B model training, including environment setup, metric validation, and tensorboard log conversion.
  • Chores

    • Updated performance test manifest with new DeepSeek v3 test script.

✏️ Tip: You can customize this high-level summary in your review settings.

Signed-off-by: Zhiyu Li <zhiyul@NVIDIA.com>
Signed-off-by: NeMo Bot <nemo-bot@nvidia.com>
@terrykong terrykong added the CI:L0 Run doctests and unit tests label Dec 22, 2025
@terrykong terrykong enabled auto-merge (squash) December 22, 2025 18:08
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Dec 22, 2025

📝 Walkthrough

Walkthrough

A new DAPO DeepSeek v3 performance configuration and test script are introduced for a 64-node, 8-GPU setup. The YAML defines training parameters, model settings, loss functions, and cluster topology. A corresponding Bash test script launches the experiment and validates training metrics. The test suite manifest is updated to register the new test.

Changes

Cohort / File(s) Change Summary
Configuration
examples/configs/recipes/llm/performance/dapo-deepseek-v3-64n8g.yaml
New YAML configuration file specifying DAPO defaults, loss function parameters, policy/Megatron model settings, checkpointing, data limits, environment worker configuration, logging (GPU monitoring, WandB, MLflow), and cluster topology for 64-node, 8-GPU DeepSeek v3 671B setup.
Test Script
tests/test_suites/llm/performance/dapo-deepseek-v3-64n8g.sh
New Bash test script that configures environment variables, launches GRPO training with DeepSeek v3, converts tensorboard logs to JSON metrics, and conditionally validates mean token error metrics.
Test Suite Manifest
tests/test_suites/performance_h100.txt
Added new test script path to GRPO H100 BF16 SYNC performance test section.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

  • YAML configuration validation: Verify parameter correctness, Megatron/tensor parallelism settings, batch sizing, and sequence length configurations
  • Bash script logic: Review environment variable setup, configuration options passed to GRPO runner, metric validation thresholds, and conditional test execution
  • Test suite integration: Confirm proper placement and formatting in performance test manifest

Possibly related PRs

Suggested labels

r0.5.0

Suggested reviewers

  • guyueh1
  • terrykong

Pre-merge checks and finishing touches

❌ Failed checks (1 inconclusive)
Check name Status Explanation Resolution
Test Results For Major Changes ❓ Inconclusive Unable to assess PR changes and test documentation without access to specific PR details, repository context, or pull request information. Provide PR number, repository name, or direct access to PR description and changed files for assessment.
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately describes the main change: adding a DAPO recipe configuration and corresponding test file for DeepSeek v3.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch cherry-pick-1617-r0.5.0

📜 Recent review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 31a4a72 and 29f981d.

📒 Files selected for processing (3)
  • examples/configs/recipes/llm/performance/dapo-deepseek-v3-64n8g.yaml
  • tests/test_suites/llm/performance/dapo-deepseek-v3-64n8g.sh
  • tests/test_suites/performance_h100.txt
🧰 Additional context used
📓 Path-based instructions (5)
examples/configs/recipes/**/*.yaml

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

When adding support for a new model, create a recipe YAML under examples/configs/recipes/ in the appropriate domain subdirectory (llm, vlm, etc.)

Files:

  • examples/configs/recipes/llm/performance/dapo-deepseek-v3-64n8g.yaml
!(**/tests/**|**/test_*.py|**/test_*.sh)

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Add the NVIDIA copyright header to all Python files and shell scripts (excluding tests). The header should include the current year

Files:

  • examples/configs/recipes/llm/performance/dapo-deepseek-v3-64n8g.yaml
  • tests/test_suites/llm/performance/dapo-deepseek-v3-64n8g.sh
  • tests/test_suites/performance_h100.txt
**/*.sh

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.sh: Use uv run instead of python to execute scripts
Follow the Google Shell Style Guide for shell scripts

Files:

  • tests/test_suites/llm/performance/dapo-deepseek-v3-64n8g.sh
tests/test_suites/**/*.sh

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

tests/test_suites/**/*.sh: When adding support for a new model, create a corresponding driver shell script under tests/test_suites/ in the matching domain
Driver shell scripts should match the YAML base name with .sh extension and invoke training entrypoint with uv run

Files:

  • tests/test_suites/llm/performance/dapo-deepseek-v3-64n8g.sh
**/*.{py,sh}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

The NVIDIA copyright header should appear at the top of all Python files and shell scripts (excluding tests)

Files:

  • tests/test_suites/llm/performance/dapo-deepseek-v3-64n8g.sh
🧠 Learnings (5)
📓 Common learnings
Learnt from: CR
Repo: NVIDIA-NeMo/RL PR: 0
File: CODING_GUIDELINES.md:0-0
Timestamp: 2025-11-24T17:24:41.976Z
Learning: Applies to tests/test_suites/**/*.sh : Driver shell scripts should match the YAML base name with .sh extension and invoke training entrypoint with uv run
📚 Learning: 2025-11-24T17:24:41.976Z
Learnt from: CR
Repo: NVIDIA-NeMo/RL PR: 0
File: CODING_GUIDELINES.md:0-0
Timestamp: 2025-11-24T17:24:41.976Z
Learning: Applies to examples/configs/recipes/llm/*.yaml : Recipe YAML files should follow the naming pattern: <algo>-<model>-<nodes>n<gpus>g-<strategy-and-params>[-modifiers][-long][.vN].yaml for LLM recipes

Applied to files:

  • examples/configs/recipes/llm/performance/dapo-deepseek-v3-64n8g.yaml
📚 Learning: 2025-10-12T14:46:57.171Z
Learnt from: zpqiu
Repo: NVIDIA-NeMo/RL PR: 1324
File: tests/test_suites/llm/distillation-qwen3-32b-to-1.7b-base-1n8g-megatron-tp2pp2cp2-pack.sh:6-11
Timestamp: 2025-10-12T14:46:57.171Z
Learning: Test scripts in tests/test_suites/llm/ follow a standard configuration pattern that includes NUM_NODES, STEPS_PER_RUN, MAX_STEPS, NUM_RUNS (calculated as `$(( (MAX_STEPS + STEPS_PER_RUN - 1) / STEPS_PER_RUN ))`), and NUM_MINUTES. These variables are part of the test infrastructure's standard interface and should not be flagged as unused even if not directly referenced within the individual script, as they are consumed by external launch tooling or common.env.

Applied to files:

  • tests/test_suites/llm/performance/dapo-deepseek-v3-64n8g.sh
📚 Learning: 2025-10-12T14:46:55.513Z
Learnt from: zpqiu
Repo: NVIDIA-NeMo/RL PR: 1324
File: tests/test_suites/llm/distillation-qwen3-32b-to-1.7b-base-1n8g-megatron-tp2pp2cp2-pack.sh:16-30
Timestamp: 2025-10-12T14:46:55.513Z
Learning: In the NVIDIA-NeMo/RL repository, test scripts under tests/ follow a consistent pattern: use `cd $PROJECT_ROOT` without quotes or error handling, and pass arguments with `$@` unquoted. Maintain this consistency when adding new test scripts.

Applied to files:

  • tests/test_suites/llm/performance/dapo-deepseek-v3-64n8g.sh
📚 Learning: 2025-09-19T07:28:29.887Z
Learnt from: shuo-nvidia
Repo: NVIDIA-NeMo/RL PR: 1006
File: tests/test_suites/llm/distillation-qwen3-32b-to-4b-base-2n8g-fsdp2tp2-long.v1.sh:1-4
Timestamp: 2025-09-19T07:28:29.887Z
Learning: The NVIDIA-NeMo/RL project prefers to maintain consistent formatting across test scripts rather than applying individual bash hardening improvements like `set -euo pipefail` or proper quoting for sourcing files.

Applied to files:

  • tests/test_suites/llm/performance/dapo-deepseek-v3-64n8g.sh
🪛 Shellcheck (0.11.0)
tests/test_suites/llm/performance/dapo-deepseek-v3-64n8g.sh

[warning] 14-14: NUM_RUNS appears unused. Verify use (or export if used externally).

(SC2034)


[warning] 15-15: NUM_MINUTES appears unused. Verify use (or export if used externally).

(SC2034)


[warning] 21-21: Use 'cd ... || exit' or 'cd ... || return' in case cd fails.

(SC2164)


[error] 36-36: Double quote array expansions to avoid re-splitting elements.

(SC2068)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (8)
  • GitHub Check: sphinx-build / Build docs
  • GitHub Check: build-container / main
  • GitHub Check: Lint check
  • GitHub Check: Lint check
  • GitHub Check: Lint check
  • GitHub Check: Lint check
  • GitHub Check: Post submodule check comment / Comment on PR
  • GitHub Check: Post automodel integration comment / Comment on PR
🔇 Additional comments (3)
tests/test_suites/performance_h100.txt (1)

13-13: LGTM!

The new DAPO test script is correctly added to the SYNC section, consistent with the YAML configuration where async_engine: false is set.

examples/configs/recipes/llm/performance/dapo-deepseek-v3-64n8g.yaml (1)

1-91: LGTM!

The YAML configuration follows the naming convention <algo>-<model>-<nodes>n<gpus>g.yaml and is placed correctly under examples/configs/recipes/llm/performance/. The parallelism configuration (TP=8, EP=32, PP=8, CP=4) is appropriate for the 64-node × 8-GPU (512 GPU) cluster topology, and the VLLM tensor parallel size of 64 aligns with the scale. Based on learnings, the naming pattern is correct.

tests/test_suites/llm/performance/dapo-deepseek-v3-64n8g.sh (1)

1-47: LGTM!

The test script follows the repository's established patterns:

  • Uses uv run to execute Python scripts as required
  • Script name matches the YAML base name with .sh extension (per learnings)
  • Standard configuration variables (NUM_NODES, STEPS_PER_RUN, MAX_STEPS, NUM_RUNS, NUM_MINUTES) follow the test infrastructure interface
  • The cd $PROJECT_ROOT and unquoted $@ patterns are consistent with other test scripts in this repository

Regarding the static analysis hints:

  • SC2034 (NUM_RUNS/NUM_MINUTES unused): These are consumed by external launch tooling per learnings
  • SC2164 (cd error handling): Intentionally omitted for consistency with existing scripts per learnings
  • SC2068 ($@ quoting): Intentionally unquoted per repository convention

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@terrykong terrykong merged commit b1a1e73 into r0.5.0 Dec 22, 2025
68 of 71 checks passed
@terrykong terrykong deleted the cherry-pick-1617-r0.5.0 branch December 22, 2025 19:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cherry-pick CI:L0 Run doctests and unit tests Run CICD

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants