cp: fix: fix several nightly tests that were flaky (1724) into r0.5.0#1735
cp: fix: fix several nightly tests that were flaky (1724) into r0.5.0#1735
fix: fix several nightly tests that were flaky (1724) into r0.5.0#1735Conversation
Signed-off-by: Terry Kong <terryk@nvidia.com> Signed-off-by: NeMo Bot <nemo-bot@nvidia.com>
📝 WalkthroughWalkthroughThis pull request updates test suite configurations and metrics thresholds across multiple LLM training test scripts. Changes include adjusting GPU memory thresholds in SFT tests, increasing NUM_MINUTES timeout values in configuration scripts, enabling TensorBoard logging in performance tests, updating DPO training loss thresholds, and raising the nightly GPU-hours budget limit from 1140 to 1180 hours. Changes
Estimated code review effort🎯 2 (Simple) | ⏱️ ~12 minutes Possibly related PRs
Suggested labels
Suggested reviewers
🚥 Pre-merge checks | ✅ 3 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 0
🧹 Nitpick comments (1)
tests/test_suites/llm/sft-nanov3-30BA3B-2n8g-fsdp2-lora.sh (1)
10-10: LGTM: Timeout increase with clarifying comment.The timeout increase from 15 to 30 minutes appropriately addresses test flakiness. The comment helpfully explains that the additional time buffers for initial checkpoint download. Based on learnings, NUM_MINUTES is consumed by external launch tooling.
Optional: Clarify the comment wording
The comment mentions both "3 minutes" and "30min" which could be slightly clearer. Consider:
-NUM_MINUTES=30 # Usually 15 minutes is enough for 20 steps, but we add a buffer of 3 minutes in metrics check (30min to buffer for initial ckpt download) +NUM_MINUTES=30 # 15 minutes typically suffices for 20 steps; extended to 30 minutes to buffer for initial checkpoint download
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (11)
examples/configs/recipes/llm/dapo-qwen2.5-7b.yamltests/test_suites/llm/dpo-llama3.1-8b-instruct-4n8g-fsdp2tp4.shtests/test_suites/llm/dpo-llama3.1-8b-instruct-4n8g-megatron.v2.shtests/test_suites/llm/performance/grpo-qwen3-235b-16n8g.shtests/test_suites/llm/performance/grpo-qwen3-235b-32n8g-async-1off.shtests/test_suites/llm/sft-llama3.1-70b-8n8g-tp4pp2-long-megatron.shtests/test_suites/llm/sft-llama3.1-8b-1n8g-fsdp2tp1-dynamicbatch.shtests/test_suites/llm/sft-llama3.2-1b-1n8g-fsdp2tp1.v3.shtests/test_suites/llm/sft-nanov3-30BA3B-2n8g-fsdp2-lora.shtests/test_suites/llm/sft-nanov3-30BA3B-2n8g-fsdp2.shtests/unit/test_recipes_and_test_suites.py
💤 Files with no reviewable changes (1)
- tests/test_suites/llm/sft-llama3.1-70b-8n8g-tp4pp2-long-megatron.sh
🧰 Additional context used
📓 Path-based instructions (7)
**/*.sh
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
**/*.sh: Use uv run instead of python to execute scripts
Follow the Google Shell Style Guide for shell scripts
Files:
tests/test_suites/llm/sft-nanov3-30BA3B-2n8g-fsdp2.shtests/test_suites/llm/performance/grpo-qwen3-235b-32n8g-async-1off.shtests/test_suites/llm/sft-llama3.1-8b-1n8g-fsdp2tp1-dynamicbatch.shtests/test_suites/llm/dpo-llama3.1-8b-instruct-4n8g-fsdp2tp4.shtests/test_suites/llm/sft-nanov3-30BA3B-2n8g-fsdp2-lora.shtests/test_suites/llm/dpo-llama3.1-8b-instruct-4n8g-megatron.v2.shtests/test_suites/llm/performance/grpo-qwen3-235b-16n8g.shtests/test_suites/llm/sft-llama3.2-1b-1n8g-fsdp2tp1.v3.sh
tests/test_suites/**/*.sh
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
tests/test_suites/**/*.sh: When adding support for a new model, create a corresponding driver shell script under tests/test_suites/ in the matching domain
Driver shell scripts should match the YAML base name with .sh extension and invoke training entrypoint with uv run
Files:
tests/test_suites/llm/sft-nanov3-30BA3B-2n8g-fsdp2.shtests/test_suites/llm/performance/grpo-qwen3-235b-32n8g-async-1off.shtests/test_suites/llm/sft-llama3.1-8b-1n8g-fsdp2tp1-dynamicbatch.shtests/test_suites/llm/dpo-llama3.1-8b-instruct-4n8g-fsdp2tp4.shtests/test_suites/llm/sft-nanov3-30BA3B-2n8g-fsdp2-lora.shtests/test_suites/llm/dpo-llama3.1-8b-instruct-4n8g-megatron.v2.shtests/test_suites/llm/performance/grpo-qwen3-235b-16n8g.shtests/test_suites/llm/sft-llama3.2-1b-1n8g-fsdp2tp1.v3.sh
!(**/tests/**|**/test_*.py|**/test_*.sh)
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
Add the NVIDIA copyright header to all Python files and shell scripts (excluding tests). The header should include the current year
Files:
tests/test_suites/llm/sft-nanov3-30BA3B-2n8g-fsdp2.shtests/test_suites/llm/performance/grpo-qwen3-235b-32n8g-async-1off.shtests/test_suites/llm/sft-llama3.1-8b-1n8g-fsdp2tp1-dynamicbatch.shtests/test_suites/llm/dpo-llama3.1-8b-instruct-4n8g-fsdp2tp4.shtests/test_suites/llm/sft-nanov3-30BA3B-2n8g-fsdp2-lora.shtests/test_suites/llm/dpo-llama3.1-8b-instruct-4n8g-megatron.v2.shtests/unit/test_recipes_and_test_suites.pyexamples/configs/recipes/llm/dapo-qwen2.5-7b.yamltests/test_suites/llm/performance/grpo-qwen3-235b-16n8g.shtests/test_suites/llm/sft-llama3.2-1b-1n8g-fsdp2tp1.v3.sh
**/*.{py,sh}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
The NVIDIA copyright header should appear at the top of all Python files and shell scripts (excluding tests)
Files:
tests/test_suites/llm/sft-nanov3-30BA3B-2n8g-fsdp2.shtests/test_suites/llm/performance/grpo-qwen3-235b-32n8g-async-1off.shtests/test_suites/llm/sft-llama3.1-8b-1n8g-fsdp2tp1-dynamicbatch.shtests/test_suites/llm/dpo-llama3.1-8b-instruct-4n8g-fsdp2tp4.shtests/test_suites/llm/sft-nanov3-30BA3B-2n8g-fsdp2-lora.shtests/test_suites/llm/dpo-llama3.1-8b-instruct-4n8g-megatron.v2.shtests/unit/test_recipes_and_test_suites.pytests/test_suites/llm/performance/grpo-qwen3-235b-16n8g.shtests/test_suites/llm/sft-llama3.2-1b-1n8g-fsdp2tp1.v3.sh
**/*.py
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
**/*.py: Conform code to Python 3.12+
Indent code with 4 spaces. Do not use tabs
Use snake_case for file names
Use PascalCase for class names
Use snake_case for function and method names
Use snake_case for local variables
Prefix variable names that start with a number with 'k' (e.g., k_99th_percentile)
Use upper snake_case with 'G' prefix for global variables (e.g., G_MY_GLOBAL)
Use upper snake_case for constants
Avoid shadowing variables declared in an outer scope
Initialize all externally visible members of a class in the constructor
Prefer docstrings over comments for interfaces that may be used outside a file
Reserve comments for code within a function or interfaces that are local to a file
If a piece of code is commented out, include a comment describing its usage and why it's commented out. Remove debug comments before merging
Use Google style docstrings for classes and functions in Python, which can be parsed by Sphinx
Avoid using reflection when functionality can be easily achieved without reflection
When using try-except blocks, limit the except clause to the smallest set of specific errors possible
When using try-except blocks for duck-typing, keep the body of the try as small as possible and use the else block for logic
YAML is the single source of truth for configuration defaults. Do not set non-None defaults in code for configuration values
For required configuration attributes, access config directly and expect presence (e.g., policy_cfg['precision']) without hidden defaults
Use typing.NotRequired to mark optional attributes in TypedDict for configuration
When adding a new config key to a TypedDict subclass, document the key's purpose, valid values/types, and recommended default, and reflect the default in exemplar YAMLs under examples/configs/*.yaml
Follow the Google Python Style Guide for Python code
Files:
tests/unit/test_recipes_and_test_suites.py
examples/configs/recipes/**/*.yaml
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
When adding support for a new model, create a recipe YAML under examples/configs/recipes/ in the appropriate domain subdirectory (llm, vlm, etc.)
Files:
examples/configs/recipes/llm/dapo-qwen2.5-7b.yaml
examples/configs/recipes/llm/*.yaml
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
Recipe YAML files should follow the naming pattern: --ng-[-modifiers][-long][.vN].yaml for LLM recipes
Files:
examples/configs/recipes/llm/dapo-qwen2.5-7b.yaml
🧠 Learnings (2)
📓 Common learnings
Learnt from: zpqiu
Repo: NVIDIA-NeMo/RL PR: 1324
File: tests/test_suites/llm/distillation-qwen3-32b-to-1.7b-base-1n8g-megatron-tp2pp2cp2-pack.sh:6-11
Timestamp: 2025-10-12T14:46:57.171Z
Learning: Test scripts in tests/test_suites/llm/ follow a standard configuration pattern that includes NUM_NODES, STEPS_PER_RUN, MAX_STEPS, NUM_RUNS (calculated as `$(( (MAX_STEPS + STEPS_PER_RUN - 1) / STEPS_PER_RUN ))`), and NUM_MINUTES. These variables are part of the test infrastructure's standard interface and should not be flagged as unused even if not directly referenced within the individual script, as they are consumed by external launch tooling or common.env.
📚 Learning: 2025-10-12T14:46:57.171Z
Learnt from: zpqiu
Repo: NVIDIA-NeMo/RL PR: 1324
File: tests/test_suites/llm/distillation-qwen3-32b-to-1.7b-base-1n8g-megatron-tp2pp2cp2-pack.sh:6-11
Timestamp: 2025-10-12T14:46:57.171Z
Learning: Test scripts in tests/test_suites/llm/ follow a standard configuration pattern that includes NUM_NODES, STEPS_PER_RUN, MAX_STEPS, NUM_RUNS (calculated as `$(( (MAX_STEPS + STEPS_PER_RUN - 1) / STEPS_PER_RUN ))`), and NUM_MINUTES. These variables are part of the test infrastructure's standard interface and should not be flagged as unused even if not directly referenced within the individual script, as they are consumed by external launch tooling or common.env.
Applied to files:
tests/test_suites/llm/sft-nanov3-30BA3B-2n8g-fsdp2.shtests/test_suites/llm/sft-llama3.1-8b-1n8g-fsdp2tp1-dynamicbatch.shtests/test_suites/llm/dpo-llama3.1-8b-instruct-4n8g-fsdp2tp4.shtests/test_suites/llm/sft-nanov3-30BA3B-2n8g-fsdp2-lora.shtests/test_suites/llm/dpo-llama3.1-8b-instruct-4n8g-megatron.v2.shtests/test_suites/llm/performance/grpo-qwen3-235b-16n8g.shtests/test_suites/llm/sft-llama3.2-1b-1n8g-fsdp2tp1.v3.sh
🧬 Code graph analysis (1)
tests/unit/test_recipes_and_test_suites.py (1)
tests/unit/conftest.py (1)
tracker(265-296)
🪛 Shellcheck (0.11.0)
tests/test_suites/llm/sft-nanov3-30BA3B-2n8g-fsdp2.sh
[warning] 10-10: NUM_MINUTES appears unused. Verify use (or export if used externally).
(SC2034)
tests/test_suites/llm/sft-nanov3-30BA3B-2n8g-fsdp2-lora.sh
[warning] 10-10: NUM_MINUTES appears unused. Verify use (or export if used externally).
(SC2034)
tests/test_suites/llm/performance/grpo-qwen3-235b-16n8g.sh
[warning] 12-12: NUM_MINUTES appears unused. Verify use (or export if used externally).
(SC2034)
tests/test_suites/llm/sft-llama3.2-1b-1n8g-fsdp2tp1.v3.sh
[warning] 10-10: NUM_MINUTES appears unused. Verify use (or export if used externally).
(SC2034)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)
- GitHub Check: Lint check
- GitHub Check: Lint check
- GitHub Check: Lint check
- GitHub Check: Post automodel integration comment / Comment on PR
- GitHub Check: Post submodule check comment / Comment on PR
🔇 Additional comments (11)
tests/test_suites/llm/dpo-llama3.1-8b-instruct-4n8g-megatron.v2.sh (1)
37-37: LGTM: Reasonable threshold adjustment to fix test flakiness.The threshold increase from 3.60 to 3.65 for the initial training loss at step 1 is a minimal adjustment (1.4% increase) that appropriately addresses test flakiness while maintaining test effectiveness. This change is consistent with the PR objective to fix flaky nightly tests.
tests/test_suites/llm/dpo-llama3.1-8b-instruct-4n8g-fsdp2tp4.sh (1)
37-37: LGTM: Consistent threshold adjustment across test configurations.The threshold increase from 3.60 to 3.65 matches the identical change in the megatron.v2 test script, demonstrating a systematic and consistent approach to fixing flakiness across both megatron and fsdp2tp4 configurations.
tests/test_suites/llm/sft-llama3.1-8b-1n8g-fsdp2tp1-dynamicbatch.sh (2)
37-37: Good practice: Documenting observed memory behavior.The comment provides helpful context for the threshold adjustment, noting both the typical memory usage and its variability.
41-41: LGTM: Threshold adjustment addresses test flakiness.The increase from 70 GB to 75 GB is appropriate given the observed memory usage of ~72.6 GB (noted in line 37). The 5 GB buffer provides reasonable headroom for measurement variability while still catching actual memory issues. This change aligns with the PR objective of fixing flaky nightly tests.
examples/configs/recipes/llm/dapo-qwen2.5-7b.yaml (1)
28-32: This configuration change is architecturally required for DTensorPolicyWorker v1 compatibility.Setting
model_save_formattonullis not a workaround but a mandatory constraint enforced by the codebase. Whenmodel_save_formatisnull, the system automatically defaults to"safetensors"format. This is consistent across other DAPO/GRPO recipes using_v2: falseconfiguration.tests/test_suites/llm/sft-nanov3-30BA3B-2n8g-fsdp2.sh (1)
10-10: LGTM: Timeout increase to mitigate flaky tests.Doubling the timeout from 15 to 30 minutes is a reasonable adjustment for addressing test flakiness. Based on learnings, NUM_MINUTES is consumed by external launch tooling and is correctly defined here.
tests/test_suites/llm/sft-llama3.2-1b-1n8g-fsdp2tp1.v3.sh (1)
10-10: LGTM: Consistent timeout increase.The timeout increase from 15 to 30 minutes aligns with similar adjustments across other test suites in this PR to reduce flakiness. Based on learnings, NUM_MINUTES is correctly configured for consumption by external launch tooling.
tests/test_suites/llm/performance/grpo-qwen3-235b-16n8g.sh (2)
12-12: LGTM: Conservative timeout increase for large-scale test.The 15% timeout increase (100→115 minutes) provides additional headroom for this 16-node performance test while remaining conservative. Based on learnings, NUM_MINUTES is consumed by external launch tooling.
27-27: LGTM: Explicit TensorBoard logging enabled.Explicitly enabling TensorBoard logging ensures consistent metrics collection, which the script already depends on (line 34 converts TB logs to JSON). This change improves test reliability.
tests/test_suites/llm/performance/grpo-qwen3-235b-32n8g-async-1off.sh (1)
27-27: LGTM: Consistent TensorBoard logging enablement.Explicitly enabling TensorBoard logging aligns with the change in the 16-node configuration (grpo-qwen3-235b-16n8g.sh) and ensures reliable metrics collection for this 32-node async test.
tests/unit/test_recipes_and_test_suites.py (1)
183-218: Threshold increase is consistent and properly implemented across all references.All three locations correctly reflect the new 1180 GPU-hour limit: the function name, the assertion condition, and the error message. A search confirms no remaining references to the old 1140 threshold, indicating a clean migration. This ~3.5% budget increase is a reasonable accommodation for execution variance when addressing flaky tests, and the
tracker.track()call enables continued monitoring of actual usage.
beep boop [🤖]: Hi @terrykong 👋,
Summary by CodeRabbit
Tests
Chores
✏️ Tip: You can customize this high-level summary in your review settings.