FIX integration tests by escaping aalcr and adding judge args by gwarmstrong · Pull Request #1062 · NVIDIA-NeMo/Skills

gwarmstrong · 2025-12-01T21:19:18Z

Summary by CodeRabbit

Tests
- Improved detection of judge-only datasets during evaluation processes
- Expanded dataset filtering to exclude datasets with known tokenization issues
Chores
- Added judge pipeline argument configurations to dataset modules

_{✏️ Tip: You can customize this high-level summary in your review settings.}

Signed-off-by: George Armstrong <georgea@nvidia.com>

tests/gpu-tests/test_eval.py

Signed-off-by: George Armstrong <georgea@nvidia.com>

coderabbitai · 2025-12-02T22:01:11Z

📝 Walkthrough

Walkthrough

Four files are modified to introduce a new constant JUDGE_PIPELINE_ARGS = {} across three judge-specific dataset modules and update test evaluation logic to use hasattr() for judge dataset detection instead of getattr(), while expanding the excluded datasets list.

Changes

Cohort / File(s)	Summary
Judge Pipeline Constants `nemo_skills/dataset/open-proof-corpus-judge/__init__.py`, `nemo_skills/dataset/proof-arena-judge/__init__.py`, `nemo_skills/dataset/proof-bench-judge/__init__.py`	Added new public constant `JUDGE_PIPELINE_ARGS = {}` to each module, establishing a judge-only pipeline argument placeholder across multiple judge-specific dataset modules.
Test Evaluation Logic Update `tests/gpu-tests/test_eval.py`	Changed judge-dataset detection from `getattr(..., None)` to `hasattr(...)` to properly recognize datasets with `JUDGE_PIPELINE_ARGS` even when the value is an empty dict. Expanded excluded datasets list to include `bfcl_v4` and `aalcr` with a note about tokenization issues.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~8 minutes

Areas requiring attention:
- Verify that the hasattr() change in test logic correctly identifies all judge datasets across the codebase
- Confirm that excluding bfcl_v4 and aalcr datasets aligns with project requirements

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately summarizes the main changes: fixing integration tests by excluding problematic datasets (aalcr) and adding judge pipeline arguments to multiple dataset modules.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch georgea/fix-integration-tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (1)

tests/gpu-tests/test_eval.py (1)

225-244: Expanded excluded_datasets set looks reasonable

Excluding bfcl_v4 and aalcr from the “prepare and eval all datasets” sweep is consistent with the existing pattern of carving out heavy or problematic datasets and should help keep this integration test stable. You might optionally add a short comment for bfcl_v4 similar to aalcr for future maintainability.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 29674a4 and 79ae6cd.

📒 Files selected for processing (4)

nemo_skills/dataset/open-proof-corpus-judge/__init__.py (1 hunks)
nemo_skills/dataset/proof-arena-judge/__init__.py (1 hunks)
nemo_skills/dataset/proof-bench-judge/__init__.py (1 hunks)
tests/gpu-tests/test_eval.py (2 hunks)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)

GitHub Check: pre-commit
GitHub Check: unit-tests

🔇 Additional comments (4)

nemo_skills/dataset/open-proof-corpus-judge/__init__.py (1)

19-20: Judge-only marker constant is appropriate

Defining JUDGE_PIPELINE_ARGS = {} with a clear comment is consistent with the new hasattr-based detection and doesn’t alter existing behavior; it cleanly marks this dataset as judge-only.

nemo_skills/dataset/proof-bench-judge/__init__.py (1)

19-20: Consistent judge-only marker for proof-bench-judge

Adding JUDGE_PIPELINE_ARGS = {} with the judge-only comment aligns this module with the other judge datasets and works as intended with the updated detection logic.

tests/gpu-tests/test_eval.py (1)

257-259: hasattr-based judge dataset detection is correct

Switching from a truthiness check to hasattr(dataset_module, "JUDGE_PIPELINE_ARGS") correctly treats an empty dict marker as “judge-enabled” and matches the new constants added in the judge datasets, without changing behavior for modules that don’t define the attribute.

nemo_skills/dataset/proof-arena-judge/__init__.py (1)

19-20: Judge-only marker added consistently

JUDGE_PIPELINE_ARGS = {} plus the clarifying comment brings this dataset in line with the other judge-only modules and integrates cleanly with the updated test detection logic.

gwarmstrong · 2025-12-02T22:07:37Z

@coderabbitai title

coderabbitai · 2025-12-02T22:07:58Z

Tip

For best results, initiate chat on the files or code changes.

@gwarmstrong, based on the changes in this PR, here are some more descriptive title suggestions:

Recommended:

Add JUDGE_PIPELINE_ARGS to judge-only datasets and fix detection logic

Alternatives:

Fix judge dataset detection by adding JUDGE_PIPELINE_ARGS constant

Add JUDGE_PIPELINE_ARGS to judge datasets and update hasattr check

The current title "Fix integration tests" is quite generic. The PR specifically adds the JUDGE_PIPELINE_ARGS = {} constant to three judge-only datasets and updates the detection logic in test_eval.py from getattr(..., None) to hasattr(...) to properly recognize datasets with this constant (even when it's an empty dict).

Would you like me to use one of these titles, or would you prefer a different phrasing?

Signed-off-by: George Armstrong <georgea@nvidia.com>

Signed-off-by: George Armstrong <georgea@nvidia.com> Signed-off-by: mmkrtchyan <mmkrtchyan@nvidia.com>

Signed-off-by: George Armstrong <georgea@nvidia.com>

Signed-off-by: George Armstrong <georgea@nvidia.com> Signed-off-by: dgitman <dgitman@nvidia.com>

WIP try fixing integration tests

47fa174

Signed-off-by: George Armstrong <georgea@nvidia.com>

gwarmstrong added run GPU tests and removed run GPU tests labels Dec 1, 2025

gwarmstrong commented Dec 2, 2025

View reviewed changes

tests/gpu-tests/test_eval.py Outdated Show resolved Hide resolved

gwarmstrong commented Dec 2, 2025

View reviewed changes

tests/gpu-tests/test_eval.py Outdated Show resolved Hide resolved

Apply suggestions from code review

dab46e0

Signed-off-by: George Armstrong <georgea@nvidia.com>

gwarmstrong added run GPU tests and removed run GPU tests labels Dec 2, 2025

Update tests/gpu-tests/test_eval.py

9344b92

Signed-off-by: George Armstrong <georgea@nvidia.com>

gwarmstrong added run GPU tests and removed run GPU tests labels Dec 2, 2025

Merge branch 'main' into georgea/fix-integration-tests

035c3e6

gwarmstrong added run GPU tests and removed run GPU tests labels Dec 2, 2025

Merge branch 'main' into georgea/fix-integration-tests

79ae6cd

coderabbitai bot reviewed Dec 2, 2025

View reviewed changes

gwarmstrong changed the title ~~Fix integration tests~~ @coderabbitai title Dec 2, 2025

gwarmstrong changed the title ~~@coderabbitai title~~ FIX integration tests by escaping aalcr and adding judge kwargs Dec 2, 2025

gwarmstrong changed the title ~~FIX integration tests by escaping aalcr and adding judge kwargs~~ FIX integration tests by escaping aalcr and adding judge args Dec 2, 2025

gwarmstrong enabled auto-merge (squash) December 2, 2025 22:40

gwarmstrong disabled auto-merge December 2, 2025 22:40

gwarmstrong merged commit fbdece4 into main Dec 2, 2025
5 checks passed

gwarmstrong deleted the georgea/fix-integration-tests branch December 2, 2025 22:40

gwarmstrong added a commit that referenced this pull request Dec 3, 2025

FIX integration tests by escaping aalcr and adding judge args (#1062)

83af983

Signed-off-by: George Armstrong <georgea@nvidia.com>

melllinia pushed a commit that referenced this pull request Dec 5, 2025

FIX integration tests by escaping aalcr and adding judge args (#1062)

d5557bb

Signed-off-by: George Armstrong <georgea@nvidia.com> Signed-off-by: mmkrtchyan <mmkrtchyan@nvidia.com>

wasiahmad pushed a commit that referenced this pull request Feb 4, 2026

FIX integration tests by escaping aalcr and adding judge args (#1062)

7c44a0d

Signed-off-by: George Armstrong <georgea@nvidia.com>

This was referenced Feb 21, 2026

feat: add custom judge type support for external repo integration #1265

Closed

feat: add custom judge type support for external repo integration #1274

Merged

dgtm777 pushed a commit that referenced this pull request Mar 18, 2026

FIX integration tests by escaping aalcr and adding judge args (#1062)

ed88c54

Signed-off-by: George Armstrong <georgea@nvidia.com>

dgtm777 pushed a commit that referenced this pull request Mar 18, 2026

FIX integration tests by escaping aalcr and adding judge args (#1062)

2a56523

Signed-off-by: George Armstrong <georgea@nvidia.com> Signed-off-by: dgitman <dgitman@nvidia.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FIX integration tests by escaping aalcr and adding judge args#1062

FIX integration tests by escaping aalcr and adding judge args#1062
gwarmstrong merged 5 commits intomainfrom
georgea/fix-integration-tests

gwarmstrong commented Dec 1, 2025 •

edited by coderabbitai bot

Loading

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot commented Dec 2, 2025 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Uh oh!

coderabbitai bot left a comment

Uh oh!

gwarmstrong commented Dec 2, 2025

Uh oh!

coderabbitai bot commented Dec 2, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

gwarmstrong commented Dec 1, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot commented Dec 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Pre-merge checks and finishing touches

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

gwarmstrong commented Dec 2, 2025

Uh oh!

coderabbitai bot commented Dec 2, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

gwarmstrong commented Dec 1, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Dec 2, 2025 •

edited

Loading