Skip to content

FIX integration tests by escaping aalcr and adding judge args#1062

Merged
gwarmstrong merged 5 commits intomainfrom
georgea/fix-integration-tests
Dec 2, 2025
Merged

FIX integration tests by escaping aalcr and adding judge args#1062
gwarmstrong merged 5 commits intomainfrom
georgea/fix-integration-tests

Conversation

@gwarmstrong
Copy link
Collaborator

@gwarmstrong gwarmstrong commented Dec 1, 2025

Summary by CodeRabbit

  • Tests

    • Improved detection of judge-only datasets during evaluation processes
    • Expanded dataset filtering to exclude datasets with known tokenization issues
  • Chores

    • Added judge pipeline argument configurations to dataset modules

✏️ Tip: You can customize this high-level summary in your review settings.

Signed-off-by: George Armstrong <georgea@nvidia.com>
Signed-off-by: George Armstrong <georgea@nvidia.com>
Signed-off-by: George Armstrong <georgea@nvidia.com>
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Dec 2, 2025

📝 Walkthrough

Walkthrough

Four files are modified to introduce a new constant JUDGE_PIPELINE_ARGS = {} across three judge-specific dataset modules and update test evaluation logic to use hasattr() for judge dataset detection instead of getattr(), while expanding the excluded datasets list.

Changes

Cohort / File(s) Summary
Judge Pipeline Constants
nemo_skills/dataset/open-proof-corpus-judge/__init__.py, nemo_skills/dataset/proof-arena-judge/__init__.py, nemo_skills/dataset/proof-bench-judge/__init__.py
Added new public constant JUDGE_PIPELINE_ARGS = {} to each module, establishing a judge-only pipeline argument placeholder across multiple judge-specific dataset modules.
Test Evaluation Logic Update
tests/gpu-tests/test_eval.py
Changed judge-dataset detection from getattr(..., None) to hasattr(...) to properly recognize datasets with JUDGE_PIPELINE_ARGS even when the value is an empty dict. Expanded excluded datasets list to include bfcl_v4 and aalcr with a note about tokenization issues.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~8 minutes

  • Areas requiring attention:
    • Verify that the hasattr() change in test logic correctly identifies all judge datasets across the codebase
    • Confirm that excluding bfcl_v4 and aalcr datasets aligns with project requirements

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately summarizes the main changes: fixing integration tests by excluding problematic datasets (aalcr) and adding judge pipeline arguments to multiple dataset modules.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch georgea/fix-integration-tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
tests/gpu-tests/test_eval.py (1)

225-244: Expanded excluded_datasets set looks reasonable

Excluding bfcl_v4 and aalcr from the “prepare and eval all datasets” sweep is consistent with the existing pattern of carving out heavy or problematic datasets and should help keep this integration test stable. You might optionally add a short comment for bfcl_v4 similar to aalcr for future maintainability.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 29674a4 and 79ae6cd.

📒 Files selected for processing (4)
  • nemo_skills/dataset/open-proof-corpus-judge/__init__.py (1 hunks)
  • nemo_skills/dataset/proof-arena-judge/__init__.py (1 hunks)
  • nemo_skills/dataset/proof-bench-judge/__init__.py (1 hunks)
  • tests/gpu-tests/test_eval.py (2 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
  • GitHub Check: pre-commit
  • GitHub Check: unit-tests
🔇 Additional comments (4)
nemo_skills/dataset/open-proof-corpus-judge/__init__.py (1)

19-20: Judge-only marker constant is appropriate

Defining JUDGE_PIPELINE_ARGS = {} with a clear comment is consistent with the new hasattr-based detection and doesn’t alter existing behavior; it cleanly marks this dataset as judge-only.

nemo_skills/dataset/proof-bench-judge/__init__.py (1)

19-20: Consistent judge-only marker for proof-bench-judge

Adding JUDGE_PIPELINE_ARGS = {} with the judge-only comment aligns this module with the other judge datasets and works as intended with the updated detection logic.

tests/gpu-tests/test_eval.py (1)

257-259: hasattr-based judge dataset detection is correct

Switching from a truthiness check to hasattr(dataset_module, "JUDGE_PIPELINE_ARGS") correctly treats an empty dict marker as “judge-enabled” and matches the new constants added in the judge datasets, without changing behavior for modules that don’t define the attribute.

nemo_skills/dataset/proof-arena-judge/__init__.py (1)

19-20: Judge-only marker added consistently

JUDGE_PIPELINE_ARGS = {} plus the clarifying comment brings this dataset in line with the other judge-only modules and integrates cleanly with the updated test detection logic.

@gwarmstrong gwarmstrong changed the title Fix integration tests @coderabbitai title Dec 2, 2025
@gwarmstrong
Copy link
Collaborator Author

@coderabbitai title

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Dec 2, 2025

Tip

For best results, initiate chat on the files or code changes.

@gwarmstrong, based on the changes in this PR, here are some more descriptive title suggestions:

Recommended:

Add JUDGE_PIPELINE_ARGS to judge-only datasets and fix detection logic

Alternatives:

Fix judge dataset detection by adding JUDGE_PIPELINE_ARGS constant
Add JUDGE_PIPELINE_ARGS to judge datasets and update hasattr check

The current title "Fix integration tests" is quite generic. The PR specifically adds the JUDGE_PIPELINE_ARGS = {} constant to three judge-only datasets and updates the detection logic in test_eval.py from getattr(..., None) to hasattr(...) to properly recognize datasets with this constant (even when it's an empty dict).

Would you like me to use one of these titles, or would you prefer a different phrasing?

@gwarmstrong gwarmstrong changed the title @coderabbitai title FIX integration tests by escaping aalcr and adding judge kwargs Dec 2, 2025
@gwarmstrong gwarmstrong changed the title FIX integration tests by escaping aalcr and adding judge kwargs FIX integration tests by escaping aalcr and adding judge args Dec 2, 2025
@gwarmstrong gwarmstrong enabled auto-merge (squash) December 2, 2025 22:40
@gwarmstrong gwarmstrong disabled auto-merge December 2, 2025 22:40
@gwarmstrong gwarmstrong merged commit fbdece4 into main Dec 2, 2025
5 checks passed
@gwarmstrong gwarmstrong deleted the georgea/fix-integration-tests branch December 2, 2025 22:40
gwarmstrong added a commit that referenced this pull request Dec 3, 2025
Signed-off-by: George Armstrong <georgea@nvidia.com>
melllinia pushed a commit that referenced this pull request Dec 5, 2025
Signed-off-by: George Armstrong <georgea@nvidia.com>
Signed-off-by: mmkrtchyan <mmkrtchyan@nvidia.com>
wasiahmad pushed a commit that referenced this pull request Feb 4, 2026
Signed-off-by: George Armstrong <georgea@nvidia.com>
dgtm777 pushed a commit that referenced this pull request Mar 18, 2026
Signed-off-by: George Armstrong <georgea@nvidia.com>
dgtm777 pushed a commit that referenced this pull request Mar 18, 2026
Signed-off-by: George Armstrong <georgea@nvidia.com>
Signed-off-by: dgitman <dgitman@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant