Introduce regex for small differences of formatting from judge by wprazuch · Pull Request #1082 · NVIDIA-NeMo/Skills

wprazuch · 2025-12-09T15:22:24Z

Problem

The gpt-4o LLM judge sometimes returns responses with markdown bold formatting:
Judgement: yes
However, the is_correct_judgement function only looked for Judgement: (plain text), causing these valid positive judgements to be incorrectly marked as False.

Solution

Updated the regex pattern in is_correct_judgement to handle both plain and markdown-formatted responses:

# Before
if "Judgement:" in judgement:
    verdict = judgement.split("Judgement:")[-1].strip()


# After  
match = re.search(r'\*{0,2}Judgement\*{0,2}\s*:', judgement, re.IGNORECASE)
if match:
    verdict = judgement[match.end():].strip()

The new pattern matches:
Judgement: (plain)
Judgement: (italic)
Judgement: (bold)
Impact
Tested on HLE benchmark results - judge_correct nearly doubled:
Subset Before After Δ
hle (overall) 5.70% 10.24% +4.54%
hle-Math 8.91% 16.39% +7.48%
hle-Engineering 3.13% 6.25% +3.12%
hle-Chemistry 4.95% 7.92% +2.97%

coderabbitai · 2025-12-09T15:26:25Z

📝 Walkthrough

Walkthrough

Enhancement to judgment parsing logic in the metrics utilities module. The is_correct_judgement function now supports both plain text "Judgement:" and markdown-formatted "Judgement:" patterns using regex-based matching, while maintaining backward compatibility.

Changes

Cohort / File(s)	Summary
Judgment parsing enhancement `nemo_skills/evaluation/metrics/utils.py`	Added `re` module import; modified `is_correct_judgement` function to recognize judgment patterns in both plain text and markdown-formatted styles (case-insensitive) using regex matching; maintains existing control flow for return behavior

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~8 minutes

Verify the regex pattern correctly matches both "Judgement:" and "Judgement:" formats in various case combinations
Confirm backward compatibility with existing judgment formats and return behaviors

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately describes the main change: introducing regex parsing to handle formatting variations in judge responses, specifically for 'Judgement:' markers.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch wprazuch/loosen-judgement-extraction-for-gpt4-o

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (1)

nemo_skills/evaluation/metrics/utils.py (1)
37-45: Regex-based judgement parsing looks correct; consider small clarity tweaks

The new regex-based parsing behaves as intended and preserves the previous semantics while being more tolerant (handles Judgement :, *Judgement*:, **Judgement**:, case-insensitively). The yes/no verdict extraction also remains correct.

Two optional improvements for readability and future-proofing:

Precompile the regex as a module-level constant
Makes the intent explicit and avoids recompiling on every call if this ends up on a hot path.

Align comment with actual behavior
The pattern allows 0–2 asterisks, not just the bold markdown form; adjusting the comment would avoid confusion for future readers.

Example refactor (optional):
+JUDGEMENT_PREFIX_RE = re.compile(r'\*{0,2}Judgement\*{0,2}\s*:', re.IGNORECASE)
+
 def is_correct_judgement(judgement, return_none=False) -> Union[bool, None]:
-    # Match both plain "Judgement:" and markdown bold "**Judgement**:" formats, this happens for gpt-4o which is AA Judge model.
-    match = re.search(r'\*{0,2}Judgement\*{0,2}\s*:', judgement, re.IGNORECASE)
+    # Match plain "Judgement:" and common markdown variants like "*Judgement*:" / "**Judgement**:" (e.g., from gpt-4o AA Judge).
+    match = JUDGEMENT_PREFIX_RE.search(judgement)
These are non-blocking; current implementation is functionally sound.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between ac0b34d and 4743a12.

📒 Files selected for processing (1)

nemo_skills/evaluation/metrics/utils.py (2 hunks)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)

GitHub Check: unit-tests
GitHub Check: pre-commit

🔇 Additional comments (1)

nemo_skills/evaluation/metrics/utils.py (1)

14-17: re import is appropriate for the new judgement parsing logic

Using the standard library re module here is the right choice; no issues with the added import.

Signed-off-by: George Armstrong <georgea@nvidia.com>

gwarmstrong · 2025-12-09T17:10:56Z

@wprazuch LGTM, for future reference please check our CONTRIBUTING.md for guidance regarding pre-commit formatting and commit signoff.

Signed-off-by: George Armstrong <georgea@nvidia.com> Co-authored-by: George Armstrong <georgea@nvidia.com> Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>

Signed-off-by: George Armstrong <georgea@nvidia.com> Co-authored-by: George Armstrong <georgea@nvidia.com> Signed-off-by: wasiahmad <wasiahmad@ucla.edu>

Signed-off-by: George Armstrong <georgea@nvidia.com> Co-authored-by: George Armstrong <georgea@nvidia.com>

Signed-off-by: George Armstrong <georgea@nvidia.com> Co-authored-by: George Armstrong <georgea@nvidia.com> Signed-off-by: wasiahmad <wasiahmad@ucla.edu>

Signed-off-by: George Armstrong <georgea@nvidia.com> Co-authored-by: George Armstrong <georgea@nvidia.com> Signed-off-by: Cheng-Ping Hsieh <chsieh@nvidia.com>

Signed-off-by: George Armstrong <georgea@nvidia.com> Co-authored-by: George Armstrong <georgea@nvidia.com>

Signed-off-by: George Armstrong <georgea@nvidia.com> Co-authored-by: George Armstrong <georgea@nvidia.com> Signed-off-by: dgitman <dgitman@nvidia.com>

coderabbitai bot reviewed Dec 9, 2025

View reviewed changes

gwarmstrong approved these changes Dec 9, 2025

View reviewed changes

wprazuch and others added 2 commits December 9, 2025 09:07

Introduce regex for small differences of formatting from judge

0d86caf

Signed-off-by: George Armstrong <georgea@nvidia.com>

FIX formatting

9a88a1d

Signed-off-by: George Armstrong <georgea@nvidia.com>

gwarmstrong force-pushed the wprazuch/loosen-judgement-extraction-for-gpt4-o branch from 4743a12 to 9a88a1d Compare December 9, 2025 17:09

gwarmstrong enabled auto-merge (squash) December 9, 2025 17:11

gwarmstrong merged commit 48cd54b into main Dec 9, 2025
5 checks passed

gwarmstrong deleted the wprazuch/loosen-judgement-extraction-for-gpt4-o branch December 9, 2025 17:25

wasiahmad pushed a commit that referenced this pull request Dec 19, 2025

Introduce regex for small differences of formatting from judge (#1082)

d9c65dd

Signed-off-by: George Armstrong <georgea@nvidia.com> Co-authored-by: George Armstrong <georgea@nvidia.com>

wasiahmad pushed a commit that referenced this pull request Feb 4, 2026

Introduce regex for small differences of formatting from judge (#1082)

0316807

Signed-off-by: George Armstrong <georgea@nvidia.com> Co-authored-by: George Armstrong <georgea@nvidia.com>

dgtm777 pushed a commit that referenced this pull request Mar 18, 2026

Introduce regex for small differences of formatting from judge (#1082)

d9333f6

Signed-off-by: George Armstrong <georgea@nvidia.com> Co-authored-by: George Armstrong <georgea@nvidia.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Introduce regex for small differences of formatting from judge#1082

Introduce regex for small differences of formatting from judge#1082
gwarmstrong merged 2 commits intomainfrom
wprazuch/loosen-judgement-extraction-for-gpt4-o

wprazuch commented Dec 9, 2025 •

edited

Loading

Uh oh!

coderabbitai bot commented Dec 9, 2025

Walkthrough

Changes

Estimated code review effort

Uh oh!

coderabbitai bot left a comment

Uh oh!

gwarmstrong commented Dec 9, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

wprazuch commented Dec 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Solution

Uh oh!

coderabbitai bot commented Dec 9, 2025

Walkthrough

Changes

Estimated code review effort

Pre-merge checks and finishing touches

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

gwarmstrong commented Dec 9, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

wprazuch commented Dec 9, 2025 •

edited

Loading