Skip to content

Introduce regex for small differences of formatting from judge#1082

Merged
gwarmstrong merged 2 commits intomainfrom
wprazuch/loosen-judgement-extraction-for-gpt4-o
Dec 9, 2025
Merged

Introduce regex for small differences of formatting from judge#1082
gwarmstrong merged 2 commits intomainfrom
wprazuch/loosen-judgement-extraction-for-gpt4-o

Conversation

@wprazuch
Copy link
Collaborator

@wprazuch wprazuch commented Dec 9, 2025

Problem

The gpt-4o LLM judge sometimes returns responses with markdown bold formatting:
Judgement: yes
However, the is_correct_judgement function only looked for Judgement: (plain text), causing these valid positive judgements to be incorrectly marked as False.

Solution

Updated the regex pattern in is_correct_judgement to handle both plain and markdown-formatted responses:

# Before
if "Judgement:" in judgement:
    verdict = judgement.split("Judgement:")[-1].strip()


# After  
match = re.search(r'\*{0,2}Judgement\*{0,2}\s*:', judgement, re.IGNORECASE)
if match:
    verdict = judgement[match.end():].strip()

The new pattern matches:
Judgement: (plain)
Judgement: (italic)
Judgement: (bold)
Impact
Tested on HLE benchmark results - judge_correct nearly doubled:
Subset Before After Δ
hle (overall) 5.70% 10.24% +4.54%
hle-Math 8.91% 16.39% +7.48%
hle-Engineering 3.13% 6.25% +3.12%
hle-Chemistry 4.95% 7.92% +2.97%

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Dec 9, 2025

📝 Walkthrough

Walkthrough

Enhancement to judgment parsing logic in the metrics utilities module. The is_correct_judgement function now supports both plain text "Judgement:" and markdown-formatted "Judgement:" patterns using regex-based matching, while maintaining backward compatibility.

Changes

Cohort / File(s) Summary
Judgment parsing enhancement
nemo_skills/evaluation/metrics/utils.py
Added re module import; modified is_correct_judgement function to recognize judgment patterns in both plain text and markdown-formatted styles (case-insensitive) using regex matching; maintains existing control flow for return behavior

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~8 minutes

  • Verify the regex pattern correctly matches both "Judgement:" and "Judgement:" formats in various case combinations
  • Confirm backward compatibility with existing judgment formats and return behaviors

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately describes the main change: introducing regex parsing to handle formatting variations in judge responses, specifically for 'Judgement:' markers.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch wprazuch/loosen-judgement-extraction-for-gpt4-o

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
nemo_skills/evaluation/metrics/utils.py (1)

37-45: Regex-based judgement parsing looks correct; consider small clarity tweaks

The new regex-based parsing behaves as intended and preserves the previous semantics while being more tolerant (handles Judgement :, *Judgement*:, **Judgement**:, case-insensitively). The yes/no verdict extraction also remains correct.

Two optional improvements for readability and future-proofing:

  1. Precompile the regex as a module-level constant
    Makes the intent explicit and avoids recompiling on every call if this ends up on a hot path.

  2. Align comment with actual behavior
    The pattern allows 0–2 asterisks, not just the bold markdown form; adjusting the comment would avoid confusion for future readers.

Example refactor (optional):

+JUDGEMENT_PREFIX_RE = re.compile(r'\*{0,2}Judgement\*{0,2}\s*:', re.IGNORECASE)
+
 def is_correct_judgement(judgement, return_none=False) -> Union[bool, None]:
-    # Match both plain "Judgement:" and markdown bold "**Judgement**:" formats, this happens for gpt-4o which is AA Judge model.
-    match = re.search(r'\*{0,2}Judgement\*{0,2}\s*:', judgement, re.IGNORECASE)
+    # Match plain "Judgement:" and common markdown variants like "*Judgement*:" / "**Judgement**:" (e.g., from gpt-4o AA Judge).
+    match = JUDGEMENT_PREFIX_RE.search(judgement)

These are non-blocking; current implementation is functionally sound.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between ac0b34d and 4743a12.

📒 Files selected for processing (1)
  • nemo_skills/evaluation/metrics/utils.py (2 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
  • GitHub Check: unit-tests
  • GitHub Check: pre-commit
🔇 Additional comments (1)
nemo_skills/evaluation/metrics/utils.py (1)

14-17: re import is appropriate for the new judgement parsing logic

Using the standard library re module here is the right choice; no issues with the added import.

wprazuch and others added 2 commits December 9, 2025 09:07
Signed-off-by: George Armstrong <georgea@nvidia.com>
Signed-off-by: George Armstrong <georgea@nvidia.com>
@gwarmstrong gwarmstrong force-pushed the wprazuch/loosen-judgement-extraction-for-gpt4-o branch from 4743a12 to 9a88a1d Compare December 9, 2025 17:09
@gwarmstrong
Copy link
Collaborator

@wprazuch LGTM, for future reference please check our CONTRIBUTING.md for guidance regarding pre-commit formatting and commit signoff.

@gwarmstrong gwarmstrong enabled auto-merge (squash) December 9, 2025 17:11
@gwarmstrong gwarmstrong merged commit 48cd54b into main Dec 9, 2025
5 checks passed
@gwarmstrong gwarmstrong deleted the wprazuch/loosen-judgement-extraction-for-gpt4-o branch December 9, 2025 17:25
Jorjeous pushed a commit that referenced this pull request Dec 11, 2025
Signed-off-by: George Armstrong <georgea@nvidia.com>
Co-authored-by: George Armstrong <georgea@nvidia.com>
Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>
wasiahmad pushed a commit that referenced this pull request Dec 12, 2025
Signed-off-by: George Armstrong <georgea@nvidia.com>
Co-authored-by: George Armstrong <georgea@nvidia.com>
Signed-off-by: wasiahmad <wasiahmad@ucla.edu>
wasiahmad pushed a commit that referenced this pull request Dec 19, 2025
Signed-off-by: George Armstrong <georgea@nvidia.com>
Co-authored-by: George Armstrong <georgea@nvidia.com>
wasiahmad pushed a commit that referenced this pull request Dec 19, 2025
Signed-off-by: George Armstrong <georgea@nvidia.com>
Co-authored-by: George Armstrong <georgea@nvidia.com>

Signed-off-by: wasiahmad <wasiahmad@ucla.edu>
hsiehjackson pushed a commit that referenced this pull request Jan 13, 2026
Signed-off-by: George Armstrong <georgea@nvidia.com>
Co-authored-by: George Armstrong <georgea@nvidia.com>
Signed-off-by: Cheng-Ping Hsieh <chsieh@nvidia.com>
wasiahmad pushed a commit that referenced this pull request Feb 4, 2026
Signed-off-by: George Armstrong <georgea@nvidia.com>
Co-authored-by: George Armstrong <georgea@nvidia.com>
dgtm777 pushed a commit that referenced this pull request Mar 18, 2026
Signed-off-by: George Armstrong <georgea@nvidia.com>
Co-authored-by: George Armstrong <georgea@nvidia.com>
dgtm777 pushed a commit that referenced this pull request Mar 18, 2026
Signed-off-by: George Armstrong <georgea@nvidia.com>
Co-authored-by: George Armstrong <georgea@nvidia.com>
Signed-off-by: dgitman <dgitman@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants