Introduce regex for small differences of formatting from judge#1082
Introduce regex for small differences of formatting from judge#1082gwarmstrong merged 2 commits intomainfrom
Conversation
📝 WalkthroughWalkthroughEnhancement to judgment parsing logic in the metrics utilities module. The Changes
Estimated code review effort🎯 2 (Simple) | ⏱️ ~8 minutes
Pre-merge checks and finishing touches❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 0
🧹 Nitpick comments (1)
nemo_skills/evaluation/metrics/utils.py (1)
37-45: Regex-based judgement parsing looks correct; consider small clarity tweaksThe new regex-based parsing behaves as intended and preserves the previous semantics while being more tolerant (handles
Judgement :,*Judgement*:,**Judgement**:, case-insensitively). The yes/no verdict extraction also remains correct.Two optional improvements for readability and future-proofing:
Precompile the regex as a module-level constant
Makes the intent explicit and avoids recompiling on every call if this ends up on a hot path.Align comment with actual behavior
The pattern allows 0–2 asterisks, not just the bold markdown form; adjusting the comment would avoid confusion for future readers.Example refactor (optional):
+JUDGEMENT_PREFIX_RE = re.compile(r'\*{0,2}Judgement\*{0,2}\s*:', re.IGNORECASE) + def is_correct_judgement(judgement, return_none=False) -> Union[bool, None]: - # Match both plain "Judgement:" and markdown bold "**Judgement**:" formats, this happens for gpt-4o which is AA Judge model. - match = re.search(r'\*{0,2}Judgement\*{0,2}\s*:', judgement, re.IGNORECASE) + # Match plain "Judgement:" and common markdown variants like "*Judgement*:" / "**Judgement**:" (e.g., from gpt-4o AA Judge). + match = JUDGEMENT_PREFIX_RE.search(judgement)These are non-blocking; current implementation is functionally sound.
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
nemo_skills/evaluation/metrics/utils.py(2 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
- GitHub Check: unit-tests
- GitHub Check: pre-commit
🔇 Additional comments (1)
nemo_skills/evaluation/metrics/utils.py (1)
14-17:reimport is appropriate for the new judgement parsing logicUsing the standard library
remodule here is the right choice; no issues with the added import.
Signed-off-by: George Armstrong <georgea@nvidia.com>
Signed-off-by: George Armstrong <georgea@nvidia.com>
4743a12 to
9a88a1d
Compare
|
@wprazuch LGTM, for future reference please check our CONTRIBUTING.md for guidance regarding pre-commit formatting and commit signoff. |
Signed-off-by: George Armstrong <georgea@nvidia.com> Co-authored-by: George Armstrong <georgea@nvidia.com> Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>
Signed-off-by: George Armstrong <georgea@nvidia.com> Co-authored-by: George Armstrong <georgea@nvidia.com> Signed-off-by: wasiahmad <wasiahmad@ucla.edu>
Signed-off-by: George Armstrong <georgea@nvidia.com> Co-authored-by: George Armstrong <georgea@nvidia.com>
Signed-off-by: George Armstrong <georgea@nvidia.com> Co-authored-by: George Armstrong <georgea@nvidia.com> Signed-off-by: wasiahmad <wasiahmad@ucla.edu>
Signed-off-by: George Armstrong <georgea@nvidia.com> Co-authored-by: George Armstrong <georgea@nvidia.com> Signed-off-by: Cheng-Ping Hsieh <chsieh@nvidia.com>
Signed-off-by: George Armstrong <georgea@nvidia.com> Co-authored-by: George Armstrong <georgea@nvidia.com>
Signed-off-by: George Armstrong <georgea@nvidia.com> Co-authored-by: George Armstrong <georgea@nvidia.com>
Signed-off-by: George Armstrong <georgea@nvidia.com> Co-authored-by: George Armstrong <georgea@nvidia.com> Signed-off-by: dgitman <dgitman@nvidia.com>
Problem
The gpt-4o LLM judge sometimes returns responses with markdown bold formatting:
Judgement: yes
However, the
is_correct_judgementfunction only looked for Judgement: (plain text), causing these valid positive judgements to be incorrectly marked as False.Solution
Updated the regex pattern in
is_correct_judgementto handle both plain and markdown-formatted responses:The new pattern matches:
Judgement: (plain)
Judgement: (italic)
Judgement: (bold)
Impact
Tested on HLE benchmark results - judge_correct nearly doubled:
Subset Before After Δ
hle (overall) 5.70% 10.24% +4.54%
hle-Math 8.91% 16.39% +7.48%
hle-Engineering 3.13% 6.25% +3.12%
hle-Chemistry 4.95% 7.92% +2.97%