FIX math verify handle leading zeros and int literals cases#1074
FIX math verify handle leading zeros and int literals cases#1074wedu-nvidia merged 2 commits intomainfrom
Conversation
Signed-off-by: George Armstrong <georgea@nvidia.com>
📝 WalkthroughWalkthroughRefactored the Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes
Pre-merge checks and finishing touches❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 0
🧹 Nitpick comments (1)
nemo_skills/evaluation/math_grader.py (1)
72-85: Clarify text‑literal branch behavior (comment vs. implementation)The overall control flow here looks good:
- Fast path on
is_normalized_equalavoids unnecessary symbolic work.- The
text_literal_patternguard keeps plain text out ofmath_verify, which is what we want.Minor clarity nit: the comment says “For TEXT literals (not numeric), use direct string comparison”, but the branch actually just returns
False, relying on the earlieris_normalized_equalcheck to have already handled the equal‑string case.To better align intent and implementation, consider one of:
- Adjusting the comment to something like:
“For TEXT literals (not numeric), if they weren’t already equal after normalization, short‑circuit to False instead of calling math_verify.”- Or, make the behavior explicit in code, e.g.:
- if is_text_literal: - return False # Already checked is_normalized_equal above + if is_text_literal: + # For pure text literals, equality is decided by normalized string match above. + # If we reach here, they differ, so avoid symbolic parsing. + return FalseFunctionally identical, but it makes the intended semantics obvious to the next reader.
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
nemo_skills/evaluation/math_grader.py(1 hunks)tests/test_math_equal.py(1 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
- GitHub Check: pre-commit
- GitHub Check: unit-tests
🔇 Additional comments (1)
tests/test_math_equal.py (1)
37-38: Nice addition of regression tests for leading‑zero answersThe new cases for
"016" == "16"and"007" == "7"directly capture the original bug and are symmetric (both argument orders are exercised), which should keep this behavior from regressing.
Signed-off-by: George Armstrong <georgea@nvidia.com> Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>
Signed-off-by: George Armstrong <georgea@nvidia.com> Signed-off-by: wasiahmad <wasiahmad@ucla.edu>
Signed-off-by: George Armstrong <georgea@nvidia.com>
Signed-off-by: George Armstrong <georgea@nvidia.com>
Signed-off-by: George Armstrong <georgea@nvidia.com> Signed-off-by: dgitman <dgitman@nvidia.com>
Problem
The math grader incorrectly marked answers with leading zeros as wrong:
predicted_answer="016"vsexpected_answer="16"→ False (should be True)This caused AIME problem #3 (id: aime25-2) to be incorrectly scored.
Root Cause
The literal comparison logic short-circuited before reaching
math_verify's symbolic comparison, which already handles leading zeros correctly.Solution
Delegate numeric comparisons to
math_verify(which already handles leading zeros correctly) instead of short-circuiting with string comparison.Summary by CodeRabbit
Bug Fixes
Tests
✏️ Tip: You can customize this high-level summary in your review settings.