Skip to content

FIX math verify handle leading zeros and int literals cases#1074

Merged
wedu-nvidia merged 2 commits intomainfrom
georgea/fix-math-verify-for-leading-zeros
Dec 4, 2025
Merged

FIX math verify handle leading zeros and int literals cases#1074
wedu-nvidia merged 2 commits intomainfrom
georgea/fix-math-verify-for-leading-zeros

Conversation

@gwarmstrong
Copy link
Collaborator

@gwarmstrong gwarmstrong commented Dec 4, 2025

Problem

The math grader incorrectly marked answers with leading zeros as wrong:

  • predicted_answer="016" vs expected_answer="16"False (should be True)

This caused AIME problem #3 (id: aime25-2) to be incorrectly scored.

Root Cause

The literal comparison logic short-circuited before reaching math_verify's symbolic comparison, which already handles leading zeros correctly.

Solution

Delegate numeric comparisons to math_verify (which already handles leading zeros correctly) instead of short-circuiting with string comparison.

Summary by CodeRabbit

  • Bug Fixes

    • Improved mathematical answer verification with refined normalization and comparison logic.
    • Fixed handling of numeric answers containing leading zeros in evaluations.
  • Tests

    • Added test coverage for leading-zero number scenarios.

✏️ Tip: You can customize this high-level summary in your review settings.

Signed-off-by: George Armstrong <georgea@nvidia.com>
@gwarmstrong gwarmstrong requested a review from i-vainn December 4, 2025 21:43
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Dec 4, 2025

📝 Walkthrough

Walkthrough

Refactored the math_equal function to improve comparison logic by introducing an early normalization check, adding explicit text-literal detection, and replacing the fallback path with symbolic comparison via math_verify. Added test cases for leading-zero handling. Function parameter renamed from **kw to **kwargs.

Changes

Cohort / File(s) Summary
Math grader logic refactoring
nemo_skills/evaluation/math_grader.py
Restructured math_equal comparison flow: removed literal-pattern quick path; added early exit for normalized equality; introduced dedicated text-literal handling; replaced fallback with symbolic comparison via math_verify. Parameter **kw renamed to **kwargs. MCQ and modulo logic paths retained.
Test coverage
tests/test_math_equal.py
Added two parameterized test cases to test_correct_examples: ("016", "16") and ("007", "7") to validate leading-zero normalization behavior.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

  • Focus areas:
    • Verify the early normalization path correctly handles all edge cases previously covered by the literal-pattern quick path
    • Confirm text-literal detection logic is distinct from and doesn't overlap with the normalized equality check
    • Review math_verify invocation to ensure LaTeX wrapping and normalization steps are applied consistently
    • Validate that MCQ and modulo handling paths remain functionally equivalent with adjusted control flow

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately reflects the main fix: handling leading zeros and integer literals in math_verify, which aligns with the core change of removing string-comparison shortcuts and delegating to math_verify.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch georgea/fix-math-verify-for-leading-zeros

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
nemo_skills/evaluation/math_grader.py (1)

72-85: Clarify text‑literal branch behavior (comment vs. implementation)

The overall control flow here looks good:

  • Fast path on is_normalized_equal avoids unnecessary symbolic work.
  • The text_literal_pattern guard keeps plain text out of math_verify, which is what we want.

Minor clarity nit: the comment says “For TEXT literals (not numeric), use direct string comparison”, but the branch actually just returns False, relying on the earlier is_normalized_equal check to have already handled the equal‑string case.

To better align intent and implementation, consider one of:

  • Adjusting the comment to something like:
    “For TEXT literals (not numeric), if they weren’t already equal after normalization, short‑circuit to False instead of calling math_verify.”
  • Or, make the behavior explicit in code, e.g.:
-    if is_text_literal:
-        return False  # Already checked is_normalized_equal above
+    if is_text_literal:
+        # For pure text literals, equality is decided by normalized string match above.
+        # If we reach here, they differ, so avoid symbolic parsing.
+        return False

Functionally identical, but it makes the intended semantics obvious to the next reader.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 9115aef and 44a7fe8.

📒 Files selected for processing (2)
  • nemo_skills/evaluation/math_grader.py (1 hunks)
  • tests/test_math_equal.py (1 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
  • GitHub Check: pre-commit
  • GitHub Check: unit-tests
🔇 Additional comments (1)
tests/test_math_equal.py (1)

37-38: Nice addition of regression tests for leading‑zero answers

The new cases for "016" == "16" and "007" == "7" directly capture the original bug and are symmetric (both argument orders are exercised), which should keep this behavior from regressing.

Copy link
Collaborator

@wedu-nvidia wedu-nvidia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks

@wedu-nvidia wedu-nvidia merged commit 5cc1dcc into main Dec 4, 2025
5 checks passed
@wedu-nvidia wedu-nvidia deleted the georgea/fix-math-verify-for-leading-zeros branch December 4, 2025 22:22
Jorjeous pushed a commit that referenced this pull request Dec 11, 2025
Signed-off-by: George Armstrong <georgea@nvidia.com>
Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>
wasiahmad pushed a commit that referenced this pull request Dec 12, 2025
Signed-off-by: George Armstrong <georgea@nvidia.com>
Signed-off-by: wasiahmad <wasiahmad@ucla.edu>
wasiahmad pushed a commit that referenced this pull request Feb 4, 2026
Signed-off-by: George Armstrong <georgea@nvidia.com>
dgtm777 pushed a commit that referenced this pull request Mar 18, 2026
Signed-off-by: George Armstrong <georgea@nvidia.com>
dgtm777 pushed a commit that referenced this pull request Mar 18, 2026
Signed-off-by: George Armstrong <georgea@nvidia.com>
Signed-off-by: dgitman <dgitman@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants