Skip to content

Fgalko/mcq regex#944

Merged
fgalko-oss merged 7 commits intomainfrom
fgalko/mcq-regex
Oct 20, 2025
Merged

Fgalko/mcq regex#944
fgalko-oss merged 7 commits intomainfrom
fgalko/mcq-regex

Conversation

@fgalko-oss
Copy link
Collaborator

@fgalko-oss fgalko-oss commented Oct 14, 2025

Summary by CodeRabbit

  • New Features

    • Configurable MCQ answer extraction with toggles for boxed parsing and custom final-answer regex.
    • Per-sample override support for extraction settings.
    • Evaluation output now includes the model’s predicted answer and a correctness flag.
  • Refactor

    • Improved answer parsing with multi-stage extraction strategies to increase accuracy.
    • Added logging of parsed answers and extraction settings.

…r MCQ

Signed-off-by: fgalko <fgalko@nvidia.com>
Signed-off-by: fgalko <fgalko@nvidia.com>
Signed-off-by: fgalko <fgalko@nvidia.com>
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Oct 14, 2025

Walkthrough

Adds MCQEvaluatorConfig and updates eval_mcq to build and use it; implements per-sample overrides with fallback to config, refactors multi-stage answer extraction (boxed → regex → textual fallback), logs extraction details, and writes predicted_answer and symbolic_correct into each sample.

Changes

Cohort / File(s) Summary
MCQ evaluation config and extraction refactor
nemo_skills/evaluation/evaluator/mcq.py
- Added MCQEvaluatorConfig (@nested_dataclass) with extract_from_boxed and extract_regex fields
- eval_mcq now instantiates MCQEvaluatorConfig(**cfg.eval_config) and applies per-sample overrides (falling back to eval_config)
- Refactored extraction to compute extracted_answer and derive parsed_letter via boxed → regex → textual fallbacks
- Added LOG.info for parsed_letter, extract_from_boxed, extract_regex, and extracted_answer
- Writes sample["predicted_answer"] and computes/stores sample["symbolic_correct"] before writing output

Sequence Diagram(s)

sequenceDiagram
  autonumber
  participant Runner as eval_mcq
  participant Config as MCQEvaluatorConfig
  participant Sample as Sample Item
  participant Extract as Extraction Logic
  participant Logger as LOG

  Runner->>Config: Instantiate from cfg.eval_config
  loop For each sample
    Runner->>Sample: Read per-sample overrides (extract_from_boxed, extract_regex)
    Runner->>Extract: Attempt boxed extraction
    alt boxed yields single answer
      Extract-->>Runner: extracted_answer
      Runner->>Extract: set parsed_letter = extracted_answer
    else boxed absent/ambiguous
      Runner->>Extract: try regex extraction (extract_regex)
      alt regex matches
        Extract-->>Runner: parsed_letter
      else no regex match
        Runner->>Extract: try textual fallback ("Answer: X")
        Extract-->>Runner: parsed_letter or None
      end
    end
    Runner->>Logger: info(parsed_letter, extract_from_boxed, extract_regex, extracted_answer)
    Runner->>Sample: set sample["predicted_answer"]
    Runner->>Sample: compute & set sample["symbolic_correct"]
  end
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Poem

A rabbit hops through lines of code,
I sniff boxed answers on the road.
Regex and fallback, I chase each clue,
I log, I save, I sort what's true.
Hooray — a parsed choice, fresh and new. 🐇

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title Check ✅ Passed The title "Fgalko/mcq regex" refers to a real aspect of the changeset—specifically the reworked regex-based extraction logic for the MCQ evaluator. However, the title is presented in a branch name format and is fairly vague, failing to capture the main structural changes such as the addition of the MCQEvaluatorConfig class, the multi-stage extraction approach, or the overall refactoring of the eval_mcq function. While the title is partially related to the changeset, it does not clearly summarize the primary scope of the changes for a reader scanning the history.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch fgalko/mcq-regex

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (1)
nemo_skills/evaluation/evaluator/mcq.py (1)

54-56: Consider using LOG.debug for detailed extraction information.

Logging every sample's extraction details at INFO level could be verbose in production. Consider LOG.debug unless detailed logging is required for all runs.

-    LOG.info(
+    LOG.debug(
         f"Final parsed letter: {parsed_letter}, extract_from_boxed: {extract_from_boxed}, extract_regex: {extract_regex}, extracted_answer: {extracted_answer}"
     )
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between ad8e8f8 and 8803e14.

📒 Files selected for processing (1)
  • nemo_skills/evaluation/evaluator/mcq.py (1 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
nemo_skills/evaluation/evaluator/mcq.py (2)
nemo_skills/utils.py (2)
  • get_logger_name (130-134)
  • nested_dataclass (49-82)
nemo_skills/evaluation/math_grader.py (1)
  • extract_answer (139-183)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: unit-tests
🔇 Additional comments (3)
nemo_skills/evaluation/evaluator/mcq.py (3)

27-32: LGTM! Well-structured config class.

The MCQEvaluatorConfig dataclass is properly defined with sensible defaults and clear field purposes. The use of @nested_dataclass follows the pattern from nemo_skills.utils.


34-37: LGTM! Config instantiation follows established patterns.

The initialization of eval_config from cfg.eval_config is consistent with other evaluators in the codebase.


65-72: LGTM! Per-sample overrides and evaluation logic are well-implemented.

The per-sample override mechanism (lines 66-67) provides backward compatibility, and the evaluation flow correctly stores both predicted_answer and symbolic_correct.

@Kipok Kipok requested a review from ekmb October 14, 2025 16:21
@Kipok
Copy link
Collaborator

Kipok commented Oct 14, 2025

@ekmb could you please help to review this?

@ekmb ekmb requested review from gnalbandyan and removed request for ekmb October 16, 2025 18:39
Copy link
Collaborator

@shuoyangd shuoyangd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall LGTM. Just two nit comments and we should be able to merge.

@shuoyangd
Copy link
Collaborator

For clarity: this comment at L46 looks like like a legit bug that needs to be fixed. Good job coderabbit.

Signed-off-by: fgalko <fgalko@nvidia.com>
Signed-off-by: fgalko <fgalko@nvidia.com>
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (2)
nemo_skills/evaluation/evaluator/mcq.py (2)

53-56: Normalize case in fallback regex to ensure consistent comparisons.

The fallback regex uses (?i) for case-insensitive matching but doesn't uppercase the captured letter. If the text contains "answer: a" (lowercase), it captures "a", which won't match the uppercase expected_answer on line 75.

This was flagged in a previous review but remains unaddressed.

Apply this diff to normalize the case:

         if parsed_letter is None:
             match = re.findall(r"(?i)[\*\_]{0,2}Answer[\*\_]{0,2}\s*:[\s\*\_]{0,2}\s*([A-Z])(?![a-zA-Z0-9])", text)
             if match:
-                parsed_letter = match[-1].strip()
+                parsed_letter = match[-1].strip().upper()

44-45: Validate single-character extracted answers are valid MCQ letters.

When extracted_answer has length 1, it's used directly without checking if it's a valid letter A-J. If extract_answer returns "X", "1", or other single characters, they'll be incorrectly accepted as valid MCQ answers.

Apply this diff to add validation:

         if extracted_answer is not None:
             if len(extracted_answer) == 1:
-                parsed_letter = extracted_answer
+                if extracted_answer.upper() in 'ABCDEFGHIJ':
+                    parsed_letter = extracted_answer.upper()
             elif len(extracted_answer) > 1:
🧹 Nitpick comments (1)
nemo_skills/evaluation/evaluator/mcq.py (1)

58-60: Consider using LOG.debug for per-sample extraction details.

Logging extraction details for every sample at INFO level can be very verbose in production, especially for large evaluation runs. Consider using LOG.debug() instead to keep logs cleaner while still allowing detailed diagnostics when needed.

Apply this diff:

-        LOG.info(
+        LOG.debug(
             f"Final parsed letter: {parsed_letter}, extract_from_boxed: {extract_from_boxed}, extract_regex: {extract_regex}, extracted_answer: {extracted_answer}"
         )
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 68dbc31 and 4ce2c65.

📒 Files selected for processing (1)
  • nemo_skills/evaluation/evaluator/mcq.py (1 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
nemo_skills/evaluation/evaluator/mcq.py (3)
nemo_skills/utils.py (2)
  • get_logger_name (130-134)
  • nested_dataclass (49-82)
nemo_skills/code_execution/sandbox.py (1)
  • unroll_files (39-42)
nemo_skills/evaluation/math_grader.py (2)
  • unroll_files (27-30)
  • extract_answer (139-183)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
  • GitHub Check: unit-tests
  • GitHub Check: pre-commit
🔇 Additional comments (3)
nemo_skills/evaluation/evaluator/mcq.py (3)

27-32: LGTM! Clean configuration dataclass.

The MCQEvaluatorConfig follows the established pattern and uses @nested_dataclass correctly with keyword-only arguments.


34-36: LGTM! Config initialization follows best practices.

The initialization pattern is consistent with other evaluators in the codebase.


69-76: Per-sample override logic is well-implemented.

The implementation correctly prioritizes per-sample values over config defaults, maintaining backward compatibility. The comparison on line 75 assumes case-normalized letters, so fixing the case normalization issue in the fallback regex (lines 53-56) is important for correctness here.

Signed-off-by: fgalko <fgalko@nvidia.com>
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (2)
nemo_skills/evaluation/evaluator/mcq.py (2)

43-50: Consider validating single-character extraction results.

When extracted_answer is a single character, the code uses it directly without verifying it's a valid letter A-Z. If extract_answer returns "1" or punctuation, it will be used as-is, producing a misleading predicted_answer.

Apply this diff to validate single-character results:

     if extracted_answer is not None:
         if len(extracted_answer) == 1:
-            parsed_letter = extracted_answer
+            if extracted_answer.upper() in 'ABCDEFGHIJKLMNOPQRSTUVWXYZ':
+                parsed_letter = extracted_answer.upper()
         elif len(extracted_answer) > 1:
             # try to extract the letter from extracted answer, useful to match <A>, {A}, *A*, etc.
             match = re.findall(r"\b[A-Z]\b(?!.*\b[A-Z]\b)", extracted_answer, re.DOTALL)
             if len(match) > 0:
                 parsed_letter = match[-1].strip()

58-60: Consider DEBUG level for per-sample logging.

Logging extraction details at INFO level for every sample can produce excessive output in production. DEBUG level would be more appropriate for detailed per-sample diagnostics.

Apply this diff:

-        LOG.info(
+        LOG.debug(
             f"Final parsed letter: {parsed_letter}, extract_from_boxed: {extract_from_boxed}, extract_regex: {extract_regex}, extracted_answer: {extracted_answer}"
         )
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 4ce2c65 and be112dd.

📒 Files selected for processing (1)
  • nemo_skills/evaluation/evaluator/mcq.py (1 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
nemo_skills/evaluation/evaluator/mcq.py (2)
nemo_skills/utils.py (2)
  • get_logger_name (130-134)
  • nested_dataclass (49-82)
nemo_skills/code_execution/sandbox.py (1)
  • unroll_files (39-42)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: unit-tests
🔇 Additional comments (4)
nemo_skills/evaluation/evaluator/mcq.py (4)

22-32: Well-structured configuration class.

The addition of MCQEvaluatorConfig with the @nested_dataclass decorator follows established patterns and provides clean defaults for extraction behavior. The keyword-only argument enforcement enhances API clarity.


69-71: Clean per-sample override pattern.

The fallback mechanism from per-sample values to eval_config defaults provides good backward compatibility while enabling centralized configuration. Well done.


38-76: Excellent refactoring with proper bug fixes.

The multi-stage extraction logic is well-structured and all previously identified bugs have been properly addressed:

  • Single-character answers are now correctly processed
  • Case normalization ensures consistent comparisons
  • Regex patterns use A-Z consistently

The addition of MCQEvaluatorConfig and per-sample overrides significantly improves configurability while maintaining backward compatibility.


35-36: No issues found. The code is safe as implemented.

The review comment assumes cfg.eval_config could be None or missing, but cfg.eval_config is defined with field(default_factory=dict), which guarantees it always initializes to an empty dict rather than None. MCQEvaluatorConfig has all fields with default values, so unpacking an empty dict works correctly. This pattern is intentional and consistent across all other evaluators in the codebase (ruler.py, ojbench.py, bfcl.py, livecodebench.py, scicode.py, ioi.py), as the code comment accurately states.

Likely an incorrect or invalid review comment.

Copy link
Collaborator

@shuoyangd shuoyangd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks!

@fgalko-oss fgalko-oss enabled auto-merge (squash) October 20, 2025 11:30
@fgalko-oss fgalko-oss merged commit 8ed0e83 into main Oct 20, 2025
7 checks passed
@fgalko-oss fgalko-oss deleted the fgalko/mcq-regex branch October 20, 2025 11:53
@coderabbitai coderabbitai bot mentioned this pull request Feb 27, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants