Fgalko/mcq regex by fgalko-oss · Pull Request #944 · NVIDIA-NeMo/Skills

fgalko-oss · 2025-10-14T08:00:54Z

Summary by CodeRabbit

New Features
- Configurable MCQ answer extraction with toggles for boxed parsing and custom final-answer regex.
- Per-sample override support for extraction settings.
- Evaluation output now includes the model’s predicted answer and a correctness flag.
Refactor
- Improved answer parsing with multi-stage extraction strategies to increase accuracy.
- Added logging of parsed answers and extraction settings.

…r MCQ Signed-off-by: fgalko <fgalko@nvidia.com>

Signed-off-by: fgalko <fgalko@nvidia.com>

coderabbitai · 2025-10-14T08:01:36Z

Walkthrough

Adds MCQEvaluatorConfig and updates eval_mcq to build and use it; implements per-sample overrides with fallback to config, refactors multi-stage answer extraction (boxed → regex → textual fallback), logs extraction details, and writes predicted_answer and symbolic_correct into each sample.

Changes

Cohort / File(s)	Summary
MCQ evaluation config and extraction refactor `nemo_skills/evaluation/evaluator/mcq.py`	- Added `MCQEvaluatorConfig` (@nested_dataclass) with `extract_from_boxed` and `extract_regex` fields - `eval_mcq` now instantiates `MCQEvaluatorConfig(**cfg.eval_config)` and applies per-sample overrides (falling back to eval_config) - Refactored extraction to compute `extracted_answer` and derive `parsed_letter` via boxed → regex → textual fallbacks - Added `LOG.info` for `parsed_letter`, `extract_from_boxed`, `extract_regex`, and `extracted_answer` - Writes `sample["predicted_answer"]` and computes/stores `sample["symbolic_correct"]` before writing output

Sequence Diagram(s)

sequenceDiagram
  autonumber
  participant Runner as eval_mcq
  participant Config as MCQEvaluatorConfig
  participant Sample as Sample Item
  participant Extract as Extraction Logic
  participant Logger as LOG

  Runner->>Config: Instantiate from cfg.eval_config
  loop For each sample
    Runner->>Sample: Read per-sample overrides (extract_from_boxed, extract_regex)
    Runner->>Extract: Attempt boxed extraction
    alt boxed yields single answer
      Extract-->>Runner: extracted_answer
      Runner->>Extract: set parsed_letter = extracted_answer
    else boxed absent/ambiguous
      Runner->>Extract: try regex extraction (extract_regex)
      alt regex matches
        Extract-->>Runner: parsed_letter
      else no regex match
        Runner->>Extract: try textual fallback ("Answer: X")
        Extract-->>Runner: parsed_letter or None
      end
    end
    Runner->>Logger: info(parsed_letter, extract_from_boxed, extract_regex, extracted_answer)
    Runner->>Sample: set sample["predicted_answer"]
    Runner->>Sample: compute & set sample["symbolic_correct"]
  end

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Poem

A rabbit hops through lines of code,
I sniff boxed answers on the road.
Regex and fallback, I chase each clue,
I log, I save, I sort what's true.
Hooray — a parsed choice, fresh and new. 🐇

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title Check	✅ Passed	The title "Fgalko/mcq regex" refers to a real aspect of the changeset—specifically the reworked regex-based extraction logic for the MCQ evaluator. However, the title is presented in a branch name format and is fairly vague, failing to capture the main structural changes such as the addition of the MCQEvaluatorConfig class, the multi-stage extraction approach, or the overall refactoring of the eval_mcq function. While the title is partially related to the changeset, it does not clearly summarize the primary scope of the changes for a reader scanning the history.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch fgalko/mcq-regex

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 2

🧹 Nitpick comments (1)

nemo_skills/evaluation/evaluator/mcq.py (1)
54-56: Consider using LOG.debug for detailed extraction information.

Logging every sample's extraction details at INFO level could be verbose in production. Consider LOG.debug unless detailed logging is required for all runs.
-    LOG.info(
+    LOG.debug(
         f"Final parsed letter: {parsed_letter}, extract_from_boxed: {extract_from_boxed}, extract_regex: {extract_regex}, extracted_answer: {extracted_answer}"
     )

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between ad8e8f8 and 8803e14.

📒 Files selected for processing (1)

nemo_skills/evaluation/evaluator/mcq.py (1 hunks)

🧰 Additional context used

🧬 Code graph analysis (1)

nemo_skills/evaluation/evaluator/mcq.py (2)

nemo_skills/utils.py (2)

get_logger_name (130-134)

nested_dataclass (49-82)

nemo_skills/evaluation/math_grader.py (1)

extract_answer (139-183)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: unit-tests

🔇 Additional comments (3)

nemo_skills/evaluation/evaluator/mcq.py (3)

27-32: LGTM! Well-structured config class.

The MCQEvaluatorConfig dataclass is properly defined with sensible defaults and clear field purposes. The use of @nested_dataclass follows the pattern from nemo_skills.utils.

34-37: LGTM! Config instantiation follows established patterns.

The initialization of eval_config from cfg.eval_config is consistent with other evaluators in the codebase.

65-72: LGTM! Per-sample overrides and evaluation logic are well-implemented.

The per-sample override mechanism (lines 66-67) provides backward compatibility, and the evaluation flow correctly stores both predicted_answer and symbolic_correct.

nemo_skills/evaluation/evaluator/mcq.py

Kipok · 2025-10-14T16:21:33Z

@ekmb could you please help to review this?

shuoyangd

Overall LGTM. Just two nit comments and we should be able to merge.

nemo_skills/evaluation/evaluator/mcq.py

shuoyangd · 2025-10-17T16:28:57Z

For clarity: this comment at L46 looks like like a legit bug that needs to be fixed. Good job coderabbit.

Signed-off-by: fgalko <fgalko@nvidia.com>

coderabbitai

Actionable comments posted: 0

♻️ Duplicate comments (2)

nemo_skills/evaluation/evaluator/mcq.py (2)
53-56: Normalize case in fallback regex to ensure consistent comparisons.

The fallback regex uses (?i) for case-insensitive matching but doesn't uppercase the captured letter. If the text contains "answer: a" (lowercase), it captures "a", which won't match the uppercase expected_answer on line 75.

This was flagged in a previous review but remains unaddressed.

Apply this diff to normalize the case:
         if parsed_letter is None:
             match = re.findall(r"(?i)[\*\_]{0,2}Answer[\*\_]{0,2}\s*:[\s\*\_]{0,2}\s*([A-Z])(?![a-zA-Z0-9])", text)
             if match:
-                parsed_letter = match[-1].strip()
+                parsed_letter = match[-1].strip().upper()
44-45: Validate single-character extracted answers are valid MCQ letters.

When extracted_answer has length 1, it's used directly without checking if it's a valid letter A-J. If extract_answer returns "X", "1", or other single characters, they'll be incorrectly accepted as valid MCQ answers.

Apply this diff to add validation:
         if extracted_answer is not None:
             if len(extracted_answer) == 1:
-                parsed_letter = extracted_answer
+                if extracted_answer.upper() in 'ABCDEFGHIJ':
+                    parsed_letter = extracted_answer.upper()
             elif len(extracted_answer) > 1:

🧹 Nitpick comments (1)

nemo_skills/evaluation/evaluator/mcq.py (1)
58-60: Consider using LOG.debug for per-sample extraction details.

Logging extraction details for every sample at INFO level can be very verbose in production, especially for large evaluation runs. Consider using LOG.debug() instead to keep logs cleaner while still allowing detailed diagnostics when needed.

Apply this diff:
-        LOG.info(
+        LOG.debug(
             f"Final parsed letter: {parsed_letter}, extract_from_boxed: {extract_from_boxed}, extract_regex: {extract_regex}, extracted_answer: {extracted_answer}"
         )

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 68dbc31 and 4ce2c65.

📒 Files selected for processing (1)

nemo_skills/evaluation/evaluator/mcq.py (1 hunks)

🧰 Additional context used

🧬 Code graph analysis (1)

nemo_skills/evaluation/evaluator/mcq.py (3)

nemo_skills/utils.py (2)

get_logger_name (130-134)

nested_dataclass (49-82)

nemo_skills/code_execution/sandbox.py (1)

unroll_files (39-42)

nemo_skills/evaluation/math_grader.py (2)

unroll_files (27-30)

extract_answer (139-183)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)

GitHub Check: unit-tests
GitHub Check: pre-commit

🔇 Additional comments (3)

nemo_skills/evaluation/evaluator/mcq.py (3)

27-32: LGTM! Clean configuration dataclass.

The MCQEvaluatorConfig follows the established pattern and uses @nested_dataclass correctly with keyword-only arguments.

34-36: LGTM! Config initialization follows best practices.

The initialization pattern is consistent with other evaluators in the codebase.

69-76: Per-sample override logic is well-implemented.

The implementation correctly prioritizes per-sample values over config defaults, maintaining backward compatibility. The comparison on line 75 assumes case-normalized letters, so fixing the case normalization issue in the fallback regex (lines 53-56) is important for correctness here.

Signed-off-by: fgalko <fgalko@nvidia.com>

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (2)

nemo_skills/evaluation/evaluator/mcq.py (2)
43-50: Consider validating single-character extraction results.

When extracted_answer is a single character, the code uses it directly without verifying it's a valid letter A-Z. If extract_answer returns "1" or punctuation, it will be used as-is, producing a misleading predicted_answer.

Apply this diff to validate single-character results:
     if extracted_answer is not None:
         if len(extracted_answer) == 1:
-            parsed_letter = extracted_answer
+            if extracted_answer.upper() in 'ABCDEFGHIJKLMNOPQRSTUVWXYZ':
+                parsed_letter = extracted_answer.upper()
         elif len(extracted_answer) > 1:
             # try to extract the letter from extracted answer, useful to match <A>, {A}, *A*, etc.
             match = re.findall(r"\b[A-Z]\b(?!.*\b[A-Z]\b)", extracted_answer, re.DOTALL)
             if len(match) > 0:
                 parsed_letter = match[-1].strip()
58-60: Consider DEBUG level for per-sample logging.

Logging extraction details at INFO level for every sample can produce excessive output in production. DEBUG level would be more appropriate for detailed per-sample diagnostics.

Apply this diff:
-        LOG.info(
+        LOG.debug(
             f"Final parsed letter: {parsed_letter}, extract_from_boxed: {extract_from_boxed}, extract_regex: {extract_regex}, extracted_answer: {extracted_answer}"
         )

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 4ce2c65 and be112dd.

📒 Files selected for processing (1)

nemo_skills/evaluation/evaluator/mcq.py (1 hunks)

🧰 Additional context used

🧬 Code graph analysis (1)

nemo_skills/evaluation/evaluator/mcq.py (2)

nemo_skills/utils.py (2)

get_logger_name (130-134)

nested_dataclass (49-82)

nemo_skills/code_execution/sandbox.py (1)

unroll_files (39-42)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: unit-tests

🔇 Additional comments (4)

nemo_skills/evaluation/evaluator/mcq.py (4)

22-32: Well-structured configuration class.

The addition of MCQEvaluatorConfig with the @nested_dataclass decorator follows established patterns and provides clean defaults for extraction behavior. The keyword-only argument enforcement enhances API clarity.

69-71: Clean per-sample override pattern.

The fallback mechanism from per-sample values to eval_config defaults provides good backward compatibility while enabling centralized configuration. Well done.

38-76: Excellent refactoring with proper bug fixes.

The multi-stage extraction logic is well-structured and all previously identified bugs have been properly addressed:

Single-character answers are now correctly processed

Case normalization ensures consistent comparisons

Regex patterns use A-Z consistently

The addition of MCQEvaluatorConfig and per-sample overrides significantly improves configurability while maintaining backward compatibility.

35-36: No issues found. The code is safe as implemented.

The review comment assumes cfg.eval_config could be None or missing, but cfg.eval_config is defined with field(default_factory=dict), which guarantees it always initializes to an empty dict rather than None. MCQEvaluatorConfig has all fields with default values, so unpacking an empty dict works correctly. This pattern is intentional and consistent across all other evaluators in the codebase (ruler.py, ojbench.py, bfcl.py, livecodebench.py, scicode.py, ioi.py), as the code comment accurately states.

Likely an incorrect or invalid review comment.

shuoyangd

LGTM, thanks!

fgalko-oss added 3 commits October 13, 2025 11:03

fix(regex): Use user-provided extract_from_boxed and extract_regex fo…

a64dd08

…r MCQ Signed-off-by: fgalko <fgalko@nvidia.com>

fix(loggin): improve logging and variable naming

6e47598

Signed-off-by: fgalko <fgalko@nvidia.com>

fix(formatting): run pre-commit hooks

8803e14

Signed-off-by: fgalko <fgalko@nvidia.com>

coderabbitai bot reviewed Oct 14, 2025

View reviewed changes

nemo_skills/evaluation/evaluator/mcq.py Outdated Show resolved Hide resolved

nemo_skills/evaluation/evaluator/mcq.py Outdated Show resolved Hide resolved

Kipok requested a review from ekmb October 14, 2025 16:21

ekmb requested review from gnalbandyan and removed request for ekmb October 16, 2025 18:39

shuoyangd reviewed Oct 16, 2025

View reviewed changes

nemo_skills/evaluation/evaluator/mcq.py Outdated Show resolved Hide resolved

nemo_skills/evaluation/evaluator/mcq.py Outdated Show resolved Hide resolved

fgalko-oss added 2 commits October 17, 2025 19:37

fix(mcq-regex): use parsed single-letters directly as answers

68dbc31

Signed-off-by: fgalko <fgalko@nvidia.com>

fix(mcq-regex): if hierarchy

4ce2c65

Signed-off-by: fgalko <fgalko@nvidia.com>

coderabbitai bot reviewed Oct 17, 2025

View reviewed changes

fix(mcq-regex): use upper and A-Z for consistency

be112dd

Signed-off-by: fgalko <fgalko@nvidia.com>

coderabbitai bot reviewed Oct 17, 2025

View reviewed changes

shuoyangd approved these changes Oct 17, 2025

View reviewed changes

gnalbandyan approved these changes Oct 20, 2025

View reviewed changes

Merge branch 'main' into fgalko/mcq-regex

9c077ec

fgalko-oss enabled auto-merge (squash) October 20, 2025 11:30

fgalko-oss merged commit 8ed0e83 into main Oct 20, 2025
7 checks passed

fgalko-oss deleted the fgalko/mcq-regex branch October 20, 2025 11:53

coderabbitai bot mentioned this pull request Dec 5, 2025

Add LCB Prompts, fix regex bug in robust_eval, remove CR, make summarize_robustness generic for more benchmarks, update docstrings. #1079

Merged

coderabbitai bot mentioned this pull request Feb 27, 2026

Add MMMLU benchmark #1281

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fgalko/mcq regex#944

Fgalko/mcq regex#944
fgalko-oss merged 7 commits intomainfrom
fgalko/mcq-regex

fgalko-oss commented Oct 14, 2025 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Oct 14, 2025 •

edited

Loading

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

Uh oh!

Kipok commented Oct 14, 2025

Uh oh!

shuoyangd left a comment

Uh oh!

Uh oh!

Uh oh!

shuoyangd commented Oct 17, 2025

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot left a comment

Uh oh!

shuoyangd left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

fgalko-oss commented Oct 14, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Oct 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Poem

Pre-merge checks and finishing touches

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Kipok commented Oct 14, 2025

Uh oh!

shuoyangd left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

shuoyangd commented Oct 17, 2025

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

shuoyangd left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

fgalko-oss commented Oct 14, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Oct 14, 2025 •

edited

Loading