Conversation
…r MCQ Signed-off-by: fgalko <fgalko@nvidia.com>
Signed-off-by: fgalko <fgalko@nvidia.com>
Signed-off-by: fgalko <fgalko@nvidia.com>
WalkthroughAdds Changes
Sequence Diagram(s)sequenceDiagram
autonumber
participant Runner as eval_mcq
participant Config as MCQEvaluatorConfig
participant Sample as Sample Item
participant Extract as Extraction Logic
participant Logger as LOG
Runner->>Config: Instantiate from cfg.eval_config
loop For each sample
Runner->>Sample: Read per-sample overrides (extract_from_boxed, extract_regex)
Runner->>Extract: Attempt boxed extraction
alt boxed yields single answer
Extract-->>Runner: extracted_answer
Runner->>Extract: set parsed_letter = extracted_answer
else boxed absent/ambiguous
Runner->>Extract: try regex extraction (extract_regex)
alt regex matches
Extract-->>Runner: parsed_letter
else no regex match
Runner->>Extract: try textual fallback ("Answer: X")
Extract-->>Runner: parsed_letter or None
end
end
Runner->>Logger: info(parsed_letter, extract_from_boxed, extract_regex, extracted_answer)
Runner->>Sample: set sample["predicted_answer"]
Runner->>Sample: compute & set sample["symbolic_correct"]
end
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Poem
Pre-merge checks and finishing touches❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (1)
nemo_skills/evaluation/evaluator/mcq.py (1)
54-56: Consider using LOG.debug for detailed extraction information.Logging every sample's extraction details at
INFOlevel could be verbose in production. ConsiderLOG.debugunless detailed logging is required for all runs.- LOG.info( + LOG.debug( f"Final parsed letter: {parsed_letter}, extract_from_boxed: {extract_from_boxed}, extract_regex: {extract_regex}, extracted_answer: {extracted_answer}" )
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
nemo_skills/evaluation/evaluator/mcq.py(1 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
nemo_skills/evaluation/evaluator/mcq.py (2)
nemo_skills/utils.py (2)
get_logger_name(130-134)nested_dataclass(49-82)nemo_skills/evaluation/math_grader.py (1)
extract_answer(139-183)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: unit-tests
🔇 Additional comments (3)
nemo_skills/evaluation/evaluator/mcq.py (3)
27-32: LGTM! Well-structured config class.The
MCQEvaluatorConfigdataclass is properly defined with sensible defaults and clear field purposes. The use of@nested_dataclassfollows the pattern fromnemo_skills.utils.
34-37: LGTM! Config instantiation follows established patterns.The initialization of
eval_configfromcfg.eval_configis consistent with other evaluators in the codebase.
65-72: LGTM! Per-sample overrides and evaluation logic are well-implemented.The per-sample override mechanism (lines 66-67) provides backward compatibility, and the evaluation flow correctly stores both
predicted_answerandsymbolic_correct.
|
@ekmb could you please help to review this? |
shuoyangd
left a comment
There was a problem hiding this comment.
Overall LGTM. Just two nit comments and we should be able to merge.
|
For clarity: this comment at L46 looks like like a legit bug that needs to be fixed. Good job coderabbit. |
Signed-off-by: fgalko <fgalko@nvidia.com>
Signed-off-by: fgalko <fgalko@nvidia.com>
There was a problem hiding this comment.
Actionable comments posted: 0
♻️ Duplicate comments (2)
nemo_skills/evaluation/evaluator/mcq.py (2)
53-56: Normalize case in fallback regex to ensure consistent comparisons.The fallback regex uses
(?i)for case-insensitive matching but doesn't uppercase the captured letter. If the text contains "answer: a" (lowercase), it captures "a", which won't match the uppercaseexpected_answeron line 75.This was flagged in a previous review but remains unaddressed.
Apply this diff to normalize the case:
if parsed_letter is None: match = re.findall(r"(?i)[\*\_]{0,2}Answer[\*\_]{0,2}\s*:[\s\*\_]{0,2}\s*([A-Z])(?![a-zA-Z0-9])", text) if match: - parsed_letter = match[-1].strip() + parsed_letter = match[-1].strip().upper()
44-45: Validate single-character extracted answers are valid MCQ letters.When
extracted_answerhas length 1, it's used directly without checking if it's a valid letter A-J. Ifextract_answerreturns "X", "1", or other single characters, they'll be incorrectly accepted as valid MCQ answers.Apply this diff to add validation:
if extracted_answer is not None: if len(extracted_answer) == 1: - parsed_letter = extracted_answer + if extracted_answer.upper() in 'ABCDEFGHIJ': + parsed_letter = extracted_answer.upper() elif len(extracted_answer) > 1:
🧹 Nitpick comments (1)
nemo_skills/evaluation/evaluator/mcq.py (1)
58-60: Consider using LOG.debug for per-sample extraction details.Logging extraction details for every sample at INFO level can be very verbose in production, especially for large evaluation runs. Consider using
LOG.debug()instead to keep logs cleaner while still allowing detailed diagnostics when needed.Apply this diff:
- LOG.info( + LOG.debug( f"Final parsed letter: {parsed_letter}, extract_from_boxed: {extract_from_boxed}, extract_regex: {extract_regex}, extracted_answer: {extracted_answer}" )
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
nemo_skills/evaluation/evaluator/mcq.py(1 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
nemo_skills/evaluation/evaluator/mcq.py (3)
nemo_skills/utils.py (2)
get_logger_name(130-134)nested_dataclass(49-82)nemo_skills/code_execution/sandbox.py (1)
unroll_files(39-42)nemo_skills/evaluation/math_grader.py (2)
unroll_files(27-30)extract_answer(139-183)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
- GitHub Check: unit-tests
- GitHub Check: pre-commit
🔇 Additional comments (3)
nemo_skills/evaluation/evaluator/mcq.py (3)
27-32: LGTM! Clean configuration dataclass.The
MCQEvaluatorConfigfollows the established pattern and uses@nested_dataclasscorrectly with keyword-only arguments.
34-36: LGTM! Config initialization follows best practices.The initialization pattern is consistent with other evaluators in the codebase.
69-76: Per-sample override logic is well-implemented.The implementation correctly prioritizes per-sample values over config defaults, maintaining backward compatibility. The comparison on line 75 assumes case-normalized letters, so fixing the case normalization issue in the fallback regex (lines 53-56) is important for correctness here.
Signed-off-by: fgalko <fgalko@nvidia.com>
There was a problem hiding this comment.
Actionable comments posted: 0
🧹 Nitpick comments (2)
nemo_skills/evaluation/evaluator/mcq.py (2)
43-50: Consider validating single-character extraction results.When
extracted_answeris a single character, the code uses it directly without verifying it's a valid letter A-Z. Ifextract_answerreturns "1" or punctuation, it will be used as-is, producing a misleadingpredicted_answer.Apply this diff to validate single-character results:
if extracted_answer is not None: if len(extracted_answer) == 1: - parsed_letter = extracted_answer + if extracted_answer.upper() in 'ABCDEFGHIJKLMNOPQRSTUVWXYZ': + parsed_letter = extracted_answer.upper() elif len(extracted_answer) > 1: # try to extract the letter from extracted answer, useful to match <A>, {A}, *A*, etc. match = re.findall(r"\b[A-Z]\b(?!.*\b[A-Z]\b)", extracted_answer, re.DOTALL) if len(match) > 0: parsed_letter = match[-1].strip()
58-60: Consider DEBUG level for per-sample logging.Logging extraction details at INFO level for every sample can produce excessive output in production. DEBUG level would be more appropriate for detailed per-sample diagnostics.
Apply this diff:
- LOG.info( + LOG.debug( f"Final parsed letter: {parsed_letter}, extract_from_boxed: {extract_from_boxed}, extract_regex: {extract_regex}, extracted_answer: {extracted_answer}" )
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
nemo_skills/evaluation/evaluator/mcq.py(1 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
nemo_skills/evaluation/evaluator/mcq.py (2)
nemo_skills/utils.py (2)
get_logger_name(130-134)nested_dataclass(49-82)nemo_skills/code_execution/sandbox.py (1)
unroll_files(39-42)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: unit-tests
🔇 Additional comments (4)
nemo_skills/evaluation/evaluator/mcq.py (4)
22-32: Well-structured configuration class.The addition of
MCQEvaluatorConfigwith the@nested_dataclassdecorator follows established patterns and provides clean defaults for extraction behavior. The keyword-only argument enforcement enhances API clarity.
69-71: Clean per-sample override pattern.The fallback mechanism from per-sample values to
eval_configdefaults provides good backward compatibility while enabling centralized configuration. Well done.
38-76: Excellent refactoring with proper bug fixes.The multi-stage extraction logic is well-structured and all previously identified bugs have been properly addressed:
- Single-character answers are now correctly processed
- Case normalization ensures consistent comparisons
- Regex patterns use A-Z consistently
The addition of
MCQEvaluatorConfigand per-sample overrides significantly improves configurability while maintaining backward compatibility.
35-36: No issues found. The code is safe as implemented.The review comment assumes
cfg.eval_configcould beNoneor missing, but cfg.eval_config is defined withfield(default_factory=dict), which guarantees it always initializes to an empty dict rather thanNone. MCQEvaluatorConfig has all fields with default values, so unpacking an empty dict works correctly. This pattern is intentional and consistent across all other evaluators in the codebase (ruler.py, ojbench.py, bfcl.py, livecodebench.py, scicode.py, ioi.py), as the code comment accurately states.Likely an incorrect or invalid review comment.
Summary by CodeRabbit
New Features
Refactor