Skip to content

Add MMMLU benchmark#1281

Merged
Kipok merged 20 commits intomainfrom
dkaramyan/mmmlu
Mar 9, 2026
Merged

Add MMMLU benchmark#1281
Kipok merged 20 commits intomainfrom
dkaramyan/mmmlu

Conversation

@naymaraq
Copy link
Collaborator

@naymaraq naymaraq commented Feb 27, 2026

PR for the MMMLU benchmark, a multilingual version of MMLU. Also added an option in the MCQ evaluation to accept multiple regex patterns. This is important for capturing the model’s answers, which may be provided in different languages.

Summary by CodeRabbit

  • New Features
    • Multilingual MMLU support: automatic download, parsing, and formatting tools for multiple languages with per-language test outputs.
    • Dataset preparation CLI: generate JSONL test datasets for chosen languages, optionally include English.
    • Improved MCQ evaluation: robust extraction and normalization of answers (Arabic, Bengali, Japanese) and support for multiple regex patterns.
    • Added public constants for multichoice metrics and default generation arguments.

naymaraq added 9 commits February 9, 2026 10:33
Signed-off-by: naymaraq <dkaramyan@nvidia.com>
Signed-off-by: naymaraq <dkaramyan@nvidia.com>
Signed-off-by: naymaraq <dkaramyan@nvidia.com>
Signed-off-by: naymaraq <dkaramyan@nvidia.com>
Signed-off-by: naymaraq <dkaramyan@nvidia.com>
Signed-off-by: naymaraq <dkaramyan@nvidia.com>
Signed-off-by: naymaraq <dkaramyan@nvidia.com>
Signed-off-by: naymaraq <dkaramyan@nvidia.com>
Signed-off-by: naymaraq <dkaramyan@nvidia.com>
@naymaraq naymaraq changed the title Dkaramyan/mmmlu Add MMMLU benchmark Feb 27, 2026
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Feb 27, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Adds a multilingual MMLU dataset package (utils + prepare script) that downloads and formats per-language CSVs into JSONL, exposes dataset metadata constants, and updates the MCQ evaluator to accept multiple extraction regexes and normalize multilingual answer letters.

Changes

Cohort / File(s) Summary
MMMLU Package Init
nemo_skills/dataset/mmmlu/__init__.py
Adds public constants METRICS_TYPE and GENERATION_ARGS (license header added).
MMMLU Utilities
nemo_skills/dataset/mmmlu/mmmlu_utils.py
New module: adds SUPPORTED_LANGUAGES, subject2category, prompt/regex templates, Schema class, and functions download_mmmlu_datasets(), format_multichoice_question(), get_mcq_fields() to download and format per-language MMLU CSVs.
Dataset Preparation
nemo_skills/dataset/mmmlu/prepare.py
New script: format_entry() and main() to validate languages, call download utility, transform entries into test JSONL; CLI flags --languages and --include_english.
MCQ Evaluator Changes
nemo_skills/evaluation/evaluator/mcq.py
Adds normalize_extracted_answer(); expands MCQEvaluatorConfig.extract_regex to `str

Sequence Diagram

sequenceDiagram
    participant User as User/Script
    participant Prepare as prepare.py
    participant Utils as mmmlu_utils.py
    participant Remote as Remote Data Source
    participant Disk as Local Filesystem
    participant Eval as mcq.py Evaluator

    User->>Prepare: run prepare.py (languages, include_english?)
    Prepare->>Utils: download_mmmlu_datasets(languages)
    Utils->>Remote: fetch per-language CSVs
    Remote-->>Utils: return CSV data
    Utils->>Disk: save CSVs (skip if exists)
    Utils-->>Prepare: return dict(language -> entries)
    Prepare->>Utils: get_mcq_fields(entry) / format_multichoice_question(entry)
    Utils-->>Prepare: question/options fields
    Prepare->>Disk: write formatted entries to test.jsonl

    User->>Eval: submit model outputs + test.jsonl
    Eval->>Eval: extract_letter(text, extract_regex list)
    Eval->>Eval: normalize_extracted_answer(match)
    Eval-->>User: evaluation results
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

  • Fgalko/mcq regex #944: Modifies MCQ extraction and regex handling in nemo_skills/evaluation/evaluator/mcq.py, directly related to evaluator changes here.

Suggested reviewers

  • shuoyangd
🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'Add MMMLU benchmark' directly and clearly summarizes the main change—adding a new multilingual MMLU benchmark to the codebase.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch dkaramyan/mmmlu

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (4)
nemo_skills/evaluation/evaluator/mcq.py (2)

27-45: Consider adding a # noqa: RUF001 comment to document intent.

The static analysis warnings about "ambiguous" fullwidth characters are false positives here — these fullwidth Japanese characters (A, B, C, D) are intentionally matched for normalization. Adding inline comments would silence the linter and document the intent for future maintainers.

📝 Suggested inline comments
         # In Japanese these are the letters sometimes used for A-D in multiple choice questions
-        .replace("A", " A")
-        .replace("B", " B")
-        .replace("C", " C")
-        .replace("D", " D")
+        .replace("A", " A")  # noqa: RUF001 (fullwidth A)
+        .replace("B", " B")  # noqa: RUF001 (fullwidth B)
+        .replace("C", " C")  # noqa: RUF001 (fullwidth C)
+        .replace("D", " D")  # noqa: RUF001 (fullwidth D)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nemo_skills/evaluation/evaluator/mcq.py` around lines 27 - 45, The linter
flags the fullwidth Japanese characters in normalize_extracted_answer as
ambiguous; add an inline "# noqa: RUF001" comment to the lines (or the
expression) handling the fullwidth characters (the .replace calls for "A", "B",
"C", "D") to document intent and silence the false positive, and include a short
inline comment (e.g., "match fullwidth A-D for normalization") so future
maintainers understand why these specific replacements are present.

100-103: Consider using LOG.debug for per-sample logging.

This log statement fires for every sample in the dataset. For large benchmarks like MMMLU with thousands of entries, this will produce excessive log output at INFO level.

📝 Proposed fix
-        LOG.info(
+        LOG.debug(
             f"Final parsed letter: {parsed_letter}, extract_from_boxed: {extract_from_boxed}, "
             f"extract_regex: {extract_regex}, extracted_answer: {extracted_answer}, relaxed: {relaxed}"
         )
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nemo_skills/evaluation/evaluator/mcq.py` around lines 100 - 103, The
per-sample logging currently uses LOG.info and emits parsed_letter,
extract_from_boxed, extract_regex, extracted_answer, and relaxed for every
sample; change this call to LOG.debug to avoid spamming INFO for large datasets
(in nemo_skills/evaluation/evaluator/mcq.py where the LOG.info(...) call
appears) and ensure any higher-level summary logs remain at INFO if needed; keep
the same message content and variable references but lower the level to debug so
developers can enable verbose output when required.
nemo_skills/dataset/mmmlu/mmmlu_utils.py (2)

160-165: Mutable class attribute should be a tuple or ClassVar.

Schema.OPTIONS is a mutable list at class level. While it's used as a constant here, using a tuple would prevent accidental mutation and silence the linter warning.

📝 Proposed fix
 class Schema:
     ANSWER: str = "Answer"
     QUESTION: str = "Question"
     SUBJECT: str = "Subject"
-    OPTIONS: list[str] = ["A", "B", "C", "D"]
+    OPTIONS: tuple[str, ...] = ("A", "B", "C", "D")
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nemo_skills/dataset/mmmlu/mmmlu_utils.py` around lines 160 - 165, Schema
defines OPTIONS as a mutable class-level list; change it to an immutable tuple
(or annotate it as a ClassVar[Tuple[str, ...]]) to prevent accidental mutation
and satisfy the linter: update Schema.OPTIONS to be a tuple of strings (or add
ClassVar typing and use a tuple) so callers treat it as a constant.

178-180: Redundant existence check after download.

urllib.request.urlretrieve raises an exception (e.g., URLError) on download failure rather than silently failing. If it returns successfully, the file is guaranteed to exist. This check is unreachable in failure scenarios.

📝 Proposed simplification
             url = OPENAI_PUBLIC_URL.format(suffix)
             urllib.request.urlretrieve(url, download_dst_path)
-            if not os.path.exists(download_dst_path):
-                raise RuntimeError(f"Failed to download {suffix}")

The download error will propagate naturally with a descriptive URLError.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nemo_skills/dataset/mmmlu/mmmlu_utils.py` around lines 178 - 180, Remove the
redundant post-download existence check around urllib.request.urlretrieve in
mmmlu_utils.py: delete the os.path.exists(download_dst_path) check and the
subsequent RuntimeError raise so that failures from urllib.request.urlretrieve
(e.g., URLError) propagate naturally; keep the download call using the same
download_dst_path variable and any surrounding try/except handling intact if
present.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@nemo_skills/dataset/mmmlu/prepare.py`:
- Around line 52-59: In main(), the appended "EN-US" from args.include_english
is being validated against SUPPORTED_LANGUAGES (which doesn't include "EN-US"),
causing a false negative; update the validation so that either EN-US is excluded
from the set being checked or validate against SUPPORTED_LANGUAGES union
{"EN-US"} instead—modify the logic around the languages and invalid calculation
(referencing the languages variable, args.include_english flag, and
SUPPORTED_LANGUAGES) so that including English does not trigger the ValueError.

---

Nitpick comments:
In `@nemo_skills/dataset/mmmlu/mmmlu_utils.py`:
- Around line 160-165: Schema defines OPTIONS as a mutable class-level list;
change it to an immutable tuple (or annotate it as a ClassVar[Tuple[str, ...]])
to prevent accidental mutation and satisfy the linter: update Schema.OPTIONS to
be a tuple of strings (or add ClassVar typing and use a tuple) so callers treat
it as a constant.
- Around line 178-180: Remove the redundant post-download existence check around
urllib.request.urlretrieve in mmmlu_utils.py: delete the
os.path.exists(download_dst_path) check and the subsequent RuntimeError raise so
that failures from urllib.request.urlretrieve (e.g., URLError) propagate
naturally; keep the download call using the same download_dst_path variable and
any surrounding try/except handling intact if present.

In `@nemo_skills/evaluation/evaluator/mcq.py`:
- Around line 27-45: The linter flags the fullwidth Japanese characters in
normalize_extracted_answer as ambiguous; add an inline "# noqa: RUF001" comment
to the lines (or the expression) handling the fullwidth characters (the .replace
calls for "A", "B", "C", "D") to document intent and silence the false positive,
and include a short inline comment (e.g., "match fullwidth A-D for
normalization") so future maintainers understand why these specific replacements
are present.
- Around line 100-103: The per-sample logging currently uses LOG.info and emits
parsed_letter, extract_from_boxed, extract_regex, extracted_answer, and relaxed
for every sample; change this call to LOG.debug to avoid spamming INFO for large
datasets (in nemo_skills/evaluation/evaluator/mcq.py where the LOG.info(...)
call appears) and ensure any higher-level summary logs remain at INFO if needed;
keep the same message content and variable references but lower the level to
debug so developers can enable verbose output when required.

ℹ️ Review info

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 2b38cce and 4e52d28.

📒 Files selected for processing (4)
  • nemo_skills/dataset/mmmlu/__init__.py
  • nemo_skills/dataset/mmmlu/mmmlu_utils.py
  • nemo_skills/dataset/mmmlu/prepare.py
  • nemo_skills/evaluation/evaluator/mcq.py

Signed-off-by: naymaraq <dkaramyan@nvidia.com>
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@nemo_skills/dataset/mmmlu/prepare.py`:
- Line 31: The current code silently maps unknown subjects to "other" using
subject2category.get(entry[Schema.SUBJECT], "other"), which hides missing
mappings; change this to explicitly require the mapping by accessing
subject2category[entry[Schema.SUBJECT]] (or validate with an if and raise a
clear ValueError referencing the missing subject) so that missing subject keys
surface immediately; update the assignment to category and add a descriptive
error message mentioning entry[Schema.SUBJECT] and subject2category to aid
debugging.
- Around line 66-71: The current loop formats and writes entries directly into
the truncate-opened output_file which risks data loss if formatting errors
occur; instead, first iterate datasets and call format_entry(entry=...,
language=...) for every entry, collect the resulting JSON-serializable objects
(e.g., into a list like formatted_entries), and only after that successfully
completes open output_file for writing ("wt") and write the collected entries
with json.dump and fout.write("\n"); reference format_entry, datasets, and
output_file to locate the code to change.
- Around line 53-57: The code currently silently removes "EN-US" from
args.languages when args.include_english is false; change this so user-provided
"EN-US" is not ignored: keep languages built from args.languages (do not
pre-filter out "EN-US"), validate each requested language against
SUPPORTED_LANGUAGES, and if "EN-US" is requested but args.include_english is
False raise an error (or exit) telling the user to pass --include_english; also
if any requested language is not in SUPPORTED_LANGUAGES raise an error. Modify
the logic around languages, valid_languages, args.languages,
args.include_english and SUPPORTED_LANGUAGES accordingly.

ℹ️ Review info

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 4e52d28 and f2e1a7f.

📒 Files selected for processing (1)
  • nemo_skills/dataset/mmmlu/prepare.py


def format_entry(entry: dict, language: str) -> dict:
expected_answer = entry[Schema.ANSWER]
category = subject2category.get(entry[Schema.SUBJECT], "other")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Avoid silent fallback for missing subject-category mappings.

Line 31 silently maps unknown subjects to "other", which can hide dataset/schema drift and corrupt category metrics.

💡 Proposed fix
-    category = subject2category.get(entry[Schema.SUBJECT], "other")
+    subject = entry[Schema.SUBJECT]
+    category = subject2category[subject]

As per coding guidelines: "Don't use .get() for accessing dictionary keys if the code expects them to be present; use direct access data[key_name] to fail with a clear error instead of silently corrupting data".

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
category = subject2category.get(entry[Schema.SUBJECT], "other")
subject = entry[Schema.SUBJECT]
category = subject2category[subject]
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nemo_skills/dataset/mmmlu/prepare.py` at line 31, The current code silently
maps unknown subjects to "other" using
subject2category.get(entry[Schema.SUBJECT], "other"), which hides missing
mappings; change this to explicitly require the mapping by accessing
subject2category[entry[Schema.SUBJECT]] (or validate with an if and raise a
clear ValueError referencing the missing subject) so that missing subject keys
surface immediately; update the assignment to category and add a descriptive
error message mentioning entry[Schema.SUBJECT] and subject2category to aid
debugging.

Comment on lines +53 to +57
languages = [lang for lang in args.languages if lang != "EN-US"]
valid_languages = set(SUPPORTED_LANGUAGES)
if args.include_english:
valid_languages.add("EN-US")
languages.append("EN-US")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Do not silently ignore explicit EN-US input.

Line 53 removes EN-US from user input when --include_english is not set, so a user-passed language can be dropped without error.

💡 Proposed fix
 def main(args):
-    languages = [lang for lang in args.languages if lang != "EN-US"]
+    if "EN-US" in args.languages and not args.include_english:
+        raise ValueError("EN-US requires --include_english.")
+
+    languages = [lang for lang in args.languages if lang != "EN-US"]
     valid_languages = set(SUPPORTED_LANGUAGES)
     if args.include_english:
         valid_languages.add("EN-US")
         languages.append("EN-US")

As per coding guidelines: "Avoid cases where user-passed parameters are unused; code should fail if user specifies an unsupported argument or if a required argument is missing."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nemo_skills/dataset/mmmlu/prepare.py` around lines 53 - 57, The code
currently silently removes "EN-US" from args.languages when args.include_english
is false; change this so user-provided "EN-US" is not ignored: keep languages
built from args.languages (do not pre-filter out "EN-US"), validate each
requested language against SUPPORTED_LANGUAGES, and if "EN-US" is requested but
args.include_english is False raise an error (or exit) telling the user to pass
--include_english; also if any requested language is not in SUPPORTED_LANGUAGES
raise an error. Modify the logic around languages, valid_languages,
args.languages, args.include_english and SUPPORTED_LANGUAGES accordingly.

Comment on lines +66 to +71
with open(output_file, "wt", encoding="utf-8") as fout:
for language, examples in datasets.items():
for entry in examples:
entry = format_entry(entry=entry, language=language)
json.dump(entry, fout, ensure_ascii=False)
fout.write("\n")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Compute all entries before truncating test.jsonl.

Lines 66-71 perform formatting while writing to a truncate-opened file. If formatting fails mid-run, existing output is lost and only a partial file remains.

💡 Proposed fix
-    with open(output_file, "wt", encoding="utf-8") as fout:
-        for language, examples in datasets.items():
-            for entry in examples:
-                entry = format_entry(entry=entry, language=language)
-                json.dump(entry, fout, ensure_ascii=False)
-                fout.write("\n")
+    formatted_entries = []
+    for language, examples in datasets.items():
+        for entry in examples:
+            formatted_entries.append(format_entry(entry=entry, language=language))
+
+    with open(output_file, "wt", encoding="utf-8") as fout:
+        for entry in formatted_entries:
+            json.dump(entry, fout, ensure_ascii=False)
+            fout.write("\n")

As per coding guidelines: "When adding new benchmarks, avoid data loss by doing all computation before re-opening files for writing; ensure computation completes before file writes to prevent accidental data loss if code fails".

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nemo_skills/dataset/mmmlu/prepare.py` around lines 66 - 71, The current loop
formats and writes entries directly into the truncate-opened output_file which
risks data loss if formatting errors occur; instead, first iterate datasets and
call format_entry(entry=..., language=...) for every entry, collect the
resulting JSON-serializable objects (e.g., into a list like formatted_entries),
and only after that successfully completes open output_file for writing ("wt")
and write the collected entries with json.dump and fout.write("\n"); reference
format_entry, datasets, and output_file to locate the code to change.

Signed-off-by: naymaraq <dkaramyan@nvidia.com>
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (3)
nemo_skills/dataset/mmmlu/prepare.py (3)

67-72: ⚠️ Potential issue | 🟠 Major

Compute all entries before writing to avoid data loss on failure.

If format_entry fails mid-iteration, the output file will be truncated with only partial data. Collect all formatted entries first, then write them.

🐛 Proposed fix
     data_dir = Path(__file__).absolute().parent
     output_file = data_dir / "test.jsonl"
+
+    formatted_entries = []
+    for language, examples in datasets.items():
+        for entry in examples:
+            formatted_entries.append(format_entry(entry=entry, language=language))
+
     with open(output_file, "wt", encoding="utf-8") as fout:
-        for language, examples in datasets.items():
-            for entry in examples:
-                entry = format_entry(entry=entry, language=language)
-                json.dump(entry, fout, ensure_ascii=False)
-                fout.write("\n")
+        for entry in formatted_entries:
+            json.dump(entry, fout, ensure_ascii=False)
+            fout.write("\n")

As per coding guidelines: "When adding new benchmarks, avoid data loss by doing all computation before re-opening files for writing; ensure computation completes before file writes to prevent accidental data loss if code fails".

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nemo_skills/dataset/mmmlu/prepare.py` around lines 67 - 72, The loop
currently formats and writes entries directly to output_file inside the
with-open block which risks truncating the file if format_entry raises; instead,
iterate over datasets and call format_entry for every entry first (collect
formatted entries into a list, e.g., formatted = []), then open output_file for
writing and json.dump each item from formatted to fout; reference symbols:
format_entry, datasets, output_file, and the current write loop that uses fout.

32-32: ⚠️ Potential issue | 🟠 Major

Use direct dictionary access instead of .get() with fallback.

This line silently maps unknown subjects to "other", which could mask dataset schema drift or data issues. If a subject is not in subject2category, the code should fail explicitly.

🐛 Proposed fix
-    category = subject2category.get(entry[Schema.SUBJECT], "other")
+    category = subject2category[entry[Schema.SUBJECT]]

As per coding guidelines: "Don't use .get() for accessing dictionary keys if the code expects them to be present; use direct access data[key_name] to fail with a clear error instead of silently corrupting data".

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nemo_skills/dataset/mmmlu/prepare.py` at line 32, The current line silently
falls back to "other"; change it to access the mapping directly so missing keys
raise an error: replace the use of subject2category.get(...) with direct
indexing using subject2category[entry[Schema.SUBJECT]] (or perform an explicit
if-not-in check and raise a clear KeyError/ValueError) so that failures in
subject2category lookup (in the category assignment for variable category)
surface immediately rather than being mapped to "other".

54-58: ⚠️ Potential issue | 🟠 Major

Silently dropping EN-US from user input violates coding guidelines.

If a user explicitly passes EN-US in --languages without --include_english, it's silently removed. The code should either error or handle it explicitly.

🐛 Proposed fix
 def main(args):
+    if "EN-US" in args.languages and not args.include_english:
+        raise ValueError("EN-US requires --include_english flag.")
+
     languages = [lang for lang in args.languages if lang != "EN-US"]
     valid_languages = set(SUPPORTED_LANGUAGES)
     if args.include_english:
         valid_languages.add("EN-US")
         languages.append("EN-US")

As per coding guidelines: "Avoid cases where user-passed parameters are unused; code should fail if user specifies an unsupported argument or if a required argument is missing."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nemo_skills/dataset/mmmlu/prepare.py` around lines 54 - 58, The code silently
drops "EN-US" from args.languages by filtering it out when building languages
and only later adding it back if args.include_english; change behavior to
validate user input up front: check args.languages for "EN-US" and if present
but args.include_english is False raise an explicit error (or
argparse.ArgumentError) informing the user to pass --include_english to enable
EN-US; keep SUPPORTED_LANGUAGES, valid_languages, languages variables but
perform this validation before building languages so that user-supplied
parameters are never ignored and the code path in prepare.py that references
languages and valid_languages behaves deterministically.
🧹 Nitpick comments (2)
nemo_skills/dataset/mmmlu/mmmlu_utils.py (2)

110-112: Consider adding a comment clarifying intentional fullwidth characters.

The static analysis tool flags the fullwidth Unicode characters (A, B, C, D) as ambiguous, but these are intentionally included to match answer letters in CJK text where fullwidth characters are common. A brief inline comment would suppress future confusion.

📝 Suggested comment
 MULTILINGUAL_ANSWER_PATTERN_TEMPLATE = (
+    # Note: Fullwidth letters (A-D) and non-Latin scripts are intentional for multilingual matching
     "(?i){}[ \t]*([A-D]|[أ-د]|[অ]|[ব]|[ড]|[ঢ]|[A]|[B]|[C]|[D])"
 )
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nemo_skills/dataset/mmmlu/mmmlu_utils.py` around lines 110 - 112, Add a
clarifying inline comment above MULTILINGUAL_ANSWER_PATTERN_TEMPLATE stating
that the fullwidth Unicode characters (A, B, C, D) are intentionally included to
match answer letters in CJK text and to avoid static analysis ambiguity;
reference the constant name MULTILINGUAL_ANSWER_PATTERN_TEMPLATE so reviewers
can locate it and ensure the comment explains the purpose and that these
characters are expected.

160-164: Consider using a tuple for immutable OPTIONS.

OPTIONS is a mutable list as a class attribute. While it's only used for reading, using a tuple would prevent accidental mutation and silence the static analysis warning.

♻️ Suggested change
 class Schema:
     ANSWER: str = "Answer"
     QUESTION: str = "Question"
     SUBJECT: str = "Subject"
-    OPTIONS: list[str] = ["A", "B", "C", "D"]
+    OPTIONS: tuple[str, ...] = ("A", "B", "C", "D")
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nemo_skills/dataset/mmmlu/mmmlu_utils.py` around lines 160 - 164, The class
attribute OPTIONS on Schema is declared as a mutable list; change it to an
immutable tuple to prevent accidental mutation and silence static analysis:
replace the list literal and its type annotation with a tuple-based annotation
(e.g., tuple[str, ...] or typing.Tuple[str, ...]) and a tuple literal of the
options ("A", "B", "C", "D") in the Schema class.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@nemo_skills/dataset/mmmlu/prepare.py`:
- Around line 67-72: The loop currently formats and writes entries directly to
output_file inside the with-open block which risks truncating the file if
format_entry raises; instead, iterate over datasets and call format_entry for
every entry first (collect formatted entries into a list, e.g., formatted = []),
then open output_file for writing and json.dump each item from formatted to
fout; reference symbols: format_entry, datasets, output_file, and the current
write loop that uses fout.
- Line 32: The current line silently falls back to "other"; change it to access
the mapping directly so missing keys raise an error: replace the use of
subject2category.get(...) with direct indexing using
subject2category[entry[Schema.SUBJECT]] (or perform an explicit if-not-in check
and raise a clear KeyError/ValueError) so that failures in subject2category
lookup (in the category assignment for variable category) surface immediately
rather than being mapped to "other".
- Around line 54-58: The code silently drops "EN-US" from args.languages by
filtering it out when building languages and only later adding it back if
args.include_english; change behavior to validate user input up front: check
args.languages for "EN-US" and if present but args.include_english is False
raise an explicit error (or argparse.ArgumentError) informing the user to pass
--include_english to enable EN-US; keep SUPPORTED_LANGUAGES, valid_languages,
languages variables but perform this validation before building languages so
that user-supplied parameters are never ignored and the code path in prepare.py
that references languages and valid_languages behaves deterministically.

---

Nitpick comments:
In `@nemo_skills/dataset/mmmlu/mmmlu_utils.py`:
- Around line 110-112: Add a clarifying inline comment above
MULTILINGUAL_ANSWER_PATTERN_TEMPLATE stating that the fullwidth Unicode
characters (A, B, C, D) are intentionally included to match answer letters in
CJK text and to avoid static analysis ambiguity; reference the constant name
MULTILINGUAL_ANSWER_PATTERN_TEMPLATE so reviewers can locate it and ensure the
comment explains the purpose and that these characters are expected.
- Around line 160-164: The class attribute OPTIONS on Schema is declared as a
mutable list; change it to an immutable tuple to prevent accidental mutation and
silence static analysis: replace the list literal and its type annotation with a
tuple-based annotation (e.g., tuple[str, ...] or typing.Tuple[str, ...]) and a
tuple literal of the options ("A", "B", "C", "D") in the Schema class.

ℹ️ Review info

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between f2e1a7f and 23bae3e.

📒 Files selected for processing (2)
  • nemo_skills/dataset/mmmlu/mmmlu_utils.py
  • nemo_skills/dataset/mmmlu/prepare.py

Signed-off-by: naymaraq <dkaramyan@nvidia.com>
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

♻️ Duplicate comments (3)
nemo_skills/dataset/mmmlu/prepare.py (3)

64-69: ⚠️ Potential issue | 🟠 Major

Compute all entries before opening test.jsonl in truncate mode.

If formatting fails mid-loop, existing output is lost and only partial data remains.

💡 Proposed fix
-    with open(output_file, "wt", encoding="utf-8") as fout:
-        for language, examples in datasets.items():
-            for entry in examples:
-                entry = format_entry(entry=entry, language=language)
-                json.dump(entry, fout, ensure_ascii=False)
-                fout.write("\n")
+    formatted_entries = []
+    for language, examples in datasets.items():
+        for entry in examples:
+            formatted_entries.append(format_entry(entry=entry, language=language))
+
+    with open(output_file, "wt", encoding="utf-8") as fout:
+        for entry in formatted_entries:
+            json.dump(entry, fout, ensure_ascii=False)
+            fout.write("\n")

Based on learnings: "When adding new benchmarks, avoid data loss by doing all computation before re-opening files for writing; ensure computation completes before file writes to prevent accidental data loss if code fails".

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nemo_skills/dataset/mmmlu/prepare.py` around lines 64 - 69, Compute and
collect all formatted entries before opening output_file for writing: iterate
datasets and call format_entry(entry, language) to build a list (e.g.,
formatted_entries) and only after that open output_file and write each
precomputed JSON line; this avoids truncating test.jsonl early and losing data
if formatting raises an exception. Ensure you reference the existing
variables/functions: datasets, format_entry, entry, language, and output_file
when implementing the change.

32-32: ⚠️ Potential issue | 🟠 Major

Do not silently fallback unknown subjects to "other".

This masks missing mappings and can skew category metrics. Fail fast instead.

💡 Proposed fix
-    category = subject2category.get(entry[Schema.SUBJECT], "other")
+    category = subject2category[entry[Schema.SUBJECT]]

As per coding guidelines: "Don't use .get() for accessing dictionary keys if the code expects them to be present; use direct access data[key_name] to fail with a clear error instead of silently corrupting data".

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nemo_skills/dataset/mmmlu/prepare.py` at line 32, The line using
subject2category.get to default unknown subjects to "other" should be changed to
fail fast: replace the silent fallback with explicit lookup of subject =
entry[Schema.SUBJECT] and then lookup category = subject2category[subject], or
if you prefer clearer errors, check if subject in subject2category and raise a
descriptive exception (including subject and entry identifiers) when missing;
update references to subject2category, entry and Schema.SUBJECT to locate and
fix the code path.

51-55: ⚠️ Potential issue | 🟠 Major

Don’t silently ignore user-provided EN-US.

If EN-US is passed without --include_english, the current logic drops it silently.

💡 Proposed fix
 def main(args):
+    if "EN-US" in args.languages and not args.include_english:
+        raise ValueError("EN-US requires --include_english.")
+
     languages = [lang for lang in args.languages if lang != "EN-US"]
     valid_languages = set(SUPPORTED_LANGUAGES)
     if args.include_english:
         valid_languages.add("EN-US")
         languages.append("EN-US")

As per coding guidelines: "Avoid cases where user-passed parameters are unused; code should fail if user specifies an unsupported argument or if a required argument is missing."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nemo_skills/dataset/mmmlu/prepare.py` around lines 51 - 55, The code
currently drops "EN-US" from args.languages when args.include_english is False,
silently ignoring a user-provided value; change the logic in prepare.py to
validate instead: do not filter out "EN-US" in the languages comprehension, then
if "EN-US" is present in args.languages and args.include_english is False raise
a clear ValueError (or argparse.ArgumentError) explaining that EN-US requires
--include_english; ensure valid_languages (based on SUPPORTED_LANGUAGES) still
adds "EN-US" only when args.include_english is True and use these validated
lists (languages, valid_languages) thereafter (refer to symbols: args.languages,
args.include_english, languages, valid_languages, SUPPORTED_LANGUAGES).
🧹 Nitpick comments (1)
nemo_skills/dataset/mmmlu/mmmlu_utils.py (1)

162-162: Prefer immutable option constants.

Schema.OPTIONS is a mutable class attribute; using a tuple avoids accidental mutation.

♻️ Proposed refactor
 class Schema:
     ANSWER: str = "Answer"
     QUESTION: str = "Question"
     SUBJECT: str = "Subject"
-    OPTIONS: list[str] = ["A", "B", "C", "D"]
+    OPTIONS: tuple[str, str, str, str] = ("A", "B", "C", "D")
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nemo_skills/dataset/mmmlu/mmmlu_utils.py` at line 162, Schema.OPTIONS is
currently a mutable list which can be accidentally modified; change it to an
immutable tuple by replacing the list literal with a tuple literal and update
its type annotation to tuple[str, ...] (or typing.Tuple[str, ...] if needed) so
OPTIONS becomes an immutable constant; locate the OPTIONS definition in the
Schema (symbol: OPTIONS) and make this simple type and value change.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@nemo_skills/dataset/mmmlu/mmmlu_utils.py`:
- Line 110: MULTILINGUAL_ANSWER_PATTERN_TEMPLATE currently uses the range [أ-د]
which over-matches Arabic characters; update the regex to list the exact Arabic
option letters instead of a range (e.g. replace [أ-د] with [أبجد]) so only the
intended letters map to A-D; ensure the final pattern for
MULTILINGUAL_ANSWER_PATTERN_TEMPLATE explicitly enumerates each script's option
letters (e.g., [A-D]|[أبجد]|[অবডঢ]|[ABCD]) rather than using character ranges.

In `@nemo_skills/dataset/mmmlu/prepare.py`:
- Around line 36-38: LETTER_REGEX uses the Arabic range `[أ-د]` which matches
unintended characters; change it to list the four Arabic option letters
explicitly (e.g. use `[أبجد]` or include them as alternatives) inside
LETTER_REGEX so only the intended choices are matched, then keep GREEDY_REGEX
and regexes.append(GREEDY_REGEX) as-is; update the LETTER_REGEX definition (the
symbol name is LETTER_REGEX) to use explicit Arabic letters instead of the
range.

In `@nemo_skills/evaluation/evaluator/mcq.py`:
- Around line 28-46: The current normalize_extracted_answer function performs
blind global .replace() calls on extracted_answer which can turn characters
inside longer answers into synthetic " A/B/C/D" tokens; change it to only map
standalone choice markers or markers at the start/end of the answer (e.g., when
the entire string equals an Arabic/Bengali/Japanese marker, when the marker is
surrounded by whitespace/punctuation, or when it appears as the first token) —
update normalize_extracted_answer to use targeted matching (regex or token-based
checks) on extracted_answer so only isolated choice markers are converted to
"A"/"B"/"C"/"D" and embedded characters in longer answers are left unchanged.

---

Duplicate comments:
In `@nemo_skills/dataset/mmmlu/prepare.py`:
- Around line 64-69: Compute and collect all formatted entries before opening
output_file for writing: iterate datasets and call format_entry(entry, language)
to build a list (e.g., formatted_entries) and only after that open output_file
and write each precomputed JSON line; this avoids truncating test.jsonl early
and losing data if formatting raises an exception. Ensure you reference the
existing variables/functions: datasets, format_entry, entry, language, and
output_file when implementing the change.
- Line 32: The line using subject2category.get to default unknown subjects to
"other" should be changed to fail fast: replace the silent fallback with
explicit lookup of subject = entry[Schema.SUBJECT] and then lookup category =
subject2category[subject], or if you prefer clearer errors, check if subject in
subject2category and raise a descriptive exception (including subject and entry
identifiers) when missing; update references to subject2category, entry and
Schema.SUBJECT to locate and fix the code path.
- Around line 51-55: The code currently drops "EN-US" from args.languages when
args.include_english is False, silently ignoring a user-provided value; change
the logic in prepare.py to validate instead: do not filter out "EN-US" in the
languages comprehension, then if "EN-US" is present in args.languages and
args.include_english is False raise a clear ValueError (or
argparse.ArgumentError) explaining that EN-US requires --include_english; ensure
valid_languages (based on SUPPORTED_LANGUAGES) still adds "EN-US" only when
args.include_english is True and use these validated lists (languages,
valid_languages) thereafter (refer to symbols: args.languages,
args.include_english, languages, valid_languages, SUPPORTED_LANGUAGES).

---

Nitpick comments:
In `@nemo_skills/dataset/mmmlu/mmmlu_utils.py`:
- Line 162: Schema.OPTIONS is currently a mutable list which can be accidentally
modified; change it to an immutable tuple by replacing the list literal with a
tuple literal and update its type annotation to tuple[str, ...] (or
typing.Tuple[str, ...] if needed) so OPTIONS becomes an immutable constant;
locate the OPTIONS definition in the Schema (symbol: OPTIONS) and make this
simple type and value change.

ℹ️ Review info

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 23bae3e and 13fb375.

📒 Files selected for processing (4)
  • nemo_skills/dataset/mmmlu/__init__.py
  • nemo_skills/dataset/mmmlu/mmmlu_utils.py
  • nemo_skills/dataset/mmmlu/prepare.py
  • nemo_skills/evaluation/evaluator/mcq.py

D) {D}
""".strip()

MULTILINGUAL_ANSWER_PATTERN_TEMPLATE = "(?i){}[ \t]*([A-D]|[أ-د]|[অ]|[ব]|[ড]|[ঢ]|[A]|[B]|[C]|[D])"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

[أ-د] over-matches unintended Arabic letters.

This is a range, not just option letters. It can capture non-option characters and produce wrong labels.

💡 Proposed fix
-MULTILINGUAL_ANSWER_PATTERN_TEMPLATE = "(?i){}[ \t]*([A-D]|[أ-د]|[অ]|[ব]|[ড]|[ঢ]|[A]|[B]|[C]|[D])"
+MULTILINGUAL_ANSWER_PATTERN_TEMPLATE = "(?i){}[ \t]*([A-D]|[أبجد]|[অ]|[ব]|[ড]|[ঢ]|[A]|[B]|[C]|[D])"
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
MULTILINGUAL_ANSWER_PATTERN_TEMPLATE = "(?i){}[ \t]*([A-D]|[أ-د]|[অ]|[ব]|[ড]|[ঢ]|[A]|[B]|[C]|[D])"
MULTILINGUAL_ANSWER_PATTERN_TEMPLATE = "(?i){}[ \t]*([A-D]|[أبجد]|[অ]|[ব]|[ড]|[ঢ]|[A]|[B]|[C]|[D])"
🧰 Tools
🪛 Ruff (0.15.2)

[warning] 110-110: String contains ambiguous (FULLWIDTH LATIN CAPITAL LETTER A). Did you mean A (LATIN CAPITAL LETTER A)?

(RUF001)


[warning] 110-110: String contains ambiguous (FULLWIDTH LATIN CAPITAL LETTER B). Did you mean B (LATIN CAPITAL LETTER B)?

(RUF001)


[warning] 110-110: String contains ambiguous (FULLWIDTH LATIN CAPITAL LETTER C). Did you mean C (LATIN CAPITAL LETTER C)?

(RUF001)


[warning] 110-110: String contains ambiguous (FULLWIDTH LATIN CAPITAL LETTER D). Did you mean D (LATIN CAPITAL LETTER D)?

(RUF001)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nemo_skills/dataset/mmmlu/mmmlu_utils.py` at line 110,
MULTILINGUAL_ANSWER_PATTERN_TEMPLATE currently uses the range [أ-د] which
over-matches Arabic characters; update the regex to list the exact Arabic option
letters instead of a range (e.g. replace [أ-د] with [أبجد]) so only the intended
letters map to A-D; ensure the final pattern for
MULTILINGUAL_ANSWER_PATTERN_TEMPLATE explicitly enumerates each script's option
letters (e.g., [A-D]|[أبجد]|[অবডঢ]|[ABCD]) rather than using character ranges.

Comment on lines +36 to +38
LETTER_REGEX = r"\b\(?\s*([A-D]|[أ-د]|[অ]|[ব]|[ড]|[ঢ]|[A]|[B]|[C]|[D])\s*\)?\.?\b"
GREEDY_REGEX = r"[\s\S]*" + LETTER_REGEX
regexes.append(GREEDY_REGEX) # Matches the last A/B/C/D letter in the response
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Use explicit Arabic option letters instead of [أ-د] range.

[أ-د] matches more than the intended four option letters and can mis-extract answers.

💡 Proposed fix
-    LETTER_REGEX = r"\b\(?\s*([A-D]|[أ-د]|[অ]|[ব]|[ড]|[ঢ]|[A]|[B]|[C]|[D])\s*\)?\.?\b"
+    LETTER_REGEX = r"\b\(?\s*([A-D]|[أبجد]|[অ]|[ব]|[ড]|[ঢ]|[A]|[B]|[C]|[D])\s*\)?\.?\b"
🧰 Tools
🪛 Ruff (0.15.2)

[warning] 36-36: String contains ambiguous (FULLWIDTH LATIN CAPITAL LETTER A). Did you mean A (LATIN CAPITAL LETTER A)?

(RUF001)


[warning] 36-36: String contains ambiguous (FULLWIDTH LATIN CAPITAL LETTER B). Did you mean B (LATIN CAPITAL LETTER B)?

(RUF001)


[warning] 36-36: String contains ambiguous (FULLWIDTH LATIN CAPITAL LETTER C). Did you mean C (LATIN CAPITAL LETTER C)?

(RUF001)


[warning] 36-36: String contains ambiguous (FULLWIDTH LATIN CAPITAL LETTER D). Did you mean D (LATIN CAPITAL LETTER D)?

(RUF001)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nemo_skills/dataset/mmmlu/prepare.py` around lines 36 - 38, LETTER_REGEX uses
the Arabic range `[أ-د]` which matches unintended characters; change it to list
the four Arabic option letters explicitly (e.g. use `[أبجد]` or include them as
alternatives) inside LETTER_REGEX so only the intended choices are matched, then
keep GREEDY_REGEX and regexes.append(GREEDY_REGEX) as-is; update the
LETTER_REGEX definition (the symbol name is LETTER_REGEX) to use explicit Arabic
letters instead of the range.

Comment on lines +28 to +46
def normalize_extracted_answer(extracted_answer: str) -> str:
return (
# In arabic these are the letters used for A-D in multiple choice questions
extracted_answer.replace("أ", " A")
.replace("ب", " B")
.replace("ج", " C")
.replace("د", " D")
# In Bengali these are the letters used for A-D in multiple choice questions
.replace("অ", " A")
.replace("ব", " B")
.replace("ড", " C")
.replace("ঢ", " D")
# In Japanese these are the letters sometimes used for A-D in multiple choice questions
.replace("A", " A")
.replace("B", " B")
.replace("C", " C")
.replace("D", " D")
.strip()
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Avoid global character replacement across the full extracted text.

This replaces Arabic/Bengali letters everywhere in extracted_answer, so longer answers can gain synthetic A/B/C/D tokens and be misgraded.

💡 Proposed fix
+OPTION_CHAR_MAP = {
+    "A": "A",
+    "B": "B",
+    "C": "C",
+    "D": "D",
+    "أ": "A",
+    "ب": "B",
+    "ج": "C",
+    "د": "D",
+    "অ": "A",
+    "ব": "B",
+    "ড": "C",
+    "ঢ": "D",
+    "A": "A",
+    "B": "B",
+    "C": "C",
+    "D": "D",
+}
+
 def normalize_extracted_answer(extracted_answer: str) -> str:
-    return (
-        # In arabic these are the letters used for A-D in multiple choice questions
-        extracted_answer.replace("أ", " A")
-        .replace("ب", " B")
-        .replace("ج", " C")
-        .replace("د", " D")
-        # In Bengali these are the letters used for A-D in multiple choice questions
-        .replace("অ", " A")
-        .replace("ব", " B")
-        .replace("ড", " C")
-        .replace("ঢ", " D")
-        # In Japanese these are the letters sometimes used for A-D in multiple choice questions
-        .replace("A", " A")
-        .replace("B", " B")
-        .replace("C", " C")
-        .replace("D", " D")
-        .strip()
-    )
+    normalized = extracted_answer.strip()
+    match = re.fullmatch(r"[\s\(\[\{<\*_]*([A-DأبجدঅবডঢABCD])[\s\)\]\}>.\*_]*", normalized)
+    if not match:
+        return normalized
+    return OPTION_CHAR_MAP[match.group(1)]
🧰 Tools
🪛 Ruff (0.15.2)

[warning] 41-41: String contains ambiguous (FULLWIDTH LATIN CAPITAL LETTER A). Did you mean A (LATIN CAPITAL LETTER A)?

(RUF001)


[warning] 42-42: String contains ambiguous (FULLWIDTH LATIN CAPITAL LETTER B). Did you mean B (LATIN CAPITAL LETTER B)?

(RUF001)


[warning] 43-43: String contains ambiguous (FULLWIDTH LATIN CAPITAL LETTER C). Did you mean C (LATIN CAPITAL LETTER C)?

(RUF001)


[warning] 44-44: String contains ambiguous (FULLWIDTH LATIN CAPITAL LETTER D). Did you mean D (LATIN CAPITAL LETTER D)?

(RUF001)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nemo_skills/evaluation/evaluator/mcq.py` around lines 28 - 46, The current
normalize_extracted_answer function performs blind global .replace() calls on
extracted_answer which can turn characters inside longer answers into synthetic
" A/B/C/D" tokens; change it to only map standalone choice markers or markers at
the start/end of the answer (e.g., when the entire string equals an
Arabic/Bengali/Japanese marker, when the marker is surrounded by
whitespace/punctuation, or when it appears as the first token) — update
normalize_extracted_answer to use targeted matching (regex or token-based
checks) on extracted_answer so only isolated choice markers are converted to
"A"/"B"/"C"/"D" and embedded characters in longer answers are left unchanged.

@gwarmstrong gwarmstrong enabled auto-merge (squash) March 3, 2026 21:57
@gwarmstrong gwarmstrong disabled auto-merge March 3, 2026 21:58
@gwarmstrong
Copy link
Collaborator

@AlexGrinch @shuoyangd let me know if the content looks good to you and we can merge

Copy link
Collaborator

@Kipok Kipok left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm, just a small comment. Let's run gpu tests.

@shuoyangd @AlexGrinch please have a look if you want, if not we will merge as is and you can still review later and we can fix issues in follow up pr

Kipok added 2 commits March 6, 2026 11:18
Signed-off-by: Igor Gitman <igitman@nvidia.com>

Signed-off-by: Igor Gitman <igor.a.gitman@gmail.com>
@Kipok Kipok merged commit f6f7041 into main Mar 9, 2026
6 checks passed
@Kipok Kipok deleted the dkaramyan/mmmlu branch March 9, 2026 22:40
dgtm777 pushed a commit that referenced this pull request Mar 18, 2026
Signed-off-by: naymaraq <dkaramyan@nvidia.com>
Signed-off-by: Igor Gitman <igor.a.gitman@gmail.com>
Co-authored-by: naymaraq <dkaramyan@nvidia.com>
Co-authored-by: George Armstrong <georgea@nvidia.com>
Co-authored-by: Igor Gitman <igitman@nvidia.com>
dgtm777 pushed a commit that referenced this pull request Mar 18, 2026
Signed-off-by: naymaraq <dkaramyan@nvidia.com>
Signed-off-by: Igor Gitman <igor.a.gitman@gmail.com>
Co-authored-by: naymaraq <dkaramyan@nvidia.com>
Co-authored-by: George Armstrong <georgea@nvidia.com>
Co-authored-by: Igor Gitman <igitman@nvidia.com>
Signed-off-by: dgitman <dgitman@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants