Conversation
Signed-off-by: naymaraq <dkaramyan@nvidia.com>
Signed-off-by: naymaraq <dkaramyan@nvidia.com>
Signed-off-by: naymaraq <dkaramyan@nvidia.com>
Signed-off-by: naymaraq <dkaramyan@nvidia.com>
Signed-off-by: naymaraq <dkaramyan@nvidia.com>
Signed-off-by: naymaraq <dkaramyan@nvidia.com>
Signed-off-by: naymaraq <dkaramyan@nvidia.com>
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughAdds a multilingual MMLU dataset package (utils + prepare script) that downloads and formats per-language CSVs into JSONL, exposes dataset metadata constants, and updates the MCQ evaluator to accept multiple extraction regexes and normalize multilingual answer letters. Changes
Sequence DiagramsequenceDiagram
participant User as User/Script
participant Prepare as prepare.py
participant Utils as mmmlu_utils.py
participant Remote as Remote Data Source
participant Disk as Local Filesystem
participant Eval as mcq.py Evaluator
User->>Prepare: run prepare.py (languages, include_english?)
Prepare->>Utils: download_mmmlu_datasets(languages)
Utils->>Remote: fetch per-language CSVs
Remote-->>Utils: return CSV data
Utils->>Disk: save CSVs (skip if exists)
Utils-->>Prepare: return dict(language -> entries)
Prepare->>Utils: get_mcq_fields(entry) / format_multichoice_question(entry)
Utils-->>Prepare: question/options fields
Prepare->>Disk: write formatted entries to test.jsonl
User->>Eval: submit model outputs + test.jsonl
Eval->>Eval: extract_letter(text, extract_regex list)
Eval->>Eval: normalize_extracted_answer(match)
Eval-->>User: evaluation results
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Possibly related PRs
Suggested reviewers
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (4)
nemo_skills/evaluation/evaluator/mcq.py (2)
27-45: Consider adding a# noqa: RUF001comment to document intent.The static analysis warnings about "ambiguous" fullwidth characters are false positives here — these fullwidth Japanese characters (A, B, C, D) are intentionally matched for normalization. Adding inline comments would silence the linter and document the intent for future maintainers.
📝 Suggested inline comments
# In Japanese these are the letters sometimes used for A-D in multiple choice questions - .replace("A", " A") - .replace("B", " B") - .replace("C", " C") - .replace("D", " D") + .replace("A", " A") # noqa: RUF001 (fullwidth A) + .replace("B", " B") # noqa: RUF001 (fullwidth B) + .replace("C", " C") # noqa: RUF001 (fullwidth C) + .replace("D", " D") # noqa: RUF001 (fullwidth D)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@nemo_skills/evaluation/evaluator/mcq.py` around lines 27 - 45, The linter flags the fullwidth Japanese characters in normalize_extracted_answer as ambiguous; add an inline "# noqa: RUF001" comment to the lines (or the expression) handling the fullwidth characters (the .replace calls for "A", "B", "C", "D") to document intent and silence the false positive, and include a short inline comment (e.g., "match fullwidth A-D for normalization") so future maintainers understand why these specific replacements are present.
100-103: Consider usingLOG.debugfor per-sample logging.This log statement fires for every sample in the dataset. For large benchmarks like MMMLU with thousands of entries, this will produce excessive log output at INFO level.
📝 Proposed fix
- LOG.info( + LOG.debug( f"Final parsed letter: {parsed_letter}, extract_from_boxed: {extract_from_boxed}, " f"extract_regex: {extract_regex}, extracted_answer: {extracted_answer}, relaxed: {relaxed}" )🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@nemo_skills/evaluation/evaluator/mcq.py` around lines 100 - 103, The per-sample logging currently uses LOG.info and emits parsed_letter, extract_from_boxed, extract_regex, extracted_answer, and relaxed for every sample; change this call to LOG.debug to avoid spamming INFO for large datasets (in nemo_skills/evaluation/evaluator/mcq.py where the LOG.info(...) call appears) and ensure any higher-level summary logs remain at INFO if needed; keep the same message content and variable references but lower the level to debug so developers can enable verbose output when required.nemo_skills/dataset/mmmlu/mmmlu_utils.py (2)
160-165: Mutable class attribute should be a tuple or ClassVar.
Schema.OPTIONSis a mutable list at class level. While it's used as a constant here, using a tuple would prevent accidental mutation and silence the linter warning.📝 Proposed fix
class Schema: ANSWER: str = "Answer" QUESTION: str = "Question" SUBJECT: str = "Subject" - OPTIONS: list[str] = ["A", "B", "C", "D"] + OPTIONS: tuple[str, ...] = ("A", "B", "C", "D")🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@nemo_skills/dataset/mmmlu/mmmlu_utils.py` around lines 160 - 165, Schema defines OPTIONS as a mutable class-level list; change it to an immutable tuple (or annotate it as a ClassVar[Tuple[str, ...]]) to prevent accidental mutation and satisfy the linter: update Schema.OPTIONS to be a tuple of strings (or add ClassVar typing and use a tuple) so callers treat it as a constant.
178-180: Redundant existence check after download.
urllib.request.urlretrieveraises an exception (e.g.,URLError) on download failure rather than silently failing. If it returns successfully, the file is guaranteed to exist. This check is unreachable in failure scenarios.📝 Proposed simplification
url = OPENAI_PUBLIC_URL.format(suffix) urllib.request.urlretrieve(url, download_dst_path) - if not os.path.exists(download_dst_path): - raise RuntimeError(f"Failed to download {suffix}")The download error will propagate naturally with a descriptive
URLError.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@nemo_skills/dataset/mmmlu/mmmlu_utils.py` around lines 178 - 180, Remove the redundant post-download existence check around urllib.request.urlretrieve in mmmlu_utils.py: delete the os.path.exists(download_dst_path) check and the subsequent RuntimeError raise so that failures from urllib.request.urlretrieve (e.g., URLError) propagate naturally; keep the download call using the same download_dst_path variable and any surrounding try/except handling intact if present.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@nemo_skills/dataset/mmmlu/prepare.py`:
- Around line 52-59: In main(), the appended "EN-US" from args.include_english
is being validated against SUPPORTED_LANGUAGES (which doesn't include "EN-US"),
causing a false negative; update the validation so that either EN-US is excluded
from the set being checked or validate against SUPPORTED_LANGUAGES union
{"EN-US"} instead—modify the logic around the languages and invalid calculation
(referencing the languages variable, args.include_english flag, and
SUPPORTED_LANGUAGES) so that including English does not trigger the ValueError.
---
Nitpick comments:
In `@nemo_skills/dataset/mmmlu/mmmlu_utils.py`:
- Around line 160-165: Schema defines OPTIONS as a mutable class-level list;
change it to an immutable tuple (or annotate it as a ClassVar[Tuple[str, ...]])
to prevent accidental mutation and satisfy the linter: update Schema.OPTIONS to
be a tuple of strings (or add ClassVar typing and use a tuple) so callers treat
it as a constant.
- Around line 178-180: Remove the redundant post-download existence check around
urllib.request.urlretrieve in mmmlu_utils.py: delete the
os.path.exists(download_dst_path) check and the subsequent RuntimeError raise so
that failures from urllib.request.urlretrieve (e.g., URLError) propagate
naturally; keep the download call using the same download_dst_path variable and
any surrounding try/except handling intact if present.
In `@nemo_skills/evaluation/evaluator/mcq.py`:
- Around line 27-45: The linter flags the fullwidth Japanese characters in
normalize_extracted_answer as ambiguous; add an inline "# noqa: RUF001" comment
to the lines (or the expression) handling the fullwidth characters (the .replace
calls for "A", "B", "C", "D") to document intent and silence the false positive,
and include a short inline comment (e.g., "match fullwidth A-D for
normalization") so future maintainers understand why these specific replacements
are present.
- Around line 100-103: The per-sample logging currently uses LOG.info and emits
parsed_letter, extract_from_boxed, extract_regex, extracted_answer, and relaxed
for every sample; change this call to LOG.debug to avoid spamming INFO for large
datasets (in nemo_skills/evaluation/evaluator/mcq.py where the LOG.info(...)
call appears) and ensure any higher-level summary logs remain at INFO if needed;
keep the same message content and variable references but lower the level to
debug so developers can enable verbose output when required.
ℹ️ Review info
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (4)
nemo_skills/dataset/mmmlu/__init__.pynemo_skills/dataset/mmmlu/mmmlu_utils.pynemo_skills/dataset/mmmlu/prepare.pynemo_skills/evaluation/evaluator/mcq.py
Signed-off-by: naymaraq <dkaramyan@nvidia.com>
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@nemo_skills/dataset/mmmlu/prepare.py`:
- Line 31: The current code silently maps unknown subjects to "other" using
subject2category.get(entry[Schema.SUBJECT], "other"), which hides missing
mappings; change this to explicitly require the mapping by accessing
subject2category[entry[Schema.SUBJECT]] (or validate with an if and raise a
clear ValueError referencing the missing subject) so that missing subject keys
surface immediately; update the assignment to category and add a descriptive
error message mentioning entry[Schema.SUBJECT] and subject2category to aid
debugging.
- Around line 66-71: The current loop formats and writes entries directly into
the truncate-opened output_file which risks data loss if formatting errors
occur; instead, first iterate datasets and call format_entry(entry=...,
language=...) for every entry, collect the resulting JSON-serializable objects
(e.g., into a list like formatted_entries), and only after that successfully
completes open output_file for writing ("wt") and write the collected entries
with json.dump and fout.write("\n"); reference format_entry, datasets, and
output_file to locate the code to change.
- Around line 53-57: The code currently silently removes "EN-US" from
args.languages when args.include_english is false; change this so user-provided
"EN-US" is not ignored: keep languages built from args.languages (do not
pre-filter out "EN-US"), validate each requested language against
SUPPORTED_LANGUAGES, and if "EN-US" is requested but args.include_english is
False raise an error (or exit) telling the user to pass --include_english; also
if any requested language is not in SUPPORTED_LANGUAGES raise an error. Modify
the logic around languages, valid_languages, args.languages,
args.include_english and SUPPORTED_LANGUAGES accordingly.
|
|
||
| def format_entry(entry: dict, language: str) -> dict: | ||
| expected_answer = entry[Schema.ANSWER] | ||
| category = subject2category.get(entry[Schema.SUBJECT], "other") |
There was a problem hiding this comment.
Avoid silent fallback for missing subject-category mappings.
Line 31 silently maps unknown subjects to "other", which can hide dataset/schema drift and corrupt category metrics.
💡 Proposed fix
- category = subject2category.get(entry[Schema.SUBJECT], "other")
+ subject = entry[Schema.SUBJECT]
+ category = subject2category[subject]As per coding guidelines: "Don't use .get() for accessing dictionary keys if the code expects them to be present; use direct access data[key_name] to fail with a clear error instead of silently corrupting data".
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| category = subject2category.get(entry[Schema.SUBJECT], "other") | |
| subject = entry[Schema.SUBJECT] | |
| category = subject2category[subject] |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@nemo_skills/dataset/mmmlu/prepare.py` at line 31, The current code silently
maps unknown subjects to "other" using
subject2category.get(entry[Schema.SUBJECT], "other"), which hides missing
mappings; change this to explicitly require the mapping by accessing
subject2category[entry[Schema.SUBJECT]] (or validate with an if and raise a
clear ValueError referencing the missing subject) so that missing subject keys
surface immediately; update the assignment to category and add a descriptive
error message mentioning entry[Schema.SUBJECT] and subject2category to aid
debugging.
| languages = [lang for lang in args.languages if lang != "EN-US"] | ||
| valid_languages = set(SUPPORTED_LANGUAGES) | ||
| if args.include_english: | ||
| valid_languages.add("EN-US") | ||
| languages.append("EN-US") |
There was a problem hiding this comment.
Do not silently ignore explicit EN-US input.
Line 53 removes EN-US from user input when --include_english is not set, so a user-passed language can be dropped without error.
💡 Proposed fix
def main(args):
- languages = [lang for lang in args.languages if lang != "EN-US"]
+ if "EN-US" in args.languages and not args.include_english:
+ raise ValueError("EN-US requires --include_english.")
+
+ languages = [lang for lang in args.languages if lang != "EN-US"]
valid_languages = set(SUPPORTED_LANGUAGES)
if args.include_english:
valid_languages.add("EN-US")
languages.append("EN-US")As per coding guidelines: "Avoid cases where user-passed parameters are unused; code should fail if user specifies an unsupported argument or if a required argument is missing."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@nemo_skills/dataset/mmmlu/prepare.py` around lines 53 - 57, The code
currently silently removes "EN-US" from args.languages when args.include_english
is false; change this so user-provided "EN-US" is not ignored: keep languages
built from args.languages (do not pre-filter out "EN-US"), validate each
requested language against SUPPORTED_LANGUAGES, and if "EN-US" is requested but
args.include_english is False raise an error (or exit) telling the user to pass
--include_english; also if any requested language is not in SUPPORTED_LANGUAGES
raise an error. Modify the logic around languages, valid_languages,
args.languages, args.include_english and SUPPORTED_LANGUAGES accordingly.
| with open(output_file, "wt", encoding="utf-8") as fout: | ||
| for language, examples in datasets.items(): | ||
| for entry in examples: | ||
| entry = format_entry(entry=entry, language=language) | ||
| json.dump(entry, fout, ensure_ascii=False) | ||
| fout.write("\n") |
There was a problem hiding this comment.
Compute all entries before truncating test.jsonl.
Lines 66-71 perform formatting while writing to a truncate-opened file. If formatting fails mid-run, existing output is lost and only a partial file remains.
💡 Proposed fix
- with open(output_file, "wt", encoding="utf-8") as fout:
- for language, examples in datasets.items():
- for entry in examples:
- entry = format_entry(entry=entry, language=language)
- json.dump(entry, fout, ensure_ascii=False)
- fout.write("\n")
+ formatted_entries = []
+ for language, examples in datasets.items():
+ for entry in examples:
+ formatted_entries.append(format_entry(entry=entry, language=language))
+
+ with open(output_file, "wt", encoding="utf-8") as fout:
+ for entry in formatted_entries:
+ json.dump(entry, fout, ensure_ascii=False)
+ fout.write("\n")As per coding guidelines: "When adding new benchmarks, avoid data loss by doing all computation before re-opening files for writing; ensure computation completes before file writes to prevent accidental data loss if code fails".
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@nemo_skills/dataset/mmmlu/prepare.py` around lines 66 - 71, The current loop
formats and writes entries directly into the truncate-opened output_file which
risks data loss if formatting errors occur; instead, first iterate datasets and
call format_entry(entry=..., language=...) for every entry, collect the
resulting JSON-serializable objects (e.g., into a list like formatted_entries),
and only after that successfully completes open output_file for writing ("wt")
and write the collected entries with json.dump and fout.write("\n"); reference
format_entry, datasets, and output_file to locate the code to change.
Signed-off-by: naymaraq <dkaramyan@nvidia.com>
There was a problem hiding this comment.
♻️ Duplicate comments (3)
nemo_skills/dataset/mmmlu/prepare.py (3)
67-72:⚠️ Potential issue | 🟠 MajorCompute all entries before writing to avoid data loss on failure.
If
format_entryfails mid-iteration, the output file will be truncated with only partial data. Collect all formatted entries first, then write them.🐛 Proposed fix
data_dir = Path(__file__).absolute().parent output_file = data_dir / "test.jsonl" + + formatted_entries = [] + for language, examples in datasets.items(): + for entry in examples: + formatted_entries.append(format_entry(entry=entry, language=language)) + with open(output_file, "wt", encoding="utf-8") as fout: - for language, examples in datasets.items(): - for entry in examples: - entry = format_entry(entry=entry, language=language) - json.dump(entry, fout, ensure_ascii=False) - fout.write("\n") + for entry in formatted_entries: + json.dump(entry, fout, ensure_ascii=False) + fout.write("\n")As per coding guidelines: "When adding new benchmarks, avoid data loss by doing all computation before re-opening files for writing; ensure computation completes before file writes to prevent accidental data loss if code fails".
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@nemo_skills/dataset/mmmlu/prepare.py` around lines 67 - 72, The loop currently formats and writes entries directly to output_file inside the with-open block which risks truncating the file if format_entry raises; instead, iterate over datasets and call format_entry for every entry first (collect formatted entries into a list, e.g., formatted = []), then open output_file for writing and json.dump each item from formatted to fout; reference symbols: format_entry, datasets, output_file, and the current write loop that uses fout.
32-32:⚠️ Potential issue | 🟠 MajorUse direct dictionary access instead of
.get()with fallback.This line silently maps unknown subjects to
"other", which could mask dataset schema drift or data issues. If a subject is not insubject2category, the code should fail explicitly.🐛 Proposed fix
- category = subject2category.get(entry[Schema.SUBJECT], "other") + category = subject2category[entry[Schema.SUBJECT]]As per coding guidelines: "Don't use
.get()for accessing dictionary keys if the code expects them to be present; use direct accessdata[key_name]to fail with a clear error instead of silently corrupting data".🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@nemo_skills/dataset/mmmlu/prepare.py` at line 32, The current line silently falls back to "other"; change it to access the mapping directly so missing keys raise an error: replace the use of subject2category.get(...) with direct indexing using subject2category[entry[Schema.SUBJECT]] (or perform an explicit if-not-in check and raise a clear KeyError/ValueError) so that failures in subject2category lookup (in the category assignment for variable category) surface immediately rather than being mapped to "other".
54-58:⚠️ Potential issue | 🟠 MajorSilently dropping
EN-USfrom user input violates coding guidelines.If a user explicitly passes
EN-USin--languageswithout--include_english, it's silently removed. The code should either error or handle it explicitly.🐛 Proposed fix
def main(args): + if "EN-US" in args.languages and not args.include_english: + raise ValueError("EN-US requires --include_english flag.") + languages = [lang for lang in args.languages if lang != "EN-US"] valid_languages = set(SUPPORTED_LANGUAGES) if args.include_english: valid_languages.add("EN-US") languages.append("EN-US")As per coding guidelines: "Avoid cases where user-passed parameters are unused; code should fail if user specifies an unsupported argument or if a required argument is missing."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@nemo_skills/dataset/mmmlu/prepare.py` around lines 54 - 58, The code silently drops "EN-US" from args.languages by filtering it out when building languages and only later adding it back if args.include_english; change behavior to validate user input up front: check args.languages for "EN-US" and if present but args.include_english is False raise an explicit error (or argparse.ArgumentError) informing the user to pass --include_english to enable EN-US; keep SUPPORTED_LANGUAGES, valid_languages, languages variables but perform this validation before building languages so that user-supplied parameters are never ignored and the code path in prepare.py that references languages and valid_languages behaves deterministically.
🧹 Nitpick comments (2)
nemo_skills/dataset/mmmlu/mmmlu_utils.py (2)
110-112: Consider adding a comment clarifying intentional fullwidth characters.The static analysis tool flags the fullwidth Unicode characters (A, B, C, D) as ambiguous, but these are intentionally included to match answer letters in CJK text where fullwidth characters are common. A brief inline comment would suppress future confusion.
📝 Suggested comment
MULTILINGUAL_ANSWER_PATTERN_TEMPLATE = ( + # Note: Fullwidth letters (A-D) and non-Latin scripts are intentional for multilingual matching "(?i){}[ \t]*([A-D]|[أ-د]|[অ]|[ব]|[ড]|[ঢ]|[A]|[B]|[C]|[D])" )🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@nemo_skills/dataset/mmmlu/mmmlu_utils.py` around lines 110 - 112, Add a clarifying inline comment above MULTILINGUAL_ANSWER_PATTERN_TEMPLATE stating that the fullwidth Unicode characters (A, B, C, D) are intentionally included to match answer letters in CJK text and to avoid static analysis ambiguity; reference the constant name MULTILINGUAL_ANSWER_PATTERN_TEMPLATE so reviewers can locate it and ensure the comment explains the purpose and that these characters are expected.
160-164: Consider using a tuple for immutableOPTIONS.
OPTIONSis a mutable list as a class attribute. While it's only used for reading, using a tuple would prevent accidental mutation and silence the static analysis warning.♻️ Suggested change
class Schema: ANSWER: str = "Answer" QUESTION: str = "Question" SUBJECT: str = "Subject" - OPTIONS: list[str] = ["A", "B", "C", "D"] + OPTIONS: tuple[str, ...] = ("A", "B", "C", "D")🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@nemo_skills/dataset/mmmlu/mmmlu_utils.py` around lines 160 - 164, The class attribute OPTIONS on Schema is declared as a mutable list; change it to an immutable tuple to prevent accidental mutation and silence static analysis: replace the list literal and its type annotation with a tuple-based annotation (e.g., tuple[str, ...] or typing.Tuple[str, ...]) and a tuple literal of the options ("A", "B", "C", "D") in the Schema class.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Duplicate comments:
In `@nemo_skills/dataset/mmmlu/prepare.py`:
- Around line 67-72: The loop currently formats and writes entries directly to
output_file inside the with-open block which risks truncating the file if
format_entry raises; instead, iterate over datasets and call format_entry for
every entry first (collect formatted entries into a list, e.g., formatted = []),
then open output_file for writing and json.dump each item from formatted to
fout; reference symbols: format_entry, datasets, output_file, and the current
write loop that uses fout.
- Line 32: The current line silently falls back to "other"; change it to access
the mapping directly so missing keys raise an error: replace the use of
subject2category.get(...) with direct indexing using
subject2category[entry[Schema.SUBJECT]] (or perform an explicit if-not-in check
and raise a clear KeyError/ValueError) so that failures in subject2category
lookup (in the category assignment for variable category) surface immediately
rather than being mapped to "other".
- Around line 54-58: The code silently drops "EN-US" from args.languages by
filtering it out when building languages and only later adding it back if
args.include_english; change behavior to validate user input up front: check
args.languages for "EN-US" and if present but args.include_english is False
raise an explicit error (or argparse.ArgumentError) informing the user to pass
--include_english to enable EN-US; keep SUPPORTED_LANGUAGES, valid_languages,
languages variables but perform this validation before building languages so
that user-supplied parameters are never ignored and the code path in prepare.py
that references languages and valid_languages behaves deterministically.
---
Nitpick comments:
In `@nemo_skills/dataset/mmmlu/mmmlu_utils.py`:
- Around line 110-112: Add a clarifying inline comment above
MULTILINGUAL_ANSWER_PATTERN_TEMPLATE stating that the fullwidth Unicode
characters (A, B, C, D) are intentionally included to match answer letters in
CJK text and to avoid static analysis ambiguity; reference the constant name
MULTILINGUAL_ANSWER_PATTERN_TEMPLATE so reviewers can locate it and ensure the
comment explains the purpose and that these characters are expected.
- Around line 160-164: The class attribute OPTIONS on Schema is declared as a
mutable list; change it to an immutable tuple to prevent accidental mutation and
silence static analysis: replace the list literal and its type annotation with a
tuple-based annotation (e.g., tuple[str, ...] or typing.Tuple[str, ...]) and a
tuple literal of the options ("A", "B", "C", "D") in the Schema class.
Signed-off-by: naymaraq <dkaramyan@nvidia.com>
There was a problem hiding this comment.
Actionable comments posted: 3
♻️ Duplicate comments (3)
nemo_skills/dataset/mmmlu/prepare.py (3)
64-69:⚠️ Potential issue | 🟠 MajorCompute all entries before opening
test.jsonlin truncate mode.If formatting fails mid-loop, existing output is lost and only partial data remains.
💡 Proposed fix
- with open(output_file, "wt", encoding="utf-8") as fout: - for language, examples in datasets.items(): - for entry in examples: - entry = format_entry(entry=entry, language=language) - json.dump(entry, fout, ensure_ascii=False) - fout.write("\n") + formatted_entries = [] + for language, examples in datasets.items(): + for entry in examples: + formatted_entries.append(format_entry(entry=entry, language=language)) + + with open(output_file, "wt", encoding="utf-8") as fout: + for entry in formatted_entries: + json.dump(entry, fout, ensure_ascii=False) + fout.write("\n")Based on learnings: "When adding new benchmarks, avoid data loss by doing all computation before re-opening files for writing; ensure computation completes before file writes to prevent accidental data loss if code fails".
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@nemo_skills/dataset/mmmlu/prepare.py` around lines 64 - 69, Compute and collect all formatted entries before opening output_file for writing: iterate datasets and call format_entry(entry, language) to build a list (e.g., formatted_entries) and only after that open output_file and write each precomputed JSON line; this avoids truncating test.jsonl early and losing data if formatting raises an exception. Ensure you reference the existing variables/functions: datasets, format_entry, entry, language, and output_file when implementing the change.
32-32:⚠️ Potential issue | 🟠 MajorDo not silently fallback unknown subjects to
"other".This masks missing mappings and can skew category metrics. Fail fast instead.
💡 Proposed fix
- category = subject2category.get(entry[Schema.SUBJECT], "other") + category = subject2category[entry[Schema.SUBJECT]]As per coding guidelines: "Don't use
.get()for accessing dictionary keys if the code expects them to be present; use direct accessdata[key_name]to fail with a clear error instead of silently corrupting data".🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@nemo_skills/dataset/mmmlu/prepare.py` at line 32, The line using subject2category.get to default unknown subjects to "other" should be changed to fail fast: replace the silent fallback with explicit lookup of subject = entry[Schema.SUBJECT] and then lookup category = subject2category[subject], or if you prefer clearer errors, check if subject in subject2category and raise a descriptive exception (including subject and entry identifiers) when missing; update references to subject2category, entry and Schema.SUBJECT to locate and fix the code path.
51-55:⚠️ Potential issue | 🟠 MajorDon’t silently ignore user-provided
EN-US.If
EN-USis passed without--include_english, the current logic drops it silently.💡 Proposed fix
def main(args): + if "EN-US" in args.languages and not args.include_english: + raise ValueError("EN-US requires --include_english.") + languages = [lang for lang in args.languages if lang != "EN-US"] valid_languages = set(SUPPORTED_LANGUAGES) if args.include_english: valid_languages.add("EN-US") languages.append("EN-US")As per coding guidelines: "Avoid cases where user-passed parameters are unused; code should fail if user specifies an unsupported argument or if a required argument is missing."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@nemo_skills/dataset/mmmlu/prepare.py` around lines 51 - 55, The code currently drops "EN-US" from args.languages when args.include_english is False, silently ignoring a user-provided value; change the logic in prepare.py to validate instead: do not filter out "EN-US" in the languages comprehension, then if "EN-US" is present in args.languages and args.include_english is False raise a clear ValueError (or argparse.ArgumentError) explaining that EN-US requires --include_english; ensure valid_languages (based on SUPPORTED_LANGUAGES) still adds "EN-US" only when args.include_english is True and use these validated lists (languages, valid_languages) thereafter (refer to symbols: args.languages, args.include_english, languages, valid_languages, SUPPORTED_LANGUAGES).
🧹 Nitpick comments (1)
nemo_skills/dataset/mmmlu/mmmlu_utils.py (1)
162-162: Prefer immutable option constants.
Schema.OPTIONSis a mutable class attribute; using a tuple avoids accidental mutation.♻️ Proposed refactor
class Schema: ANSWER: str = "Answer" QUESTION: str = "Question" SUBJECT: str = "Subject" - OPTIONS: list[str] = ["A", "B", "C", "D"] + OPTIONS: tuple[str, str, str, str] = ("A", "B", "C", "D")🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@nemo_skills/dataset/mmmlu/mmmlu_utils.py` at line 162, Schema.OPTIONS is currently a mutable list which can be accidentally modified; change it to an immutable tuple by replacing the list literal with a tuple literal and update its type annotation to tuple[str, ...] (or typing.Tuple[str, ...] if needed) so OPTIONS becomes an immutable constant; locate the OPTIONS definition in the Schema (symbol: OPTIONS) and make this simple type and value change.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@nemo_skills/dataset/mmmlu/mmmlu_utils.py`:
- Line 110: MULTILINGUAL_ANSWER_PATTERN_TEMPLATE currently uses the range [أ-د]
which over-matches Arabic characters; update the regex to list the exact Arabic
option letters instead of a range (e.g. replace [أ-د] with [أبجد]) so only the
intended letters map to A-D; ensure the final pattern for
MULTILINGUAL_ANSWER_PATTERN_TEMPLATE explicitly enumerates each script's option
letters (e.g., [A-D]|[أبجد]|[অবডঢ]|[ABCD]) rather than using character ranges.
In `@nemo_skills/dataset/mmmlu/prepare.py`:
- Around line 36-38: LETTER_REGEX uses the Arabic range `[أ-د]` which matches
unintended characters; change it to list the four Arabic option letters
explicitly (e.g. use `[أبجد]` or include them as alternatives) inside
LETTER_REGEX so only the intended choices are matched, then keep GREEDY_REGEX
and regexes.append(GREEDY_REGEX) as-is; update the LETTER_REGEX definition (the
symbol name is LETTER_REGEX) to use explicit Arabic letters instead of the
range.
In `@nemo_skills/evaluation/evaluator/mcq.py`:
- Around line 28-46: The current normalize_extracted_answer function performs
blind global .replace() calls on extracted_answer which can turn characters
inside longer answers into synthetic " A/B/C/D" tokens; change it to only map
standalone choice markers or markers at the start/end of the answer (e.g., when
the entire string equals an Arabic/Bengali/Japanese marker, when the marker is
surrounded by whitespace/punctuation, or when it appears as the first token) —
update normalize_extracted_answer to use targeted matching (regex or token-based
checks) on extracted_answer so only isolated choice markers are converted to
"A"/"B"/"C"/"D" and embedded characters in longer answers are left unchanged.
---
Duplicate comments:
In `@nemo_skills/dataset/mmmlu/prepare.py`:
- Around line 64-69: Compute and collect all formatted entries before opening
output_file for writing: iterate datasets and call format_entry(entry, language)
to build a list (e.g., formatted_entries) and only after that open output_file
and write each precomputed JSON line; this avoids truncating test.jsonl early
and losing data if formatting raises an exception. Ensure you reference the
existing variables/functions: datasets, format_entry, entry, language, and
output_file when implementing the change.
- Line 32: The line using subject2category.get to default unknown subjects to
"other" should be changed to fail fast: replace the silent fallback with
explicit lookup of subject = entry[Schema.SUBJECT] and then lookup category =
subject2category[subject], or if you prefer clearer errors, check if subject in
subject2category and raise a descriptive exception (including subject and entry
identifiers) when missing; update references to subject2category, entry and
Schema.SUBJECT to locate and fix the code path.
- Around line 51-55: The code currently drops "EN-US" from args.languages when
args.include_english is False, silently ignoring a user-provided value; change
the logic in prepare.py to validate instead: do not filter out "EN-US" in the
languages comprehension, then if "EN-US" is present in args.languages and
args.include_english is False raise a clear ValueError (or
argparse.ArgumentError) explaining that EN-US requires --include_english; ensure
valid_languages (based on SUPPORTED_LANGUAGES) still adds "EN-US" only when
args.include_english is True and use these validated lists (languages,
valid_languages) thereafter (refer to symbols: args.languages,
args.include_english, languages, valid_languages, SUPPORTED_LANGUAGES).
---
Nitpick comments:
In `@nemo_skills/dataset/mmmlu/mmmlu_utils.py`:
- Line 162: Schema.OPTIONS is currently a mutable list which can be accidentally
modified; change it to an immutable tuple by replacing the list literal with a
tuple literal and update its type annotation to tuple[str, ...] (or
typing.Tuple[str, ...] if needed) so OPTIONS becomes an immutable constant;
locate the OPTIONS definition in the Schema (symbol: OPTIONS) and make this
simple type and value change.
ℹ️ Review info
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (4)
nemo_skills/dataset/mmmlu/__init__.pynemo_skills/dataset/mmmlu/mmmlu_utils.pynemo_skills/dataset/mmmlu/prepare.pynemo_skills/evaluation/evaluator/mcq.py
| D) {D} | ||
| """.strip() | ||
|
|
||
| MULTILINGUAL_ANSWER_PATTERN_TEMPLATE = "(?i){}[ \t]*([A-D]|[أ-د]|[অ]|[ব]|[ড]|[ঢ]|[A]|[B]|[C]|[D])" |
There was a problem hiding this comment.
[أ-د] over-matches unintended Arabic letters.
This is a range, not just option letters. It can capture non-option characters and produce wrong labels.
💡 Proposed fix
-MULTILINGUAL_ANSWER_PATTERN_TEMPLATE = "(?i){}[ \t]*([A-D]|[أ-د]|[অ]|[ব]|[ড]|[ঢ]|[A]|[B]|[C]|[D])"
+MULTILINGUAL_ANSWER_PATTERN_TEMPLATE = "(?i){}[ \t]*([A-D]|[أبجد]|[অ]|[ব]|[ড]|[ঢ]|[A]|[B]|[C]|[D])"📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| MULTILINGUAL_ANSWER_PATTERN_TEMPLATE = "(?i){}[ \t]*([A-D]|[أ-د]|[অ]|[ব]|[ড]|[ঢ]|[A]|[B]|[C]|[D])" | |
| MULTILINGUAL_ANSWER_PATTERN_TEMPLATE = "(?i){}[ \t]*([A-D]|[أبجد]|[অ]|[ব]|[ড]|[ঢ]|[A]|[B]|[C]|[D])" |
🧰 Tools
🪛 Ruff (0.15.2)
[warning] 110-110: String contains ambiguous A (FULLWIDTH LATIN CAPITAL LETTER A). Did you mean A (LATIN CAPITAL LETTER A)?
(RUF001)
[warning] 110-110: String contains ambiguous B (FULLWIDTH LATIN CAPITAL LETTER B). Did you mean B (LATIN CAPITAL LETTER B)?
(RUF001)
[warning] 110-110: String contains ambiguous C (FULLWIDTH LATIN CAPITAL LETTER C). Did you mean C (LATIN CAPITAL LETTER C)?
(RUF001)
[warning] 110-110: String contains ambiguous D (FULLWIDTH LATIN CAPITAL LETTER D). Did you mean D (LATIN CAPITAL LETTER D)?
(RUF001)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@nemo_skills/dataset/mmmlu/mmmlu_utils.py` at line 110,
MULTILINGUAL_ANSWER_PATTERN_TEMPLATE currently uses the range [أ-د] which
over-matches Arabic characters; update the regex to list the exact Arabic option
letters instead of a range (e.g. replace [أ-د] with [أبجد]) so only the intended
letters map to A-D; ensure the final pattern for
MULTILINGUAL_ANSWER_PATTERN_TEMPLATE explicitly enumerates each script's option
letters (e.g., [A-D]|[أبجد]|[অবডঢ]|[ABCD]) rather than using character ranges.
| LETTER_REGEX = r"\b\(?\s*([A-D]|[أ-د]|[অ]|[ব]|[ড]|[ঢ]|[A]|[B]|[C]|[D])\s*\)?\.?\b" | ||
| GREEDY_REGEX = r"[\s\S]*" + LETTER_REGEX | ||
| regexes.append(GREEDY_REGEX) # Matches the last A/B/C/D letter in the response |
There was a problem hiding this comment.
Use explicit Arabic option letters instead of [أ-د] range.
[أ-د] matches more than the intended four option letters and can mis-extract answers.
💡 Proposed fix
- LETTER_REGEX = r"\b\(?\s*([A-D]|[أ-د]|[অ]|[ব]|[ড]|[ঢ]|[A]|[B]|[C]|[D])\s*\)?\.?\b"
+ LETTER_REGEX = r"\b\(?\s*([A-D]|[أبجد]|[অ]|[ব]|[ড]|[ঢ]|[A]|[B]|[C]|[D])\s*\)?\.?\b"🧰 Tools
🪛 Ruff (0.15.2)
[warning] 36-36: String contains ambiguous A (FULLWIDTH LATIN CAPITAL LETTER A). Did you mean A (LATIN CAPITAL LETTER A)?
(RUF001)
[warning] 36-36: String contains ambiguous B (FULLWIDTH LATIN CAPITAL LETTER B). Did you mean B (LATIN CAPITAL LETTER B)?
(RUF001)
[warning] 36-36: String contains ambiguous C (FULLWIDTH LATIN CAPITAL LETTER C). Did you mean C (LATIN CAPITAL LETTER C)?
(RUF001)
[warning] 36-36: String contains ambiguous D (FULLWIDTH LATIN CAPITAL LETTER D). Did you mean D (LATIN CAPITAL LETTER D)?
(RUF001)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@nemo_skills/dataset/mmmlu/prepare.py` around lines 36 - 38, LETTER_REGEX uses
the Arabic range `[أ-د]` which matches unintended characters; change it to list
the four Arabic option letters explicitly (e.g. use `[أبجد]` or include them as
alternatives) inside LETTER_REGEX so only the intended choices are matched, then
keep GREEDY_REGEX and regexes.append(GREEDY_REGEX) as-is; update the
LETTER_REGEX definition (the symbol name is LETTER_REGEX) to use explicit Arabic
letters instead of the range.
| def normalize_extracted_answer(extracted_answer: str) -> str: | ||
| return ( | ||
| # In arabic these are the letters used for A-D in multiple choice questions | ||
| extracted_answer.replace("أ", " A") | ||
| .replace("ب", " B") | ||
| .replace("ج", " C") | ||
| .replace("د", " D") | ||
| # In Bengali these are the letters used for A-D in multiple choice questions | ||
| .replace("অ", " A") | ||
| .replace("ব", " B") | ||
| .replace("ড", " C") | ||
| .replace("ঢ", " D") | ||
| # In Japanese these are the letters sometimes used for A-D in multiple choice questions | ||
| .replace("A", " A") | ||
| .replace("B", " B") | ||
| .replace("C", " C") | ||
| .replace("D", " D") | ||
| .strip() | ||
| ) |
There was a problem hiding this comment.
Avoid global character replacement across the full extracted text.
This replaces Arabic/Bengali letters everywhere in extracted_answer, so longer answers can gain synthetic A/B/C/D tokens and be misgraded.
💡 Proposed fix
+OPTION_CHAR_MAP = {
+ "A": "A",
+ "B": "B",
+ "C": "C",
+ "D": "D",
+ "أ": "A",
+ "ب": "B",
+ "ج": "C",
+ "د": "D",
+ "অ": "A",
+ "ব": "B",
+ "ড": "C",
+ "ঢ": "D",
+ "A": "A",
+ "B": "B",
+ "C": "C",
+ "D": "D",
+}
+
def normalize_extracted_answer(extracted_answer: str) -> str:
- return (
- # In arabic these are the letters used for A-D in multiple choice questions
- extracted_answer.replace("أ", " A")
- .replace("ب", " B")
- .replace("ج", " C")
- .replace("د", " D")
- # In Bengali these are the letters used for A-D in multiple choice questions
- .replace("অ", " A")
- .replace("ব", " B")
- .replace("ড", " C")
- .replace("ঢ", " D")
- # In Japanese these are the letters sometimes used for A-D in multiple choice questions
- .replace("A", " A")
- .replace("B", " B")
- .replace("C", " C")
- .replace("D", " D")
- .strip()
- )
+ normalized = extracted_answer.strip()
+ match = re.fullmatch(r"[\s\(\[\{<\*_]*([A-DأبجدঅবডঢABCD])[\s\)\]\}>.\*_]*", normalized)
+ if not match:
+ return normalized
+ return OPTION_CHAR_MAP[match.group(1)]🧰 Tools
🪛 Ruff (0.15.2)
[warning] 41-41: String contains ambiguous A (FULLWIDTH LATIN CAPITAL LETTER A). Did you mean A (LATIN CAPITAL LETTER A)?
(RUF001)
[warning] 42-42: String contains ambiguous B (FULLWIDTH LATIN CAPITAL LETTER B). Did you mean B (LATIN CAPITAL LETTER B)?
(RUF001)
[warning] 43-43: String contains ambiguous C (FULLWIDTH LATIN CAPITAL LETTER C). Did you mean C (LATIN CAPITAL LETTER C)?
(RUF001)
[warning] 44-44: String contains ambiguous D (FULLWIDTH LATIN CAPITAL LETTER D). Did you mean D (LATIN CAPITAL LETTER D)?
(RUF001)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@nemo_skills/evaluation/evaluator/mcq.py` around lines 28 - 46, The current
normalize_extracted_answer function performs blind global .replace() calls on
extracted_answer which can turn characters inside longer answers into synthetic
" A/B/C/D" tokens; change it to only map standalone choice markers or markers at
the start/end of the answer (e.g., when the entire string equals an
Arabic/Bengali/Japanese marker, when the marker is surrounded by
whitespace/punctuation, or when it appears as the first token) — update
normalize_extracted_answer to use targeted matching (regex or token-based
checks) on extracted_answer so only isolated choice markers are converted to
"A"/"B"/"C"/"D" and embedded characters in longer answers are left unchanged.
|
@AlexGrinch @shuoyangd let me know if the content looks good to you and we can merge |
Kipok
left a comment
There was a problem hiding this comment.
lgtm, just a small comment. Let's run gpu tests.
@shuoyangd @AlexGrinch please have a look if you want, if not we will merge as is and you can still review later and we can fix issues in follow up pr
Signed-off-by: Igor Gitman <igitman@nvidia.com> Signed-off-by: Igor Gitman <igor.a.gitman@gmail.com>
Signed-off-by: naymaraq <dkaramyan@nvidia.com> Signed-off-by: Igor Gitman <igor.a.gitman@gmail.com> Co-authored-by: naymaraq <dkaramyan@nvidia.com> Co-authored-by: George Armstrong <georgea@nvidia.com> Co-authored-by: Igor Gitman <igitman@nvidia.com>
Signed-off-by: naymaraq <dkaramyan@nvidia.com> Signed-off-by: Igor Gitman <igor.a.gitman@gmail.com> Co-authored-by: naymaraq <dkaramyan@nvidia.com> Co-authored-by: George Armstrong <georgea@nvidia.com> Co-authored-by: Igor Gitman <igitman@nvidia.com> Signed-off-by: dgitman <dgitman@nvidia.com>
PR for the MMMLU benchmark, a multilingual version of MMLU. Also added an option in the MCQ evaluation to accept multiple regex patterns. This is important for capturing the model’s answers, which may be provided in different languages.
Summary by CodeRabbit