add multilingual aime25, gqpa and lcb by zijiachen95 · Pull Request #1194 · NVIDIA-NeMo/Skills

zijiachen95 · 2026-01-28T02:32:55Z

add multilingual aime25, gqpa and lcb and corresponding metrics calculation code.
The source data is not included yet.

Summary by CodeRabbit

Release Notes

New Features
- Added multilingual dataset support for AIME25 (German, Spanish, French, Japanese) and GPQA (German, Spanish, French, Japanese) benchmarks across English and native language prompts.
- Added LiveCodeBench dataset support in German, Spanish, French, and Japanese.
- Introduced multilingual math evaluation metrics supporting language and symbolic correctness assessment.
- Added multilingual prompt configurations for math and code benchmarks in German, Spanish, French, and Japanese.

_{✏️ Tip: You can customize this high-level summary in your review settings.}

greptile-apps

_{3 files reviewed, 3 comments}

_{Edit Code Review Agent Settings | Greptile}

greptile-apps · 2026-01-28T02:36:07Z

nemo_skills/evaluation/metrics/map_metrics.py

    "ioi": IOIMetrics,
    "icpc": ICPCMetrics,
    "multichoice": MathMetrics,
+    "multichoice_multilingual": MathMultilingualMetrics,


math_multilingual metrics type is missing from METRICS_MAP. The AIME25 datasets reference METRICS_TYPE = "math_multilingual" but only multichoice_multilingual is registered here. Add entry:

Suggested change

"multichoice_multilingual": MathMultilingualMetrics,

"multichoice_multilingual": MathMultilingualMetrics,

"math_multilingual": MathMultilingualMetrics,

greptile-apps · 2026-01-28T02:36:08Z

nemo_skills/dataset/livecodebench-ja-prompt-ja/__init__.py

+# EVAL_SPLIT = 'test_v5_2408_2502'
+# EVAL_SPLIT = 'test_v5_2407_2412'
+EVAL_SPLIT = 'test_v5_2407_2503'
+# EVAL_ARGS = "++eval_type=livecodebench ++eval_config.dataset=livecodebench


Incomplete comment - missing closing quote

Suggested change

# EVAL_ARGS = "++eval_type=livecodebench ++eval_config.dataset=livecodebench

# EVAL_ARGS = "++eval_type=livecodebench ++eval_config.dataset=livecodebench"

greptile-apps · 2026-01-28T02:36:09Z

nemo_skills/evaluation/metrics/math_multilingual_metrics.py

+from nemo_skills.evaluation.metrics.utils import is_correct_judgement
+from nemo_skills.utils import get_logger_name
+from langdetect import detect, DetectorFactory, LangDetectException
+import sys


import sys is unused and should be removed

coderabbitai · 2026-01-28T02:48:38Z

📝 Walkthrough

Walkthrough

This PR adds comprehensive multilingual dataset support to NeMo Skills, including AIME25, GPQA, and LiveCodeBench datasets across multiple languages (German, Spanish, French, Japanese). It introduces dataset-specific evaluation configurations, data preparation scripts, a new multilingual math metrics class, and corresponding prompt templates in YAML format.

Changes

Cohort / File(s)	Summary
AIME25 Dataset Configurations `nemo_skills/dataset/aime25-{de,es,fr,ja}-prompt-{de,en,es,fr,ja}/__init__.py`	Added evaluation config constants (PROMPT_CONFIG, DATASET_GROUP, METRICS_TYPE, EVAL_ARGS, GENERATION_ARGS) across 10 dataset variants defining multilingual math prompt handling.
AIME25 Data Preparation `nemo_skills/dataset/aime25-{de,es,fr,ja}-prompt-{de,en,es,fr,ja}/prepare.py`	Added simple data preparation scripts that create output directories and copy `test.txt` to `test.jsonl` for each language variant.
GPQA Dataset Configurations `nemo_skills/dataset/gpqa-{de,es,fr,ja}-prompt-{de,en,es,fr,ja}/__init__.py`	Added evaluation config constants across 8 dataset variants for multichoice evaluation, including EVAL_SPLIT field set to "diamond".
GPQA Data Preparation `nemo_skills/dataset/gpqa-{de,es,fr,ja}-prompt-{de,en,es,fr,ja}/prepare.py`	Added comprehensive data preprocessing scripts (~97 lines each) featuring text normalization, MCQ answer shuffling, difficulty extraction, and JSONL output with progress tracking via `tqdm`. Uses `get_mcq_fields` utility for MCQ formatting.
LiveCodeBench Dataset Configurations `nemo_skills/dataset/livecodebench-{de,es,fr,ja}-prompt-{de,en,es,fr,ja}/__init__.py`	Added evaluation config constants across 8 dataset variants for code evaluation, with specific EVAL_SPLIT and prompt paths for code generation.
LiveCodeBench Data Preparation `nemo_skills/dataset/livecodebench-{de,es,fr,ja}-prompt-{de,en,es,fr,ja}/prepare.py`	Added substantial data preparation scripts (~166 lines each) featuring dataset loading, date range parsing with month-based filtering, starter code formatting, and JSONL output. Includes CLI with batch processing support via DEFAULT_SPLITS.
Metrics Implementation `nemo_skills/evaluation/metrics/math_multilingual_metrics.py`, `nemo_skills/evaluation/metrics/map_metrics.py`	Added MathMultilingualMetrics class (~182 lines) with multilingual answer evaluation, language detection, reward-at-k computation, and metric aggregation. Registered new metric type `"multichoice_multilingual"` in METRICS_MAP.
Multilingual Prompts (Math) `nemo_skills/prompt/config/multilingual/math_{de,es,fr,ja}.yaml`	Added German, Spanish, French, Japanese math prompt configs with localized few-shot examples and user instructions for boxed answer formatting.
Multilingual Prompts (General) `nemo_skills/prompt/config/multilingual/general-boxed_{de,es,fr,ja}.yaml`	Added German, Spanish, French, Japanese general-purpose boxed prompt configs for benchmarks like MMLU-pro and GPQA with localized examples.
LiveCodeBench Prompts `nemo_skills/prompt/config/eval/livecodebench/python_codegen_{de,es,fr,ja}.yaml`	Added German, Spanish, French, Japanese code generation prompts with bilingual instructions and {question} placeholders.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

MT datasets FLORES200 and WMT24pp #892: Modifies metrics mapping (map_metrics.py) to register new metric classes, similar pattern for metric registration.
Added AAI-Omniscience Benchmark #1161: Also modifies METRICS_MAP to register new metric types, affecting the same metrics registration file.
Moving evaluation inside generation class and enforcing empty generations when remove_thinking=True #958: Modifies dataset-level evaluation and generation configuration constants across dataset variants.

Suggested labels

enhancement, multilingual-support, dataset-addition

Suggested reviewers

Kipok
gwarmstrong

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 3.41% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title 'add multilingual aime25, gqpa and lcb' directly summarizes the main change: adding multilingual support for three dataset types (AIME25, GPQA, and LiveCodeBench).

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing touches

📝 Generate docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 18

🤖 Fix all issues with AI agents

In `@nemo_skills/dataset/aime25-de-prompt-de/prepare.py`:
- Line 1: The file prepare.py is marked executable but has no shebang; either
add a proper Python shebang or remove the executable bit. If this script is
intended to be run directly, insert a shebang line (e.g. /usr/bin/env python3)
at the very top of prepare.py; if it’s a module/library not meant to be
executed, remove the executable permission (chmod a-x) so static analysis no
longer flags it as executable.

In `@nemo_skills/dataset/gpqa-de-prompt-de/__init__.py`:
- Line 1: The package __init__.py file is marked executable but lacks a shebang;
either remove the executable permission or make it a true executable module: if
this module is not intended to be run, remove the executable bit (e.g., chmod
-x) on nemo_skills/dataset/gpqa-de-prompt-de/__init__.py, otherwise add an
appropriate shebang (e.g., #!/usr/bin/env python3) at the top of __init__.py so
execution behavior is correct.

In `@nemo_skills/dataset/gpqa-de-prompt-de/prepare.py`:
- Around line 40-60: In format_entry, preprocess is called multiple times and if
preprocessing causes duplicates the index lookup can pick the wrong choice; to
fix, construct choices as pairs like (text=preprocess(...), is_correct=bool) for
each of the four answers (mark the Correct Answer True), shuffle the list of
pairs, then build the final choices list from the pair.text values and compute
expected_answer by finding the pair where is_correct is True (using its shuffled
position => chr(65+index)); update get_mcq_fields call to receive the list of
texts. Ensure preprocess is invoked exactly once per original entry field (e.g.,
in format_entry) and that you reference format_entry, preprocess, and
get_mcq_fields when making the change.

In `@nemo_skills/dataset/gpqa-es-prompt-en/prepare.py`:
- Around line 40-60: In format_entry, preprocessing can make the correct answer
text duplicate another choice so choices.index(preprocess(...)) may return the
wrong index after shuffle; fix by capturing the preprocessed correct answer
(call preprocess on entry["Correct Answer"] once) and either build choices as
structures that preserve identity (e.g., tuples of (text, is_correct)) or record
the correct-answer object before random.shuffle and compute correct_answer_index
by finding the item whose identity/flag matches the captured correct marker;
update get_mcq_fields usage accordingly so expected_answer is derived from that
preserved correct item rather than using choices.index on raw text.

In `@nemo_skills/dataset/gpqa-es-prompt-es/__init__.py`:
- Line 1: This __init__.py file currently has the executable bit set but
contains no shebang; remove the executable permission so it is a normal Python
package init. Locate nemo_skills/dataset/gpqa-es-prompt-es/__init__.py (module
gpqa-es-prompt-es, symbol __init__.py) and clear the executable flag (e.g., run
chmod -x on the file or use git update-index --chmod=-x <path>) and commit the
permission change so Ruff no longer flags it as executable.

In `@nemo_skills/dataset/gpqa-ja-prompt-ja/__init__.py`:
- Line 1: This __init__.py is marked executable but has no shebang; clear the
executable permission on the file (remove the executable bit) so it is a normal
module rather than a script—i.e., update file permissions for
nemo_skills/dataset/gpqa-ja-prompt-ja/__init__.py (or if execution was intended,
instead add an appropriate shebang to the top), but prefer removing the
executable bit for this package __init__.py.

In `@nemo_skills/dataset/livecodebench-de-prompt-de/prepare.py`:
- Around line 98-129: The code in prepare() calls
datetime.strptime(problem['contest_date'], ...) but problem['contest_date'] may
already be a datetime object, causing a TypeError; update clean_data() to ensure
the contest_date field is always a string (e.g., cast problem['contest_date'] =
str(problem['contest_date']) or use a dedicated cast_column for 'contest_date'),
or modify the loop in prepare() to handle both types by converting to string
when necessary (e.g., use str(problem['contest_date']) before calling
datetime.strptime); change should reference the prepare() function's parsing
logic and the clean_data()/parse_data() pipeline so contest_date is consistently
a string before parsing.

In `@nemo_skills/dataset/livecodebench-de-prompt-en/prepare.py`:
- Around line 114-118: The loop that computes input_date from
problem['contest_date'] assumes a string and calls datetime.strptime, which
fails if contest_date is already a datetime; update the logic around
problem['contest_date'] in the block that sets input_date (used with start_date
and end_date) to handle both types: if isinstance(problem['contest_date'],
datetime) call .date() on it, otherwise parse the string with
datetime.strptime(...).date(); ensure the variable name input_date and the
comparison start_date <= input_date <= end_date remain unchanged.

In `@nemo_skills/dataset/livecodebench-es-prompt-en/__init__.py`:
- Around line 16-23: The EVAL_SPLIT 'test_v5_2407_2503' referenced in this
module (see EVAL_SPLIT, PROMPT_CONFIG, DATASET_GROUP, METRICS_TYPE) does not
exist in the dataset preparation DEFAULT_SPLITS, causing runtime failures; fix
by either adding the tuple ('v5','2024-07','2025-03') to DEFAULT_SPLITS in the
dataset's prepare.py (so prepare.py generates test_v5_2407_2503.jsonl) or change
EVAL_SPLIT here to an existing split such as 'test_v5_2408_2502' so the
evaluation points to a generated file.

In `@nemo_skills/dataset/livecodebench-es-prompt-en/prepare.py`:
- Around line 114-118: The code incorrectly treats problem['contest_date'] as a
string and calls datetime.strptime; instead, handle the schema-typed datetime by
obtaining the date directly from the datetime object (use
problem['contest_date'].date()), or defensively check the type and parse only if
it's a str; update the block around contest_date handling used with start_date
and end_date so input_date is derived from the actual datetime object rather
than calling strptime on it.

In `@nemo_skills/dataset/livecodebench-es-prompt-es/prepare.py`:
- Around line 114-118: The loop that parses problem['contest_date'] assumes it's
a string and calls datetime.strptime, which will TypeError if contest_date is
already a datetime; update the parsing in the for-loop that produces input_date
(the code that currently calls datetime.strptime on problem['contest_date'] and
assigns to input_date) to handle both types: if
isinstance(problem['contest_date'], datetime) use its .date() (or .date() after
ensuring tz handling), otherwise cast to str and parse with datetime.strptime;
keep the variable names input_date, start_date, end_date and the same
conditional (if start_date <= input_date <= end_date) so the rest of the logic
is unchanged.

In `@nemo_skills/dataset/livecodebench-fr-prompt-en/prepare.py`:
- Around line 143-144: The help string for the argument parser is wrong: update
the call to parser.add_argument for '--start_date' so its help correctly says
"Start date in YYYY-MM format" (while '--end_date' keeps "End date in YYYY-MM
format"); locate the parser.add_argument invocation for '--start_date' in
prepare.py and replace the incorrect "End date" text with "Start date" to match
the flag name.

In `@nemo_skills/dataset/livecodebench-fr-prompt-fr/prepare.py`:
- Around line 25-28: PromptConstants contains English prompt strings but this
file is for French prompts; update the two constants
FORMATTING_MESSAGE_WITH_STARTER_CODE and FORMATTING_WITHOUT_STARTER_CODE in
class PromptConstants to French equivalents that preserve the original meaning
and instructions (use starter code and delimiters; read input from stdin, do not
run sample tests, output to stdout, enclose code in delimiters). Ensure the
translations are clear, in French, and keep references to "starter code",
"stdin", "stdout" and "delimiters" so callers relying on those keywords still
understand the intent.
- Around line 143-144: The help strings for the CLI arguments are incorrect:
update the parser.add_argument calls for '--start_date' and '--end_date' so that
'--start_date' help says "Start date in YYYY-MM format" and '--end_date' help
says "End date in YYYY-MM format" (refer to the parser.add_argument declarations
for '--start_date' and '--end_date' to locate and edit the help text).

In `@nemo_skills/dataset/livecodebench-ja-prompt-en/prepare.py`:
- Around line 114-118: The code assumes problem['contest_date'] is a string and
uses datetime.strptime, but the schema documents contest_date as a datetime
object; update the block that sets input_date (inside the with open loop) to
normalize both cases: check the type of problem['contest_date'], if it's a str
parse with datetime.strptime(..., '%Y-%m-%dT%H:%M:%S'), if it's a datetime use
.date(), if it's a date use it directly, otherwise raise/skip with a clear
error; ensure the rest of the existing comparison using input_date against
start_date and end_date remains unchanged.

In `@nemo_skills/dataset/livecodebench-ja-prompt-ja/prepare.py`:
- Around line 25-28: PromptConstants currently contains English strings but
lives in a Japanese prompt module; update the two constants
FORMATTING_MESSAGE_WITH_STARTER_CODE and FORMATTING_WITHOUT_STARTER_CODE to
Japanese equivalents that preserve the original intent (instructions about using
starter code, enclosing code within delimiters, reading stdin/writing stdout,
and the warning about not directly testing on sample inputs), keeping
punctuation and any special phrasing intact so callers of PromptConstants need
no further changes.
- Around line 143-144: The help text for the CLI argument defined by
parser.add_argument('--start_date', type=str, default='all', help="End date in
YYYY-MM format") is incorrect; update the help string for the '--start_date'
argument to accurately describe it (e.g., "Start date in YYYY-MM format") while
keeping the '--end_date' help as the end-date description so both
parser.add_argument('--start_date', ...) and parser.add_argument('--end_date',
...) have correct, non-duplicated help messages.
- Around line 31-34: PromptConstants currently contains English prompt templates
but this variant (livecodebench-ja-prompt-ja) must use Japanese; open the
PromptConstants class in
nemo_skills/dataset/livecodebench-ja-prompt-ja/prepare.py and replace the
English strings (e.g., instruction/format templates defined around
PromptConstants) with their Japanese equivalents, preserving variable
placeholders and formatting, then run any local tests that embed these constants
to confirm no formatting errors; ensure the class name PromptConstants and any
references (e.g., where parse_data or dataset preparation uses PromptConstants)
remain unchanged.

🧹 Nitpick comments (9)

nemo_skills/evaluation/metrics/math_multilingual_metrics.py (1)

31-36: Please track/resolve the TODO on aggregation parity.

Leaving TODOs here can hide missing behavior as the metrics evolve. If you’d like, I can draft an issue or propose a small check to enforce parity.
nemo_skills/dataset/gpqa-fr-prompt-fr/prepare.py (1)
31-37: Consider returning empty string for None input.

Returning a single space " " for None input may cause subtle formatting issues in the generated prompts. An empty string might be more appropriate unless there's a specific reason for the space.
♻️ Suggested change
 def preprocess(text):
     if text is None:
-        return " "
+        return ""
     text = text.strip()
     text = text.replace(" [title]", ". ")
     text = text.replace("  ", " ")
     return text
nemo_skills/dataset/livecodebench-ja-prompt-ja/__init__.py (1)
19-23: Consider removing commented-out code.

These commented-out lines appear to be alternative configurations or work-in-progress artifacts. If they serve as documentation for valid split options, consider converting them to a proper comment or docstring. Otherwise, they can be removed to keep the file clean.
♻️ Suggested cleanup
 # settings that define how evaluation should be done by default (all can be changed from cmdline)
 PROMPT_CONFIG = 'eval/livecodebench/python_codegen_ja'
 DATASET_GROUP = 'code'
 METRICS_TYPE = 'livecodebench'
-# EVAL_SPLIT = 'test_v5_2408_2502'
-# EVAL_SPLIT = 'test_v5_2407_2412'
 EVAL_SPLIT = 'test_v5_2407_2503'
-# EVAL_ARGS = "++eval_type=livecodebench ++eval_config.dataset=livecodebench
 EVAL_ARGS = "++eval_type=livecodebench"
 GENERATION_ARGS = ""
nemo_skills/dataset/gpqa-es-prompt-en/prepare.py (1)

31-97: Significant code duplication across GPQA prepare.py files.

This file is nearly identical to gpqa-de-prompt-en/prepare.py, gpqa-ja-prompt-en/prepare.py, gpqa-ja-prompt-ja/prepare.py, and others. Consider extracting the common logic into a shared module (e.g., nemo_skills/dataset/gpqa_utils.py) that each language variant can import and optionally customize.

This would reduce maintenance burden when changes are needed across all variants.
nemo_skills/dataset/livecodebench-ja-prompt-ja/prepare.py (2)
76-77: Remove extraneous f-prefix.

Line 77 uses an f-string without any placeholders.
Suggested fix
-            question += f"```python\n# YOUR CODE HERE\n```\n\n"
+            question += "```python\n# YOUR CODE HERE\n```\n\n"
50-66: Consider using exception chaining.

When re-raising exceptions, use raise ... from err to preserve the exception chain for better debugging.
Suggested fix
 def get_first_last_day(year_month_str):
     try:
         date_obj = datetime.strptime(year_month_str, "%Y-%m")
         first_day = date_obj.date().replace(day=1)
         last_day = (date_obj + relativedelta(months=1, days=-1)).date()
         return first_day, last_day
-    except ValueError:
-        raise ValueError("Invalid date format. Please use '%Y-%m'.")
+    except ValueError as err:
+        raise ValueError("Invalid date format. Please use '%Y-%m'.") from err


 def parse_month_range(start_date, end_date):
     try:
         start_date, _ = get_first_last_day(start_date)
         _, end_date = get_first_last_day(end_date)
         return start_date, end_date
     except ValueError as e:
-        raise ValueError(str(e))
+        raise ValueError(str(e)) from e
nemo_skills/dataset/livecodebench-fr-prompt-en/prepare.py (1)
76-77: Remove extraneous f-prefix.
Suggested fix
-            question += f"```python\n# YOUR CODE HERE\n```\n\n"
+            question += "```python\n# YOUR CODE HERE\n```\n\n"
nemo_skills/dataset/livecodebench-fr-prompt-fr/prepare.py (2)
1-166: Significant code duplication across prepare.py files.

This file is nearly identical to livecodebench-ja-prompt-ja/prepare.py and livecodebench-fr-prompt-en/prepare.py. Consider extracting the shared logic (date utilities, clean_data, prepare function, CLI handling) into a common module, with only the language-specific constants (dataset path, PromptConstants) varying per locale.

This would reduce maintenance burden and ensure bug fixes propagate to all variants.

76-77: Remove extraneous f-prefix.
Suggested fix
-            question += f"```python\n# YOUR CODE HERE\n```\n\n"
+            question += "```python\n# YOUR CODE HERE\n```\n\n"

coderabbitai · 2026-01-28T02:48:41Z

nemo_skills/dataset/aime25-de-prompt-de/prepare.py

@@ -0,0 +1,24 @@
+# Copyright (c) 2024, NVIDIA CORPORATION.  All rights reserved.


⚠️ Potential issue | 🟡 Minor

Add a shebang or drop the executable bit.

Line 1: Static analysis flags this file as executable but it lacks a shebang, which breaks direct execution on Unix. Add a shebang or remove the executable bit if it’s not meant to be run directly.

🛠️ Suggested fix

+#!/usr/bin/env python3 + # Copyright (c) 2024, NVIDIA CORPORATION. All rights reserved.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

# Copyright (c) 2024, NVIDIA CORPORATION. All rights reserved.

#!/usr/bin/env python3

# Copyright (c) 2024, NVIDIA CORPORATION. All rights reserved.

🧰 Tools

🪛 Ruff (0.14.14)

1-1: The file is executable but no shebang is present

(EXE002)

🤖 Prompt for AI Agents

In `@nemo_skills/dataset/aime25-de-prompt-de/prepare.py` at line 1, The file prepare.py is marked executable but has no shebang; either add a proper Python shebang or remove the executable bit. If this script is intended to be run directly, insert a shebang line (e.g. /usr/bin/env python3) at the very top of prepare.py; if it’s a module/library not meant to be executed, remove the executable permission (chmod a-x) so static analysis no longer flags it as executable.

coderabbitai · 2026-01-28T02:48:41Z

nemo_skills/dataset/gpqa-de-prompt-de/__init__.py

@@ -0,0 +1,23 @@
+# Copyright (c) 2024, NVIDIA CORPORATION.  All rights reserved.


⚠️ Potential issue | 🟡 Minor

Remove executable bit or add shebang.

Line 1: The file appears to be marked executable but lacks a shebang. If this module isn’t meant to be run directly, please remove the executable bit (e.g., chmod -x). If it is meant to be executable, add an appropriate shebang.

🧰 Tools

🪛 Ruff (0.14.14)

1-1: The file is executable but no shebang is present

(EXE002)

🤖 Prompt for AI Agents

In `@nemo_skills/dataset/gpqa-de-prompt-de/__init__.py` at line 1, The package __init__.py file is marked executable but lacks a shebang; either remove the executable permission or make it a true executable module: if this module is not intended to be run, remove the executable bit (e.g., chmod -x) on nemo_skills/dataset/gpqa-de-prompt-de/__init__.py, otherwise add an appropriate shebang (e.g., #!/usr/bin/env python3) at the top of __init__.py so execution behavior is correct.

coderabbitai · 2026-01-28T02:48:41Z

nemo_skills/dataset/gpqa-de-prompt-de/prepare.py

+def format_entry(entry):
+    choices = [
+        preprocess(entry["Incorrect Answer 1"]),
+        preprocess(entry["Incorrect Answer 2"]),
+        preprocess(entry["Incorrect Answer 3"]),
+        preprocess(entry["Correct Answer"]),
+    ]
+
+    random.shuffle(choices)
+    correct_answer_index = choices.index(preprocess(entry["Correct Answer"]))
+    return {
+        "expected_answer": f"{chr(65 + correct_answer_index)}",
+        "explanation": preprocess(entry["Explanation"]),
+        "subset_for_metrics": entry["Subdomain"],
+        "difficulty": (
+            re.split(r'\s*\(', entry["Writer's Difficulty Estimate"])[0]
+            if entry["Writer's Difficulty Estimate"] is not None
+            else None
+        ),
+        **get_mcq_fields(entry["Question"], choices),
+    }


⚠️ Potential issue | 🟠 Major

Protect expected_answer from duplicate choices after normalization.

If preprocess collapses answers into identical strings, choices.index(...) can point to the wrong option, corrupting expected_answer. Track correctness alongside each choice during shuffle to avoid collisions.

🐛 Proposed fix

def format_entry(entry): - choices = [ - preprocess(entry["Incorrect Answer 1"]), - preprocess(entry["Incorrect Answer 2"]), - preprocess(entry["Incorrect Answer 3"]), - preprocess(entry["Correct Answer"]), - ] - - random.shuffle(choices) - correct_answer_index = choices.index(preprocess(entry["Correct Answer"])) + choices = [ + (preprocess(entry["Incorrect Answer 1"]), False), + (preprocess(entry["Incorrect Answer 2"]), False), + (preprocess(entry["Incorrect Answer 3"]), False), + (preprocess(entry["Correct Answer"]), True), + ] + + random.shuffle(choices) + correct_answer_index = next( + i for i, (_, is_correct) in enumerate(choices) if is_correct + ) + choice_texts = [choice for choice, _ in choices] return { "expected_answer": f"{chr(65 + correct_answer_index)}", "explanation": preprocess(entry["Explanation"]), "subset_for_metrics": entry["Subdomain"], "difficulty": ( re.split(r'\s*\(', entry["Writer's Difficulty Estimate"])[0] if entry["Writer's Difficulty Estimate"] is not None else None ), - **get_mcq_fields(entry["Question"], choices), + **get_mcq_fields(entry["Question"], choice_texts), }

🤖 Prompt for AI Agents

In `@nemo_skills/dataset/gpqa-de-prompt-de/prepare.py` around lines 40 - 60, In format_entry, preprocess is called multiple times and if preprocessing causes duplicates the index lookup can pick the wrong choice; to fix, construct choices as pairs like (text=preprocess(...), is_correct=bool) for each of the four answers (mark the Correct Answer True), shuffle the list of pairs, then build the final choices list from the pair.text values and compute expected_answer by finding the pair where is_correct is True (using its shuffled position => chr(65+index)); update get_mcq_fields call to receive the list of texts. Ensure preprocess is invoked exactly once per original entry field (e.g., in format_entry) and that you reference format_entry, preprocess, and get_mcq_fields when making the change.

coderabbitai · 2026-01-28T02:48:41Z

nemo_skills/dataset/gpqa-es-prompt-en/prepare.py

+def format_entry(entry):
+    choices = [
+        preprocess(entry["Incorrect Answer 1"]),
+        preprocess(entry["Incorrect Answer 2"]),
+        preprocess(entry["Incorrect Answer 3"]),
+        preprocess(entry["Correct Answer"]),
+    ]
+
+    random.shuffle(choices)
+    correct_answer_index = choices.index(preprocess(entry["Correct Answer"]))
+    return {
+        "expected_answer": f"{chr(65 + correct_answer_index)}",
+        "explanation": preprocess(entry["Explanation"]),
+        "subset_for_metrics": entry["Subdomain"],
+        "difficulty": (
+            re.split(r'\s*\(', entry["Writer's Difficulty Estimate"])[0]
+            if entry["Writer's Difficulty Estimate"] is not None
+            else None
+        ),
+        **get_mcq_fields(entry["Question"], choices),
+    }


⚠️ Potential issue | 🟡 Minor

Potential edge case: duplicate answer text after preprocessing.

If two choices have identical text after preprocessing, choices.index() returns the index of the first occurrence, which may not correspond to the actual correct answer position after shuffling. This could lead to an incorrect expected_answer label.

Consider storing the correct answer reference before shuffling:

🐛 Proposed fix

def format_entry(entry): + correct_answer = preprocess(entry["Correct Answer"]) choices = [ preprocess(entry["Incorrect Answer 1"]), preprocess(entry["Incorrect Answer 2"]), preprocess(entry["Incorrect Answer 3"]), - preprocess(entry["Correct Answer"]), + correct_answer, ] random.shuffle(choices) - correct_answer_index = choices.index(preprocess(entry["Correct Answer"])) + correct_answer_index = choices.index(correct_answer) return { "expected_answer": f"{chr(65 + correct_answer_index)}",

🤖 Prompt for AI Agents

In `@nemo_skills/dataset/gpqa-es-prompt-en/prepare.py` around lines 40 - 60, In format_entry, preprocessing can make the correct answer text duplicate another choice so choices.index(preprocess(...)) may return the wrong index after shuffle; fix by capturing the preprocessed correct answer (call preprocess on entry["Correct Answer"] once) and either build choices as structures that preserve identity (e.g., tuples of (text, is_correct)) or record the correct-answer object before random.shuffle and compute correct_answer_index by finding the item whose identity/flag matches the captured correct marker; update get_mcq_fields usage accordingly so expected_answer is derived from that preserved correct item rather than using choices.index on raw text.

coderabbitai · 2026-01-28T02:48:41Z

nemo_skills/dataset/gpqa-es-prompt-es/__init__.py

@@ -0,0 +1,23 @@
+# Copyright (c) 2024, NVIDIA CORPORATION.  All rights reserved.


⚠️ Potential issue | 🟡 Minor

Remove executable bit (no shebang).

Ruff flags this file as executable without a shebang. Since it’s an __init__.py, it should not be executable—please clear the executable flag (or add a shebang if execution is intended).

🧰 Tools

🪛 Ruff (0.14.14)

1-1: The file is executable but no shebang is present

(EXE002)

🤖 Prompt for AI Agents

In `@nemo_skills/dataset/gpqa-es-prompt-es/__init__.py` at line 1, This __init__.py file currently has the executable bit set but contains no shebang; remove the executable permission so it is a normal Python package init. Locate nemo_skills/dataset/gpqa-es-prompt-es/__init__.py (module gpqa-es-prompt-es, symbol __init__.py) and clear the executable flag (e.g., run chmod -x on the file or use git update-index --chmod=-x <path>) and commit the permission change so Ruff no longer flags it as executable.

coderabbitai · 2026-01-28T02:48:42Z

nemo_skills/dataset/livecodebench-fr-prompt-fr/prepare.py

+    parser.add_argument('--start_date', type=str, default='all', help="End date in YYYY-MM format")
+    parser.add_argument('--end_date', type=str, default='all', help="End date in YYYY-MM format")


⚠️ Potential issue | 🟡 Minor

Fix incorrect help text.

Suggested fix

- parser.add_argument('--start_date', type=str, default='all', help="End date in YYYY-MM format") + parser.add_argument('--start_date', type=str, default='all', help="Start date in YYYY-MM format")

🤖 Prompt for AI Agents

In `@nemo_skills/dataset/livecodebench-fr-prompt-fr/prepare.py` around lines 143 - 144, The help strings for the CLI arguments are incorrect: update the parser.add_argument calls for '--start_date' and '--end_date' so that '--start_date' help says "Start date in YYYY-MM format" and '--end_date' help says "End date in YYYY-MM format" (refer to the parser.add_argument declarations for '--start_date' and '--end_date' to locate and edit the help text).

coderabbitai · 2026-01-28T02:48:42Z

nemo_skills/dataset/livecodebench-ja-prompt-en/prepare.py

+    with open(output_file_path, 'w') as f:
+        for problem in data:
+            input_date = datetime.strptime(problem['contest_date'], '%Y-%m-%dT%H:%M:%S').date()
+            if start_date <= input_date <= end_date:
+                json.dump(


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

git ls-files | grep -E "(prepare\.py|livecodebench)" | head -20

Repository: NVIDIA-NeMo/Skills

Length of output: 954

🏁 Script executed:

cat -n nemo_skills/dataset/livecodebench-ja-prompt-en/prepare.py | sed -n '100,130p'

Repository: NVIDIA-NeMo/Skills

Length of output: 1445

🏁 Script executed:

# Check the schema or any comments about contest_date in the file rg "contest_date" nemo_skills/dataset/livecodebench-ja-prompt-en/prepare.py -B 3 -A 3

Repository: NVIDIA-NeMo/Skills

Length of output: 523

🏁 Script executed:

# Check imports at the top of the file head -30 nemo_skills/dataset/livecodebench-ja-prompt-en/prepare.py

Repository: NVIDIA-NeMo/Skills

Length of output: 1515

🏁 Script executed:

python - <<'PY' from datasets import load_dataset data = load_dataset("livecodebench/code_generation_lite", split="test", version_tag="release_v5") row = data[0] print("Type:", type(row["contest_date"])) print("Value:", row["contest_date"]) print("Repr:", repr(row["contest_date"])) PY

Repository: NVIDIA-NeMo/Skills

Length of output: 182

🏁 Script executed:

pip install datasets -q && python - <<'PY' from datasets import load_dataset data = load_dataset("livecodebench/code_generation_lite", split="test", version_tag="release_v5") row = data[0] print("Type:", type(row["contest_date"])) print("Value:", row["contest_date"]) print("Repr:", repr(row["contest_date"])) PY

Repository: NVIDIA-NeMo/Skills

Length of output: 1770

🏁 Script executed:

rg "def parse_data" nemo_skills/dataset/livecodebench-ja-prompt-en/prepare.py -A 30

Repository: NVIDIA-NeMo/Skills

Length of output: 1025

🏁 Script executed:

# Check if there's a load_data or similar function rg "def (parse|load|clean)" nemo_skills/dataset/livecodebench-ja-prompt-en/prepare.py -A 10

Repository: NVIDIA-NeMo/Skills

Length of output: 1200

🏁 Script executed:

# Get the full prepare.py file to understand the data flow wc -l nemo_skills/dataset/livecodebench-ja-prompt-en/prepare.py

Repository: NVIDIA-NeMo/Skills

Length of output: 123

🏁 Script executed:

# Read the clean_data function to see if it modifies contest_date cat -n nemo_skills/dataset/livecodebench-ja-prompt-en/prepare.py | sed -n '45,95p'

Repository: NVIDIA-NeMo/Skills

Length of output: 2025

🏁 Script executed:

# Also check the full clean_data to see all transformations rg "def clean_data" nemo_skills/dataset/livecodebench-ja-prompt-en/prepare.py -A 50

Repository: NVIDIA-NeMo/Skills

Length of output: 1872

🏁 Script executed:

find nemo_skills/dataset -name "prepare.py" -path "*livecodebench*" | head -10

Repository: NVIDIA-NeMo/Skills

Length of output: 610

🏁 Script executed:

# Check another livecodebench prepare file to see the pattern cat nemo_skills/dataset/livecodebench-ja-prompt-ja/prepare.py | grep -A 5 "contest_date"

Repository: NVIDIA-NeMo/Skills

Length of output: 550

🏁 Script executed:

# Also check the main livecodebench prepare if it exists ls -la nemo_skills/dataset/ | grep livecodebench

Repository: NVIDIA-NeMo/Skills

Length of output: 888

🏁 Script executed:

# Check the main livecodebench prepare.py grep -A 5 "contest_date" nemo_skills/dataset/livecodebench/prepare.py

Repository: NVIDIA-NeMo/Skills

Length of output: 620

🏁 Script executed:

# Check if the main one uses strptime too cat nemo_skills/dataset/livecodebench/prepare.py | grep -B 2 -A 2 "strptime"

Repository: NVIDIA-NeMo/Skills

Length of output: 563

🏁 Script executed:

# Let's look at the context around the contest_date handling in the main file cat -n nemo_skills/dataset/livecodebench/prepare.py | sed -n '100,130p'

Repository: NVIDIA-NeMo/Skills

Length of output: 1509

Handle contest_date when it's a datetime object instead of a string.

The schema comment on line 38 documents contest_date: datetime, but line 116 assumes a string format with datetime.strptime(). This mismatch will cause a runtime error if the dataset yields datetime/date objects as documented. Normalize the input to handle both cases.

Suggested fix

@@ -from datetime import datetime +from datetime import datetime, date +from dateutil.parser import isoparse @@ - input_date = datetime.strptime(problem['contest_date'], '%Y-%m-%dT%H:%M:%S').date() + contest_date = problem["contest_date"] + if isinstance(contest_date, datetime): + input_date = contest_date.date() + elif isinstance(contest_date, date): + input_date = contest_date + else: + input_date = isoparse(contest_date).date()

🤖 Prompt for AI Agents

In `@nemo_skills/dataset/livecodebench-ja-prompt-en/prepare.py` around lines 114 - 118, The code assumes problem['contest_date'] is a string and uses datetime.strptime, but the schema documents contest_date as a datetime object; update the block that sets input_date (inside the with open loop) to normalize both cases: check the type of problem['contest_date'], if it's a str parse with datetime.strptime(..., '%Y-%m-%dT%H:%M:%S'), if it's a datetime use .date(), if it's a date use it directly, otherwise raise/skip with a clear error; ensure the rest of the existing comparison using input_date against start_date and end_date remains unchanged.

coderabbitai · 2026-01-28T02:48:42Z

nemo_skills/dataset/livecodebench-ja-prompt-ja/prepare.py

+class PromptConstants:
+    # reference: https://github.com/QwenLM/Qwen2.5-Coder/blob/main/qwencoder-eval/reasoning/livecode_bench_cot/lcb_runner_cq/prompts/code_generation.py#L31
+    FORMATTING_MESSAGE_WITH_STARTER_CODE = "You will use the following starter code to write the solution to the problem and enclose your code within delimiters."
+    FORMATTING_WITHOUT_STARTER_CODE = "Read the inputs from stdin solve the problem and write the answer to stdout (do not directly test on the sample inputs). Enclose your code within delimiters as follows. Ensure that when the python program runs, it reads the inputs, runs the algorithm and writes output to STDOUT."


⚠️ Potential issue | 🟠 Major

English prompts in Japanese-prompt file.

The directory name livecodebench-ja-prompt-ja suggests Japanese prompts, but PromptConstants contains English text. These formatting messages should be translated to Japanese to match the intended localization.

🤖 Prompt for AI Agents

In `@nemo_skills/dataset/livecodebench-ja-prompt-ja/prepare.py` around lines 25 - 28, PromptConstants currently contains English strings but lives in a Japanese prompt module; update the two constants FORMATTING_MESSAGE_WITH_STARTER_CODE and FORMATTING_WITHOUT_STARTER_CODE to Japanese equivalents that preserve the original intent (instructions about using starter code, enclosing code within delimiters, reading stdin/writing stdout, and the warning about not directly testing on sample inputs), keeping punctuation and any special phrasing intact so callers of PromptConstants need no further changes.

coderabbitai · 2026-01-28T02:48:42Z

nemo_skills/dataset/livecodebench-ja-prompt-ja/prepare.py

+def parse_data(release_version='release_latest'):
+    data = load_dataset(
+        "livecodebench/code_generation_lite", split="test", version_tag=release_version, trust_remote_code=True
+    )


⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🌐 Web query:

livecodebench multilingual Japanese dataset support

💡 Result:

LiveCodeBench does not currently provide an official multilingual / Japanese version of the benchmark prompts/problem statements.

The official LiveCodeBench benchmark is built from contest problems (LeetCode, AtCoder, Codeforces) for code evaluation over time. [1]

In the released Hugging Face dataset (e.g., livecodebench/code_generation), the question_content field is provided in English (the dataset viewer examples are English AtCoder statements). [2]

The official repo/docs describe dataset versions and evaluation tooling, but do not describe any packaged Japanese/multilingual problem-statement split. [3]

If you want Japanese support, you’d need to create your own variant (e.g., replace question_content with Japanese statements—such as AtCoder’s Japanese versions—then run the same evaluator). The LiveCodeBench runner is flexible enough to evaluate custom outputs, but the benchmark’s distributed prompts are not multilingual. [3]

Sources: [1] [2] [3]

🏁 Script executed:

# Check the directory structure and contents find nemo_skills/dataset -type d -name "*livecodebench*ja*" | head -20

Repository: NVIDIA-NeMo/Skills

Length of output: 155

🏁 Script executed:

# Read the prepare.py file to see the full implementation cat -n nemo_skills/dataset/livecodebench-ja-prompt-ja/prepare.py

Repository: NVIDIA-NeMo/Skills

Length of output: 7761

🏁 Script executed:

# Check if there are any README or configuration files in this directory find nemo_skills/dataset/livecodebench-ja-prompt-ja -type f | head -20

Repository: NVIDIA-NeMo/Skills

Length of output: 178

🏁 Script executed:

# Compare the two variants cat -n nemo_skills/dataset/livecodebench-ja-prompt-en/prepare.py 2>/dev/null | head -50

Repository: NVIDIA-NeMo/Skills

Length of output: 2452

🏁 Script executed:

# Look for README or documentation in the parent directory find nemo_skills/dataset -maxdepth 2 -name "README*" -o -name "*.md" | grep -i livecode

Repository: NVIDIA-NeMo/Skills

Length of output: 44

🏁 Script executed:

# Check if there's a parent README or docs ls -la nemo_skills/dataset/ | head -20

Repository: NVIDIA-NeMo/Skills

Length of output: 1285

🏁 Script executed:

# Check git history or comments about the Japanese variants git log --oneline nemo_skills/dataset/livecodebench-ja-prompt-ja/ 2>/dev/null | head -5

Repository: NVIDIA-NeMo/Skills

Length of output: 107

🏁 Script executed:

# Look for any configuration or mapping files that might translate the dataset find nemo_skills/dataset/livecodebench-ja-prompt-ja -type f

Repository: NVIDIA-NeMo/Skills

Length of output: 178

🏁 Script executed:

# Check if __init__.py has any relevant information cat nemo_skills/dataset/livecodebench-ja-prompt-ja/__init__.py

Repository: NVIDIA-NeMo/Skills

Length of output: 1120

🏁 Script executed:

# Check the git commit to understand the multilingual variants better git show c8a2bb8 --stat 2>/dev/null | head -30

Repository: NVIDIA-NeMo/Skills

Length of output: 1698

🏁 Script executed:

# Compare PromptConstants between the two variants more carefully echo "=== livecodebench-ja-prompt-ja ==="; grep -A 3 "class PromptConstants" nemo_skills/dataset/livecodebench-ja-prompt-ja/prepare.py echo "=== livecodebench-ja-prompt-en ===" && grep -A 3 "class PromptConstants" nemo_skills/dataset/livecodebench-ja-prompt-en/prepare.py 2>/dev/null || echo "Same or identical"

Repository: NVIDIA-NeMo/Skills

Length of output: 1455

🏁 Script executed:

# Check if there's a pattern with other multilingual variants (AIME25 Japanese) echo "=== AIME25 Japanese variant ===" && cat nemo_skills/dataset/aime25-ja-prompt-ja/__init__.py | grep -A 1 "PROMPT_CONFIG" echo "=== AIME25 English variant ===" && cat nemo_skills/dataset/aime25-ja-prompt-en/__init__.py | grep -A 1 "PROMPT_CONFIG" 2>/dev/null || echo "Not found"

Repository: NVIDIA-NeMo/Skills

Length of output: 240

PromptConstants should be in Japanese for the -ja-prompt-ja variant.

The PromptConstants class contains English formatting messages (lines 27-28), but this file is in the livecodebench-ja-prompt-ja directory, which by naming convention should provide Japanese prompts. These constants are embedded into the dataset and should be translated to Japanese. Compare to the eval/livecodebench/python_codegen_ja config referenced in __init__.py, which suggests Japanese evaluation support is intended.

Note: The dataset source (livecodebench/code_generation_lite) is English-only per official LiveCodeBench design; this is expected and not a problem. However, the instruction text in PromptConstants should be localized.

🤖 Prompt for AI Agents

In `@nemo_skills/dataset/livecodebench-ja-prompt-ja/prepare.py` around lines 31 - 34, PromptConstants currently contains English prompt templates but this variant (livecodebench-ja-prompt-ja) must use Japanese; open the PromptConstants class in nemo_skills/dataset/livecodebench-ja-prompt-ja/prepare.py and replace the English strings (e.g., instruction/format templates defined around PromptConstants) with their Japanese equivalents, preserving variable placeholders and formatting, then run any local tests that embed these constants to confirm no formatting errors; ensure the class name PromptConstants and any references (e.g., where parse_data or dataset preparation uses PromptConstants) remain unchanged.

coderabbitai · 2026-01-28T02:48:42Z

nemo_skills/dataset/livecodebench-ja-prompt-ja/prepare.py

+    parser.add_argument('--start_date', type=str, default='all', help="End date in YYYY-MM format")
+    parser.add_argument('--end_date', type=str, default='all', help="End date in YYYY-MM format")


⚠️ Potential issue | 🟡 Minor

Fix incorrect help text.

The --start_date argument has help text saying "End date in YYYY-MM format" — this appears to be a copy-paste error.

Suggested fix

- parser.add_argument('--start_date', type=str, default='all', help="End date in YYYY-MM format") + parser.add_argument('--start_date', type=str, default='all', help="Start date in YYYY-MM format")

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

parser.add_argument('--start_date', type=str, default='all', help="End date in YYYY-MM format")

parser.add_argument('--end_date', type=str, default='all', help="End date in YYYY-MM format")

parser.add_argument('--start_date', type=str, default='all', help="Start date in YYYY-MM format")

parser.add_argument('--end_date', type=str, default='all', help="End date in YYYY-MM format")

🤖 Prompt for AI Agents

In `@nemo_skills/dataset/livecodebench-ja-prompt-ja/prepare.py` around lines 143 - 144, The help text for the CLI argument defined by parser.add_argument('--start_date', type=str, default='all', help="End date in YYYY-MM format") is incorrect; update the help string for the '--start_date' argument to accurately describe it (e.g., "Start date in YYYY-MM format") while keeping the '--end_date' help as the end-date description so both parser.add_argument('--start_date', ...) and parser.add_argument('--end_date', ...) have correct, non-duplicated help messages.

shuoyangd

Thanks for adding this PR! I left a few questions, mostly about the metric. We should aim to make this one compatible for both the test sets you added as well as MMLU-ProX.

shuoyangd · 2026-01-28T07:45:52Z

nemo_skills/evaluation/metrics/math_multilingual_metrics.py

+
+
+class MathMultilingualMetrics(BaseMetrics):
+    # TODO: how can we ensure that user-defined aggregations have all the same metrics as in base?


Not sure what this means. Is this still relevant?

shuoyangd · 2026-01-28T15:27:58Z

nemo_skills/evaluation/metrics/math_multilingual_metrics.py

+LOG = logging.getLogger(get_logger_name(__file__))
+
+
+class MathMultilingualMetrics(BaseMetrics):


I don't think this only covers math datasets? Maybe use a more general name?

shuoyangd · 2026-01-28T15:28:34Z

nemo_skills/evaluation/metrics/math_multilingual_metrics.py

+            for score_method in score_dicts[0].keys():
+                # Get valid answers and their results for this field
+                valid_answers_and_results = [
+                    (elem[self.answer_key], correctness_dict[score_method], elem["reward_model_score"])


I've never used evaluation with reward model before -- is this a feature in the public branch? If not we can remove it for now.

shuoyangd · 2026-01-28T15:36:58Z

nemo_skills/evaluation/metrics/math_multilingual_metrics.py

+
+    def _get_score_dict(self, prediction: dict) -> dict[str, bool | int | float]:
+        correctness_dict = {}
+        if "target_language" in prediction:


Rather than preparing each individual language as a separate dataset, IMO the cleaner way is to set them as subset_for_metrics field (example). This way, you can have one dataset for all the languages, while still having the model automatically report breakdowns per-language. Plus you get averages across all languages for free.

MMLU-ProX currently use this scheme. I would recommend we stick to this, and it should be a lightweight change compared to what we have here.

shuoyangd · 2026-01-28T15:40:07Z

nemo_skills/evaluation/metrics/math_multilingual_metrics.py

+        if "symbolic_correct" in prediction:
+            correctness_dict["symbolic_correct"] = prediction["symbolic_correct"]
+            correctness_dict["symbolic_and_language_correct"] = language_correct and prediction["symbolic_correct"]
+        if "judgement" in prediction:


Same sanity check as reward models -- is the this judgment field also what we use in the main branch? If not we can also remove it for now.

(Also I think judgment is the more widely-used spelling)

Kipok · 2026-01-30T05:04:24Z

nemo_skills/dataset/aime25-de-prompt-de/__init__.py

@@ -0,0 +1,20 @@
+# Copyright (c) 2024, NVIDIA CORPORATION.  All rights reserved.


maybe we keep those in subfolders? E.g. aime25-X and then inside you'd have de-prompt-de, etc. Also it seems there is quite a bit of code duplication in all these prepare.py files, maybe we can unify some of it in shared utilities

Kipok · 2026-01-30T05:04:47Z

@zijiachen95 please re-create PR from a branch, I just sent you an invite

add multilingual aime25, gqpa and lcb

c8a2bb8

greptile-apps bot reviewed Jan 28, 2026

View reviewed changes

shuoyangd self-assigned this Jan 28, 2026

coderabbitai bot reviewed Jan 28, 2026

View reviewed changes

shuoyangd reviewed Jan 28, 2026

View reviewed changes

Kipok reviewed Jan 30, 2026

View reviewed changes

	"multichoice_multilingual": MathMultilingualMetrics,
	"multichoice_multilingual": MathMultilingualMetrics,
	"math_multilingual": MathMultilingualMetrics,

	# EVAL_ARGS = "++eval_type=livecodebench ++eval_config.dataset=livecodebench
	# EVAL_ARGS = "++eval_type=livecodebench ++eval_config.dataset=livecodebench"

		@@ -0,0 +1,24 @@
		# Copyright (c) 2024, NVIDIA CORPORATION. All rights reserved.

		@@ -0,0 +1,23 @@
		# Copyright (c) 2024, NVIDIA CORPORATION. All rights reserved.

		parser.add_argument('--start_date', type=str, default='all', help="End date in YYYY-MM format")
		parser.add_argument('--end_date', type=str, default='all', help="End date in YYYY-MM format")



		class MathMultilingualMetrics(BaseMetrics):
		# TODO: how can we ensure that user-defined aggregations have all the same metrics as in base?

		LOG = logging.getLogger(get_logger_name(__file__))


		class MathMultilingualMetrics(BaseMetrics):

		@@ -0,0 +1,20 @@
		# Copyright (c) 2024, NVIDIA CORPORATION. All rights reserved.

Conversation

zijiachen95 commented Jan 28, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Release Notes

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Jan 28, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Jan 28, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Jan 28, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot commented Jan 28, 2026

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jan 28, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jan 28, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jan 28, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jan 28, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jan 28, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jan 28, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jan 28, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jan 28, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jan 28, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jan 28, 2026

Choose a reason for hiding this comment

Uh oh!

shuoyangd left a comment

Choose a reason for hiding this comment

Uh oh!

shuoyangd Jan 28, 2026

Choose a reason for hiding this comment

Uh oh!

shuoyangd Jan 28, 2026

Choose a reason for hiding this comment

Uh oh!

shuoyangd Jan 28, 2026

Choose a reason for hiding this comment

Uh oh!

shuoyangd Jan 28, 2026

Choose a reason for hiding this comment

Uh oh!

shuoyangd Jan 28, 2026

Choose a reason for hiding this comment

Uh oh!

Kipok Jan 30, 2026

Choose a reason for hiding this comment

Uh oh!

zijiachen95 commented Jan 28, 2026 •

edited by coderabbitai bot

Loading