add multilingual aime25, gqpa and lcb#1194
add multilingual aime25, gqpa and lcb#1194zijiachen95 wants to merge 1 commit intoNVIDIA-NeMo:mainfrom
Conversation
| "ioi": IOIMetrics, | ||
| "icpc": ICPCMetrics, | ||
| "multichoice": MathMetrics, | ||
| "multichoice_multilingual": MathMultilingualMetrics, |
There was a problem hiding this comment.
math_multilingual metrics type is missing from METRICS_MAP. The AIME25 datasets reference METRICS_TYPE = "math_multilingual" but only multichoice_multilingual is registered here. Add entry:
| "multichoice_multilingual": MathMultilingualMetrics, | |
| "multichoice_multilingual": MathMultilingualMetrics, | |
| "math_multilingual": MathMultilingualMetrics, |
| # EVAL_SPLIT = 'test_v5_2408_2502' | ||
| # EVAL_SPLIT = 'test_v5_2407_2412' | ||
| EVAL_SPLIT = 'test_v5_2407_2503' | ||
| # EVAL_ARGS = "++eval_type=livecodebench ++eval_config.dataset=livecodebench |
There was a problem hiding this comment.
Incomplete comment - missing closing quote
| # EVAL_ARGS = "++eval_type=livecodebench ++eval_config.dataset=livecodebench | |
| # EVAL_ARGS = "++eval_type=livecodebench ++eval_config.dataset=livecodebench" |
| from nemo_skills.evaluation.metrics.utils import is_correct_judgement | ||
| from nemo_skills.utils import get_logger_name | ||
| from langdetect import detect, DetectorFactory, LangDetectException | ||
| import sys |
There was a problem hiding this comment.
import sys is unused and should be removed
📝 WalkthroughWalkthroughThis PR adds comprehensive multilingual dataset support to NeMo Skills, including AIME25, GPQA, and LiveCodeBench datasets across multiple languages (German, Spanish, French, Japanese). It introduces dataset-specific evaluation configurations, data preparation scripts, a new multilingual math metrics class, and corresponding prompt templates in YAML format. Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related PRs
Suggested labels
Suggested reviewers
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 18
🤖 Fix all issues with AI agents
In `@nemo_skills/dataset/aime25-de-prompt-de/prepare.py`:
- Line 1: The file prepare.py is marked executable but has no shebang; either
add a proper Python shebang or remove the executable bit. If this script is
intended to be run directly, insert a shebang line (e.g. /usr/bin/env python3)
at the very top of prepare.py; if it’s a module/library not meant to be
executed, remove the executable permission (chmod a-x) so static analysis no
longer flags it as executable.
In `@nemo_skills/dataset/gpqa-de-prompt-de/__init__.py`:
- Line 1: The package __init__.py file is marked executable but lacks a shebang;
either remove the executable permission or make it a true executable module: if
this module is not intended to be run, remove the executable bit (e.g., chmod
-x) on nemo_skills/dataset/gpqa-de-prompt-de/__init__.py, otherwise add an
appropriate shebang (e.g., #!/usr/bin/env python3) at the top of __init__.py so
execution behavior is correct.
In `@nemo_skills/dataset/gpqa-de-prompt-de/prepare.py`:
- Around line 40-60: In format_entry, preprocess is called multiple times and if
preprocessing causes duplicates the index lookup can pick the wrong choice; to
fix, construct choices as pairs like (text=preprocess(...), is_correct=bool) for
each of the four answers (mark the Correct Answer True), shuffle the list of
pairs, then build the final choices list from the pair.text values and compute
expected_answer by finding the pair where is_correct is True (using its shuffled
position => chr(65+index)); update get_mcq_fields call to receive the list of
texts. Ensure preprocess is invoked exactly once per original entry field (e.g.,
in format_entry) and that you reference format_entry, preprocess, and
get_mcq_fields when making the change.
In `@nemo_skills/dataset/gpqa-es-prompt-en/prepare.py`:
- Around line 40-60: In format_entry, preprocessing can make the correct answer
text duplicate another choice so choices.index(preprocess(...)) may return the
wrong index after shuffle; fix by capturing the preprocessed correct answer
(call preprocess on entry["Correct Answer"] once) and either build choices as
structures that preserve identity (e.g., tuples of (text, is_correct)) or record
the correct-answer object before random.shuffle and compute correct_answer_index
by finding the item whose identity/flag matches the captured correct marker;
update get_mcq_fields usage accordingly so expected_answer is derived from that
preserved correct item rather than using choices.index on raw text.
In `@nemo_skills/dataset/gpqa-es-prompt-es/__init__.py`:
- Line 1: This __init__.py file currently has the executable bit set but
contains no shebang; remove the executable permission so it is a normal Python
package init. Locate nemo_skills/dataset/gpqa-es-prompt-es/__init__.py (module
gpqa-es-prompt-es, symbol __init__.py) and clear the executable flag (e.g., run
chmod -x on the file or use git update-index --chmod=-x <path>) and commit the
permission change so Ruff no longer flags it as executable.
In `@nemo_skills/dataset/gpqa-ja-prompt-ja/__init__.py`:
- Line 1: This __init__.py is marked executable but has no shebang; clear the
executable permission on the file (remove the executable bit) so it is a normal
module rather than a script—i.e., update file permissions for
nemo_skills/dataset/gpqa-ja-prompt-ja/__init__.py (or if execution was intended,
instead add an appropriate shebang to the top), but prefer removing the
executable bit for this package __init__.py.
In `@nemo_skills/dataset/livecodebench-de-prompt-de/prepare.py`:
- Around line 98-129: The code in prepare() calls
datetime.strptime(problem['contest_date'], ...) but problem['contest_date'] may
already be a datetime object, causing a TypeError; update clean_data() to ensure
the contest_date field is always a string (e.g., cast problem['contest_date'] =
str(problem['contest_date']) or use a dedicated cast_column for 'contest_date'),
or modify the loop in prepare() to handle both types by converting to string
when necessary (e.g., use str(problem['contest_date']) before calling
datetime.strptime); change should reference the prepare() function's parsing
logic and the clean_data()/parse_data() pipeline so contest_date is consistently
a string before parsing.
In `@nemo_skills/dataset/livecodebench-de-prompt-en/prepare.py`:
- Around line 114-118: The loop that computes input_date from
problem['contest_date'] assumes a string and calls datetime.strptime, which
fails if contest_date is already a datetime; update the logic around
problem['contest_date'] in the block that sets input_date (used with start_date
and end_date) to handle both types: if isinstance(problem['contest_date'],
datetime) call .date() on it, otherwise parse the string with
datetime.strptime(...).date(); ensure the variable name input_date and the
comparison start_date <= input_date <= end_date remain unchanged.
In `@nemo_skills/dataset/livecodebench-es-prompt-en/__init__.py`:
- Around line 16-23: The EVAL_SPLIT 'test_v5_2407_2503' referenced in this
module (see EVAL_SPLIT, PROMPT_CONFIG, DATASET_GROUP, METRICS_TYPE) does not
exist in the dataset preparation DEFAULT_SPLITS, causing runtime failures; fix
by either adding the tuple ('v5','2024-07','2025-03') to DEFAULT_SPLITS in the
dataset's prepare.py (so prepare.py generates test_v5_2407_2503.jsonl) or change
EVAL_SPLIT here to an existing split such as 'test_v5_2408_2502' so the
evaluation points to a generated file.
In `@nemo_skills/dataset/livecodebench-es-prompt-en/prepare.py`:
- Around line 114-118: The code incorrectly treats problem['contest_date'] as a
string and calls datetime.strptime; instead, handle the schema-typed datetime by
obtaining the date directly from the datetime object (use
problem['contest_date'].date()), or defensively check the type and parse only if
it's a str; update the block around contest_date handling used with start_date
and end_date so input_date is derived from the actual datetime object rather
than calling strptime on it.
In `@nemo_skills/dataset/livecodebench-es-prompt-es/prepare.py`:
- Around line 114-118: The loop that parses problem['contest_date'] assumes it's
a string and calls datetime.strptime, which will TypeError if contest_date is
already a datetime; update the parsing in the for-loop that produces input_date
(the code that currently calls datetime.strptime on problem['contest_date'] and
assigns to input_date) to handle both types: if
isinstance(problem['contest_date'], datetime) use its .date() (or .date() after
ensuring tz handling), otherwise cast to str and parse with datetime.strptime;
keep the variable names input_date, start_date, end_date and the same
conditional (if start_date <= input_date <= end_date) so the rest of the logic
is unchanged.
In `@nemo_skills/dataset/livecodebench-fr-prompt-en/prepare.py`:
- Around line 143-144: The help string for the argument parser is wrong: update
the call to parser.add_argument for '--start_date' so its help correctly says
"Start date in YYYY-MM format" (while '--end_date' keeps "End date in YYYY-MM
format"); locate the parser.add_argument invocation for '--start_date' in
prepare.py and replace the incorrect "End date" text with "Start date" to match
the flag name.
In `@nemo_skills/dataset/livecodebench-fr-prompt-fr/prepare.py`:
- Around line 25-28: PromptConstants contains English prompt strings but this
file is for French prompts; update the two constants
FORMATTING_MESSAGE_WITH_STARTER_CODE and FORMATTING_WITHOUT_STARTER_CODE in
class PromptConstants to French equivalents that preserve the original meaning
and instructions (use starter code and delimiters; read input from stdin, do not
run sample tests, output to stdout, enclose code in delimiters). Ensure the
translations are clear, in French, and keep references to "starter code",
"stdin", "stdout" and "delimiters" so callers relying on those keywords still
understand the intent.
- Around line 143-144: The help strings for the CLI arguments are incorrect:
update the parser.add_argument calls for '--start_date' and '--end_date' so that
'--start_date' help says "Start date in YYYY-MM format" and '--end_date' help
says "End date in YYYY-MM format" (refer to the parser.add_argument declarations
for '--start_date' and '--end_date' to locate and edit the help text).
In `@nemo_skills/dataset/livecodebench-ja-prompt-en/prepare.py`:
- Around line 114-118: The code assumes problem['contest_date'] is a string and
uses datetime.strptime, but the schema documents contest_date as a datetime
object; update the block that sets input_date (inside the with open loop) to
normalize both cases: check the type of problem['contest_date'], if it's a str
parse with datetime.strptime(..., '%Y-%m-%dT%H:%M:%S'), if it's a datetime use
.date(), if it's a date use it directly, otherwise raise/skip with a clear
error; ensure the rest of the existing comparison using input_date against
start_date and end_date remains unchanged.
In `@nemo_skills/dataset/livecodebench-ja-prompt-ja/prepare.py`:
- Around line 25-28: PromptConstants currently contains English strings but
lives in a Japanese prompt module; update the two constants
FORMATTING_MESSAGE_WITH_STARTER_CODE and FORMATTING_WITHOUT_STARTER_CODE to
Japanese equivalents that preserve the original intent (instructions about using
starter code, enclosing code within delimiters, reading stdin/writing stdout,
and the warning about not directly testing on sample inputs), keeping
punctuation and any special phrasing intact so callers of PromptConstants need
no further changes.
- Around line 143-144: The help text for the CLI argument defined by
parser.add_argument('--start_date', type=str, default='all', help="End date in
YYYY-MM format") is incorrect; update the help string for the '--start_date'
argument to accurately describe it (e.g., "Start date in YYYY-MM format") while
keeping the '--end_date' help as the end-date description so both
parser.add_argument('--start_date', ...) and parser.add_argument('--end_date',
...) have correct, non-duplicated help messages.
- Around line 31-34: PromptConstants currently contains English prompt templates
but this variant (livecodebench-ja-prompt-ja) must use Japanese; open the
PromptConstants class in
nemo_skills/dataset/livecodebench-ja-prompt-ja/prepare.py and replace the
English strings (e.g., instruction/format templates defined around
PromptConstants) with their Japanese equivalents, preserving variable
placeholders and formatting, then run any local tests that embed these constants
to confirm no formatting errors; ensure the class name PromptConstants and any
references (e.g., where parse_data or dataset preparation uses PromptConstants)
remain unchanged.
🧹 Nitpick comments (9)
nemo_skills/evaluation/metrics/math_multilingual_metrics.py (1)
31-36: Please track/resolve the TODO on aggregation parity.Leaving TODOs here can hide missing behavior as the metrics evolve. If you’d like, I can draft an issue or propose a small check to enforce parity.
nemo_skills/dataset/gpqa-fr-prompt-fr/prepare.py (1)
31-37: Consider returning empty string for None input.Returning a single space
" "forNoneinput may cause subtle formatting issues in the generated prompts. An empty string might be more appropriate unless there's a specific reason for the space.♻️ Suggested change
def preprocess(text): if text is None: - return " " + return "" text = text.strip() text = text.replace(" [title]", ". ") text = text.replace(" ", " ") return textnemo_skills/dataset/livecodebench-ja-prompt-ja/__init__.py (1)
19-23: Consider removing commented-out code.These commented-out lines appear to be alternative configurations or work-in-progress artifacts. If they serve as documentation for valid split options, consider converting them to a proper comment or docstring. Otherwise, they can be removed to keep the file clean.
♻️ Suggested cleanup
# settings that define how evaluation should be done by default (all can be changed from cmdline) PROMPT_CONFIG = 'eval/livecodebench/python_codegen_ja' DATASET_GROUP = 'code' METRICS_TYPE = 'livecodebench' -# EVAL_SPLIT = 'test_v5_2408_2502' -# EVAL_SPLIT = 'test_v5_2407_2412' EVAL_SPLIT = 'test_v5_2407_2503' -# EVAL_ARGS = "++eval_type=livecodebench ++eval_config.dataset=livecodebench EVAL_ARGS = "++eval_type=livecodebench" GENERATION_ARGS = ""nemo_skills/dataset/gpqa-es-prompt-en/prepare.py (1)
31-97: Significant code duplication across GPQA prepare.py files.This file is nearly identical to
gpqa-de-prompt-en/prepare.py,gpqa-ja-prompt-en/prepare.py,gpqa-ja-prompt-ja/prepare.py, and others. Consider extracting the common logic into a shared module (e.g.,nemo_skills/dataset/gpqa_utils.py) that each language variant can import and optionally customize.This would reduce maintenance burden when changes are needed across all variants.
nemo_skills/dataset/livecodebench-ja-prompt-ja/prepare.py (2)
76-77: Remove extraneous f-prefix.Line 77 uses an f-string without any placeholders.
Suggested fix
- question += f"```python\n# YOUR CODE HERE\n```\n\n" + question += "```python\n# YOUR CODE HERE\n```\n\n"
50-66: Consider using exception chaining.When re-raising exceptions, use
raise ... from errto preserve the exception chain for better debugging.Suggested fix
def get_first_last_day(year_month_str): try: date_obj = datetime.strptime(year_month_str, "%Y-%m") first_day = date_obj.date().replace(day=1) last_day = (date_obj + relativedelta(months=1, days=-1)).date() return first_day, last_day - except ValueError: - raise ValueError("Invalid date format. Please use '%Y-%m'.") + except ValueError as err: + raise ValueError("Invalid date format. Please use '%Y-%m'.") from err def parse_month_range(start_date, end_date): try: start_date, _ = get_first_last_day(start_date) _, end_date = get_first_last_day(end_date) return start_date, end_date except ValueError as e: - raise ValueError(str(e)) + raise ValueError(str(e)) from enemo_skills/dataset/livecodebench-fr-prompt-en/prepare.py (1)
76-77: Remove extraneous f-prefix.Suggested fix
- question += f"```python\n# YOUR CODE HERE\n```\n\n" + question += "```python\n# YOUR CODE HERE\n```\n\n"nemo_skills/dataset/livecodebench-fr-prompt-fr/prepare.py (2)
1-166: Significant code duplication across prepare.py files.This file is nearly identical to
livecodebench-ja-prompt-ja/prepare.pyandlivecodebench-fr-prompt-en/prepare.py. Consider extracting the shared logic (date utilities,clean_data,preparefunction, CLI handling) into a common module, with only the language-specific constants (dataset path,PromptConstants) varying per locale.This would reduce maintenance burden and ensure bug fixes propagate to all variants.
76-77: Remove extraneous f-prefix.Suggested fix
- question += f"```python\n# YOUR CODE HERE\n```\n\n" + question += "```python\n# YOUR CODE HERE\n```\n\n"
| @@ -0,0 +1,24 @@ | |||
| # Copyright (c) 2024, NVIDIA CORPORATION. All rights reserved. | |||
There was a problem hiding this comment.
Add a shebang or drop the executable bit.
Line 1: Static analysis flags this file as executable but it lacks a shebang, which breaks direct execution on Unix. Add a shebang or remove the executable bit if it’s not meant to be run directly.
🛠️ Suggested fix
+#!/usr/bin/env python3
+
# Copyright (c) 2024, NVIDIA CORPORATION. All rights reserved.📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| # Copyright (c) 2024, NVIDIA CORPORATION. All rights reserved. | |
| #!/usr/bin/env python3 | |
| # Copyright (c) 2024, NVIDIA CORPORATION. All rights reserved. |
🧰 Tools
🪛 Ruff (0.14.14)
1-1: The file is executable but no shebang is present
(EXE002)
🤖 Prompt for AI Agents
In `@nemo_skills/dataset/aime25-de-prompt-de/prepare.py` at line 1, The file
prepare.py is marked executable but has no shebang; either add a proper Python
shebang or remove the executable bit. If this script is intended to be run
directly, insert a shebang line (e.g. /usr/bin/env python3) at the very top of
prepare.py; if it’s a module/library not meant to be executed, remove the
executable permission (chmod a-x) so static analysis no longer flags it as
executable.
| @@ -0,0 +1,23 @@ | |||
| # Copyright (c) 2024, NVIDIA CORPORATION. All rights reserved. | |||
There was a problem hiding this comment.
Remove executable bit or add shebang.
Line 1: The file appears to be marked executable but lacks a shebang. If this module isn’t meant to be run directly, please remove the executable bit (e.g., chmod -x). If it is meant to be executable, add an appropriate shebang.
🧰 Tools
🪛 Ruff (0.14.14)
1-1: The file is executable but no shebang is present
(EXE002)
🤖 Prompt for AI Agents
In `@nemo_skills/dataset/gpqa-de-prompt-de/__init__.py` at line 1, The package
__init__.py file is marked executable but lacks a shebang; either remove the
executable permission or make it a true executable module: if this module is not
intended to be run, remove the executable bit (e.g., chmod -x) on
nemo_skills/dataset/gpqa-de-prompt-de/__init__.py, otherwise add an appropriate
shebang (e.g., #!/usr/bin/env python3) at the top of __init__.py so execution
behavior is correct.
| def format_entry(entry): | ||
| choices = [ | ||
| preprocess(entry["Incorrect Answer 1"]), | ||
| preprocess(entry["Incorrect Answer 2"]), | ||
| preprocess(entry["Incorrect Answer 3"]), | ||
| preprocess(entry["Correct Answer"]), | ||
| ] | ||
|
|
||
| random.shuffle(choices) | ||
| correct_answer_index = choices.index(preprocess(entry["Correct Answer"])) | ||
| return { | ||
| "expected_answer": f"{chr(65 + correct_answer_index)}", | ||
| "explanation": preprocess(entry["Explanation"]), | ||
| "subset_for_metrics": entry["Subdomain"], | ||
| "difficulty": ( | ||
| re.split(r'\s*\(', entry["Writer's Difficulty Estimate"])[0] | ||
| if entry["Writer's Difficulty Estimate"] is not None | ||
| else None | ||
| ), | ||
| **get_mcq_fields(entry["Question"], choices), | ||
| } |
There was a problem hiding this comment.
Protect expected_answer from duplicate choices after normalization.
If preprocess collapses answers into identical strings, choices.index(...) can point to the wrong option, corrupting expected_answer. Track correctness alongside each choice during shuffle to avoid collisions.
🐛 Proposed fix
def format_entry(entry):
- choices = [
- preprocess(entry["Incorrect Answer 1"]),
- preprocess(entry["Incorrect Answer 2"]),
- preprocess(entry["Incorrect Answer 3"]),
- preprocess(entry["Correct Answer"]),
- ]
-
- random.shuffle(choices)
- correct_answer_index = choices.index(preprocess(entry["Correct Answer"]))
+ choices = [
+ (preprocess(entry["Incorrect Answer 1"]), False),
+ (preprocess(entry["Incorrect Answer 2"]), False),
+ (preprocess(entry["Incorrect Answer 3"]), False),
+ (preprocess(entry["Correct Answer"]), True),
+ ]
+
+ random.shuffle(choices)
+ correct_answer_index = next(
+ i for i, (_, is_correct) in enumerate(choices) if is_correct
+ )
+ choice_texts = [choice for choice, _ in choices]
return {
"expected_answer": f"{chr(65 + correct_answer_index)}",
"explanation": preprocess(entry["Explanation"]),
"subset_for_metrics": entry["Subdomain"],
"difficulty": (
re.split(r'\s*\(', entry["Writer's Difficulty Estimate"])[0]
if entry["Writer's Difficulty Estimate"] is not None
else None
),
- **get_mcq_fields(entry["Question"], choices),
+ **get_mcq_fields(entry["Question"], choice_texts),
}🤖 Prompt for AI Agents
In `@nemo_skills/dataset/gpqa-de-prompt-de/prepare.py` around lines 40 - 60, In
format_entry, preprocess is called multiple times and if preprocessing causes
duplicates the index lookup can pick the wrong choice; to fix, construct choices
as pairs like (text=preprocess(...), is_correct=bool) for each of the four
answers (mark the Correct Answer True), shuffle the list of pairs, then build
the final choices list from the pair.text values and compute expected_answer by
finding the pair where is_correct is True (using its shuffled position =>
chr(65+index)); update get_mcq_fields call to receive the list of texts. Ensure
preprocess is invoked exactly once per original entry field (e.g., in
format_entry) and that you reference format_entry, preprocess, and
get_mcq_fields when making the change.
| def format_entry(entry): | ||
| choices = [ | ||
| preprocess(entry["Incorrect Answer 1"]), | ||
| preprocess(entry["Incorrect Answer 2"]), | ||
| preprocess(entry["Incorrect Answer 3"]), | ||
| preprocess(entry["Correct Answer"]), | ||
| ] | ||
|
|
||
| random.shuffle(choices) | ||
| correct_answer_index = choices.index(preprocess(entry["Correct Answer"])) | ||
| return { | ||
| "expected_answer": f"{chr(65 + correct_answer_index)}", | ||
| "explanation": preprocess(entry["Explanation"]), | ||
| "subset_for_metrics": entry["Subdomain"], | ||
| "difficulty": ( | ||
| re.split(r'\s*\(', entry["Writer's Difficulty Estimate"])[0] | ||
| if entry["Writer's Difficulty Estimate"] is not None | ||
| else None | ||
| ), | ||
| **get_mcq_fields(entry["Question"], choices), | ||
| } |
There was a problem hiding this comment.
Potential edge case: duplicate answer text after preprocessing.
If two choices have identical text after preprocessing, choices.index() returns the index of the first occurrence, which may not correspond to the actual correct answer position after shuffling. This could lead to an incorrect expected_answer label.
Consider storing the correct answer reference before shuffling:
🐛 Proposed fix
def format_entry(entry):
+ correct_answer = preprocess(entry["Correct Answer"])
choices = [
preprocess(entry["Incorrect Answer 1"]),
preprocess(entry["Incorrect Answer 2"]),
preprocess(entry["Incorrect Answer 3"]),
- preprocess(entry["Correct Answer"]),
+ correct_answer,
]
random.shuffle(choices)
- correct_answer_index = choices.index(preprocess(entry["Correct Answer"]))
+ correct_answer_index = choices.index(correct_answer)
return {
"expected_answer": f"{chr(65 + correct_answer_index)}",🤖 Prompt for AI Agents
In `@nemo_skills/dataset/gpqa-es-prompt-en/prepare.py` around lines 40 - 60, In
format_entry, preprocessing can make the correct answer text duplicate another
choice so choices.index(preprocess(...)) may return the wrong index after
shuffle; fix by capturing the preprocessed correct answer (call preprocess on
entry["Correct Answer"] once) and either build choices as structures that
preserve identity (e.g., tuples of (text, is_correct)) or record the
correct-answer object before random.shuffle and compute correct_answer_index by
finding the item whose identity/flag matches the captured correct marker; update
get_mcq_fields usage accordingly so expected_answer is derived from that
preserved correct item rather than using choices.index on raw text.
| @@ -0,0 +1,23 @@ | |||
| # Copyright (c) 2024, NVIDIA CORPORATION. All rights reserved. | |||
There was a problem hiding this comment.
Remove executable bit (no shebang).
Ruff flags this file as executable without a shebang. Since it’s an __init__.py, it should not be executable—please clear the executable flag (or add a shebang if execution is intended).
🧰 Tools
🪛 Ruff (0.14.14)
1-1: The file is executable but no shebang is present
(EXE002)
🤖 Prompt for AI Agents
In `@nemo_skills/dataset/gpqa-es-prompt-es/__init__.py` at line 1, This
__init__.py file currently has the executable bit set but contains no shebang;
remove the executable permission so it is a normal Python package init. Locate
nemo_skills/dataset/gpqa-es-prompt-es/__init__.py (module gpqa-es-prompt-es,
symbol __init__.py) and clear the executable flag (e.g., run chmod -x on the
file or use git update-index --chmod=-x <path>) and commit the permission change
so Ruff no longer flags it as executable.
| parser.add_argument('--start_date', type=str, default='all', help="End date in YYYY-MM format") | ||
| parser.add_argument('--end_date', type=str, default='all', help="End date in YYYY-MM format") |
There was a problem hiding this comment.
Fix incorrect help text.
Suggested fix
- parser.add_argument('--start_date', type=str, default='all', help="End date in YYYY-MM format")
+ parser.add_argument('--start_date', type=str, default='all', help="Start date in YYYY-MM format")🤖 Prompt for AI Agents
In `@nemo_skills/dataset/livecodebench-fr-prompt-fr/prepare.py` around lines 143 -
144, The help strings for the CLI arguments are incorrect: update the
parser.add_argument calls for '--start_date' and '--end_date' so that
'--start_date' help says "Start date in YYYY-MM format" and '--end_date' help
says "End date in YYYY-MM format" (refer to the parser.add_argument declarations
for '--start_date' and '--end_date' to locate and edit the help text).
| with open(output_file_path, 'w') as f: | ||
| for problem in data: | ||
| input_date = datetime.strptime(problem['contest_date'], '%Y-%m-%dT%H:%M:%S').date() | ||
| if start_date <= input_date <= end_date: | ||
| json.dump( |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
git ls-files | grep -E "(prepare\.py|livecodebench)" | head -20Repository: NVIDIA-NeMo/Skills
Length of output: 954
🏁 Script executed:
cat -n nemo_skills/dataset/livecodebench-ja-prompt-en/prepare.py | sed -n '100,130p'Repository: NVIDIA-NeMo/Skills
Length of output: 1445
🏁 Script executed:
# Check the schema or any comments about contest_date in the file
rg "contest_date" nemo_skills/dataset/livecodebench-ja-prompt-en/prepare.py -B 3 -A 3Repository: NVIDIA-NeMo/Skills
Length of output: 523
🏁 Script executed:
# Check imports at the top of the file
head -30 nemo_skills/dataset/livecodebench-ja-prompt-en/prepare.pyRepository: NVIDIA-NeMo/Skills
Length of output: 1515
🏁 Script executed:
python - <<'PY'
from datasets import load_dataset
data = load_dataset("livecodebench/code_generation_lite", split="test", version_tag="release_v5")
row = data[0]
print("Type:", type(row["contest_date"]))
print("Value:", row["contest_date"])
print("Repr:", repr(row["contest_date"]))
PYRepository: NVIDIA-NeMo/Skills
Length of output: 182
🏁 Script executed:
pip install datasets -q && python - <<'PY'
from datasets import load_dataset
data = load_dataset("livecodebench/code_generation_lite", split="test", version_tag="release_v5")
row = data[0]
print("Type:", type(row["contest_date"]))
print("Value:", row["contest_date"])
print("Repr:", repr(row["contest_date"]))
PYRepository: NVIDIA-NeMo/Skills
Length of output: 1770
🏁 Script executed:
rg "def parse_data" nemo_skills/dataset/livecodebench-ja-prompt-en/prepare.py -A 30Repository: NVIDIA-NeMo/Skills
Length of output: 1025
🏁 Script executed:
# Check if there's a load_data or similar function
rg "def (parse|load|clean)" nemo_skills/dataset/livecodebench-ja-prompt-en/prepare.py -A 10Repository: NVIDIA-NeMo/Skills
Length of output: 1200
🏁 Script executed:
# Get the full prepare.py file to understand the data flow
wc -l nemo_skills/dataset/livecodebench-ja-prompt-en/prepare.pyRepository: NVIDIA-NeMo/Skills
Length of output: 123
🏁 Script executed:
# Read the clean_data function to see if it modifies contest_date
cat -n nemo_skills/dataset/livecodebench-ja-prompt-en/prepare.py | sed -n '45,95p'Repository: NVIDIA-NeMo/Skills
Length of output: 2025
🏁 Script executed:
# Also check the full clean_data to see all transformations
rg "def clean_data" nemo_skills/dataset/livecodebench-ja-prompt-en/prepare.py -A 50Repository: NVIDIA-NeMo/Skills
Length of output: 1872
🏁 Script executed:
find nemo_skills/dataset -name "prepare.py" -path "*livecodebench*" | head -10Repository: NVIDIA-NeMo/Skills
Length of output: 610
🏁 Script executed:
# Check another livecodebench prepare file to see the pattern
cat nemo_skills/dataset/livecodebench-ja-prompt-ja/prepare.py | grep -A 5 "contest_date"Repository: NVIDIA-NeMo/Skills
Length of output: 550
🏁 Script executed:
# Also check the main livecodebench prepare if it exists
ls -la nemo_skills/dataset/ | grep livecodebenchRepository: NVIDIA-NeMo/Skills
Length of output: 888
🏁 Script executed:
# Check the main livecodebench prepare.py
grep -A 5 "contest_date" nemo_skills/dataset/livecodebench/prepare.pyRepository: NVIDIA-NeMo/Skills
Length of output: 620
🏁 Script executed:
# Check if the main one uses strptime too
cat nemo_skills/dataset/livecodebench/prepare.py | grep -B 2 -A 2 "strptime"Repository: NVIDIA-NeMo/Skills
Length of output: 563
🏁 Script executed:
# Let's look at the context around the contest_date handling in the main file
cat -n nemo_skills/dataset/livecodebench/prepare.py | sed -n '100,130p'Repository: NVIDIA-NeMo/Skills
Length of output: 1509
Handle contest_date when it's a datetime object instead of a string.
The schema comment on line 38 documents contest_date: datetime, but line 116 assumes a string format with datetime.strptime(). This mismatch will cause a runtime error if the dataset yields datetime/date objects as documented. Normalize the input to handle both cases.
Suggested fix
@@
-from datetime import datetime
+from datetime import datetime, date
+from dateutil.parser import isoparse
@@
- input_date = datetime.strptime(problem['contest_date'], '%Y-%m-%dT%H:%M:%S').date()
+ contest_date = problem["contest_date"]
+ if isinstance(contest_date, datetime):
+ input_date = contest_date.date()
+ elif isinstance(contest_date, date):
+ input_date = contest_date
+ else:
+ input_date = isoparse(contest_date).date()🤖 Prompt for AI Agents
In `@nemo_skills/dataset/livecodebench-ja-prompt-en/prepare.py` around lines 114 -
118, The code assumes problem['contest_date'] is a string and uses
datetime.strptime, but the schema documents contest_date as a datetime object;
update the block that sets input_date (inside the with open loop) to normalize
both cases: check the type of problem['contest_date'], if it's a str parse with
datetime.strptime(..., '%Y-%m-%dT%H:%M:%S'), if it's a datetime use .date(), if
it's a date use it directly, otherwise raise/skip with a clear error; ensure the
rest of the existing comparison using input_date against start_date and end_date
remains unchanged.
| class PromptConstants: | ||
| # reference: https://github.com/QwenLM/Qwen2.5-Coder/blob/main/qwencoder-eval/reasoning/livecode_bench_cot/lcb_runner_cq/prompts/code_generation.py#L31 | ||
| FORMATTING_MESSAGE_WITH_STARTER_CODE = "You will use the following starter code to write the solution to the problem and enclose your code within delimiters." | ||
| FORMATTING_WITHOUT_STARTER_CODE = "Read the inputs from stdin solve the problem and write the answer to stdout (do not directly test on the sample inputs). Enclose your code within delimiters as follows. Ensure that when the python program runs, it reads the inputs, runs the algorithm and writes output to STDOUT." |
There was a problem hiding this comment.
English prompts in Japanese-prompt file.
The directory name livecodebench-ja-prompt-ja suggests Japanese prompts, but PromptConstants contains English text. These formatting messages should be translated to Japanese to match the intended localization.
🤖 Prompt for AI Agents
In `@nemo_skills/dataset/livecodebench-ja-prompt-ja/prepare.py` around lines 25 -
28, PromptConstants currently contains English strings but lives in a Japanese
prompt module; update the two constants FORMATTING_MESSAGE_WITH_STARTER_CODE and
FORMATTING_WITHOUT_STARTER_CODE to Japanese equivalents that preserve the
original intent (instructions about using starter code, enclosing code within
delimiters, reading stdin/writing stdout, and the warning about not directly
testing on sample inputs), keeping punctuation and any special phrasing intact
so callers of PromptConstants need no further changes.
| def parse_data(release_version='release_latest'): | ||
| data = load_dataset( | ||
| "livecodebench/code_generation_lite", split="test", version_tag=release_version, trust_remote_code=True | ||
| ) |
There was a problem hiding this comment.
🧩 Analysis chain
🌐 Web query:
livecodebench multilingual Japanese dataset support
💡 Result:
LiveCodeBench does not currently provide an official multilingual / Japanese version of the benchmark prompts/problem statements.
- The official LiveCodeBench benchmark is built from contest problems (LeetCode, AtCoder, Codeforces) for code evaluation over time. [1]
- In the released Hugging Face dataset (e.g.,
livecodebench/code_generation), thequestion_contentfield is provided in English (the dataset viewer examples are English AtCoder statements). [2] - The official repo/docs describe dataset versions and evaluation tooling, but do not describe any packaged Japanese/multilingual problem-statement split. [3]
If you want Japanese support, you’d need to create your own variant (e.g., replace question_content with Japanese statements—such as AtCoder’s Japanese versions—then run the same evaluator). The LiveCodeBench runner is flexible enough to evaluate custom outputs, but the benchmark’s distributed prompts are not multilingual. [3]
Sources: [1] [2] [3]
🏁 Script executed:
# Check the directory structure and contents
find nemo_skills/dataset -type d -name "*livecodebench*ja*" | head -20Repository: NVIDIA-NeMo/Skills
Length of output: 155
🏁 Script executed:
# Read the prepare.py file to see the full implementation
cat -n nemo_skills/dataset/livecodebench-ja-prompt-ja/prepare.pyRepository: NVIDIA-NeMo/Skills
Length of output: 7761
🏁 Script executed:
# Check if there are any README or configuration files in this directory
find nemo_skills/dataset/livecodebench-ja-prompt-ja -type f | head -20Repository: NVIDIA-NeMo/Skills
Length of output: 178
🏁 Script executed:
# Compare the two variants
cat -n nemo_skills/dataset/livecodebench-ja-prompt-en/prepare.py 2>/dev/null | head -50Repository: NVIDIA-NeMo/Skills
Length of output: 2452
🏁 Script executed:
# Look for README or documentation in the parent directory
find nemo_skills/dataset -maxdepth 2 -name "README*" -o -name "*.md" | grep -i livecodeRepository: NVIDIA-NeMo/Skills
Length of output: 44
🏁 Script executed:
# Check if there's a parent README or docs
ls -la nemo_skills/dataset/ | head -20Repository: NVIDIA-NeMo/Skills
Length of output: 1285
🏁 Script executed:
# Check git history or comments about the Japanese variants
git log --oneline nemo_skills/dataset/livecodebench-ja-prompt-ja/ 2>/dev/null | head -5Repository: NVIDIA-NeMo/Skills
Length of output: 107
🏁 Script executed:
# Look for any configuration or mapping files that might translate the dataset
find nemo_skills/dataset/livecodebench-ja-prompt-ja -type fRepository: NVIDIA-NeMo/Skills
Length of output: 178
🏁 Script executed:
# Check if __init__.py has any relevant information
cat nemo_skills/dataset/livecodebench-ja-prompt-ja/__init__.pyRepository: NVIDIA-NeMo/Skills
Length of output: 1120
🏁 Script executed:
# Check the git commit to understand the multilingual variants better
git show c8a2bb8 --stat 2>/dev/null | head -30Repository: NVIDIA-NeMo/Skills
Length of output: 1698
🏁 Script executed:
# Compare PromptConstants between the two variants more carefully
echo "=== livecodebench-ja-prompt-ja ==="; grep -A 3 "class PromptConstants" nemo_skills/dataset/livecodebench-ja-prompt-ja/prepare.py
echo "=== livecodebench-ja-prompt-en ===" && grep -A 3 "class PromptConstants" nemo_skills/dataset/livecodebench-ja-prompt-en/prepare.py 2>/dev/null || echo "Same or identical"Repository: NVIDIA-NeMo/Skills
Length of output: 1455
🏁 Script executed:
# Check if there's a pattern with other multilingual variants (AIME25 Japanese)
echo "=== AIME25 Japanese variant ===" && cat nemo_skills/dataset/aime25-ja-prompt-ja/__init__.py | grep -A 1 "PROMPT_CONFIG"
echo "=== AIME25 English variant ===" && cat nemo_skills/dataset/aime25-ja-prompt-en/__init__.py | grep -A 1 "PROMPT_CONFIG" 2>/dev/null || echo "Not found"Repository: NVIDIA-NeMo/Skills
Length of output: 240
PromptConstants should be in Japanese for the -ja-prompt-ja variant.
The PromptConstants class contains English formatting messages (lines 27-28), but this file is in the livecodebench-ja-prompt-ja directory, which by naming convention should provide Japanese prompts. These constants are embedded into the dataset and should be translated to Japanese. Compare to the eval/livecodebench/python_codegen_ja config referenced in __init__.py, which suggests Japanese evaluation support is intended.
Note: The dataset source (livecodebench/code_generation_lite) is English-only per official LiveCodeBench design; this is expected and not a problem. However, the instruction text in PromptConstants should be localized.
🤖 Prompt for AI Agents
In `@nemo_skills/dataset/livecodebench-ja-prompt-ja/prepare.py` around lines 31 -
34, PromptConstants currently contains English prompt templates but this variant
(livecodebench-ja-prompt-ja) must use Japanese; open the PromptConstants class
in nemo_skills/dataset/livecodebench-ja-prompt-ja/prepare.py and replace the
English strings (e.g., instruction/format templates defined around
PromptConstants) with their Japanese equivalents, preserving variable
placeholders and formatting, then run any local tests that embed these constants
to confirm no formatting errors; ensure the class name PromptConstants and any
references (e.g., where parse_data or dataset preparation uses PromptConstants)
remain unchanged.
| parser.add_argument('--start_date', type=str, default='all', help="End date in YYYY-MM format") | ||
| parser.add_argument('--end_date', type=str, default='all', help="End date in YYYY-MM format") |
There was a problem hiding this comment.
Fix incorrect help text.
The --start_date argument has help text saying "End date in YYYY-MM format" — this appears to be a copy-paste error.
Suggested fix
- parser.add_argument('--start_date', type=str, default='all', help="End date in YYYY-MM format")
+ parser.add_argument('--start_date', type=str, default='all', help="Start date in YYYY-MM format")📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| parser.add_argument('--start_date', type=str, default='all', help="End date in YYYY-MM format") | |
| parser.add_argument('--end_date', type=str, default='all', help="End date in YYYY-MM format") | |
| parser.add_argument('--start_date', type=str, default='all', help="Start date in YYYY-MM format") | |
| parser.add_argument('--end_date', type=str, default='all', help="End date in YYYY-MM format") |
🤖 Prompt for AI Agents
In `@nemo_skills/dataset/livecodebench-ja-prompt-ja/prepare.py` around lines 143 -
144, The help text for the CLI argument defined by
parser.add_argument('--start_date', type=str, default='all', help="End date in
YYYY-MM format") is incorrect; update the help string for the '--start_date'
argument to accurately describe it (e.g., "Start date in YYYY-MM format") while
keeping the '--end_date' help as the end-date description so both
parser.add_argument('--start_date', ...) and parser.add_argument('--end_date',
...) have correct, non-duplicated help messages.
shuoyangd
left a comment
There was a problem hiding this comment.
Thanks for adding this PR! I left a few questions, mostly about the metric. We should aim to make this one compatible for both the test sets you added as well as MMLU-ProX.
|
|
||
|
|
||
| class MathMultilingualMetrics(BaseMetrics): | ||
| # TODO: how can we ensure that user-defined aggregations have all the same metrics as in base? |
There was a problem hiding this comment.
Not sure what this means. Is this still relevant?
| LOG = logging.getLogger(get_logger_name(__file__)) | ||
|
|
||
|
|
||
| class MathMultilingualMetrics(BaseMetrics): |
There was a problem hiding this comment.
I don't think this only covers math datasets? Maybe use a more general name?
| for score_method in score_dicts[0].keys(): | ||
| # Get valid answers and their results for this field | ||
| valid_answers_and_results = [ | ||
| (elem[self.answer_key], correctness_dict[score_method], elem["reward_model_score"]) |
There was a problem hiding this comment.
I've never used evaluation with reward model before -- is this a feature in the public branch? If not we can remove it for now.
|
|
||
| def _get_score_dict(self, prediction: dict) -> dict[str, bool | int | float]: | ||
| correctness_dict = {} | ||
| if "target_language" in prediction: |
There was a problem hiding this comment.
Rather than preparing each individual language as a separate dataset, IMO the cleaner way is to set them as subset_for_metrics field (example). This way, you can have one dataset for all the languages, while still having the model automatically report breakdowns per-language. Plus you get averages across all languages for free.
MMLU-ProX currently use this scheme. I would recommend we stick to this, and it should be a lightweight change compared to what we have here.
| if "symbolic_correct" in prediction: | ||
| correctness_dict["symbolic_correct"] = prediction["symbolic_correct"] | ||
| correctness_dict["symbolic_and_language_correct"] = language_correct and prediction["symbolic_correct"] | ||
| if "judgement" in prediction: |
There was a problem hiding this comment.
Same sanity check as reward models -- is the this judgment field also what we use in the main branch? If not we can also remove it for now.
(Also I think judgment is the more widely-used spelling)
| @@ -0,0 +1,20 @@ | |||
| # Copyright (c) 2024, NVIDIA CORPORATION. All rights reserved. | |||
There was a problem hiding this comment.
maybe we keep those in subfolders? E.g. aime25-X and then inside you'd have de-prompt-de, etc. Also it seems there is quite a bit of code duplication in all these prepare.py files, maybe we can unify some of it in shared utilities
|
@zijiachen95 please re-create PR from a branch, I just sent you an invite |
add multilingual aime25, gqpa and lcb and corresponding metrics calculation code.
The source data is not included yet.
Summary by CodeRabbit
Release Notes
✏️ Tip: You can customize this high-level summary in your review settings.