Skip to content
159 changes: 152 additions & 7 deletions docs/evaluation/multilingual.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@

Our multilingual benchmarks cover things like multilingual reasoning as well as machine translation.

All benchmarks in this category will have an extra `--language` argument with its associated `ns prepare` command, which allows you to choose which language(s) of the benchmark to run.
All benchmarks in this category will have an extra `--languages` argument with its associated `ns prepare` command, which allows you to choose which language(s) of the benchmark to run.
Once prepared, the `ns eval` command will run on all languages prepared, and the summarized results generated with `ns eval` will include per-language breakdowns.

## Supported benchmarks
Expand Down Expand Up @@ -96,7 +96,7 @@ Some reference numbers for devtest split (xx corresponds to average over 5 langu
--server_type=vllm \
--server_gpus=8 \
--split=devtest \
++inference.tokens_to_generate=512
++inference.tokens_to_generate=512 \
++system_message='/no_think'
```

Expand All @@ -111,7 +111,7 @@ Some reference numbers for devtest split (xx corresponds to average over 5 langu
--server_type=vllm \
--server_gpus=8 \
--split=devtest \
++inference.tokens_to_generate=512
++inference.tokens_to_generate=512 \
++prompt_suffix='/no_think'
```

Expand All @@ -126,7 +126,7 @@ Some reference numbers for devtest split (xx corresponds to average over 5 langu
--server_type=vllm \
--server_gpus=8 \
--split=devtest \
++inference.tokens_to_generate=512
++inference.tokens_to_generate=512 \
++prompt_suffix='/no_think'
```

Expand Down Expand Up @@ -169,7 +169,7 @@ Some reference numbers for test split (xx corresponds to average over 5 language
--server_type=vllm \
--server_gpus=8 \
--split=test \
++inference.tokens_to_generate=512
++inference.tokens_to_generate=512 \
++system_message='/no_think'
```

Expand All @@ -184,7 +184,7 @@ Some reference numbers for test split (xx corresponds to average over 5 language
--server_type=vllm \
--server_gpus=8 \
--split=test \
++inference.tokens_to_generate=512
++inference.tokens_to_generate=512 \
++prompt_suffix='/no_think'
```

Expand All @@ -199,7 +199,7 @@ Some reference numbers for test split (xx corresponds to average over 5 language
--server_type=vllm \
--server_gpus=8 \
--split=test \
++inference.tokens_to_generate=512
++inference.tokens_to_generate=512 \
++prompt_suffix='/no_think'
```

Expand All @@ -217,6 +217,151 @@ Some reference numbers for test split (xx corresponds to average over 5 language
++inference.tokens_to_generate=2048
```

### mmmlu

- Benchmark is defined in [`nemo_skills/dataset/mmmlu/__init__.py`](https://github.com/NVIDIA-NeMo/Skills/blob/main/nemo_skills/dataset/mmmlu/__init__.py)
- Original benchmark source is [here](https://huggingface.co/datasets/openai/MMMLU).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Use descriptive link text instead of “here”.

Line 223 and Line 296 use generic link text, which triggers MD059 and is less accessible in rendered docs.

Suggested fix
-- Original benchmark source is [here](https://huggingface.co/datasets/openai/MMMLU).
+- Original benchmark source is [OpenAI MMMLU dataset](https://huggingface.co/datasets/openai/MMMLU).

-- Original benchmark source is [here](https://huggingface.co/datasets/mrlbenchmarks/global-piqa-nonparallel).
+- Original benchmark source is [Global PIQA nonparallel dataset](https://huggingface.co/datasets/mrlbenchmarks/global-piqa-nonparallel).

Also applies to: 296-296

🧰 Tools
🪛 markdownlint-cli2 (0.21.0)

[warning] 223-223: Link text should be descriptive

(MD059, descriptive-link-text)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/evaluation/multilingual.md` at line 223, Replace the generic link text
"here" with descriptive link text for both instances in
docs/evaluation/multilingual.md (lines referencing the original benchmark and
the other occurrence) to satisfy MD059; for example change
"[here](https://huggingface.co/datasets/openai/MMMLU)" to "[OpenAI MMMLU dataset
on Hugging Face](https://huggingface.co/datasets/openai/MMMLU)" (and make a
similarly descriptive replacement for the second occurrence) so the links convey
their destination and purpose.


MMMLU is a multilingual extension of the MMLU benchmark that covers 14 languages: Arabic, Bengali, German, Spanish, French, Hindi, Indonesian, Italian, Japanese, Korean, Portuguese, Chinese, Swahili, and Yoruba. The `--include_english` flag can be used to additionally include the English split (original MMLU dataset).

```bash
ns prepare_data mmmlu --languages <lang1> <lang2> ... --include_english
```
Comment on lines +227 to +229
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Match repository markdown code-block style (MD046).

Line 227 and Line 300 use fenced blocks, but lint expects indented blocks in this doc style. Please convert these two snippets to the expected indentation format.

Also applies to: 300-302

🧰 Tools
🪛 markdownlint-cli2 (0.21.0)

[warning] 227-227: Code block style
Expected: indented; Actual: fenced

(MD046, code-block-style)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/evaluation/multilingual.md` around lines 227 - 229, Replace the fenced
markdown code blocks with indented code blocks for the two command snippets (the
one containing "ns prepare_data mmmlu --languages <lang1> <lang2> ...
--include_english" and the later snippet around line 300) by removing the triple
backticks and indenting each line of the snippet with four spaces so the
document matches the repository's MD046 style; update both occurrences (the
mmmlu prepare_data snippet and the other snippet referenced) to use the indented
format.

Some reference numbers for reference and commands for reproduction:


| Model | Avg (14 langs) | AR-XY | DE-DE | ES-LA | FR-FR | HI-IN | IT-IT | JA-JP | KO-KR | PT-BR | ZH-CN | BN-BD | ID-ID | SW-KE | YO-NG |
| :-----------------------------: | :------------: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
| gpt-oss-120b | **82.66** | 83.58 | 84.18 | 86.53 | 86.08 | 83.67 | 85.91 | 84.98 | 83.95 | 86.03 | 85.11 | 81.87 | 85.04 | 75.04 | 65.20 |
| Qwen3.5-122B-A10B | **87.57** | 88.62 | 89.08 | 90.10 | 89.68 | 88.11 | 89.69 | 89.27 | 88.51 | 90.09 | 89.39 | 86.56 | 89.04 | 83.65 | 74.13 |
| Nemotron-3-Super-120B-A12B-BF16 | **81.51** | 86.68 | 84.59 | 88.59 | 88.04 | 86.21 | 88.06 | 86.83 | 86.23 | 88.35 | 87.12 | 80.88 | 86.84 | 71.31 | 31.43 |

=== "gpt-oss-120b"

```bash
ns eval \
--cluster=[cluster] \
--model=openai/gpt-oss-120b \
--benchmarks mmmlu \
--output_dir=[output dir] \
--server_type=vllm \
--server_gpus=8 \
++inference.tokens_to_generate=120000 \
++inference.temperature=1.0 \
++inference.top_p=1.0 \
++inference.reasoning_effort=high
```

=== "Qwen3.5-122B-A10B"

```bash
ns eval \
--cluster=[cluster] \
--model=Qwen/Qwen3.5-122B-A10B \
--benchmarks mmmlu \
--output_dir=[output dir] \
--server_type=vllm \
--server_gpus=8 \
--server_args='--max-model-len 262144 --reasoning-parser qwen3 --language-model-only' \
++chat_template_kwargs.enable_thinking=true \
++inference.tokens_to_generate=81920 \
++inference.temperature=1.0 \
++inference.top_p=0.95 \
++inference.top_k=20 \
++inference.repetition_penalty=1.0
```

=== "Nemotron-3-Super-120B-A12B-BF16"

```bash
ns eval \
--cluster=[cluster] \
--model=NVIDIA/Nemotron-3-Super-120B-A12B-BF16 \
--benchmarks mmmlu \
--output_dir=[output dir] \
--server_type=vllm \
--server_gpus=8 \
--server_args='--mamba_ssm_cache_dtype float32' \
++chat_template_kwargs.enable_thinking=true \
++parse_reasoning=true \
++inference.tokens_to_generate=131072 \
++inference.temperature=1.0 \
++inference.top_p=0.95
```


### Global PIQA

- Benchmark is defined in [`nemo_skills/dataset/global_piqa/__init__.py`](https://github.com/NVIDIA-NeMo/Skills/blob/main/nemo_skills/dataset/global_piqa/__init__.py)
- Original benchmark source is [here](https://huggingface.co/datasets/mrlbenchmarks/global-piqa-nonparallel).

Global PIQA is a multilingual physical intuition question-answering benchmark focused on commonsense reasoning. Each question presents a situation with two solution options (A/B). It supports 116 languages.

```bash
ns prepare_data global_piqa --languages <lang1> <lang2> ...
```
Some reference numbers for reference and commands for reproduction:

| Model | Avg (116 langs) |
| :-----------------------------: | :-------------: |
| gpt-oss-120b | **84.61** |
| Qwen3.5-122B-A10B | **88.72** |
| Nemotron-3-Super-120B-A12B-BF16 | **82.28** |

=== "gpt-oss-120b"

```bash
ns eval \
--cluster=[cluster] \
--model=openai/gpt-oss-120b \
--benchmarks global_piqa \
--output_dir=[output dir] \
--server_type=vllm \
--server_gpus=8 \
++inference.tokens_to_generate=120000 \
++inference.temperature=1.0 \
++inference.top_p=1.0 \
++inference.reasoning_effort=high
```

=== "Qwen3.5-122B-A10B"

```bash
ns eval \
--cluster=[cluster] \
--model=Qwen/Qwen3.5-122B-A10B \
--benchmarks global_piqa \
--output_dir=[output dir] \
--server_type=vllm \
--server_gpus=8 \
--server_args='--max-model-len 262144 --reasoning-parser qwen3 --language-model-only' \
++chat_template_kwargs.enable_thinking=true \
++inference.tokens_to_generate=81920 \
++inference.temperature=1.0 \
++inference.top_p=0.95 \
++inference.top_k=20 \
++inference.repetition_penalty=1.0
```

=== "Nemotron-3-Super-120B-A12B-BF16"

```bash
ns eval \
--cluster=[cluster] \
--model=NVIDIA/Nemotron-3-Super-120B-A12B-BF16 \
--benchmarks global_piqa \
--output_dir=[output dir] \
--server_type=vllm \
--server_gpus=8 \
--server_args='--mamba_ssm_cache_dtype float32' \
++chat_template_kwargs.enable_thinking=true \
++parse_reasoning=true \
++inference.tokens_to_generate=131072 \
++inference.temperature=1.0 \
++inference.top_p=0.95
```


## Supported translation metrics

By default, we compute [BLEU score](https://github.com/mjpost/sacrebleu) to evaluate machine translation. However, we also support COMET, a popular neural metric for machine translation. Computing COMET requires a separate evaluation run that uses [xCOMET-XXL](https://huggingface.co/Unbabel/XCOMET-XXL) model as a judge. This run can be scheduled by adding the following parameters to the evaluation command:
Expand Down
16 changes: 16 additions & 0 deletions nemo_skills/dataset/global_piqa/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
# Copyright (c) 2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

METRICS_TYPE = "multichoice"
GENERATION_ARGS = "++prompt_config=generic/default ++eval_type=multichoice"
69 changes: 69 additions & 0 deletions nemo_skills/dataset/global_piqa/global_piqa_utils.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,69 @@
# Copyright (c) 2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

from datasets import get_dataset_config_names, load_dataset


def supported_languages() -> list[str]:
return get_dataset_config_names("mrlbenchmarks/global-piqa-nonparallel")


def load_global_piqa_datasets(languages: list[str], split: str = "test") -> dict[str, list[dict]]:
return {lang: load_dataset("mrlbenchmarks/global-piqa-nonparallel", lang)[split] for lang in languages}


def digit_to_letter(digit: int) -> str:
return chr(ord("A") + digit)
Comment on lines +26 to +27
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Fail fast on labels outside A/B.

This is a two-choice benchmark, but chr(ord("A") + digit) will happily emit C, D, etc. if the source label is ever unexpected. Raising here is safer than writing impossible expected_answer values.

🛡️ Proposed fix
 def digit_to_letter(digit: int) -> str:
+    if digit not in (0, 1):
+        raise ValueError(f"Unsupported Global PIQA label: {digit}")
     return chr(ord("A") + digit)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
def digit_to_letter(digit: int) -> str:
return chr(ord("A") + digit)
def digit_to_letter(digit: int) -> str:
if digit not in (0, 1):
raise ValueError(f"Unsupported Global PIQA label: {digit}")
return chr(ord("A") + digit)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nemo_skills/dataset/global_piqa/global_piqa_utils.py` around lines 26 - 27,
The function digit_to_letter currently maps any integer to letters via
chr(ord("A") + digit) which can produce labels beyond "A"/"B"; update
digit_to_letter to validate the input digit (e.g., ensure it's 0 or 1 or at
least within 0..1) and raise a ValueError for out-of-range values, leaving the
mapping of 0->"A" and 1->"B" intact; reference the function name digit_to_letter
when making this change so callers receive a clear exception instead of invalid
labels.



class Schema:
PROMPT = "prompt"
SOLUTION0 = "solution0"
SOLUTION1 = "solution1"
LABEL = "label"


# Prompt and Regex are adapted from:
# https://github.com/EleutherAI/lm-evaluation-harness/blob/main/lm_eval/tasks/global_piqa/prompted/_template

QUERY_TEMPLATE = """
Given the following situation, which option is more likely to be correct?

Situation:
{prompt} ...

Option A: {solution0}

Option B: {solution1}

Your response should end with "The best answer is: [answer_letter]" where [answer_letter] is one of A or B.
""".strip()


ANSWER_REGEX1 = r"(?i)[Tt]he (?:[Bb]est [Aa]nswer|[Ff]inal [Aa]nswer|[Aa]nswer)[^A-B]*([A-B])"
ANSWER_REGEX2 = r"(?i)[Aa]nswer\s*:[^A-B]*([A-B])"
Comment on lines +54 to +55
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
python - <<'PY'
import re

pattern = re.compile(r"(?i)[Tt]he (?:[Bb]est [Aa]nswer|[Ff]inal [Aa]nswer|[Aa]nswer)[^A-B]*([A-B])")
for sample in [
    "The answer is maybe A",
    "The answer is probably B",
    "The final answer is A",
]:
    match = pattern.search(sample)
    print(f"{sample!r} -> {match.group(1) if match else None}")
PY

Repository: NVIDIA-NeMo/Skills

Length of output: 151


🏁 Script executed:

fd "global_piqa_utils.py" --type f

Repository: NVIDIA-NeMo/Skills

Length of output: 114


🏁 Script executed:

cat -n nemo_skills/dataset/global_piqa/global_piqa_utils.py | head -100

Repository: NVIDIA-NeMo/Skills

Length of output: 3055


🏁 Script executed:

rg "digit_to_letter" nemo_skills/dataset/global_piqa/global_piqa_utils.py -A 2 -B 2

Repository: NVIDIA-NeMo/Skills

Length of output: 411


🏁 Script executed:

rg "EXTRACT_REGEX" nemo_skills/dataset/global_piqa/ -A 3 -B 2

Repository: NVIDIA-NeMo/Skills

Length of output: 1484


🏁 Script executed:

rg "digit_to_letter" nemo_skills/dataset/global_piqa/ -B 2 -A 2

Repository: NVIDIA-NeMo/Skills

Length of output: 1689


🏁 Script executed:

cat nemo_skills/dataset/global_piqa/prepare.py | grep -A 10 "Schema.LABEL"

Repository: NVIDIA-NeMo/Skills

Length of output: 389


🏁 Script executed:

python3 <<'EOF'
import re

# Test the proposed fix from the review comment
proposed_regex1 = r"(?i)\bthe (?:best answer|final answer|answer)\b(?:\s+is)?\s*[:\-]?\s*(?:option\s*)?\(?([AB])\)?(?=[^\w]|$)"
proposed_regex2 = r"(?i)\banswer\s*:\s*(?:option\s*)?\(?([AB])\)?(?=[^\w]|$)"

# Also test a simpler fix: use [^A-Ba-b]
simple_regex1 = r"(?i)[Tt]he (?:[Bb]est [Aa]nswer|[Ff]inal [Aa]nswer|[Aa]nswer)[^A-Ba-b]*([A-B])"

test_cases = [
    "The answer is maybe A",
    "The answer is probably B",
    "The final answer is A",
    "The best answer is A",
    "Answer: B",
]

print("=== ORIGINAL REGEX (BROKEN) ===")
original_regex1 = re.compile(r"(?i)[Tt]he (?:[Bb]est [Aa]nswer|[Ff]inal [Aa]nswer|[Aa]nswer)[^A-B]*([A-B])")
for test in test_cases:
    match = original_regex1.search(test)
    print(f"{test!r} -> {match.group(1) if match else None}")

print("\n=== PROPOSED FIX (COMPLEX) ===")
proposed_pattern1 = re.compile(proposed_regex1)
for test in test_cases:
    match = proposed_pattern1.search(test)
    print(f"{test!r} -> {match.group(1) if match else None}")

print("\n=== SIMPLE FIX [^A-Ba-b] ===")
simple_pattern1 = re.compile(simple_regex1)
for test in test_cases:
    match = simple_pattern1.search(test)
    print(f"{test!r} -> {match.group(1) if match else None}")
EOF

Repository: NVIDIA-NeMo/Skills

Length of output: 575


🏁 Script executed:

python3 <<'EOF'
import re

# Test alternative fixes
word_boundary_regex1 = r"(?i)[Tt]he (?:[Bb]est [Aa]nswer|[Ff]inal [Aa]nswer|[Aa]nswer)[\s:=\-]*\b([A-B])\b"
word_boundary_regex2 = r"(?i)[Aa]nswer\s*:[\s]*\b([A-B])\b"

# Negative lookbehind/lookahead approach
negative_lb_regex1 = r"(?i)[Tt]he (?:[Bb]est [Aa]nswer|[Ff]inal [Aa]nswer|[Aa]nswer)[^A-B]*(?<![a-z])([A-B])(?![a-z])"

test_cases = [
    "The answer is maybe A",  # Invalid LLM output
    "The answer is probably B",  # Invalid LLM output
    "The final answer is A",  # Valid
    "The best answer is A",  # Valid
    "The best answer is: A",  # Valid
    "Answer: B",  # Valid
    "The answer is B",  # Valid
]

print("=== WORD BOUNDARY FIX ===")
pattern1 = re.compile(word_boundary_regex1)
pattern2 = re.compile(word_boundary_regex2)
for test in test_cases:
    match = pattern1.search(test)
    if not match:
        match = pattern2.search(test)
    print(f"{test!r} -> {match.group(1) if match else None}")

print("\n=== NEGATIVE LOOKBEHIND/LOOKAHEAD FIX ===")
pattern_lb = re.compile(negative_lb_regex1)
for test in test_cases:
    match = pattern_lb.search(test)
    print(f"{test!r} -> {match.group(1) if match else None}")
EOF

Repository: NVIDIA-NeMo/Skills

Length of output: 531


🏁 Script executed:

python3 <<'EOF'
import re

# Test ANSWER_REGEX2 with the same approach
answer_regex2_original = r"(?i)[Aa]nswer\s*:[^A-B]*([A-B])"
answer_regex2_fixed = r"(?i)[Aa]nswer\s*:(?<![a-z])[^A-B]*(?<![a-z])([A-B])(?![a-z])"

test_cases = [
    "answer: maybe A",
    "answer: probably B",
    "Answer: A",
    "Answer: B",
]

print("=== ANSWER_REGEX2 ORIGINAL ===")
pattern = re.compile(answer_regex2_original)
for test in test_cases:
    match = pattern.search(test)
    print(f"{test!r} -> {match.group(1) if match else None}")

print("\n=== ANSWER_REGEX2 WITH NEGATIVE LOOKBEHIND/LOOKAHEAD ===")
pattern_fixed = re.compile(answer_regex2_fixed)
for test in test_cases:
    match = pattern_fixed.search(test)
    print(f"{test!r} -> {match.group(1) if match else None}")
EOF

Repository: NVIDIA-NeMo/Skills

Length of output: 322


Fix regex patterns to reject letters inside words like "maybe" and "probably".

With [^A-B]*([A-B]), the regex captures the b in words like maybe and probably, marking correct outputs as B. Use negative lookbehind/lookahead instead to ensure the captured letter is not part of a word:

🔧 Proposed fix
-ANSWER_REGEX1 = r"(?i)[Tt]he (?:[Bb]est [Aa]nswer|[Ff]inal [Aa]nswer|[Aa]nswer)[^A-B]*([A-B])"
-ANSWER_REGEX2 = r"(?i)[Aa]nswer\s*:[^A-B]*([A-B])"
+ANSWER_REGEX1 = r"(?i)[Tt]he (?:[Bb]est [Aa]nswer|[Ff]inal [Aa]nswer|[Aa]nswer)[^A-B]*(?<![a-z])([A-B])(?![a-z])"
+ANSWER_REGEX2 = r"(?i)[Aa]nswer\s*:[^A-B]*(?<![a-z])([A-B])(?![a-z])"

Additionally, add bounds checking to digit_to_letter() to fail fast if passed values outside [0, 1]:

def digit_to_letter(digit: int) -> str:
    if digit not in (0, 1):
        raise ValueError(f"digit must be 0 or 1, got {digit}")
    return chr(ord("A") + digit)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nemo_skills/dataset/global_piqa/global_piqa_utils.py` around lines 54 - 55,
Update ANSWER_REGEX1 and ANSWER_REGEX2 to use word-boundary or lookaround so
captured A/B are not part of larger words (e.g., replace the [^A-B]*([A-B])
postfix with a pattern that ensures the captured letter is not preceded or
followed by word characters, using constructs like
(?<![A-Za-z0-9_])([AB])(?![A-Za-z0-9_]) or \b([AB])\b so "maybe" or "probably"
won't match); then add bounds checking in digit_to_letter(digit: int) to raise
ValueError when digit is not 0 or 1 (e.g., if digit not in (0,1): raise
ValueError(...)) so invalid inputs fail fast.

ANSWER_REGEX3 = r"(?i)\\boxed\{([A-B])\}"

# Fallback: greedily match everything, then capture the last A/B in the response.
# This is not in lm-evaluation-harness and is where we diverge from the original benchmark.
LETTER_REGEX = r"\b\(?\s*([A-B])\s*\)?\.?\b"
GREEDY_REGEX = r"[\s\S]*" + LETTER_REGEX
EXTRACT_REGEX = [ANSWER_REGEX1, ANSWER_REGEX2, ANSWER_REGEX3, GREEDY_REGEX]


def get_mcq_fields(entry: dict) -> dict:
options_dict = {digit_to_letter(i): entry[f"solution{i}"] for i in range(2)}
options_text = "\n".join(f"Option {letter}: {option}" for letter, option in options_dict.items())
question = QUERY_TEMPLATE.format(**entry)
return {"question": question, "options": options_text, **options_dict}
65 changes: 65 additions & 0 deletions nemo_skills/dataset/global_piqa/prepare.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,65 @@
# Copyright (c) 2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

import argparse
import json
from pathlib import Path

from nemo_skills.dataset.global_piqa.global_piqa_utils import (
EXTRACT_REGEX,
Schema,
digit_to_letter,
get_mcq_fields,
load_global_piqa_datasets,
supported_languages,
)


def format_entry(entry: dict, language: str) -> dict:
return {
"expected_answer": digit_to_letter(entry[Schema.LABEL]),
"extract_from_boxed": False,
"extract_regex": EXTRACT_REGEX,
"subset_for_metrics": language,
"relaxed": False,
**get_mcq_fields(entry),
}


def main(args):
invalid = set(args.languages) - set(supported_languages())
if invalid:
raise ValueError(f"Unsupported languages: {invalid}")
datasets = load_global_piqa_datasets(args.languages)

data_dir = Path(__file__).absolute().parent
output_file = data_dir / "test.jsonl"
with open(output_file, "wt", encoding="utf-8") as fout:
for language in datasets:
for entry in datasets[language]:
entry = format_entry(entry=entry, language=language)
json.dump(entry, fout, ensure_ascii=False)
fout.write("\n")
Comment on lines +44 to +53
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Compute all entries before writing to prevent data loss on failure.

Per coding guidelines, when adding new benchmarks, all computation should complete before file writes to prevent accidental data loss if code fails mid-way. Currently, if format_entry raises an exception during iteration, the file will contain partial data.

🛡️ Proposed fix to prevent partial writes
     datasets = load_global_piqa_datasets(args.languages)
 
+    # Compute all entries first to prevent partial writes on failure
+    all_entries = [
+        format_entry(entry=entry, language=language)
+        for language in datasets
+        for entry in datasets[language]
+    ]
+
     data_dir = Path(__file__).absolute().parent
     output_file = data_dir / "test.jsonl"
     with open(output_file, "wt", encoding="utf-8") as fout:
-        for language in datasets:
-            for entry in datasets[language]:
-                entry = format_entry(entry=entry, language=language)
-                json.dump(entry, fout, ensure_ascii=False)
-                fout.write("\n")
+        for entry in all_entries:
+            json.dump(entry, fout, ensure_ascii=False)
+            fout.write("\n")

As per coding guidelines: "When adding new benchmarks, avoid data loss by doing all computation before re-opening files for writing; ensure computation completes before file writes to prevent accidental data loss if code fails".

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nemo_skills/dataset/global_piqa/prepare.py` around lines 44 - 53, The current
loop writes each formatted entry to output_file as it's produced, risking
partial writes if format_entry raises; instead, first iterate through datasets
(using load_global_piqa_datasets and format_entry) and collect all formatted
entries into a list (e.g., formatted_entries), then open output_file (data_dir /
"test.jsonl") once and write the precomputed entries to the file; ensure you
reference datasets, format_entry, and output_file in your changes so you compute
everything before any file I/O.



if __name__ == "__main__":
parser = argparse.ArgumentParser()
parser.add_argument(
"--languages",
default=supported_languages(),
nargs="+",
help="Languages to process.",
)
args = parser.parse_args()
main(args)
Loading