-
Notifications
You must be signed in to change notification settings - Fork 165
Add Global PIQA benchmark #1299
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
c075e8a
49956cf
3b362e5
2359f9a
b7cd927
fc0fe78
2577a04
3d44942
55cf0b9
e093423
ace6423
74384bd
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -2,7 +2,7 @@ | |
|
|
||
| Our multilingual benchmarks cover things like multilingual reasoning as well as machine translation. | ||
|
|
||
| All benchmarks in this category will have an extra `--language` argument with its associated `ns prepare` command, which allows you to choose which language(s) of the benchmark to run. | ||
| All benchmarks in this category will have an extra `--languages` argument with its associated `ns prepare` command, which allows you to choose which language(s) of the benchmark to run. | ||
| Once prepared, the `ns eval` command will run on all languages prepared, and the summarized results generated with `ns eval` will include per-language breakdowns. | ||
|
|
||
| ## Supported benchmarks | ||
|
|
@@ -96,7 +96,7 @@ Some reference numbers for devtest split (xx corresponds to average over 5 langu | |
| --server_type=vllm \ | ||
| --server_gpus=8 \ | ||
| --split=devtest \ | ||
| ++inference.tokens_to_generate=512 | ||
| ++inference.tokens_to_generate=512 \ | ||
| ++system_message='/no_think' | ||
| ``` | ||
|
|
||
|
|
@@ -111,7 +111,7 @@ Some reference numbers for devtest split (xx corresponds to average over 5 langu | |
| --server_type=vllm \ | ||
| --server_gpus=8 \ | ||
| --split=devtest \ | ||
| ++inference.tokens_to_generate=512 | ||
| ++inference.tokens_to_generate=512 \ | ||
| ++prompt_suffix='/no_think' | ||
| ``` | ||
|
|
||
|
|
@@ -126,7 +126,7 @@ Some reference numbers for devtest split (xx corresponds to average over 5 langu | |
| --server_type=vllm \ | ||
| --server_gpus=8 \ | ||
| --split=devtest \ | ||
| ++inference.tokens_to_generate=512 | ||
| ++inference.tokens_to_generate=512 \ | ||
| ++prompt_suffix='/no_think' | ||
| ``` | ||
|
|
||
|
|
@@ -169,7 +169,7 @@ Some reference numbers for test split (xx corresponds to average over 5 language | |
| --server_type=vllm \ | ||
| --server_gpus=8 \ | ||
| --split=test \ | ||
| ++inference.tokens_to_generate=512 | ||
| ++inference.tokens_to_generate=512 \ | ||
| ++system_message='/no_think' | ||
| ``` | ||
|
|
||
|
|
@@ -184,7 +184,7 @@ Some reference numbers for test split (xx corresponds to average over 5 language | |
| --server_type=vllm \ | ||
| --server_gpus=8 \ | ||
| --split=test \ | ||
| ++inference.tokens_to_generate=512 | ||
| ++inference.tokens_to_generate=512 \ | ||
| ++prompt_suffix='/no_think' | ||
| ``` | ||
|
|
||
|
|
@@ -199,7 +199,7 @@ Some reference numbers for test split (xx corresponds to average over 5 language | |
| --server_type=vllm \ | ||
| --server_gpus=8 \ | ||
| --split=test \ | ||
| ++inference.tokens_to_generate=512 | ||
| ++inference.tokens_to_generate=512 \ | ||
| ++prompt_suffix='/no_think' | ||
| ``` | ||
|
|
||
|
|
@@ -217,6 +217,151 @@ Some reference numbers for test split (xx corresponds to average over 5 language | |
| ++inference.tokens_to_generate=2048 | ||
| ``` | ||
|
|
||
| ### mmmlu | ||
|
|
||
| - Benchmark is defined in [`nemo_skills/dataset/mmmlu/__init__.py`](https://github.com/NVIDIA-NeMo/Skills/blob/main/nemo_skills/dataset/mmmlu/__init__.py) | ||
| - Original benchmark source is [here](https://huggingface.co/datasets/openai/MMMLU). | ||
|
|
||
| MMMLU is a multilingual extension of the MMLU benchmark that covers 14 languages: Arabic, Bengali, German, Spanish, French, Hindi, Indonesian, Italian, Japanese, Korean, Portuguese, Chinese, Swahili, and Yoruba. The `--include_english` flag can be used to additionally include the English split (original MMLU dataset). | ||
|
|
||
| ```bash | ||
| ns prepare_data mmmlu --languages <lang1> <lang2> ... --include_english | ||
| ``` | ||
|
Comment on lines
+227
to
+229
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Match repository markdown code-block style (MD046). Line 227 and Line 300 use fenced blocks, but lint expects indented blocks in this doc style. Please convert these two snippets to the expected indentation format. Also applies to: 300-302 🧰 Tools🪛 markdownlint-cli2 (0.21.0)[warning] 227-227: Code block style (MD046, code-block-style) 🤖 Prompt for AI Agents |
||
| Some reference numbers for reference and commands for reproduction: | ||
|
|
||
|
|
||
| | Model | Avg (14 langs) | AR-XY | DE-DE | ES-LA | FR-FR | HI-IN | IT-IT | JA-JP | KO-KR | PT-BR | ZH-CN | BN-BD | ID-ID | SW-KE | YO-NG | | ||
| | :-----------------------------: | :------------: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | | ||
| | gpt-oss-120b | **82.66** | 83.58 | 84.18 | 86.53 | 86.08 | 83.67 | 85.91 | 84.98 | 83.95 | 86.03 | 85.11 | 81.87 | 85.04 | 75.04 | 65.20 | | ||
| | Qwen3.5-122B-A10B | **87.57** | 88.62 | 89.08 | 90.10 | 89.68 | 88.11 | 89.69 | 89.27 | 88.51 | 90.09 | 89.39 | 86.56 | 89.04 | 83.65 | 74.13 | | ||
| | Nemotron-3-Super-120B-A12B-BF16 | **81.51** | 86.68 | 84.59 | 88.59 | 88.04 | 86.21 | 88.06 | 86.83 | 86.23 | 88.35 | 87.12 | 80.88 | 86.84 | 71.31 | 31.43 | | ||
|
|
||
| === "gpt-oss-120b" | ||
|
|
||
| ```bash | ||
| ns eval \ | ||
| --cluster=[cluster] \ | ||
| --model=openai/gpt-oss-120b \ | ||
| --benchmarks mmmlu \ | ||
| --output_dir=[output dir] \ | ||
| --server_type=vllm \ | ||
| --server_gpus=8 \ | ||
| ++inference.tokens_to_generate=120000 \ | ||
| ++inference.temperature=1.0 \ | ||
| ++inference.top_p=1.0 \ | ||
| ++inference.reasoning_effort=high | ||
| ``` | ||
|
|
||
| === "Qwen3.5-122B-A10B" | ||
|
|
||
| ```bash | ||
| ns eval \ | ||
| --cluster=[cluster] \ | ||
| --model=Qwen/Qwen3.5-122B-A10B \ | ||
| --benchmarks mmmlu \ | ||
| --output_dir=[output dir] \ | ||
| --server_type=vllm \ | ||
| --server_gpus=8 \ | ||
| --server_args='--max-model-len 262144 --reasoning-parser qwen3 --language-model-only' \ | ||
| ++chat_template_kwargs.enable_thinking=true \ | ||
| ++inference.tokens_to_generate=81920 \ | ||
| ++inference.temperature=1.0 \ | ||
| ++inference.top_p=0.95 \ | ||
| ++inference.top_k=20 \ | ||
| ++inference.repetition_penalty=1.0 | ||
| ``` | ||
|
|
||
| === "Nemotron-3-Super-120B-A12B-BF16" | ||
|
|
||
| ```bash | ||
| ns eval \ | ||
| --cluster=[cluster] \ | ||
| --model=NVIDIA/Nemotron-3-Super-120B-A12B-BF16 \ | ||
| --benchmarks mmmlu \ | ||
| --output_dir=[output dir] \ | ||
| --server_type=vllm \ | ||
| --server_gpus=8 \ | ||
| --server_args='--mamba_ssm_cache_dtype float32' \ | ||
| ++chat_template_kwargs.enable_thinking=true \ | ||
| ++parse_reasoning=true \ | ||
| ++inference.tokens_to_generate=131072 \ | ||
| ++inference.temperature=1.0 \ | ||
| ++inference.top_p=0.95 | ||
| ``` | ||
|
|
||
|
|
||
| ### Global PIQA | ||
|
|
||
| - Benchmark is defined in [`nemo_skills/dataset/global_piqa/__init__.py`](https://github.com/NVIDIA-NeMo/Skills/blob/main/nemo_skills/dataset/global_piqa/__init__.py) | ||
| - Original benchmark source is [here](https://huggingface.co/datasets/mrlbenchmarks/global-piqa-nonparallel). | ||
|
|
||
| Global PIQA is a multilingual physical intuition question-answering benchmark focused on commonsense reasoning. Each question presents a situation with two solution options (A/B). It supports 116 languages. | ||
|
|
||
| ```bash | ||
| ns prepare_data global_piqa --languages <lang1> <lang2> ... | ||
| ``` | ||
| Some reference numbers for reference and commands for reproduction: | ||
|
|
||
| | Model | Avg (116 langs) | | ||
| | :-----------------------------: | :-------------: | | ||
| | gpt-oss-120b | **84.61** | | ||
| | Qwen3.5-122B-A10B | **88.72** | | ||
| | Nemotron-3-Super-120B-A12B-BF16 | **82.28** | | ||
|
|
||
| === "gpt-oss-120b" | ||
|
|
||
| ```bash | ||
| ns eval \ | ||
| --cluster=[cluster] \ | ||
| --model=openai/gpt-oss-120b \ | ||
| --benchmarks global_piqa \ | ||
| --output_dir=[output dir] \ | ||
| --server_type=vllm \ | ||
| --server_gpus=8 \ | ||
| ++inference.tokens_to_generate=120000 \ | ||
| ++inference.temperature=1.0 \ | ||
| ++inference.top_p=1.0 \ | ||
| ++inference.reasoning_effort=high | ||
| ``` | ||
coderabbitai[bot] marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
|
||
| === "Qwen3.5-122B-A10B" | ||
|
|
||
| ```bash | ||
| ns eval \ | ||
| --cluster=[cluster] \ | ||
| --model=Qwen/Qwen3.5-122B-A10B \ | ||
| --benchmarks global_piqa \ | ||
| --output_dir=[output dir] \ | ||
| --server_type=vllm \ | ||
| --server_gpus=8 \ | ||
| --server_args='--max-model-len 262144 --reasoning-parser qwen3 --language-model-only' \ | ||
| ++chat_template_kwargs.enable_thinking=true \ | ||
| ++inference.tokens_to_generate=81920 \ | ||
| ++inference.temperature=1.0 \ | ||
| ++inference.top_p=0.95 \ | ||
| ++inference.top_k=20 \ | ||
| ++inference.repetition_penalty=1.0 | ||
| ``` | ||
|
|
||
| === "Nemotron-3-Super-120B-A12B-BF16" | ||
|
|
||
| ```bash | ||
| ns eval \ | ||
| --cluster=[cluster] \ | ||
| --model=NVIDIA/Nemotron-3-Super-120B-A12B-BF16 \ | ||
| --benchmarks global_piqa \ | ||
| --output_dir=[output dir] \ | ||
| --server_type=vllm \ | ||
| --server_gpus=8 \ | ||
| --server_args='--mamba_ssm_cache_dtype float32' \ | ||
| ++chat_template_kwargs.enable_thinking=true \ | ||
| ++parse_reasoning=true \ | ||
| ++inference.tokens_to_generate=131072 \ | ||
| ++inference.temperature=1.0 \ | ||
| ++inference.top_p=0.95 | ||
| ``` | ||
|
|
||
|
|
||
| ## Supported translation metrics | ||
|
|
||
| By default, we compute [BLEU score](https://github.com/mjpost/sacrebleu) to evaluate machine translation. However, we also support COMET, a popular neural metric for machine translation. Computing COMET requires a separate evaluation run that uses [xCOMET-XXL](https://huggingface.co/Unbabel/XCOMET-XXL) model as a judge. This run can be scheduled by adding the following parameters to the evaluation command: | ||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,16 @@ | ||
| # Copyright (c) 2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved. | ||
| # | ||
| # Licensed under the Apache License, Version 2.0 (the "License"); | ||
| # you may not use this file except in compliance with the License. | ||
| # You may obtain a copy of the License at | ||
| # | ||
| # http://www.apache.org/licenses/LICENSE-2.0 | ||
| # | ||
| # Unless required by applicable law or agreed to in writing, software | ||
| # distributed under the License is distributed on an "AS IS" BASIS, | ||
| # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
| # See the License for the specific language governing permissions and | ||
| # limitations under the License. | ||
|
|
||
| METRICS_TYPE = "multichoice" | ||
| GENERATION_ARGS = "++prompt_config=generic/default ++eval_type=multichoice" |
| Original file line number | Diff line number | Diff line change | ||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| @@ -0,0 +1,69 @@ | ||||||||||||||
| # Copyright (c) 2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved. | ||||||||||||||
| # | ||||||||||||||
| # Licensed under the Apache License, Version 2.0 (the "License"); | ||||||||||||||
| # you may not use this file except in compliance with the License. | ||||||||||||||
| # You may obtain a copy of the License at | ||||||||||||||
| # | ||||||||||||||
| # http://www.apache.org/licenses/LICENSE-2.0 | ||||||||||||||
| # | ||||||||||||||
| # Unless required by applicable law or agreed to in writing, software | ||||||||||||||
| # distributed under the License is distributed on an "AS IS" BASIS, | ||||||||||||||
| # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||||||||||||||
| # See the License for the specific language governing permissions and | ||||||||||||||
| # limitations under the License. | ||||||||||||||
|
|
||||||||||||||
| from datasets import get_dataset_config_names, load_dataset | ||||||||||||||
|
|
||||||||||||||
|
|
||||||||||||||
| def supported_languages() -> list[str]: | ||||||||||||||
| return get_dataset_config_names("mrlbenchmarks/global-piqa-nonparallel") | ||||||||||||||
|
|
||||||||||||||
|
|
||||||||||||||
| def load_global_piqa_datasets(languages: list[str], split: str = "test") -> dict[str, list[dict]]: | ||||||||||||||
| return {lang: load_dataset("mrlbenchmarks/global-piqa-nonparallel", lang)[split] for lang in languages} | ||||||||||||||
|
|
||||||||||||||
|
|
||||||||||||||
| def digit_to_letter(digit: int) -> str: | ||||||||||||||
| return chr(ord("A") + digit) | ||||||||||||||
|
Comment on lines
+26
to
+27
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Fail fast on labels outside A/B. This is a two-choice benchmark, but 🛡️ Proposed fix def digit_to_letter(digit: int) -> str:
+ if digit not in (0, 1):
+ raise ValueError(f"Unsupported Global PIQA label: {digit}")
return chr(ord("A") + digit)📝 Committable suggestion
Suggested change
🤖 Prompt for AI Agents |
||||||||||||||
|
|
||||||||||||||
|
|
||||||||||||||
| class Schema: | ||||||||||||||
| PROMPT = "prompt" | ||||||||||||||
| SOLUTION0 = "solution0" | ||||||||||||||
| SOLUTION1 = "solution1" | ||||||||||||||
| LABEL = "label" | ||||||||||||||
|
|
||||||||||||||
|
|
||||||||||||||
| # Prompt and Regex are adapted from: | ||||||||||||||
| # https://github.com/EleutherAI/lm-evaluation-harness/blob/main/lm_eval/tasks/global_piqa/prompted/_template | ||||||||||||||
|
|
||||||||||||||
| QUERY_TEMPLATE = """ | ||||||||||||||
| Given the following situation, which option is more likely to be correct? | ||||||||||||||
|
|
||||||||||||||
| Situation: | ||||||||||||||
| {prompt} ... | ||||||||||||||
|
|
||||||||||||||
| Option A: {solution0} | ||||||||||||||
|
|
||||||||||||||
| Option B: {solution1} | ||||||||||||||
|
|
||||||||||||||
| Your response should end with "The best answer is: [answer_letter]" where [answer_letter] is one of A or B. | ||||||||||||||
| """.strip() | ||||||||||||||
|
|
||||||||||||||
|
|
||||||||||||||
| ANSWER_REGEX1 = r"(?i)[Tt]he (?:[Bb]est [Aa]nswer|[Ff]inal [Aa]nswer|[Aa]nswer)[^A-B]*([A-B])" | ||||||||||||||
| ANSWER_REGEX2 = r"(?i)[Aa]nswer\s*:[^A-B]*([A-B])" | ||||||||||||||
|
Comment on lines
+54
to
+55
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 🧩 Analysis chain🏁 Script executed: #!/bin/bash
python - <<'PY'
import re
pattern = re.compile(r"(?i)[Tt]he (?:[Bb]est [Aa]nswer|[Ff]inal [Aa]nswer|[Aa]nswer)[^A-B]*([A-B])")
for sample in [
"The answer is maybe A",
"The answer is probably B",
"The final answer is A",
]:
match = pattern.search(sample)
print(f"{sample!r} -> {match.group(1) if match else None}")
PYRepository: NVIDIA-NeMo/Skills Length of output: 151 🏁 Script executed: fd "global_piqa_utils.py" --type fRepository: NVIDIA-NeMo/Skills Length of output: 114 🏁 Script executed: cat -n nemo_skills/dataset/global_piqa/global_piqa_utils.py | head -100Repository: NVIDIA-NeMo/Skills Length of output: 3055 🏁 Script executed: rg "digit_to_letter" nemo_skills/dataset/global_piqa/global_piqa_utils.py -A 2 -B 2Repository: NVIDIA-NeMo/Skills Length of output: 411 🏁 Script executed: rg "EXTRACT_REGEX" nemo_skills/dataset/global_piqa/ -A 3 -B 2Repository: NVIDIA-NeMo/Skills Length of output: 1484 🏁 Script executed: rg "digit_to_letter" nemo_skills/dataset/global_piqa/ -B 2 -A 2Repository: NVIDIA-NeMo/Skills Length of output: 1689 🏁 Script executed: cat nemo_skills/dataset/global_piqa/prepare.py | grep -A 10 "Schema.LABEL"Repository: NVIDIA-NeMo/Skills Length of output: 389 🏁 Script executed: python3 <<'EOF'
import re
# Test the proposed fix from the review comment
proposed_regex1 = r"(?i)\bthe (?:best answer|final answer|answer)\b(?:\s+is)?\s*[:\-]?\s*(?:option\s*)?\(?([AB])\)?(?=[^\w]|$)"
proposed_regex2 = r"(?i)\banswer\s*:\s*(?:option\s*)?\(?([AB])\)?(?=[^\w]|$)"
# Also test a simpler fix: use [^A-Ba-b]
simple_regex1 = r"(?i)[Tt]he (?:[Bb]est [Aa]nswer|[Ff]inal [Aa]nswer|[Aa]nswer)[^A-Ba-b]*([A-B])"
test_cases = [
"The answer is maybe A",
"The answer is probably B",
"The final answer is A",
"The best answer is A",
"Answer: B",
]
print("=== ORIGINAL REGEX (BROKEN) ===")
original_regex1 = re.compile(r"(?i)[Tt]he (?:[Bb]est [Aa]nswer|[Ff]inal [Aa]nswer|[Aa]nswer)[^A-B]*([A-B])")
for test in test_cases:
match = original_regex1.search(test)
print(f"{test!r} -> {match.group(1) if match else None}")
print("\n=== PROPOSED FIX (COMPLEX) ===")
proposed_pattern1 = re.compile(proposed_regex1)
for test in test_cases:
match = proposed_pattern1.search(test)
print(f"{test!r} -> {match.group(1) if match else None}")
print("\n=== SIMPLE FIX [^A-Ba-b] ===")
simple_pattern1 = re.compile(simple_regex1)
for test in test_cases:
match = simple_pattern1.search(test)
print(f"{test!r} -> {match.group(1) if match else None}")
EOFRepository: NVIDIA-NeMo/Skills Length of output: 575 🏁 Script executed: python3 <<'EOF'
import re
# Test alternative fixes
word_boundary_regex1 = r"(?i)[Tt]he (?:[Bb]est [Aa]nswer|[Ff]inal [Aa]nswer|[Aa]nswer)[\s:=\-]*\b([A-B])\b"
word_boundary_regex2 = r"(?i)[Aa]nswer\s*:[\s]*\b([A-B])\b"
# Negative lookbehind/lookahead approach
negative_lb_regex1 = r"(?i)[Tt]he (?:[Bb]est [Aa]nswer|[Ff]inal [Aa]nswer|[Aa]nswer)[^A-B]*(?<![a-z])([A-B])(?![a-z])"
test_cases = [
"The answer is maybe A", # Invalid LLM output
"The answer is probably B", # Invalid LLM output
"The final answer is A", # Valid
"The best answer is A", # Valid
"The best answer is: A", # Valid
"Answer: B", # Valid
"The answer is B", # Valid
]
print("=== WORD BOUNDARY FIX ===")
pattern1 = re.compile(word_boundary_regex1)
pattern2 = re.compile(word_boundary_regex2)
for test in test_cases:
match = pattern1.search(test)
if not match:
match = pattern2.search(test)
print(f"{test!r} -> {match.group(1) if match else None}")
print("\n=== NEGATIVE LOOKBEHIND/LOOKAHEAD FIX ===")
pattern_lb = re.compile(negative_lb_regex1)
for test in test_cases:
match = pattern_lb.search(test)
print(f"{test!r} -> {match.group(1) if match else None}")
EOFRepository: NVIDIA-NeMo/Skills Length of output: 531 🏁 Script executed: python3 <<'EOF'
import re
# Test ANSWER_REGEX2 with the same approach
answer_regex2_original = r"(?i)[Aa]nswer\s*:[^A-B]*([A-B])"
answer_regex2_fixed = r"(?i)[Aa]nswer\s*:(?<![a-z])[^A-B]*(?<![a-z])([A-B])(?![a-z])"
test_cases = [
"answer: maybe A",
"answer: probably B",
"Answer: A",
"Answer: B",
]
print("=== ANSWER_REGEX2 ORIGINAL ===")
pattern = re.compile(answer_regex2_original)
for test in test_cases:
match = pattern.search(test)
print(f"{test!r} -> {match.group(1) if match else None}")
print("\n=== ANSWER_REGEX2 WITH NEGATIVE LOOKBEHIND/LOOKAHEAD ===")
pattern_fixed = re.compile(answer_regex2_fixed)
for test in test_cases:
match = pattern_fixed.search(test)
print(f"{test!r} -> {match.group(1) if match else None}")
EOFRepository: NVIDIA-NeMo/Skills Length of output: 322 Fix regex patterns to reject letters inside words like "maybe" and "probably". With 🔧 Proposed fix-ANSWER_REGEX1 = r"(?i)[Tt]he (?:[Bb]est [Aa]nswer|[Ff]inal [Aa]nswer|[Aa]nswer)[^A-B]*([A-B])"
-ANSWER_REGEX2 = r"(?i)[Aa]nswer\s*:[^A-B]*([A-B])"
+ANSWER_REGEX1 = r"(?i)[Tt]he (?:[Bb]est [Aa]nswer|[Ff]inal [Aa]nswer|[Aa]nswer)[^A-B]*(?<![a-z])([A-B])(?![a-z])"
+ANSWER_REGEX2 = r"(?i)[Aa]nswer\s*:[^A-B]*(?<![a-z])([A-B])(?![a-z])"Additionally, add bounds checking to def digit_to_letter(digit: int) -> str:
if digit not in (0, 1):
raise ValueError(f"digit must be 0 or 1, got {digit}")
return chr(ord("A") + digit)🤖 Prompt for AI Agents |
||||||||||||||
| ANSWER_REGEX3 = r"(?i)\\boxed\{([A-B])\}" | ||||||||||||||
|
|
||||||||||||||
| # Fallback: greedily match everything, then capture the last A/B in the response. | ||||||||||||||
| # This is not in lm-evaluation-harness and is where we diverge from the original benchmark. | ||||||||||||||
| LETTER_REGEX = r"\b\(?\s*([A-B])\s*\)?\.?\b" | ||||||||||||||
shuoyangd marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||||||||||
| GREEDY_REGEX = r"[\s\S]*" + LETTER_REGEX | ||||||||||||||
| EXTRACT_REGEX = [ANSWER_REGEX1, ANSWER_REGEX2, ANSWER_REGEX3, GREEDY_REGEX] | ||||||||||||||
|
|
||||||||||||||
|
|
||||||||||||||
| def get_mcq_fields(entry: dict) -> dict: | ||||||||||||||
| options_dict = {digit_to_letter(i): entry[f"solution{i}"] for i in range(2)} | ||||||||||||||
| options_text = "\n".join(f"Option {letter}: {option}" for letter, option in options_dict.items()) | ||||||||||||||
| question = QUERY_TEMPLATE.format(**entry) | ||||||||||||||
| return {"question": question, "options": options_text, **options_dict} | ||||||||||||||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,65 @@ | ||
| # Copyright (c) 2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved. | ||
| # | ||
| # Licensed under the Apache License, Version 2.0 (the "License"); | ||
| # you may not use this file except in compliance with the License. | ||
| # You may obtain a copy of the License at | ||
| # | ||
| # http://www.apache.org/licenses/LICENSE-2.0 | ||
| # | ||
| # Unless required by applicable law or agreed to in writing, software | ||
| # distributed under the License is distributed on an "AS IS" BASIS, | ||
| # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
| # See the License for the specific language governing permissions and | ||
| # limitations under the License. | ||
|
|
||
| import argparse | ||
| import json | ||
| from pathlib import Path | ||
|
|
||
| from nemo_skills.dataset.global_piqa.global_piqa_utils import ( | ||
| EXTRACT_REGEX, | ||
| Schema, | ||
| digit_to_letter, | ||
| get_mcq_fields, | ||
| load_global_piqa_datasets, | ||
| supported_languages, | ||
| ) | ||
|
|
||
|
|
||
| def format_entry(entry: dict, language: str) -> dict: | ||
| return { | ||
| "expected_answer": digit_to_letter(entry[Schema.LABEL]), | ||
| "extract_from_boxed": False, | ||
| "extract_regex": EXTRACT_REGEX, | ||
| "subset_for_metrics": language, | ||
| "relaxed": False, | ||
| **get_mcq_fields(entry), | ||
| } | ||
|
|
||
|
|
||
| def main(args): | ||
| invalid = set(args.languages) - set(supported_languages()) | ||
| if invalid: | ||
| raise ValueError(f"Unsupported languages: {invalid}") | ||
| datasets = load_global_piqa_datasets(args.languages) | ||
|
|
||
| data_dir = Path(__file__).absolute().parent | ||
| output_file = data_dir / "test.jsonl" | ||
| with open(output_file, "wt", encoding="utf-8") as fout: | ||
| for language in datasets: | ||
| for entry in datasets[language]: | ||
| entry = format_entry(entry=entry, language=language) | ||
| json.dump(entry, fout, ensure_ascii=False) | ||
| fout.write("\n") | ||
|
Comment on lines
+44
to
+53
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Compute all entries before writing to prevent data loss on failure. Per coding guidelines, when adding new benchmarks, all computation should complete before file writes to prevent accidental data loss if code fails mid-way. Currently, if 🛡️ Proposed fix to prevent partial writes datasets = load_global_piqa_datasets(args.languages)
+ # Compute all entries first to prevent partial writes on failure
+ all_entries = [
+ format_entry(entry=entry, language=language)
+ for language in datasets
+ for entry in datasets[language]
+ ]
+
data_dir = Path(__file__).absolute().parent
output_file = data_dir / "test.jsonl"
with open(output_file, "wt", encoding="utf-8") as fout:
- for language in datasets:
- for entry in datasets[language]:
- entry = format_entry(entry=entry, language=language)
- json.dump(entry, fout, ensure_ascii=False)
- fout.write("\n")
+ for entry in all_entries:
+ json.dump(entry, fout, ensure_ascii=False)
+ fout.write("\n")As per coding guidelines: "When adding new benchmarks, avoid data loss by doing all computation before re-opening files for writing; ensure computation completes before file writes to prevent accidental data loss if code fails". 🤖 Prompt for AI Agents |
||
|
|
||
|
|
||
| if __name__ == "__main__": | ||
| parser = argparse.ArgumentParser() | ||
| parser.add_argument( | ||
| "--languages", | ||
| default=supported_languages(), | ||
| nargs="+", | ||
| help="Languages to process.", | ||
| ) | ||
| args = parser.parse_args() | ||
| main(args) | ||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use descriptive link text instead of “here”.
Line 223 and Line 296 use generic link text, which triggers MD059 and is less accessible in rendered docs.
Suggested fix
Also applies to: 296-296
🧰 Tools
🪛 markdownlint-cli2 (0.21.0)
[warning] 223-223: Link text should be descriptive
(MD059, descriptive-link-text)
🤖 Prompt for AI Agents