Add Global PIQA benchmark by naymaraq · Pull Request #1299 · NVIDIA-NeMo/Skills

naymaraq · 2026-03-10T13:20:01Z

Add Global PIQA Benchmark

Summary

Adds support for the Global PIQA multilingual benchmark, a physical commonsense reasoning dataset (https://arxiv.org/abs/2510.24081). The benchmark covers 116 languages.
Implements prepare.py to download and format the dataset from HuggingFace.
Allows users to optionally specify which languages to process via the --languages flag.
Formats entries as two-choice questions (A/B) using a prompt template adapted from lm-evaluation-harness.

Summary by CodeRabbit

New Features
- Added Global PIQA multilingual support with dataset loading, MCQ prompt generation, answer extraction, and a CLI to produce per-language evaluation files.
Documentation
- Expanded multilingual evaluation guide: multi-language flags, per-language results, added MMMLU and Global PIQA benchmark pages with per-model examples and command snippets.

coderabbitai · 2026-03-10T13:25:09Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

📝 Walkthrough

Walkthrough

Adds a Global PIQA dataset package: module constants, utilities to load/format the mrlbenchmarks/global-piqa-nonparallel dataset into MCQ prompts with extraction regexes, and a CLI to write per-language test.jsonl with expected answers and metadata.

Changes

Cohort / File(s)	Summary
Module Configuration `nemo_skills/dataset/global_piqa/__init__.py`	Adds exported constants `METRICS_TYPE = "multichoice"` and `GENERATION_ARGS = "++prompt_config=generic/default ++eval_type=multichoice"`.
Dataset Utilities `nemo_skills/dataset/global_piqa/global_piqa_utils.py`	New utilities: `supported_languages()`, `load_global_piqa_datasets()`, `digit_to_letter()`, `get_mcq_fields()`, `Schema` class, `QUERY_TEMPLATE`, and answer-extraction regex constants for MCQ prompt generation and extraction.
Dataset Preparation CLI `nemo_skills/dataset/global_piqa/prepare.py`	New CLI script with `format_entry(entry, language)` and `main(args)`: validates languages, loads datasets, formats entries (expected answer, extraction metadata, MCQ fields) and writes per-language `test.jsonl`.
Documentation `docs/evaluation/multilingual.md`	Expanded multilingual evaluation docs: switched `--language` → `--languages`, added MMMLU and Global PIQA sections, per-model examples, and translation metrics examples.

Sequence Diagram(s)

sequenceDiagram
    actor User
    participant Prepare as "prepare.py"
    participant Utils as "global_piqa_utils"
    participant HF as "HuggingFace\n(global-piqa-nonparallel)"
    participant FS as "FileSystem\n(test.jsonl)"

    User->>Prepare: run with --languages
    Prepare->>Utils: supported_languages()
    Utils->>HF: list dataset configs
    HF-->>Utils: available languages
    Utils-->>Prepare: supported languages

    Prepare->>Utils: load_global_piqa_datasets(languages, split="test")
    Utils->>HF: load each language split
    HF-->>Utils: dataset entries per language
    Utils-->>Prepare: dict(language -> entries)

    loop per language entry
        Prepare->>Utils: get_mcq_fields(entry)
        Utils-->>Prepare: MCQ prompt/options/labels
        Prepare->>Prepare: format_entry(entry, language)
        Prepare->>FS: append JSON line to test.jsonl
    end

    FS-->>User: test.jsonl ready

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title clearly and concisely summarizes the main addition: a new Global PIQA benchmark implementation spanning utilities, dataset preparation, and documentation.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch dkaramyan/global_piqa

📝 Coding Plan

Generate coding plan for human review comments

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

Tip

You can enable review details to help with troubleshooting, context usage and more.

Enable the reviews.review_details setting to include review details such as the model used, the time taken for each step and more in the review comments.

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@nemo_skills/dataset/global_piqa/prepare.py`:
- Around line 44-53: The current loop writes each formatted entry to output_file
as it's produced, risking partial writes if format_entry raises; instead, first
iterate through datasets (using load_global_piqa_datasets and format_entry) and
collect all formatted entries into a list (e.g., formatted_entries), then open
output_file (data_dir / "test.jsonl") once and write the precomputed entries to
the file; ensure you reference datasets, format_entry, and output_file in your
changes so you compute everything before any file I/O.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: ac8f52b0-5f48-44f2-bef8-9c8872479777

📥 Commits

Reviewing files that changed from the base of the PR and between fe92aec and 88698ba.

📒 Files selected for processing (3)

nemo_skills/dataset/global_piqa/__init__.py
nemo_skills/dataset/global_piqa/global_piqa_utils.py
nemo_skills/dataset/global_piqa/prepare.py

coderabbitai · 2026-03-10T13:25:12Z

nemo_skills/dataset/global_piqa/prepare.py

+    datasets = load_global_piqa_datasets(args.languages)
+
+    data_dir = Path(__file__).absolute().parent
+    output_file = data_dir / "test.jsonl"
+    with open(output_file, "wt", encoding="utf-8") as fout:
+        for language in datasets:
+            for entry in datasets[language]:
+                entry = format_entry(entry=entry, language=language)
+                json.dump(entry, fout, ensure_ascii=False)
+                fout.write("\n")


⚠️ Potential issue | 🟡 Minor

Compute all entries before writing to prevent data loss on failure.

Per coding guidelines, when adding new benchmarks, all computation should complete before file writes to prevent accidental data loss if code fails mid-way. Currently, if format_entry raises an exception during iteration, the file will contain partial data.

🛡️ Proposed fix to prevent partial writes

datasets = load_global_piqa_datasets(args.languages) + # Compute all entries first to prevent partial writes on failure + all_entries = [ + format_entry(entry=entry, language=language) + for language in datasets + for entry in datasets[language] + ] + data_dir = Path(__file__).absolute().parent output_file = data_dir / "test.jsonl" with open(output_file, "wt", encoding="utf-8") as fout: - for language in datasets: - for entry in datasets[language]: - entry = format_entry(entry=entry, language=language) - json.dump(entry, fout, ensure_ascii=False) - fout.write("\n") + for entry in all_entries: + json.dump(entry, fout, ensure_ascii=False) + fout.write("\n")

As per coding guidelines: "When adding new benchmarks, avoid data loss by doing all computation before re-opening files for writing; ensure computation completes before file writes to prevent accidental data loss if code fails".

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@nemo_skills/dataset/global_piqa/prepare.py` around lines 44 - 53, The current loop writes each formatted entry to output_file as it's produced, risking partial writes if format_entry raises; instead, first iterate through datasets (using load_global_piqa_datasets and format_entry) and collect all formatted entries into a list (e.g., formatted_entries), then open output_file (data_dir / "test.jsonl") once and write the precomputed entries to the file; ensure you reference datasets, format_entry, and output_file in your changes so you compute everything before any file I/O.

shuoyangd

Just one nit question, otherwise looking good to me.

nemo_skills/dataset/global_piqa/global_piqa_utils.py

Signed-off-by: naymaraq <dkaramyan@nvidia.com>

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: naymaraq <dkaramyan@nvidia.com>

Signed-off-by: naymaraq <dkaramyan@nvidia.com>

coderabbitai

Actionable comments posted: 2

♻️ Duplicate comments (1)

nemo_skills/dataset/global_piqa/prepare.py (1)

48-53: ⚠️ Potential issue | 🟡 Minor

Precompute the rows before truncating test.jsonl.

If format_entry() fails mid-loop, the file has already been opened in write mode and will be left partially written. Build the formatted records first, then open the file once to emit them.

🛡️ Proposed fix

     datasets = load_global_piqa_datasets(args.languages)
 
+    formatted_entries = [
+        format_entry(entry=entry, language=language)
+        for language, dataset in datasets.items()
+        for entry in dataset
+    ]
+
     data_dir = Path(__file__).absolute().parent
     output_file = data_dir / "test.jsonl"
     with open(output_file, "wt", encoding="utf-8") as fout:
-        for language in datasets:
-            for entry in datasets[language]:
-                entry = format_entry(entry=entry, language=language)
-                json.dump(entry, fout, ensure_ascii=False)
-                fout.write("\n")
+        for entry in formatted_entries:
+            json.dump(entry, fout, ensure_ascii=False)
+            fout.write("\n")

As per coding guidelines: "When adding new benchmarks, avoid data loss by doing all computation before re-opening files for writing; ensure computation completes before file writes to prevent accidental data loss if code fails".

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@nemo_skills/dataset/global_piqa/prepare.py` around lines 48 - 53, The current
loop opens output_file for writing before calling format_entry, risking
truncation if format_entry raises; instead, first iterate over datasets and
collect formatted entries by calling format_entry(entry, language) into a list
(e.g., formatted = []), handling or propagating errors during formatting, and
only after all entries are successfully formatted open output_file once and
write the JSON lines from that list using json.dump and fout.write("\n");
reference format_entry, datasets, and output_file to locate and modify the code.

🧹 Nitpick comments (1)

nemo_skills/dataset/global_piqa/prepare.py (1)

40-45: Avoid duplicate supported_languages() lookup by deferring default to main().

The function is called during parser setup (line 60) before args are parsed, then again in main() (line 41) for validation. Using default=None and handling the default in main() with languages = args.languages or supported_languages() eliminates the unnecessary early lookup on the no-flag path while keeping the validation logic clean.

♻️ Proposed refactor

 def main(args):
-    invalid = set(args.languages) - set(supported_languages())
+    all_languages = supported_languages()
+    languages = args.languages or all_languages
+    invalid = set(languages) - set(all_languages)
     if invalid:
         raise ValueError(f"Unsupported languages: {invalid}")
-    datasets = load_global_piqa_datasets(args.languages)
+    datasets = load_global_piqa_datasets(languages)
 ...
     parser.add_argument(
         "--languages",
-        default=supported_languages(),
+        default=None,
         nargs="+",
-        help="Languages to process.",
+        help="Languages to process. Defaults to all supported languages.",
     )

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@nemo_skills/dataset/global_piqa/prepare.py` around lines 40 - 45, Change the
argument parser to use default=None for the --languages flag and in main() avoid
calling supported_languages() twice by doing: languages = args.languages or
supported_languages(); then validate using invalid = set(languages) -
set(supported_languages()) and pass languages to load_global_piqa_datasets.
Update references from args.languages to the local languages variable so the
initial parser setup no longer triggers supported_languages() on the common
(no-flag) path.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@nemo_skills/dataset/global_piqa/global_piqa_utils.py`:
- Around line 26-27: The function digit_to_letter currently maps any integer to
letters via chr(ord("A") + digit) which can produce labels beyond "A"/"B";
update digit_to_letter to validate the input digit (e.g., ensure it's 0 or 1 or
at least within 0..1) and raise a ValueError for out-of-range values, leaving
the mapping of 0->"A" and 1->"B" intact; reference the function name
digit_to_letter when making this change so callers receive a clear exception
instead of invalid labels.
- Around line 54-55: Update ANSWER_REGEX1 and ANSWER_REGEX2 to use word-boundary
or lookaround so captured A/B are not part of larger words (e.g., replace the
[^A-B]*([A-B]) postfix with a pattern that ensures the captured letter is not
preceded or followed by word characters, using constructs like
(?<![A-Za-z0-9_])([AB])(?![A-Za-z0-9_]) or \b([AB])\b so "maybe" or "probably"
won't match); then add bounds checking in digit_to_letter(digit: int) to raise
ValueError when digit is not 0 or 1 (e.g., if digit not in (0,1): raise
ValueError(...)) so invalid inputs fail fast.

---

Duplicate comments:
In `@nemo_skills/dataset/global_piqa/prepare.py`:
- Around line 48-53: The current loop opens output_file for writing before
calling format_entry, risking truncation if format_entry raises; instead, first
iterate over datasets and collect formatted entries by calling
format_entry(entry, language) into a list (e.g., formatted = []), handling or
propagating errors during formatting, and only after all entries are
successfully formatted open output_file once and write the JSON lines from that
list using json.dump and fout.write("\n"); reference format_entry, datasets, and
output_file to locate and modify the code.

---

Nitpick comments:
In `@nemo_skills/dataset/global_piqa/prepare.py`:
- Around line 40-45: Change the argument parser to use default=None for the
--languages flag and in main() avoid calling supported_languages() twice by
doing: languages = args.languages or supported_languages(); then validate using
invalid = set(languages) - set(supported_languages()) and pass languages to
load_global_piqa_datasets. Update references from args.languages to the local
languages variable so the initial parser setup no longer triggers
supported_languages() on the common (no-flag) path.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 979e869e-f766-4632-b6de-8d0a078dcfb7

📥 Commits

Reviewing files that changed from the base of the PR and between 36f080f and 49956cf.

📒 Files selected for processing (3)

nemo_skills/dataset/global_piqa/__init__.py
nemo_skills/dataset/global_piqa/global_piqa_utils.py
nemo_skills/dataset/global_piqa/prepare.py

coderabbitai · 2026-03-10T16:47:32Z

nemo_skills/dataset/global_piqa/global_piqa_utils.py

+def digit_to_letter(digit: int) -> str:
+    return chr(ord("A") + digit)


⚠️ Potential issue | 🟡 Minor

Fail fast on labels outside A/B.

This is a two-choice benchmark, but chr(ord("A") + digit) will happily emit C, D, etc. if the source label is ever unexpected. Raising here is safer than writing impossible expected_answer values.

🛡️ Proposed fix

def digit_to_letter(digit: int) -> str: + if digit not in (0, 1): + raise ValueError(f"Unsupported Global PIQA label: {digit}") return chr(ord("A") + digit)

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

def digit_to_letter(digit: int) -> str:

return chr(ord("A") + digit)

def digit_to_letter(digit: int) -> str:

if digit not in (0, 1):

raise ValueError(f"Unsupported Global PIQA label: {digit}")

return chr(ord("A") + digit)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@nemo_skills/dataset/global_piqa/global_piqa_utils.py` around lines 26 - 27, The function digit_to_letter currently maps any integer to letters via chr(ord("A") + digit) which can produce labels beyond "A"/"B"; update digit_to_letter to validate the input digit (e.g., ensure it's 0 or 1 or at least within 0..1) and raise a ValueError for out-of-range values, leaving the mapping of 0->"A" and 1->"B" intact; reference the function name digit_to_letter when making this change so callers receive a clear exception instead of invalid labels.

coderabbitai · 2026-03-10T16:47:33Z

nemo_skills/dataset/global_piqa/global_piqa_utils.py

+ANSWER_REGEX1 = r"(?i)[Tt]he (?:[Bb]est [Aa]nswer|[Ff]inal [Aa]nswer|[Aa]nswer)[^A-B]*([A-B])"
+ANSWER_REGEX2 = r"(?i)[Aa]nswer\s*:[^A-B]*([A-B])"


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash python - <<'PY' import re pattern = re.compile(r"(?i)[Tt]he (?:[Bb]est [Aa]nswer|[Ff]inal [Aa]nswer|[Aa]nswer)[^A-B]*([A-B])") for sample in [ "The answer is maybe A", "The answer is probably B", "The final answer is A", ]: match = pattern.search(sample) print(f"{sample!r} -> {match.group(1) if match else None}") PY

Repository: NVIDIA-NeMo/Skills

Length of output: 151

🏁 Script executed:

fd "global_piqa_utils.py" --type f

Repository: NVIDIA-NeMo/Skills

Length of output: 114

🏁 Script executed:

cat -n nemo_skills/dataset/global_piqa/global_piqa_utils.py | head -100

Repository: NVIDIA-NeMo/Skills

Length of output: 3055

🏁 Script executed:

rg "digit_to_letter" nemo_skills/dataset/global_piqa/global_piqa_utils.py -A 2 -B 2

Repository: NVIDIA-NeMo/Skills

Length of output: 411

🏁 Script executed:

rg "EXTRACT_REGEX" nemo_skills/dataset/global_piqa/ -A 3 -B 2

Repository: NVIDIA-NeMo/Skills

Length of output: 1484

🏁 Script executed:

rg "digit_to_letter" nemo_skills/dataset/global_piqa/ -B 2 -A 2

Repository: NVIDIA-NeMo/Skills

Length of output: 1689

🏁 Script executed:

cat nemo_skills/dataset/global_piqa/prepare.py | grep -A 10 "Schema.LABEL"

Repository: NVIDIA-NeMo/Skills

Length of output: 389

🏁 Script executed:

python3 <<'EOF' import re # Test the proposed fix from the review comment proposed_regex1 = r"(?i)\bthe (?:best answer|final answer|answer)\b(?:\s+is)?\s*[:\-]?\s*(?:option\s*)?$?([AB])$?(?=[^\w]|$)" proposed_regex2 = r"(?i)\banswer\s*:\s*(?:option\s*)?$?([AB])$?(?=[^\w]|$)" # Also test a simpler fix: use [^A-Ba-b] simple_regex1 = r"(?i)[Tt]he (?:[Bb]est [Aa]nswer|[Ff]inal [Aa]nswer|[Aa]nswer)[^A-Ba-b]*([A-B])" test_cases = [ "The answer is maybe A", "The answer is probably B", "The final answer is A", "The best answer is A", "Answer: B", ] print("=== ORIGINAL REGEX (BROKEN) ===") original_regex1 = re.compile(r"(?i)[Tt]he (?:[Bb]est [Aa]nswer|[Ff]inal [Aa]nswer|[Aa]nswer)[^A-B]*([A-B])") for test in test_cases: match = original_regex1.search(test) print(f"{test!r} -> {match.group(1) if match else None}") print("\n=== PROPOSED FIX (COMPLEX) ===") proposed_pattern1 = re.compile(proposed_regex1) for test in test_cases: match = proposed_pattern1.search(test) print(f"{test!r} -> {match.group(1) if match else None}") print("\n=== SIMPLE FIX [^A-Ba-b] ===") simple_pattern1 = re.compile(simple_regex1) for test in test_cases: match = simple_pattern1.search(test) print(f"{test!r} -> {match.group(1) if match else None}") EOF

Repository: NVIDIA-NeMo/Skills

Length of output: 575

🏁 Script executed:

python3 <<'EOF' import re # Test alternative fixes word_boundary_regex1 = r"(?i)[Tt]he (?:[Bb]est [Aa]nswer|[Ff]inal [Aa]nswer|[Aa]nswer)[\s:=\-]*\b([A-B])\b" word_boundary_regex2 = r"(?i)[Aa]nswer\s*:[\s]*\b([A-B])\b" # Negative lookbehind/lookahead approach negative_lb_regex1 = r"(?i)[Tt]he (?:[Bb]est [Aa]nswer|[Ff]inal [Aa]nswer|[Aa]nswer)[^A-B]*(?<![a-z])([A-B])(?![a-z])" test_cases = [ "The answer is maybe A", # Invalid LLM output "The answer is probably B", # Invalid LLM output "The final answer is A", # Valid "The best answer is A", # Valid "The best answer is: A", # Valid "Answer: B", # Valid "The answer is B", # Valid ] print("=== WORD BOUNDARY FIX ===") pattern1 = re.compile(word_boundary_regex1) pattern2 = re.compile(word_boundary_regex2) for test in test_cases: match = pattern1.search(test) if not match: match = pattern2.search(test) print(f"{test!r} -> {match.group(1) if match else None}") print("\n=== NEGATIVE LOOKBEHIND/LOOKAHEAD FIX ===") pattern_lb = re.compile(negative_lb_regex1) for test in test_cases: match = pattern_lb.search(test) print(f"{test!r} -> {match.group(1) if match else None}") EOF

Repository: NVIDIA-NeMo/Skills

Length of output: 531

🏁 Script executed:

python3 <<'EOF' import re # Test ANSWER_REGEX2 with the same approach answer_regex2_original = r"(?i)[Aa]nswer\s*:[^A-B]*([A-B])" answer_regex2_fixed = r"(?i)[Aa]nswer\s*:(?<![a-z])[^A-B]*(?<![a-z])([A-B])(?![a-z])" test_cases = [ "answer: maybe A", "answer: probably B", "Answer: A", "Answer: B", ] print("=== ANSWER_REGEX2 ORIGINAL ===") pattern = re.compile(answer_regex2_original) for test in test_cases: match = pattern.search(test) print(f"{test!r} -> {match.group(1) if match else None}") print("\n=== ANSWER_REGEX2 WITH NEGATIVE LOOKBEHIND/LOOKAHEAD ===") pattern_fixed = re.compile(answer_regex2_fixed) for test in test_cases: match = pattern_fixed.search(test) print(f"{test!r} -> {match.group(1) if match else None}") EOF

Repository: NVIDIA-NeMo/Skills

Length of output: 322

Fix regex patterns to reject letters inside words like "maybe" and "probably".

With [^A-B]*([A-B]), the regex captures the b in words like maybe and probably, marking correct outputs as B. Use negative lookbehind/lookahead instead to ensure the captured letter is not part of a word:

🔧 Proposed fix

-ANSWER_REGEX1 = r"(?i)[Tt]he (?:[Bb]est [Aa]nswer|[Ff]inal [Aa]nswer|[Aa]nswer)[^A-B]*([A-B])" -ANSWER_REGEX2 = r"(?i)[Aa]nswer\s*:[^A-B]*([A-B])" +ANSWER_REGEX1 = r"(?i)[Tt]he (?:[Bb]est [Aa]nswer|[Ff]inal [Aa]nswer|[Aa]nswer)[^A-B]*(?<![a-z])([A-B])(?![a-z])" +ANSWER_REGEX2 = r"(?i)[Aa]nswer\s*:[^A-B]*(?<![a-z])([A-B])(?![a-z])"

Additionally, add bounds checking to digit_to_letter() to fail fast if passed values outside [0, 1]:

def digit_to_letter(digit: int) -> str: if digit not in (0, 1): raise ValueError(f"digit must be 0 or 1, got {digit}") return chr(ord("A") + digit)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@nemo_skills/dataset/global_piqa/global_piqa_utils.py` around lines 54 - 55, Update ANSWER_REGEX1 and ANSWER_REGEX2 to use word-boundary or lookaround so captured A/B are not part of larger words (e.g., replace the [^A-B]*([A-B]) postfix with a pattern that ensures the captured letter is not preceded or followed by word characters, using constructs like (?<![A-Za-z0-9_])([AB])(?![A-Za-z0-9_]) or \b([AB])\b so "maybe" or "probably" won't match); then add bounds checking in digit_to_letter(digit: int) to raise ValueError when digit is not 0 or 1 (e.g., if digit not in (0,1): raise ValueError(...)) so invalid inputs fail fast.

coderabbitai

♻️ Duplicate comments (1)

nemo_skills/dataset/global_piqa/global_piqa_utils.py (1)

54-55: ⚠️ Potential issue | 🟠 Major

Tighten ANSWER_REGEX1/2 so they only capture standalone option letters.

[^A-B]*([A-B]) can still grab the a/b inside words like maybe or probably, so extract_answer() may return a choice before a real A/B token is reached. Make these two regexes strict and let GREEDY_REGEX remain the fallback for looser outputs.

🔧 Proposed fix

-ANSWER_REGEX1 = r"(?i)[Tt]he (?:[Bb]est [Aa]nswer|[Ff]inal [Aa]nswer|[Aa]nswer)[^A-B]*([A-B])"
-ANSWER_REGEX2 = r"(?i)[Aa]nswer\s*:[^A-B]*([A-B])"
+ANSWER_REGEX1 = r"(?i)\bthe (?:best answer|final answer|answer)\b(?:\s+is)?\s*[:\-]?\s*(?:option\s*)?\(?([A-B])\)?(?=[^\w]|$)"
+ANSWER_REGEX2 = r"(?i)\banswer\s*:\s*(?:option\s*)?\(?([A-B])\)?(?=[^\w]|$)"

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@nemo_skills/dataset/global_piqa/global_piqa_utils.py` around lines 54 - 55,
The current ANSWER_REGEX1 and ANSWER_REGEX2 allow matching letters inside words
(e.g., "maybe"), so update those two patterns used by extract_answer() to only
capture standalone choice tokens A or B (use word boundaries or
negative/positive lookarounds so the captured group is an isolated letter
possibly followed/preceded by punctuation or whitespace, and keep GREEDY_REGEX
as the fallback). Locate ANSWER_REGEX1, ANSWER_REGEX2 and the extract_answer()
usage in global_piqa_utils.py and replace the loose [^A-B]*([A-B]) capture with
a stricter pattern that enforces boundaries around the captured A/B while
preserving case-insensitive matching.

🧹 Nitpick comments (1)

nemo_skills/dataset/global_piqa/global_piqa_utils.py (1)
22-23: Return Dataset directly and fix the annotation.

This helper advertises dict[str, list[dict]], but the datasets API returns a single Dataset when split is provided, and a DatasetDict only when it is omitted. Switching to load_dataset(..., split=split) makes the contract accurate and removes the extra indexing hop. (huggingface.co)
♻️ Proposed fix
-from datasets import get_dataset_config_names, load_dataset
+from datasets import Dataset, get_dataset_config_names, load_dataset
@@
-def load_global_piqa_datasets(languages: list[str], split: str = "test") -> dict[str, list[dict]]:
-    return {lang: load_dataset("mrlbenchmarks/global-piqa-nonparallel", lang)[split] for lang in languages}
+def load_global_piqa_datasets(languages: list[str], split: str = "test") -> dict[str, Dataset]:
+    return {
+        lang: load_dataset("mrlbenchmarks/global-piqa-nonparallel", lang, split=split)
+        for lang in languages
+    }
As per coding guidelines, "Keep code simple and elegant; reuse/extend existing functionality when possible, minimize conditional checks, use self-explanatory code over comments, avoid complicated type interfaces with unions, and keep naming consistent with existing conventions"
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nemo_skills/dataset/global_piqa/global_piqa_utils.py` around lines 22 - 23,
The function load_global_piqa_datasets currently uses
load_dataset("mrlbenchmarks/global-piqa-nonparallel", lang)[split] and declares
return type dict[str, list[dict]]; change it to call
load_dataset("mrlbenchmarks/global-piqa-nonparallel", lang, split=split) so each
value is a datasets.Dataset, update the return type annotation to dict[str,
Dataset], and add the necessary import for Dataset from the datasets package;
keep the mapping behavior (one entry per language) so callers still get a
Dataset per language without the extra indexing hop.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@nemo_skills/dataset/global_piqa/global_piqa_utils.py`:
- Around line 54-55: The current ANSWER_REGEX1 and ANSWER_REGEX2 allow matching
letters inside words (e.g., "maybe"), so update those two patterns used by
extract_answer() to only capture standalone choice tokens A or B (use word
boundaries or negative/positive lookarounds so the captured group is an isolated
letter possibly followed/preceded by punctuation or whitespace, and keep
GREEDY_REGEX as the fallback). Locate ANSWER_REGEX1, ANSWER_REGEX2 and the
extract_answer() usage in global_piqa_utils.py and replace the loose
[^A-B]*([A-B]) capture with a stricter pattern that enforces boundaries around
the captured A/B while preserving case-insensitive matching.

---

Nitpick comments:
In `@nemo_skills/dataset/global_piqa/global_piqa_utils.py`:
- Around line 22-23: The function load_global_piqa_datasets currently uses
load_dataset("mrlbenchmarks/global-piqa-nonparallel", lang)[split] and declares
return type dict[str, list[dict]]; change it to call
load_dataset("mrlbenchmarks/global-piqa-nonparallel", lang, split=split) so each
value is a datasets.Dataset, update the return type annotation to dict[str,
Dataset], and add the necessary import for Dataset from the datasets package;
keep the mapping behavior (one entry per language) so callers still get a
Dataset per language without the extra indexing hop.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 6a6669c8-6bf0-4020-8d56-492fb5fc13cd

📥 Commits

Reviewing files that changed from the base of the PR and between 49956cf and 3b362e5.

📒 Files selected for processing (1)

nemo_skills/dataset/global_piqa/global_piqa_utils.py

Kipok

@naymaraq could you add this to the docs? Should be good to merge after that

Signed-off-by: naymaraq <dkaramyan@nvidia.com>

coderabbitai

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@docs/evaluation/multilingual.md`:
- Around line 227-229: Replace the fenced markdown code blocks with indented
code blocks for the two command snippets (the one containing "ns prepare_data
mmmlu --languages <lang1> <lang2> ... --include_english" and the later snippet
around line 300) by removing the triple backticks and indenting each line of the
snippet with four spaces so the document matches the repository's MD046 style;
update both occurrences (the mmmlu prepare_data snippet and the other snippet
referenced) to use the indented format.
- Line 223: Replace the generic link text "here" with descriptive link text for
both instances in docs/evaluation/multilingual.md (lines referencing the
original benchmark and the other occurrence) to satisfy MD059; for example
change "[here](https://huggingface.co/datasets/openai/MMMLU)" to "[OpenAI MMMLU
dataset on Hugging Face](https://huggingface.co/datasets/openai/MMMLU)" (and
make a similarly descriptive replacement for the second occurrence) so the links
convey their destination and purpose.
- Around line 313-325: The benchmark flag in the command block uses the wrong
value; replace the --benchmarks mmmlu entry in the shown CLI snippet with the
correct benchmark name for Global PIQA (e.g., --benchmarks global_piqa) so the
invocation runs the intended Global PIQA benchmark; update the relevant fenced
code block where the CLI options are defined (the line containing "--benchmarks
mmmlu") to use the correct benchmark name.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: d68cb002-64ab-4c44-87ce-e3b3a2bf9e59

📥 Commits

Reviewing files that changed from the base of the PR and between 3b362e5 and 3d44942.

📒 Files selected for processing (1)

docs/evaluation/multilingual.md

coderabbitai · 2026-03-13T19:59:51Z

docs/evaluation/multilingual.md

+### mmmlu
+
+- Benchmark is defined in [`nemo_skills/dataset/mmmlu/__init__.py`](https://github.com/NVIDIA-NeMo/Skills/blob/main/nemo_skills/dataset/mmmlu/__init__.py)
+- Original benchmark source is [here](https://huggingface.co/datasets/openai/MMMLU).


⚠️ Potential issue | 🟡 Minor

Use descriptive link text instead of “here”.

Line 223 and Line 296 use generic link text, which triggers MD059 and is less accessible in rendered docs.

Suggested fix

-- Original benchmark source is [here](https://huggingface.co/datasets/openai/MMMLU). +- Original benchmark source is [OpenAI MMMLU dataset](https://huggingface.co/datasets/openai/MMMLU). -- Original benchmark source is [here](https://huggingface.co/datasets/mrlbenchmarks/global-piqa-nonparallel). +- Original benchmark source is [Global PIQA nonparallel dataset](https://huggingface.co/datasets/mrlbenchmarks/global-piqa-nonparallel).

Also applies to: 296-296

🧰 Tools

🪛 markdownlint-cli2 (0.21.0)

[warning] 223-223: Link text should be descriptive

(MD059, descriptive-link-text)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@docs/evaluation/multilingual.md` at line 223, Replace the generic link text "here" with descriptive link text for both instances in docs/evaluation/multilingual.md (lines referencing the original benchmark and the other occurrence) to satisfy MD059; for example change "[here](https://huggingface.co/datasets/openai/MMMLU)" to "[OpenAI MMMLU dataset on Hugging Face](https://huggingface.co/datasets/openai/MMMLU)" (and make a similarly descriptive replacement for the second occurrence) so the links convey their destination and purpose.

coderabbitai · 2026-03-13T19:59:51Z

docs/evaluation/multilingual.md

+```bash
+ns prepare_data mmmlu --languages <lang1> <lang2> ... --include_english
+```


⚠️ Potential issue | 🟡 Minor

Match repository markdown code-block style (MD046).

Line 227 and Line 300 use fenced blocks, but lint expects indented blocks in this doc style. Please convert these two snippets to the expected indentation format.

Also applies to: 300-302

🧰 Tools

🪛 markdownlint-cli2 (0.21.0)

[warning] 227-227: Code block style
Expected: indented; Actual: fenced

(MD046, code-block-style)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@docs/evaluation/multilingual.md` around lines 227 - 229, Replace the fenced markdown code blocks with indented code blocks for the two command snippets (the one containing "ns prepare_data mmmlu --languages <lang1> <lang2> ... --include_english" and the later snippet around line 300) by removing the triple backticks and indenting each line of the snippet with four spaces so the document matches the repository's MD046 style; update both occurrences (the mmmlu prepare_data snippet and the other snippet referenced) to use the indented format.

docs/evaluation/multilingual.md

Signed-off-by: naymaraq <dkaramyan@nvidia.com>

commit f5c0c53 Author: Dav Karamyan <47416614+naymaraq@users.noreply.github.com> Date: Mon Mar 16 16:45:33 2026 +0400 Add Global PIQA benchmark (#1299) Signed-off-by: naymaraq <dkaramyan@nvidia.com> Co-authored-by: naymaraq <dkaramyan@nvidia.com> Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com> Co-authored-by: Igor Gitman <igitman@nvidia.com> commit 86071c1 Author: Wasi Ahmad <wasiahmad@ucla.edu> Date: Thu Mar 12 21:16:32 2026 -0700 fixing sandbox use for livecodebench (#1304) Signed-off-by: wasiahmad <wasiahmad@ucla.edu> commit 4928ef5 Author: George Armstrong <georgea@nvidia.com> Date: Thu Mar 12 15:28:41 2026 -0700 nano v3 math tool calling slurm test (#1303) Signed-off-by: George Armstrong <georgea@nvidia.com> commit d4e4450 Author: George Armstrong <georgea@nvidia.com> Date: Thu Mar 12 14:17:03 2026 -0700 fix: restore SIGINT handler in sandbox shell worker to prevent session resets (#1302) Signed-off-by: George Armstrong <georgea@nvidia.com> commit 2b0a84d Author: Mahan <25934206+MahanFathi@users.noreply.github.com> Date: Thu Mar 12 00:07:49 2026 -0400 Add HotpotQA multi-hop QA benchmark (#1292) Signed-off-by: Meriem Boubdir <mboubdir@nvidia.com> Signed-off-by: Mahan Fathi <mfathi@nvidia.com> Signed-off-by: Prasoon Varshney <prasoonv@nvidia.com> Co-authored-by: Meriem B. <113170426+ka00ri@users.noreply.github.com> Co-authored-by: George Armstrong <georgea@nvidia.com> Co-authored-by: Prasoon Varshney <prasoon1995@gmail.com> Co-authored-by: Igor Gitman <igitman@nvidia.com> commit 75314b6 Author: gnalbandyan <153070076+gnalbandyan@users.noreply.github.com> Date: Thu Mar 12 08:06:51 2026 +0400 Gnalbandyan/ugph hle verified (#1293) Signed-off-by: Grigor Nalbandyan <gnalbandyan@nvidia.com> Signed-off-by: Igor Gitman <igitman@nvidia.com> Co-authored-by: Igor Gitman <igitman@nvidia.com> commit 8bbf387 Author: George Armstrong <georgea@nvidia.com> Date: Wed Mar 11 15:48:21 2026 -0700 build: fix gpu ci (#1301) Signed-off-by: George Armstrong <georgea@nvidia.com> commit 005cd03 Author: Vahid Noroozi <VahidooX@users.noreply.github.com> Date: Tue Mar 10 12:52:27 2026 -0700 Fix 1-hour client timeout in long-running generation jobs (#1297) Signed-off-by: vahidoox <vnoroozi@nvidia.com> commit 596b888 Author: anowaczynski-nvidia <anowaczynski@nvidia.com> Date: Tue Mar 10 19:11:26 2026 +0100 skip output-rs*_submissions.jsonl files when summarizing critpt (#1300) Signed-off-by: Arkadiusz Nowaczynski <anowaczynski@nvidia.com> commit fe92aec Author: anowaczynski-nvidia <anowaczynski@nvidia.com> Date: Tue Mar 10 00:00:57 2026 +0100 use output-rs prefix when detecting sampling results (#1296) Signed-off-by: Arkadiusz Nowaczynski <anowaczynski@nvidia.com> Co-authored-by: Igor Gitman <igitman@nvidia.com> commit f6f7041 Author: Dav Karamyan <47416614+naymaraq@users.noreply.github.com> Date: Tue Mar 10 02:40:06 2026 +0400 Add MMMLU benchmark (#1281) Signed-off-by: naymaraq <dkaramyan@nvidia.com> Signed-off-by: Igor Gitman <igor.a.gitman@gmail.com> Co-authored-by: naymaraq <dkaramyan@nvidia.com> Co-authored-by: George Armstrong <georgea@nvidia.com> Co-authored-by: Igor Gitman <igitman@nvidia.com> commit a5da597 Author: Igor Gitman <igitman@nvidia.com> Date: Fri Mar 6 12:13:36 2026 -0800 Revert "Eval kit support (#1239)" (#1294) Signed-off-by: Igor Gitman <igitman@nvidia.com> commit b237e33 Author: George <37293288+Jorjeous@users.noreply.github.com> Date: Fri Mar 6 20:25:37 2026 +0400 Eval kit support (#1239) Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com> Signed-off-by: George <37293288+Jorjeous@users.noreply.github.com> Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com> commit dc28bbf Author: George Armstrong <georgea@nvidia.com> Date: Thu Mar 5 10:17:44 2026 -0800 Python direct tool calling without MCP (#1286) Signed-off-by: George Armstrong <georgea@nvidia.com> commit 12454dd Author: Sadegh Mahdavi <smahdavi4@gmail.com> Date: Wed Mar 4 13:06:21 2026 -0800 Allow het servers for nemo-rl jobs (#1223) Signed-off-by: Sadegh Mahdavi <smahdavi@nvidia.com> Co-authored-by: Igor Gitman <igitman@nvidia.com> commit 8884a68 Author: Prasoon Varshney <prasoon1995@gmail.com> Date: Wed Mar 4 10:24:02 2026 -0800 Support source_lang param for translation recipe (#1290) Signed-off-by: Prasoon Varshney <prasoonv@nvidia.com> Co-authored-by: George Armstrong <georgea@nvidia.com> commit 4618b19 Author: Meriem B. <113170426+ka00ri@users.noreply.github.com> Date: Wed Mar 4 18:59:28 2026 +0100 Add MMLU-Pro 10% optimized subset for checkpoint selection (#1285) Signed-off-by: Meriem Boubdir <mboubdir@nvidia.com> Co-authored-by: George Armstrong <georgea@nvidia.com> commit 5ac8609 Author: Talor Abramovich <talor19@gmail.com> Date: Wed Mar 4 02:30:06 2026 +0200 Add SPEED-Bench (within repo) (#1279) Signed-off-by: Talor Abramovich <talora@nvidia.com> Signed-off-by: talora <talora@nvidia.com> Signed-off-by: Talor Abramovich <talor19@gmail.com> Signed-off-by: George Armstrong <georgea@nvidia.com> Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com> Co-authored-by: George Armstrong <georgea@nvidia.com> Co-authored-by: Igor Gitman <igor.a.gitman@gmail.com> commit c31eec5 Author: George Armstrong <georgea@nvidia.com> Date: Tue Mar 3 12:18:15 2026 -0800 Fix os.getlogin() crash in ns setup (#1289) Signed-off-by: George Armstrong <georgea@nvidia.com> commit c228e66 Author: George Armstrong <georgea@nvidia.com> Date: Tue Mar 3 11:04:54 2026 -0800 Fix streaming TypeError when delta.content is None (#1267) (#1288) Signed-off-by: George Armstrong <georgea@nvidia.com> commit aa47923 Author: Matvei Novikov <mnovikov@nvidia.com> Date: Mon Mar 2 16:28:41 2026 -0800 Add LibTrace recipe for generating domain-specific reasoning data (#1224) Signed-off-by: jubick1337 <mnovikov@nvidia.com> Signed-off-by: mnovikov <mnovikov@nvidia.com> Co-authored-by: George Armstrong <georgea@nvidia.com> commit 313cad7 Author: Stephen Ge <stepheng@nvidia.com> Date: Mon Mar 2 18:28:49 2026 -0500 fix: clean parse-failure retries in prover (#1284) Signed-off-by: Stephen Ge <stepheng@nvidia.com> commit 813cfa3 Author: George Armstrong <georgea@nvidia.com> Date: Mon Mar 2 15:10:08 2026 -0800 tst: rollback inference-api to integrate (#1287) Signed-off-by: George Armstrong <georgea@nvidia.com> commit 31735f9 Author: Valentin Mendelev <vmendelev@nvidia.com> Date: Mon Mar 2 23:11:25 2026 +0100 Add backend-agnostic unified inference server with NeMo ASR and TTS backends (#1250) Signed-off-by: Valentin Mendelev <vmendelev@nvidia.com> commit d4ef8c0 Author: George <37293288+Jorjeous@users.noreply.github.com> Date: Fri Feb 27 23:58:54 2026 +0400 Update promt_config to working with openai format + inline setup (#1210) Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com> Signed-off-by: George <37293288+Jorjeous@users.noreply.github.com> Co-authored-by: Igor Gitman <igitman@nvidia.com> Co-authored-by: George Armstrong <georgea@nvidia.com> commit e879cbc Author: George Armstrong <georgea@nvidia.com> Date: Fri Feb 27 10:41:23 2026 -0800 Update noc tutorial (#1282) Signed-off-by: George Armstrong <georgea@nvidia.com> commit f6e3505 Author: George Armstrong <georgea@nvidia.com> Date: Fri Feb 27 10:17:33 2026 -0800 Add noc reasoning tutorial (#1278) Signed-off-by: Amparo Canaveras <acanaveras@nvidia.com> Signed-off-by: rajeshwarid179 <rdevaramani@nvidia.com> Signed-off-by: acanaveras <142839082+acanaveras@users.noreply.github.com> Signed-off-by: George Armstrong <georgea@nvidia.com> Co-authored-by: Amparo Canaveras <acanaveras@nvidia.com> Co-authored-by: Cursor <cursoragent@cursor.com> Co-authored-by: acanaveras <142839082+acanaveras@users.noreply.github.com> Co-authored-by: rajeshwarid179 <rdevaramani@nvidia.com> commit fc2072a Author: Jiacheng Xu <jcxu@utexas.edu> Date: Fri Feb 27 10:10:25 2026 -0800 CritPt generation add prompt_format=None (#1280) Signed-off-by: Jiacheng Xu <jiachengx@nvidia.com> Co-authored-by: George Armstrong <georgea@nvidia.com> commit c8abe5d Author: Igor Gitman <igitman@nvidia.com> Date: Fri Feb 27 09:31:26 2026 -0800 New slurm customization parameters (account, containers) (#1209) Signed-off-by: Igor Gitman <igitman@nvidia.com> Signed-off-by: George Armstrong <georgea@nvidia.com> Co-authored-by: George Armstrong <georgea@nvidia.com> commit 2b38cce Author: George Armstrong <georgea@nvidia.com> Date: Wed Feb 25 17:59:52 2026 -0800 Add nemo-skills-core subpackage for lightweight installs (#1229) Signed-off-by: George Armstrong <georgea@nvidia.com> commit 9fa8e83 Author: Dheeraj Peri <peri.dheeraj@gmail.com> Date: Wed Feb 25 12:56:35 2026 -0800 feat: add custom judge type support for external repo integration (#1274) Signed-off-by: Igor Gitman <igitman@nvidia.com> Signed-off-by: bzantium <ryumin93@gmail.com> Signed-off-by: Dheeraj Peri <dperi@nvidia.com> Signed-off-by: suriya <sgunasekar@nvidia.com> Signed-off-by: Jiacheng Xu <jiachengx@nvidia.com> Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com> Co-authored-by: Igor Gitman <igitman@nvidia.com> Co-authored-by: Minho Ryu <ryumin93@gmail.com> Co-authored-by: Yongqiang Wang <yongqiang.seagull@gmail.com> Co-authored-by: Suriya Gunasekar <sgunasekar@users.noreply.github.com> Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com> Co-authored-by: Jiacheng Xu <jcxu@utexas.edu> Co-authored-by: George <37293288+Jorjeous@users.noreply.github.com> commit 8a32b13 Author: Igor Gitman <igitman@nvidia.com> Date: Tue Feb 24 15:24:42 2026 -0800 Exclude numb3rs form test_eval.py (#1275) Signed-off-by: Igor Gitman <igitman@nvidia.com> commit 6da2219 Author: George <37293288+Jorjeous@users.noreply.github.com> Date: Mon Feb 23 18:37:46 2026 +0400 Numb3rs ds addition (#1174) Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com> commit ad034b5 Author: Suriya Gunasekar <sgunasekar@users.noreply.github.com> Date: Sun Feb 22 11:55:24 2026 -0800 Add DSBench-DA evaluation (#1254) Squash merge of changes during code-review. Signed-off-by: suriya <sgunasekar@nvidia.com> commit 7593ab3 Author: Jiacheng Xu <jcxu@utexas.edu> Date: Fri Feb 20 16:42:01 2026 -0800 Add CritPt benchmark (#1200) Signed-off-by: Jiacheng Xu <jiachengx@nvidia.com> Co-authored-by: Igor Gitman <igitman@nvidia.com> commit 58c31b2 Author: Suriya Gunasekar <sgunasekar@users.noreply.github.com> Date: Fri Feb 20 16:19:22 2026 -0800 Fix no_answer metric overcounting in _compute_pass_at_k (#1245) Signed-off-by: suriya <sgunasekar@nvidia.com> Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com> Co-authored-by: Igor Gitman <igitman@nvidia.com> commit 1f1a2e7 Author: Igor Gitman <igitman@nvidia.com> Date: Fri Feb 20 15:58:40 2026 -0800 Fix incorrect prompt tokens count due to HF api update (#1264) Signed-off-by: Igor Gitman <igitman@nvidia.com> commit 8ebc6f5 Author: Igor Gitman <igitman@nvidia.com> Date: Fri Feb 20 09:05:33 2026 -0800 Remove deprecated dataset group (#1263) Signed-off-by: Igor Gitman <igitman@nvidia.com> commit ea4177f Author: Yongqiang Wang <yongqiang.seagull@gmail.com> Date: Thu Feb 19 19:57:25 2026 -0500 fix deps (#1258) commit 60905a7 Author: Minho Ryu <ryumin93@gmail.com> Date: Fri Feb 20 09:39:39 2026 +0900 Add aime26 (#1256) Signed-off-by: bzantium <ryumin93@gmail.com> commit b28afc5 Author: Igor Gitman <igitman@nvidia.com> Date: Thu Feb 19 16:18:25 2026 -0800 Rename custom -> external benchmarks (#1262) Signed-off-by: Igor Gitman <igitman@nvidia.com> commit 6cc9c45 Author: Igor Gitman <igitman@nvidia.com> Date: Thu Feb 19 16:10:33 2026 -0800 Add reference to internal benchmarks repo (#1261) Signed-off-by: Igor Gitman <igitman@nvidia.com> commit 5202af6 Author: Igor Gitman <igitman@nvidia.com> Date: Thu Feb 19 16:08:05 2026 -0800 Remove incorrect presence-penalty setting (#1259) Signed-off-by: Igor Gitman <igitman@nvidia.com> commit 144c70b Author: Igor Gitman <igitman@nvidia.com> Date: Thu Feb 19 15:26:33 2026 -0800 Adding an option to store benchmarks in external repo (#1240) Signed-off-by: Igor Gitman <igitman@nvidia.com> Co-authored-by: George Armstrong <georgea@nvidia.com> commit 10e6e39 Author: George <37293288+Jorjeous@users.noreply.github.com> Date: Thu Feb 19 19:57:21 2026 +0400 update vllm miltimodal for api calls convenience (#1213) Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com> Signed-off-by: mmkrtchyan <mmkrtchyan@nvidia.com> Co-authored-by: mmkrtchyan <mmkrtchyan@nvidia.com> commit 1ba4219 Author: Nick Ludwig <nliudvig@nvidia.com> Date: Wed Feb 18 03:28:23 2026 +0400 Fix --server_container not being applied to dependent jobs (#1244) Signed-off-by: Nikolai Ludwig <nliudvig@nvidia.com> Co-authored-by: Igor Gitman <igitman@nvidia.com> commit 9517614 Author: Wasi Ahmad <wasiahmad@ucla.edu> Date: Mon Feb 16 11:13:24 2026 -0800 Support mini-swe-agent as agent harness (#1212) Signed-off-by: wasiahmad <wasiahmad@ucla.edu> Signed-off-by: i-vainn <imoshkov@nvidia.com> Signed-off-by: George Armstrong <georgea@nvidia.com> Signed-off-by: Charlie Truong <chtruong@nvidia.com> Signed-off-by: Nikolai Ludwig <nliudvig@nvidia.com> Signed-off-by: Grigor Nalbandyan <gnalbandyan@nvidia.com> Signed-off-by: bzantium <ryumin93@gmail.com> Signed-off-by: Stephen Ge <stepheng@nvidia.com> Signed-off-by: Jiacheng Xu <jiachengx@nvidia.com> Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com> Signed-off-by: Mateusz Winiarek <mwiniarek@nvidia.com> Signed-off-by: mmkrtchyan <mmkrtchyan@nvidia.com> Signed-off-by: Wei Du <wedu@nvidia.com> Signed-off-by: Igor Gitman <igitman@nvidia.com> Signed-off-by: George <37293288+Jorjeous@users.noreply.github.com> Signed-off-by: SeanNaren <snarenthiran@nvidia.com> Signed-off-by: Arkadiusz Nowaczynski <anowaczynski@nvidia.com> Signed-off-by: fzyzcjy <5236035+fzyzcjy@users.noreply.github.com> Co-authored-by: Ivan <imoshkov@nvidia.com> Co-authored-by: George Armstrong <georgea@nvidia.com> Co-authored-by: Charlie Truong <chtruong@nvidia.com> Co-authored-by: Nick Ludwig <nliudvig@nvidia.com> Co-authored-by: Wojciech Prazuch <wojciechprazuch3@gmail.com> Co-authored-by: gnalbandyan <153070076+gnalbandyan@users.noreply.github.com> Co-authored-by: Minho Ryu <ryumin93@gmail.com> Co-authored-by: Stephen Ge <stepheng@nvidia.com> Co-authored-by: Jiacheng Xu <jcxu@utexas.edu> Co-authored-by: Jiacheng Xu <jiachengx@nvidia.com> Co-authored-by: George <37293288+Jorjeous@users.noreply.github.com> Co-authored-by: Sanyam Kapoor <sanyamk@nvidia.com> Co-authored-by: Mateusz Winiarek <72758259+Froxyy-dev@users.noreply.github.com> Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com> Co-authored-by: Meline Mkrtchyan <72409758+melllinia@users.noreply.github.com> Co-authored-by: Wei Du <wedu@nvidia.com> Co-authored-by: Igor Gitman <igitman@nvidia.com> Co-authored-by: Sean Naren <snarenthiran@nvidia.com> Co-authored-by: Mehrzad Samadi <mehrzadsamadi@gmail.com> Co-authored-by: anowaczynski-nvidia <anowaczynski@nvidia.com> Co-authored-by: fzyzcjy <5236035+fzyzcjy@users.noreply.github.com> commit a3d44dc Author: Suriya Gunasekar <sgunasekar@users.noreply.github.com> Date: Fri Feb 13 22:32:15 2026 -0800 Add --installation_command support to prepare_data (#1243) Signed-off-by: suriya <sgunasekar@nvidia.com> Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com> commit e80d524 Author: George Armstrong <georgea@nvidia.com> Date: Thu Feb 12 17:26:00 2026 -0800 Fix CI disk space for Docker image builds (#1241) Signed-off-by: George Armstrong <georgea@nvidia.com> commit d22236c Author: Sadegh Mahdavi <smahdavi4@gmail.com> Date: Wed Feb 11 17:55:00 2026 -0800 Fix answerbench prompt parsing (#1235) Signed-off-by: Sadegh Mahdavi <smahdavi@nvidia.com> commit 2401628 Author: George Armstrong <georgea@nvidia.com> Date: Wed Feb 11 14:56:43 2026 -0800 feat: add lockfiles for reproducible sandbox builds (#1233) Signed-off-by: George Armstrong <georgea@nvidia.com> commit 5a0a84d Author: Wasi Ahmad <wasiahmad@ucla.edu> Date: Wed Feb 11 13:30:03 2026 -0800 removing datasets version restriction for LCB eval (#1230) Signed-off-by: wasiahmad <wasiahmad@ucla.edu> commit ef0a890 Author: gnalbandyan <153070076+gnalbandyan@users.noreply.github.com> Date: Wed Feb 11 12:03:16 2026 +0400 Gnalbandyan/add physics (#1214) Signed-off-by: Grigor Nalbandyan <gnalbandyan@nvidia.com> Signed-off-by: gnalbandyan <153070076+gnalbandyan@users.noreply.github.com> commit bd9d30c Author: Wasi Ahmad <wasiahmad@ucla.edu> Date: Tue Feb 10 15:13:27 2026 -0800 LCB generic prompting (#1215) Signed-off-by: wasiahmad <wasiahmad@ucla.edu> commit 7d6c49a Author: Sadegh Mahdavi <smahdavi4@gmail.com> Date: Sat Feb 7 08:45:46 2026 -0800 Add support for different variations of nemo-rl (#1220) Signed-off-by: Sadegh Mahdavi <smahdavi@nvidia.com> commit b19ba96 Author: George Armstrong <georgea@nvidia.com> Date: Fri Feb 6 21:40:56 2026 -0800 Add multi-node sandbox support for SLURM clusters (#1218) Signed-off-by: George Armstrong <georgea@nvidia.com> commit 8950bb0 Author: anowaczynski-nvidia <anowaczynski@nvidia.com> Date: Sat Feb 7 01:38:00 2026 +0100 support structured outputs in hle judge for optional AA compatibility (#1186) Signed-off-by: Arkadiusz Nowaczynski <anowaczynski@nvidia.com> Signed-off-by: Igor Gitman <igitman@nvidia.com> Co-authored-by: Igor Gitman <igitman@nvidia.com> commit b84f7a2 Author: Igor Gitman <igitman@nvidia.com> Date: Fri Feb 6 14:51:02 2026 -0800 A small update on running tests docs (#1219) Signed-off-by: Igor Gitman <igitman@nvidia.com> commit 8e838e1 Author: George Armstrong <georgea@nvidia.com> Date: Thu Feb 5 18:01:35 2026 -0800 feat: add flag to disable sandbox replay (#1217) Signed-off-by: George Armstrong <georgea@nvidia.com> commit 5fd9085 Author: Igor Gitman <igitman@nvidia.com> Date: Thu Feb 5 15:57:01 2026 -0800 Add an option to limit number of tool calls (#1216) Signed-off-by: Igor Gitman <igitman@nvidia.com> commit d820200 Author: Igor Gitman <igitman@nvidia.com> Date: Tue Feb 3 10:43:55 2026 -0800 Add arena-hard v2 (#1205) Signed-off-by: bzantium <ryumin93@gmail.com> Signed-off-by: Igor Gitman <igitman@nvidia.com> Co-authored-by: bzantium <ryumin93@gmail.com> commit a30920e Author: Igor Gitman <igitman@nvidia.com> Date: Mon Feb 2 10:53:55 2026 -0800 Fix mkdocs warnings (#1204) Signed-off-by: Igor Gitman <igitman@nvidia.com> commit 19d7788 Author: Ivan <imoshkov@nvidia.com> Date: Mon Feb 2 23:25:13 2026 +0500 Fix infinite wait in sandbox.wait_for_sandbox (#1206) Signed-off-by: i-vainn <imoshkov@nvidia.com> commit 3e65fbf Author: Sadegh Mahdavi <smahdavi4@gmail.com> Date: Fri Jan 30 19:38:38 2026 -0800 Improve tts (#1203) Signed-off-by: Sadegh Mahdavi <smahdavi@nvidia.com> commit 250c862 Author: Nick Ludwig <nliudvig@nvidia.com> Date: Fri Jan 30 22:12:29 2026 +0400 SWE-bench: fix SWE-agent hanging, adjust expected scores (#1202) Signed-off-by: Nikolai Ludwig <nliudvig@nvidia.com> commit 7ded756 Author: Ivan <imoshkov@nvidia.com> Date: Fri Jan 30 09:57:41 2026 +0500 Add proper token counting to code execution model (#1184) Signed-off-by: i-vainn <imoshkov@nvidia.com> Co-authored-by: Igor Gitman <igitman@nvidia.com> commit b986304 Author: Igor Gitman <igitman@nvidia.com> Date: Thu Jan 29 17:57:07 2026 -0800 Upgrade containers (#1198) Signed-off-by: Sadegh Mahdavi <smahdavi@nvidia.com> Signed-off-by: Igor Gitman <igitman@nvidia.com> Co-authored-by: Sadegh Mahdavi <smahdavi@nvidia.com> commit 3b44f02 Author: Dan Lord <blahblahasdf@gmail.com> Date: Thu Jan 29 16:40:47 2026 -0800 Fix incorrect string format (#1199) Signed-off-by: dlord <dlord@nvidia.com> commit c4854b8 Author: Sadegh Mahdavi <smahdavi4@gmail.com> Date: Thu Jan 29 13:43:36 2026 -0800 Update nemo-rl to latest (#1087) Signed-off-by: Sadegh Mahdavi <smahdavi@nvidia.com> Signed-off-by: Igor Gitman <igitman@nvidia.com> Co-authored-by: Igor Gitman <igitman@nvidia.com>

coderabbitai bot reviewed Mar 10, 2026

View reviewed changes

naymaraq requested review from AlexGrinch, Kipok, gwarmstrong and shuoyangd March 10, 2026 13:31

shuoyangd reviewed Mar 10, 2026

View reviewed changes

nemo_skills/dataset/global_piqa/global_piqa_utils.py Show resolved Hide resolved

naymaraq and others added 2 commits March 10, 2026 20:29

add global piqa benchmark

c075e8a

Signed-off-by: naymaraq <dkaramyan@nvidia.com>

improve fallback regex comment in global_piqa_utils

49956cf

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: naymaraq <dkaramyan@nvidia.com>

naymaraq force-pushed the dkaramyan/global_piqa branch from 36f080f to 49956cf Compare March 10, 2026 16:30

shuoyangd self-requested a review March 10, 2026 16:36

shuoyangd approved these changes Mar 10, 2026

View reviewed changes

simplify EXTRACT_REGEX list to single line

3b362e5

Signed-off-by: naymaraq <dkaramyan@nvidia.com>

coderabbitai bot reviewed Mar 10, 2026

View reviewed changes

naymaraq added the run GPU tests label Mar 10, 2026

coderabbitai bot reviewed Mar 10, 2026

View reviewed changes

Merge branch 'main' into dkaramyan/global_piqa

2359f9a

Kipok added run GPU tests and removed run GPU tests labels Mar 10, 2026

Merge branch 'main' into dkaramyan/global_piqa

b7cd927

Kipok added run GPU tests and removed run GPU tests labels Mar 11, 2026

Merge branch 'main' into dkaramyan/global_piqa

fc0fe78

Kipok added run GPU tests and removed run GPU tests labels Mar 12, 2026

naymaraq added run GPU tests and removed run GPU tests labels Mar 12, 2026

Merge branch 'main' into dkaramyan/global_piqa

2577a04

naymaraq added run GPU tests and removed run GPU tests labels Mar 12, 2026

Kipok reviewed Mar 13, 2026

View reviewed changes

add docs for mmmlu and global piqa

3d44942

Signed-off-by: naymaraq <dkaramyan@nvidia.com>

coderabbitai bot reviewed Mar 13, 2026

View reviewed changes

naymaraq and others added 4 commits March 14, 2026 00:14

fix typo

55cf0b9

Signed-off-by: naymaraq <dkaramyan@nvidia.com>

fix trailing whitespace

e093423

Signed-off-by: naymaraq <dkaramyan@nvidia.com>

fix whitespace

ace6423

Signed-off-by: naymaraq <dkaramyan@nvidia.com>

Merge branch 'main' into dkaramyan/global_piqa

74384bd

naymaraq added run GPU tests and removed run GPU tests labels Mar 13, 2026

naymaraq merged commit f5c0c53 into main Mar 16, 2026
6 checks passed

naymaraq deleted the dkaramyan/global_piqa branch March 16, 2026 12:45

		def digit_to_letter(digit: int) -> str:
		return chr(ord("A") + digit)

		ANSWER_REGEX1 = r"(?i)[Tt]he (?:[Bb]est [Aa]nswer\|[Ff]inal [Aa]nswer\|[Aa]nswer)[^A-B]*([A-B])"
		ANSWER_REGEX2 = r"(?i)[Aa]nswer\s:[^A-B]([A-B])"

Conversation

naymaraq commented Mar 10, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Mar 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

❌ Failed checks (1 warning)

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 10, 2026

Choose a reason for hiding this comment

Uh oh!

shuoyangd left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 10, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 10, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Kipok left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 13, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 13, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

naymaraq commented Mar 10, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Mar 10, 2026 •

edited

Loading