Skip to content

Add Global PIQA benchmark#1299

Merged
naymaraq merged 12 commits intomainfrom
dkaramyan/global_piqa
Mar 16, 2026
Merged

Add Global PIQA benchmark#1299
naymaraq merged 12 commits intomainfrom
dkaramyan/global_piqa

Conversation

@naymaraq
Copy link
Collaborator

@naymaraq naymaraq commented Mar 10, 2026

Add Global PIQA Benchmark

Summary

  • Adds support for the Global PIQA multilingual benchmark, a physical commonsense reasoning dataset (https://arxiv.org/abs/2510.24081). The benchmark covers 116 languages.
  • Implements prepare.py to download and format the dataset from HuggingFace.
  • Allows users to optionally specify which languages to process via the --languages flag.
  • Formats entries as two-choice questions (A/B) using a prompt template adapted from lm-evaluation-harness.

Summary by CodeRabbit

  • New Features
    • Added Global PIQA multilingual support with dataset loading, MCQ prompt generation, answer extraction, and a CLI to produce per-language evaluation files.
  • Documentation
    • Expanded multilingual evaluation guide: multi-language flags, per-language results, added MMMLU and Global PIQA benchmark pages with per-model examples and command snippets.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Mar 10, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Adds a Global PIQA dataset package: module constants, utilities to load/format the mrlbenchmarks/global-piqa-nonparallel dataset into MCQ prompts with extraction regexes, and a CLI to write per-language test.jsonl with expected answers and metadata.

Changes

Cohort / File(s) Summary
Module Configuration
nemo_skills/dataset/global_piqa/__init__.py
Adds exported constants METRICS_TYPE = "multichoice" and GENERATION_ARGS = "++prompt_config=generic/default ++eval_type=multichoice".
Dataset Utilities
nemo_skills/dataset/global_piqa/global_piqa_utils.py
New utilities: supported_languages(), load_global_piqa_datasets(), digit_to_letter(), get_mcq_fields(), Schema class, QUERY_TEMPLATE, and answer-extraction regex constants for MCQ prompt generation and extraction.
Dataset Preparation CLI
nemo_skills/dataset/global_piqa/prepare.py
New CLI script with format_entry(entry, language) and main(args): validates languages, loads datasets, formats entries (expected answer, extraction metadata, MCQ fields) and writes per-language test.jsonl.
Documentation
docs/evaluation/multilingual.md
Expanded multilingual evaluation docs: switched --language--languages, added MMMLU and Global PIQA sections, per-model examples, and translation metrics examples.

Sequence Diagram(s)

sequenceDiagram
    actor User
    participant Prepare as "prepare.py"
    participant Utils as "global_piqa_utils"
    participant HF as "HuggingFace\n(global-piqa-nonparallel)"
    participant FS as "FileSystem\n(test.jsonl)"

    User->>Prepare: run with --languages
    Prepare->>Utils: supported_languages()
    Utils->>HF: list dataset configs
    HF-->>Utils: available languages
    Utils-->>Prepare: supported languages

    Prepare->>Utils: load_global_piqa_datasets(languages, split="test")
    Utils->>HF: load each language split
    HF-->>Utils: dataset entries per language
    Utils-->>Prepare: dict(language -> entries)

    loop per language entry
        Prepare->>Utils: get_mcq_fields(entry)
        Utils-->>Prepare: MCQ prompt/options/labels
        Prepare->>Prepare: format_entry(entry, language)
        Prepare->>FS: append JSON line to test.jsonl
    end

    FS-->>User: test.jsonl ready
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and concisely summarizes the main addition: a new Global PIQA benchmark implementation spanning utilities, dataset preparation, and documentation.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch dkaramyan/global_piqa
📝 Coding Plan
  • Generate coding plan for human review comments

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Tip

You can enable review details to help with troubleshooting, context usage and more.

Enable the reviews.review_details setting to include review details such as the model used, the time taken for each step and more in the review comments.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@nemo_skills/dataset/global_piqa/prepare.py`:
- Around line 44-53: The current loop writes each formatted entry to output_file
as it's produced, risking partial writes if format_entry raises; instead, first
iterate through datasets (using load_global_piqa_datasets and format_entry) and
collect all formatted entries into a list (e.g., formatted_entries), then open
output_file (data_dir / "test.jsonl") once and write the precomputed entries to
the file; ensure you reference datasets, format_entry, and output_file in your
changes so you compute everything before any file I/O.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: ac8f52b0-5f48-44f2-bef8-9c8872479777

📥 Commits

Reviewing files that changed from the base of the PR and between fe92aec and 88698ba.

📒 Files selected for processing (3)
  • nemo_skills/dataset/global_piqa/__init__.py
  • nemo_skills/dataset/global_piqa/global_piqa_utils.py
  • nemo_skills/dataset/global_piqa/prepare.py

Comment on lines +44 to +53
datasets = load_global_piqa_datasets(args.languages)

data_dir = Path(__file__).absolute().parent
output_file = data_dir / "test.jsonl"
with open(output_file, "wt", encoding="utf-8") as fout:
for language in datasets:
for entry in datasets[language]:
entry = format_entry(entry=entry, language=language)
json.dump(entry, fout, ensure_ascii=False)
fout.write("\n")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Compute all entries before writing to prevent data loss on failure.

Per coding guidelines, when adding new benchmarks, all computation should complete before file writes to prevent accidental data loss if code fails mid-way. Currently, if format_entry raises an exception during iteration, the file will contain partial data.

🛡️ Proposed fix to prevent partial writes
     datasets = load_global_piqa_datasets(args.languages)
 
+    # Compute all entries first to prevent partial writes on failure
+    all_entries = [
+        format_entry(entry=entry, language=language)
+        for language in datasets
+        for entry in datasets[language]
+    ]
+
     data_dir = Path(__file__).absolute().parent
     output_file = data_dir / "test.jsonl"
     with open(output_file, "wt", encoding="utf-8") as fout:
-        for language in datasets:
-            for entry in datasets[language]:
-                entry = format_entry(entry=entry, language=language)
-                json.dump(entry, fout, ensure_ascii=False)
-                fout.write("\n")
+        for entry in all_entries:
+            json.dump(entry, fout, ensure_ascii=False)
+            fout.write("\n")

As per coding guidelines: "When adding new benchmarks, avoid data loss by doing all computation before re-opening files for writing; ensure computation completes before file writes to prevent accidental data loss if code fails".

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nemo_skills/dataset/global_piqa/prepare.py` around lines 44 - 53, The current
loop writes each formatted entry to output_file as it's produced, risking
partial writes if format_entry raises; instead, first iterate through datasets
(using load_global_piqa_datasets and format_entry) and collect all formatted
entries into a list (e.g., formatted_entries), then open output_file (data_dir /
"test.jsonl") once and write the precomputed entries to the file; ensure you
reference datasets, format_entry, and output_file in your changes so you compute
everything before any file I/O.

Copy link
Collaborator

@shuoyangd shuoyangd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just one nit question, otherwise looking good to me.

naymaraq and others added 2 commits March 10, 2026 20:29
Signed-off-by: naymaraq <dkaramyan@nvidia.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: naymaraq <dkaramyan@nvidia.com>
@naymaraq naymaraq force-pushed the dkaramyan/global_piqa branch from 36f080f to 49956cf Compare March 10, 2026 16:30
@shuoyangd shuoyangd self-requested a review March 10, 2026 16:36
Signed-off-by: naymaraq <dkaramyan@nvidia.com>
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

♻️ Duplicate comments (1)
nemo_skills/dataset/global_piqa/prepare.py (1)

48-53: ⚠️ Potential issue | 🟡 Minor

Precompute the rows before truncating test.jsonl.

If format_entry() fails mid-loop, the file has already been opened in write mode and will be left partially written. Build the formatted records first, then open the file once to emit them.

🛡️ Proposed fix
     datasets = load_global_piqa_datasets(args.languages)
 
+    formatted_entries = [
+        format_entry(entry=entry, language=language)
+        for language, dataset in datasets.items()
+        for entry in dataset
+    ]
+
     data_dir = Path(__file__).absolute().parent
     output_file = data_dir / "test.jsonl"
     with open(output_file, "wt", encoding="utf-8") as fout:
-        for language in datasets:
-            for entry in datasets[language]:
-                entry = format_entry(entry=entry, language=language)
-                json.dump(entry, fout, ensure_ascii=False)
-                fout.write("\n")
+        for entry in formatted_entries:
+            json.dump(entry, fout, ensure_ascii=False)
+            fout.write("\n")

As per coding guidelines: "When adding new benchmarks, avoid data loss by doing all computation before re-opening files for writing; ensure computation completes before file writes to prevent accidental data loss if code fails".

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nemo_skills/dataset/global_piqa/prepare.py` around lines 48 - 53, The current
loop opens output_file for writing before calling format_entry, risking
truncation if format_entry raises; instead, first iterate over datasets and
collect formatted entries by calling format_entry(entry, language) into a list
(e.g., formatted = []), handling or propagating errors during formatting, and
only after all entries are successfully formatted open output_file once and
write the JSON lines from that list using json.dump and fout.write("\n");
reference format_entry, datasets, and output_file to locate and modify the code.
🧹 Nitpick comments (1)
nemo_skills/dataset/global_piqa/prepare.py (1)

40-45: Avoid duplicate supported_languages() lookup by deferring default to main().

The function is called during parser setup (line 60) before args are parsed, then again in main() (line 41) for validation. Using default=None and handling the default in main() with languages = args.languages or supported_languages() eliminates the unnecessary early lookup on the no-flag path while keeping the validation logic clean.

♻️ Proposed refactor
 def main(args):
-    invalid = set(args.languages) - set(supported_languages())
+    all_languages = supported_languages()
+    languages = args.languages or all_languages
+    invalid = set(languages) - set(all_languages)
     if invalid:
         raise ValueError(f"Unsupported languages: {invalid}")
-    datasets = load_global_piqa_datasets(args.languages)
+    datasets = load_global_piqa_datasets(languages)
 ...
     parser.add_argument(
         "--languages",
-        default=supported_languages(),
+        default=None,
         nargs="+",
-        help="Languages to process.",
+        help="Languages to process. Defaults to all supported languages.",
     )
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nemo_skills/dataset/global_piqa/prepare.py` around lines 40 - 45, Change the
argument parser to use default=None for the --languages flag and in main() avoid
calling supported_languages() twice by doing: languages = args.languages or
supported_languages(); then validate using invalid = set(languages) -
set(supported_languages()) and pass languages to load_global_piqa_datasets.
Update references from args.languages to the local languages variable so the
initial parser setup no longer triggers supported_languages() on the common
(no-flag) path.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@nemo_skills/dataset/global_piqa/global_piqa_utils.py`:
- Around line 26-27: The function digit_to_letter currently maps any integer to
letters via chr(ord("A") + digit) which can produce labels beyond "A"/"B";
update digit_to_letter to validate the input digit (e.g., ensure it's 0 or 1 or
at least within 0..1) and raise a ValueError for out-of-range values, leaving
the mapping of 0->"A" and 1->"B" intact; reference the function name
digit_to_letter when making this change so callers receive a clear exception
instead of invalid labels.
- Around line 54-55: Update ANSWER_REGEX1 and ANSWER_REGEX2 to use word-boundary
or lookaround so captured A/B are not part of larger words (e.g., replace the
[^A-B]*([A-B]) postfix with a pattern that ensures the captured letter is not
preceded or followed by word characters, using constructs like
(?<![A-Za-z0-9_])([AB])(?![A-Za-z0-9_]) or \b([AB])\b so "maybe" or "probably"
won't match); then add bounds checking in digit_to_letter(digit: int) to raise
ValueError when digit is not 0 or 1 (e.g., if digit not in (0,1): raise
ValueError(...)) so invalid inputs fail fast.

---

Duplicate comments:
In `@nemo_skills/dataset/global_piqa/prepare.py`:
- Around line 48-53: The current loop opens output_file for writing before
calling format_entry, risking truncation if format_entry raises; instead, first
iterate over datasets and collect formatted entries by calling
format_entry(entry, language) into a list (e.g., formatted = []), handling or
propagating errors during formatting, and only after all entries are
successfully formatted open output_file once and write the JSON lines from that
list using json.dump and fout.write("\n"); reference format_entry, datasets, and
output_file to locate and modify the code.

---

Nitpick comments:
In `@nemo_skills/dataset/global_piqa/prepare.py`:
- Around line 40-45: Change the argument parser to use default=None for the
--languages flag and in main() avoid calling supported_languages() twice by
doing: languages = args.languages or supported_languages(); then validate using
invalid = set(languages) - set(supported_languages()) and pass languages to
load_global_piqa_datasets. Update references from args.languages to the local
languages variable so the initial parser setup no longer triggers
supported_languages() on the common (no-flag) path.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 979e869e-f766-4632-b6de-8d0a078dcfb7

📥 Commits

Reviewing files that changed from the base of the PR and between 36f080f and 49956cf.

📒 Files selected for processing (3)
  • nemo_skills/dataset/global_piqa/__init__.py
  • nemo_skills/dataset/global_piqa/global_piqa_utils.py
  • nemo_skills/dataset/global_piqa/prepare.py

Comment on lines +26 to +27
def digit_to_letter(digit: int) -> str:
return chr(ord("A") + digit)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Fail fast on labels outside A/B.

This is a two-choice benchmark, but chr(ord("A") + digit) will happily emit C, D, etc. if the source label is ever unexpected. Raising here is safer than writing impossible expected_answer values.

🛡️ Proposed fix
 def digit_to_letter(digit: int) -> str:
+    if digit not in (0, 1):
+        raise ValueError(f"Unsupported Global PIQA label: {digit}")
     return chr(ord("A") + digit)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
def digit_to_letter(digit: int) -> str:
return chr(ord("A") + digit)
def digit_to_letter(digit: int) -> str:
if digit not in (0, 1):
raise ValueError(f"Unsupported Global PIQA label: {digit}")
return chr(ord("A") + digit)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nemo_skills/dataset/global_piqa/global_piqa_utils.py` around lines 26 - 27,
The function digit_to_letter currently maps any integer to letters via
chr(ord("A") + digit) which can produce labels beyond "A"/"B"; update
digit_to_letter to validate the input digit (e.g., ensure it's 0 or 1 or at
least within 0..1) and raise a ValueError for out-of-range values, leaving the
mapping of 0->"A" and 1->"B" intact; reference the function name digit_to_letter
when making this change so callers receive a clear exception instead of invalid
labels.

Comment on lines +54 to +55
ANSWER_REGEX1 = r"(?i)[Tt]he (?:[Bb]est [Aa]nswer|[Ff]inal [Aa]nswer|[Aa]nswer)[^A-B]*([A-B])"
ANSWER_REGEX2 = r"(?i)[Aa]nswer\s*:[^A-B]*([A-B])"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
python - <<'PY'
import re

pattern = re.compile(r"(?i)[Tt]he (?:[Bb]est [Aa]nswer|[Ff]inal [Aa]nswer|[Aa]nswer)[^A-B]*([A-B])")
for sample in [
    "The answer is maybe A",
    "The answer is probably B",
    "The final answer is A",
]:
    match = pattern.search(sample)
    print(f"{sample!r} -> {match.group(1) if match else None}")
PY

Repository: NVIDIA-NeMo/Skills

Length of output: 151


🏁 Script executed:

fd "global_piqa_utils.py" --type f

Repository: NVIDIA-NeMo/Skills

Length of output: 114


🏁 Script executed:

cat -n nemo_skills/dataset/global_piqa/global_piqa_utils.py | head -100

Repository: NVIDIA-NeMo/Skills

Length of output: 3055


🏁 Script executed:

rg "digit_to_letter" nemo_skills/dataset/global_piqa/global_piqa_utils.py -A 2 -B 2

Repository: NVIDIA-NeMo/Skills

Length of output: 411


🏁 Script executed:

rg "EXTRACT_REGEX" nemo_skills/dataset/global_piqa/ -A 3 -B 2

Repository: NVIDIA-NeMo/Skills

Length of output: 1484


🏁 Script executed:

rg "digit_to_letter" nemo_skills/dataset/global_piqa/ -B 2 -A 2

Repository: NVIDIA-NeMo/Skills

Length of output: 1689


🏁 Script executed:

cat nemo_skills/dataset/global_piqa/prepare.py | grep -A 10 "Schema.LABEL"

Repository: NVIDIA-NeMo/Skills

Length of output: 389


🏁 Script executed:

python3 <<'EOF'
import re

# Test the proposed fix from the review comment
proposed_regex1 = r"(?i)\bthe (?:best answer|final answer|answer)\b(?:\s+is)?\s*[:\-]?\s*(?:option\s*)?\(?([AB])\)?(?=[^\w]|$)"
proposed_regex2 = r"(?i)\banswer\s*:\s*(?:option\s*)?\(?([AB])\)?(?=[^\w]|$)"

# Also test a simpler fix: use [^A-Ba-b]
simple_regex1 = r"(?i)[Tt]he (?:[Bb]est [Aa]nswer|[Ff]inal [Aa]nswer|[Aa]nswer)[^A-Ba-b]*([A-B])"

test_cases = [
    "The answer is maybe A",
    "The answer is probably B",
    "The final answer is A",
    "The best answer is A",
    "Answer: B",
]

print("=== ORIGINAL REGEX (BROKEN) ===")
original_regex1 = re.compile(r"(?i)[Tt]he (?:[Bb]est [Aa]nswer|[Ff]inal [Aa]nswer|[Aa]nswer)[^A-B]*([A-B])")
for test in test_cases:
    match = original_regex1.search(test)
    print(f"{test!r} -> {match.group(1) if match else None}")

print("\n=== PROPOSED FIX (COMPLEX) ===")
proposed_pattern1 = re.compile(proposed_regex1)
for test in test_cases:
    match = proposed_pattern1.search(test)
    print(f"{test!r} -> {match.group(1) if match else None}")

print("\n=== SIMPLE FIX [^A-Ba-b] ===")
simple_pattern1 = re.compile(simple_regex1)
for test in test_cases:
    match = simple_pattern1.search(test)
    print(f"{test!r} -> {match.group(1) if match else None}")
EOF

Repository: NVIDIA-NeMo/Skills

Length of output: 575


🏁 Script executed:

python3 <<'EOF'
import re

# Test alternative fixes
word_boundary_regex1 = r"(?i)[Tt]he (?:[Bb]est [Aa]nswer|[Ff]inal [Aa]nswer|[Aa]nswer)[\s:=\-]*\b([A-B])\b"
word_boundary_regex2 = r"(?i)[Aa]nswer\s*:[\s]*\b([A-B])\b"

# Negative lookbehind/lookahead approach
negative_lb_regex1 = r"(?i)[Tt]he (?:[Bb]est [Aa]nswer|[Ff]inal [Aa]nswer|[Aa]nswer)[^A-B]*(?<![a-z])([A-B])(?![a-z])"

test_cases = [
    "The answer is maybe A",  # Invalid LLM output
    "The answer is probably B",  # Invalid LLM output
    "The final answer is A",  # Valid
    "The best answer is A",  # Valid
    "The best answer is: A",  # Valid
    "Answer: B",  # Valid
    "The answer is B",  # Valid
]

print("=== WORD BOUNDARY FIX ===")
pattern1 = re.compile(word_boundary_regex1)
pattern2 = re.compile(word_boundary_regex2)
for test in test_cases:
    match = pattern1.search(test)
    if not match:
        match = pattern2.search(test)
    print(f"{test!r} -> {match.group(1) if match else None}")

print("\n=== NEGATIVE LOOKBEHIND/LOOKAHEAD FIX ===")
pattern_lb = re.compile(negative_lb_regex1)
for test in test_cases:
    match = pattern_lb.search(test)
    print(f"{test!r} -> {match.group(1) if match else None}")
EOF

Repository: NVIDIA-NeMo/Skills

Length of output: 531


🏁 Script executed:

python3 <<'EOF'
import re

# Test ANSWER_REGEX2 with the same approach
answer_regex2_original = r"(?i)[Aa]nswer\s*:[^A-B]*([A-B])"
answer_regex2_fixed = r"(?i)[Aa]nswer\s*:(?<![a-z])[^A-B]*(?<![a-z])([A-B])(?![a-z])"

test_cases = [
    "answer: maybe A",
    "answer: probably B",
    "Answer: A",
    "Answer: B",
]

print("=== ANSWER_REGEX2 ORIGINAL ===")
pattern = re.compile(answer_regex2_original)
for test in test_cases:
    match = pattern.search(test)
    print(f"{test!r} -> {match.group(1) if match else None}")

print("\n=== ANSWER_REGEX2 WITH NEGATIVE LOOKBEHIND/LOOKAHEAD ===")
pattern_fixed = re.compile(answer_regex2_fixed)
for test in test_cases:
    match = pattern_fixed.search(test)
    print(f"{test!r} -> {match.group(1) if match else None}")
EOF

Repository: NVIDIA-NeMo/Skills

Length of output: 322


Fix regex patterns to reject letters inside words like "maybe" and "probably".

With [^A-B]*([A-B]), the regex captures the b in words like maybe and probably, marking correct outputs as B. Use negative lookbehind/lookahead instead to ensure the captured letter is not part of a word:

🔧 Proposed fix
-ANSWER_REGEX1 = r"(?i)[Tt]he (?:[Bb]est [Aa]nswer|[Ff]inal [Aa]nswer|[Aa]nswer)[^A-B]*([A-B])"
-ANSWER_REGEX2 = r"(?i)[Aa]nswer\s*:[^A-B]*([A-B])"
+ANSWER_REGEX1 = r"(?i)[Tt]he (?:[Bb]est [Aa]nswer|[Ff]inal [Aa]nswer|[Aa]nswer)[^A-B]*(?<![a-z])([A-B])(?![a-z])"
+ANSWER_REGEX2 = r"(?i)[Aa]nswer\s*:[^A-B]*(?<![a-z])([A-B])(?![a-z])"

Additionally, add bounds checking to digit_to_letter() to fail fast if passed values outside [0, 1]:

def digit_to_letter(digit: int) -> str:
    if digit not in (0, 1):
        raise ValueError(f"digit must be 0 or 1, got {digit}")
    return chr(ord("A") + digit)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nemo_skills/dataset/global_piqa/global_piqa_utils.py` around lines 54 - 55,
Update ANSWER_REGEX1 and ANSWER_REGEX2 to use word-boundary or lookaround so
captured A/B are not part of larger words (e.g., replace the [^A-B]*([A-B])
postfix with a pattern that ensures the captured letter is not preceded or
followed by word characters, using constructs like
(?<![A-Za-z0-9_])([AB])(?![A-Za-z0-9_]) or \b([AB])\b so "maybe" or "probably"
won't match); then add bounds checking in digit_to_letter(digit: int) to raise
ValueError when digit is not 0 or 1 (e.g., if digit not in (0,1): raise
ValueError(...)) so invalid inputs fail fast.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (1)
nemo_skills/dataset/global_piqa/global_piqa_utils.py (1)

54-55: ⚠️ Potential issue | 🟠 Major

Tighten ANSWER_REGEX1/2 so they only capture standalone option letters.

[^A-B]*([A-B]) can still grab the a/b inside words like maybe or probably, so extract_answer() may return a choice before a real A/B token is reached. Make these two regexes strict and let GREEDY_REGEX remain the fallback for looser outputs.

🔧 Proposed fix
-ANSWER_REGEX1 = r"(?i)[Tt]he (?:[Bb]est [Aa]nswer|[Ff]inal [Aa]nswer|[Aa]nswer)[^A-B]*([A-B])"
-ANSWER_REGEX2 = r"(?i)[Aa]nswer\s*:[^A-B]*([A-B])"
+ANSWER_REGEX1 = r"(?i)\bthe (?:best answer|final answer|answer)\b(?:\s+is)?\s*[:\-]?\s*(?:option\s*)?\(?([A-B])\)?(?=[^\w]|$)"
+ANSWER_REGEX2 = r"(?i)\banswer\s*:\s*(?:option\s*)?\(?([A-B])\)?(?=[^\w]|$)"
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nemo_skills/dataset/global_piqa/global_piqa_utils.py` around lines 54 - 55,
The current ANSWER_REGEX1 and ANSWER_REGEX2 allow matching letters inside words
(e.g., "maybe"), so update those two patterns used by extract_answer() to only
capture standalone choice tokens A or B (use word boundaries or
negative/positive lookarounds so the captured group is an isolated letter
possibly followed/preceded by punctuation or whitespace, and keep GREEDY_REGEX
as the fallback). Locate ANSWER_REGEX1, ANSWER_REGEX2 and the extract_answer()
usage in global_piqa_utils.py and replace the loose [^A-B]*([A-B]) capture with
a stricter pattern that enforces boundaries around the captured A/B while
preserving case-insensitive matching.
🧹 Nitpick comments (1)
nemo_skills/dataset/global_piqa/global_piqa_utils.py (1)

22-23: Return Dataset directly and fix the annotation.

This helper advertises dict[str, list[dict]], but the datasets API returns a single Dataset when split is provided, and a DatasetDict only when it is omitted. Switching to load_dataset(..., split=split) makes the contract accurate and removes the extra indexing hop. (huggingface.co)

♻️ Proposed fix
-from datasets import get_dataset_config_names, load_dataset
+from datasets import Dataset, get_dataset_config_names, load_dataset
@@
-def load_global_piqa_datasets(languages: list[str], split: str = "test") -> dict[str, list[dict]]:
-    return {lang: load_dataset("mrlbenchmarks/global-piqa-nonparallel", lang)[split] for lang in languages}
+def load_global_piqa_datasets(languages: list[str], split: str = "test") -> dict[str, Dataset]:
+    return {
+        lang: load_dataset("mrlbenchmarks/global-piqa-nonparallel", lang, split=split)
+        for lang in languages
+    }

As per coding guidelines, "Keep code simple and elegant; reuse/extend existing functionality when possible, minimize conditional checks, use self-explanatory code over comments, avoid complicated type interfaces with unions, and keep naming consistent with existing conventions"

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nemo_skills/dataset/global_piqa/global_piqa_utils.py` around lines 22 - 23,
The function load_global_piqa_datasets currently uses
load_dataset("mrlbenchmarks/global-piqa-nonparallel", lang)[split] and declares
return type dict[str, list[dict]]; change it to call
load_dataset("mrlbenchmarks/global-piqa-nonparallel", lang, split=split) so each
value is a datasets.Dataset, update the return type annotation to dict[str,
Dataset], and add the necessary import for Dataset from the datasets package;
keep the mapping behavior (one entry per language) so callers still get a
Dataset per language without the extra indexing hop.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@nemo_skills/dataset/global_piqa/global_piqa_utils.py`:
- Around line 54-55: The current ANSWER_REGEX1 and ANSWER_REGEX2 allow matching
letters inside words (e.g., "maybe"), so update those two patterns used by
extract_answer() to only capture standalone choice tokens A or B (use word
boundaries or negative/positive lookarounds so the captured group is an isolated
letter possibly followed/preceded by punctuation or whitespace, and keep
GREEDY_REGEX as the fallback). Locate ANSWER_REGEX1, ANSWER_REGEX2 and the
extract_answer() usage in global_piqa_utils.py and replace the loose
[^A-B]*([A-B]) capture with a stricter pattern that enforces boundaries around
the captured A/B while preserving case-insensitive matching.

---

Nitpick comments:
In `@nemo_skills/dataset/global_piqa/global_piqa_utils.py`:
- Around line 22-23: The function load_global_piqa_datasets currently uses
load_dataset("mrlbenchmarks/global-piqa-nonparallel", lang)[split] and declares
return type dict[str, list[dict]]; change it to call
load_dataset("mrlbenchmarks/global-piqa-nonparallel", lang, split=split) so each
value is a datasets.Dataset, update the return type annotation to dict[str,
Dataset], and add the necessary import for Dataset from the datasets package;
keep the mapping behavior (one entry per language) so callers still get a
Dataset per language without the extra indexing hop.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 6a6669c8-6bf0-4020-8d56-492fb5fc13cd

📥 Commits

Reviewing files that changed from the base of the PR and between 49956cf and 3b362e5.

📒 Files selected for processing (1)
  • nemo_skills/dataset/global_piqa/global_piqa_utils.py

Copy link
Collaborator

@Kipok Kipok left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@naymaraq could you add this to the docs? Should be good to merge after that

Signed-off-by: naymaraq <dkaramyan@nvidia.com>
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@docs/evaluation/multilingual.md`:
- Around line 227-229: Replace the fenced markdown code blocks with indented
code blocks for the two command snippets (the one containing "ns prepare_data
mmmlu --languages <lang1> <lang2> ... --include_english" and the later snippet
around line 300) by removing the triple backticks and indenting each line of the
snippet with four spaces so the document matches the repository's MD046 style;
update both occurrences (the mmmlu prepare_data snippet and the other snippet
referenced) to use the indented format.
- Line 223: Replace the generic link text "here" with descriptive link text for
both instances in docs/evaluation/multilingual.md (lines referencing the
original benchmark and the other occurrence) to satisfy MD059; for example
change "[here](https://huggingface.co/datasets/openai/MMMLU)" to "[OpenAI MMMLU
dataset on Hugging Face](https://huggingface.co/datasets/openai/MMMLU)" (and
make a similarly descriptive replacement for the second occurrence) so the links
convey their destination and purpose.
- Around line 313-325: The benchmark flag in the command block uses the wrong
value; replace the --benchmarks mmmlu entry in the shown CLI snippet with the
correct benchmark name for Global PIQA (e.g., --benchmarks global_piqa) so the
invocation runs the intended Global PIQA benchmark; update the relevant fenced
code block where the CLI options are defined (the line containing "--benchmarks
mmmlu") to use the correct benchmark name.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: d68cb002-64ab-4c44-87ce-e3b3a2bf9e59

📥 Commits

Reviewing files that changed from the base of the PR and between 3b362e5 and 3d44942.

📒 Files selected for processing (1)
  • docs/evaluation/multilingual.md

### mmmlu

- Benchmark is defined in [`nemo_skills/dataset/mmmlu/__init__.py`](https://github.com/NVIDIA-NeMo/Skills/blob/main/nemo_skills/dataset/mmmlu/__init__.py)
- Original benchmark source is [here](https://huggingface.co/datasets/openai/MMMLU).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Use descriptive link text instead of “here”.

Line 223 and Line 296 use generic link text, which triggers MD059 and is less accessible in rendered docs.

Suggested fix
-- Original benchmark source is [here](https://huggingface.co/datasets/openai/MMMLU).
+- Original benchmark source is [OpenAI MMMLU dataset](https://huggingface.co/datasets/openai/MMMLU).

-- Original benchmark source is [here](https://huggingface.co/datasets/mrlbenchmarks/global-piqa-nonparallel).
+- Original benchmark source is [Global PIQA nonparallel dataset](https://huggingface.co/datasets/mrlbenchmarks/global-piqa-nonparallel).

Also applies to: 296-296

🧰 Tools
🪛 markdownlint-cli2 (0.21.0)

[warning] 223-223: Link text should be descriptive

(MD059, descriptive-link-text)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/evaluation/multilingual.md` at line 223, Replace the generic link text
"here" with descriptive link text for both instances in
docs/evaluation/multilingual.md (lines referencing the original benchmark and
the other occurrence) to satisfy MD059; for example change
"[here](https://huggingface.co/datasets/openai/MMMLU)" to "[OpenAI MMMLU dataset
on Hugging Face](https://huggingface.co/datasets/openai/MMMLU)" (and make a
similarly descriptive replacement for the second occurrence) so the links convey
their destination and purpose.

Comment on lines +227 to +229
```bash
ns prepare_data mmmlu --languages <lang1> <lang2> ... --include_english
```
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Match repository markdown code-block style (MD046).

Line 227 and Line 300 use fenced blocks, but lint expects indented blocks in this doc style. Please convert these two snippets to the expected indentation format.

Also applies to: 300-302

🧰 Tools
🪛 markdownlint-cli2 (0.21.0)

[warning] 227-227: Code block style
Expected: indented; Actual: fenced

(MD046, code-block-style)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/evaluation/multilingual.md` around lines 227 - 229, Replace the fenced
markdown code blocks with indented code blocks for the two command snippets (the
one containing "ns prepare_data mmmlu --languages <lang1> <lang2> ...
--include_english" and the later snippet around line 300) by removing the triple
backticks and indenting each line of the snippet with four spaces so the
document matches the repository's MD046 style; update both occurrences (the
mmmlu prepare_data snippet and the other snippet referenced) to use the indented
format.

naymaraq and others added 4 commits March 14, 2026 00:14
Signed-off-by: naymaraq <dkaramyan@nvidia.com>
Signed-off-by: naymaraq <dkaramyan@nvidia.com>
Signed-off-by: naymaraq <dkaramyan@nvidia.com>
@naymaraq naymaraq merged commit f5c0c53 into main Mar 16, 2026
6 checks passed
@naymaraq naymaraq deleted the dkaramyan/global_piqa branch March 16, 2026 12:45
sgunasekar added a commit that referenced this pull request Mar 24, 2026
commit f5c0c53
Author: Dav Karamyan <47416614+naymaraq@users.noreply.github.com>
Date:   Mon Mar 16 16:45:33 2026 +0400

    Add Global PIQA benchmark (#1299)

    Signed-off-by: naymaraq <dkaramyan@nvidia.com>
    Co-authored-by: naymaraq <dkaramyan@nvidia.com>
    Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
    Co-authored-by: Igor Gitman <igitman@nvidia.com>

commit 86071c1
Author: Wasi Ahmad <wasiahmad@ucla.edu>
Date:   Thu Mar 12 21:16:32 2026 -0700

    fixing sandbox use for livecodebench (#1304)

    Signed-off-by: wasiahmad <wasiahmad@ucla.edu>

commit 4928ef5
Author: George Armstrong <georgea@nvidia.com>
Date:   Thu Mar 12 15:28:41 2026 -0700

    nano v3 math tool calling slurm test (#1303)

    Signed-off-by: George Armstrong <georgea@nvidia.com>

commit d4e4450
Author: George Armstrong <georgea@nvidia.com>
Date:   Thu Mar 12 14:17:03 2026 -0700

    fix: restore SIGINT handler in sandbox shell worker to prevent session resets (#1302)

    Signed-off-by: George Armstrong <georgea@nvidia.com>

commit 2b0a84d
Author: Mahan <25934206+MahanFathi@users.noreply.github.com>
Date:   Thu Mar 12 00:07:49 2026 -0400

    Add HotpotQA multi-hop QA benchmark (#1292)

    Signed-off-by: Meriem Boubdir <mboubdir@nvidia.com>
    Signed-off-by: Mahan Fathi <mfathi@nvidia.com>
    Signed-off-by: Prasoon Varshney <prasoonv@nvidia.com>
    Co-authored-by: Meriem B. <113170426+ka00ri@users.noreply.github.com>
    Co-authored-by: George Armstrong <georgea@nvidia.com>
    Co-authored-by: Prasoon Varshney <prasoon1995@gmail.com>
    Co-authored-by: Igor Gitman <igitman@nvidia.com>

commit 75314b6
Author: gnalbandyan <153070076+gnalbandyan@users.noreply.github.com>
Date:   Thu Mar 12 08:06:51 2026 +0400

    Gnalbandyan/ugph hle verified (#1293)

    Signed-off-by: Grigor Nalbandyan <gnalbandyan@nvidia.com>
    Signed-off-by: Igor Gitman <igitman@nvidia.com>
    Co-authored-by: Igor Gitman <igitman@nvidia.com>

commit 8bbf387
Author: George Armstrong <georgea@nvidia.com>
Date:   Wed Mar 11 15:48:21 2026 -0700

    build: fix gpu ci (#1301)

    Signed-off-by: George Armstrong <georgea@nvidia.com>

commit 005cd03
Author: Vahid Noroozi <VahidooX@users.noreply.github.com>
Date:   Tue Mar 10 12:52:27 2026 -0700

    Fix 1-hour client timeout in long-running generation jobs (#1297)

    Signed-off-by: vahidoox <vnoroozi@nvidia.com>

commit 596b888
Author: anowaczynski-nvidia <anowaczynski@nvidia.com>
Date:   Tue Mar 10 19:11:26 2026 +0100

    skip output-rs*_submissions.jsonl files when summarizing critpt (#1300)

    Signed-off-by: Arkadiusz Nowaczynski <anowaczynski@nvidia.com>

commit fe92aec
Author: anowaczynski-nvidia <anowaczynski@nvidia.com>
Date:   Tue Mar 10 00:00:57 2026 +0100

    use output-rs prefix when detecting sampling results (#1296)

    Signed-off-by: Arkadiusz Nowaczynski <anowaczynski@nvidia.com>
    Co-authored-by: Igor Gitman <igitman@nvidia.com>

commit f6f7041
Author: Dav Karamyan <47416614+naymaraq@users.noreply.github.com>
Date:   Tue Mar 10 02:40:06 2026 +0400

    Add MMMLU benchmark (#1281)

    Signed-off-by: naymaraq <dkaramyan@nvidia.com>
    Signed-off-by: Igor Gitman <igor.a.gitman@gmail.com>
    Co-authored-by: naymaraq <dkaramyan@nvidia.com>
    Co-authored-by: George Armstrong <georgea@nvidia.com>
    Co-authored-by: Igor Gitman <igitman@nvidia.com>

commit a5da597
Author: Igor Gitman <igitman@nvidia.com>
Date:   Fri Mar 6 12:13:36 2026 -0800

    Revert "Eval kit support  (#1239)" (#1294)

    Signed-off-by: Igor Gitman <igitman@nvidia.com>

commit b237e33
Author: George <37293288+Jorjeous@users.noreply.github.com>
Date:   Fri Mar 6 20:25:37 2026 +0400

    Eval kit support  (#1239)

    Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>
    Signed-off-by: George <37293288+Jorjeous@users.noreply.github.com>
    Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>

commit dc28bbf
Author: George Armstrong <georgea@nvidia.com>
Date:   Thu Mar 5 10:17:44 2026 -0800

    Python direct tool calling without MCP (#1286)

    Signed-off-by: George Armstrong <georgea@nvidia.com>

commit 12454dd
Author: Sadegh Mahdavi <smahdavi4@gmail.com>
Date:   Wed Mar 4 13:06:21 2026 -0800

    Allow het servers for nemo-rl jobs (#1223)

    Signed-off-by: Sadegh Mahdavi <smahdavi@nvidia.com>
    Co-authored-by: Igor Gitman <igitman@nvidia.com>

commit 8884a68
Author: Prasoon Varshney <prasoon1995@gmail.com>
Date:   Wed Mar 4 10:24:02 2026 -0800

    Support source_lang param for translation recipe (#1290)

    Signed-off-by: Prasoon Varshney <prasoonv@nvidia.com>
    Co-authored-by: George Armstrong <georgea@nvidia.com>

commit 4618b19
Author: Meriem B. <113170426+ka00ri@users.noreply.github.com>
Date:   Wed Mar 4 18:59:28 2026 +0100

    Add MMLU-Pro 10% optimized subset for checkpoint selection (#1285)

    Signed-off-by: Meriem Boubdir <mboubdir@nvidia.com>
    Co-authored-by: George Armstrong <georgea@nvidia.com>

commit 5ac8609
Author: Talor Abramovich <talor19@gmail.com>
Date:   Wed Mar 4 02:30:06 2026 +0200

    Add SPEED-Bench (within repo) (#1279)

    Signed-off-by: Talor Abramovich <talora@nvidia.com>
    Signed-off-by: talora <talora@nvidia.com>
    Signed-off-by: Talor Abramovich <talor19@gmail.com>
    Signed-off-by: George Armstrong <georgea@nvidia.com>
    Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
    Co-authored-by: George Armstrong <georgea@nvidia.com>
    Co-authored-by: Igor Gitman <igor.a.gitman@gmail.com>

commit c31eec5
Author: George Armstrong <georgea@nvidia.com>
Date:   Tue Mar 3 12:18:15 2026 -0800

    Fix os.getlogin() crash in ns setup (#1289)

    Signed-off-by: George Armstrong <georgea@nvidia.com>

commit c228e66
Author: George Armstrong <georgea@nvidia.com>
Date:   Tue Mar 3 11:04:54 2026 -0800

    Fix streaming TypeError when delta.content is None (#1267) (#1288)

    Signed-off-by: George Armstrong <georgea@nvidia.com>

commit aa47923
Author: Matvei Novikov <mnovikov@nvidia.com>
Date:   Mon Mar 2 16:28:41 2026 -0800

    Add LibTrace recipe for generating domain-specific reasoning data (#1224)

    Signed-off-by: jubick1337 <mnovikov@nvidia.com>
    Signed-off-by: mnovikov <mnovikov@nvidia.com>
    Co-authored-by: George Armstrong <georgea@nvidia.com>

commit 313cad7
Author: Stephen Ge <stepheng@nvidia.com>
Date:   Mon Mar 2 18:28:49 2026 -0500

    fix: clean parse-failure retries in prover (#1284)

    Signed-off-by: Stephen Ge <stepheng@nvidia.com>

commit 813cfa3
Author: George Armstrong <georgea@nvidia.com>
Date:   Mon Mar 2 15:10:08 2026 -0800

    tst: rollback inference-api to integrate (#1287)

    Signed-off-by: George Armstrong <georgea@nvidia.com>

commit 31735f9
Author: Valentin Mendelev <vmendelev@nvidia.com>
Date:   Mon Mar 2 23:11:25 2026 +0100

    Add backend-agnostic unified inference server with NeMo ASR and TTS backends (#1250)

    Signed-off-by: Valentin Mendelev <vmendelev@nvidia.com>

commit d4ef8c0
Author: George <37293288+Jorjeous@users.noreply.github.com>
Date:   Fri Feb 27 23:58:54 2026 +0400

    Update promt_config to working with openai format + inline setup (#1210)

    Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>
    Signed-off-by: George <37293288+Jorjeous@users.noreply.github.com>
    Co-authored-by: Igor Gitman <igitman@nvidia.com>
    Co-authored-by: George Armstrong <georgea@nvidia.com>

commit e879cbc
Author: George Armstrong <georgea@nvidia.com>
Date:   Fri Feb 27 10:41:23 2026 -0800

    Update noc tutorial (#1282)

    Signed-off-by: George Armstrong <georgea@nvidia.com>

commit f6e3505
Author: George Armstrong <georgea@nvidia.com>
Date:   Fri Feb 27 10:17:33 2026 -0800

    Add noc reasoning tutorial (#1278)

    Signed-off-by: Amparo Canaveras <acanaveras@nvidia.com>
    Signed-off-by: rajeshwarid179 <rdevaramani@nvidia.com>
    Signed-off-by: acanaveras <142839082+acanaveras@users.noreply.github.com>
    Signed-off-by: George Armstrong <georgea@nvidia.com>
    Co-authored-by: Amparo Canaveras <acanaveras@nvidia.com>
    Co-authored-by: Cursor <cursoragent@cursor.com>
    Co-authored-by: acanaveras <142839082+acanaveras@users.noreply.github.com>
    Co-authored-by: rajeshwarid179 <rdevaramani@nvidia.com>

commit fc2072a
Author: Jiacheng Xu <jcxu@utexas.edu>
Date:   Fri Feb 27 10:10:25 2026 -0800

    CritPt generation add prompt_format=None (#1280)

    Signed-off-by: Jiacheng Xu <jiachengx@nvidia.com>
    Co-authored-by: George Armstrong <georgea@nvidia.com>

commit c8abe5d
Author: Igor Gitman <igitman@nvidia.com>
Date:   Fri Feb 27 09:31:26 2026 -0800

    New slurm customization parameters (account, containers) (#1209)

    Signed-off-by: Igor Gitman <igitman@nvidia.com>
    Signed-off-by: George Armstrong <georgea@nvidia.com>
    Co-authored-by: George Armstrong <georgea@nvidia.com>

commit 2b38cce
Author: George Armstrong <georgea@nvidia.com>
Date:   Wed Feb 25 17:59:52 2026 -0800

    Add nemo-skills-core subpackage for lightweight installs (#1229)

    Signed-off-by: George Armstrong <georgea@nvidia.com>

commit 9fa8e83
Author: Dheeraj Peri <peri.dheeraj@gmail.com>
Date:   Wed Feb 25 12:56:35 2026 -0800

    feat: add custom judge type support for external repo integration (#1274)

    Signed-off-by: Igor Gitman <igitman@nvidia.com>
    Signed-off-by: bzantium <ryumin93@gmail.com>
    Signed-off-by: Dheeraj Peri <dperi@nvidia.com>
    Signed-off-by: suriya <sgunasekar@nvidia.com>
    Signed-off-by: Jiacheng Xu <jiachengx@nvidia.com>
    Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>
    Co-authored-by: Igor Gitman <igitman@nvidia.com>
    Co-authored-by: Minho Ryu <ryumin93@gmail.com>
    Co-authored-by: Yongqiang Wang <yongqiang.seagull@gmail.com>
    Co-authored-by: Suriya Gunasekar <sgunasekar@users.noreply.github.com>
    Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
    Co-authored-by: Jiacheng Xu <jcxu@utexas.edu>
    Co-authored-by: George <37293288+Jorjeous@users.noreply.github.com>

commit 8a32b13
Author: Igor Gitman <igitman@nvidia.com>
Date:   Tue Feb 24 15:24:42 2026 -0800

    Exclude numb3rs form test_eval.py (#1275)

    Signed-off-by: Igor Gitman <igitman@nvidia.com>

commit 6da2219
Author: George <37293288+Jorjeous@users.noreply.github.com>
Date:   Mon Feb 23 18:37:46 2026 +0400

    Numb3rs ds addition (#1174)

    Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>

commit ad034b5
Author: Suriya Gunasekar <sgunasekar@users.noreply.github.com>
Date:   Sun Feb 22 11:55:24 2026 -0800

    Add DSBench-DA evaluation (#1254)

    Squash merge of changes during code-review.
    Signed-off-by: suriya <sgunasekar@nvidia.com>

commit 7593ab3
Author: Jiacheng Xu <jcxu@utexas.edu>
Date:   Fri Feb 20 16:42:01 2026 -0800

    Add CritPt benchmark (#1200)

    Signed-off-by: Jiacheng Xu <jiachengx@nvidia.com>
    Co-authored-by: Igor Gitman <igitman@nvidia.com>

commit 58c31b2
Author: Suriya Gunasekar <sgunasekar@users.noreply.github.com>
Date:   Fri Feb 20 16:19:22 2026 -0800

    Fix no_answer metric overcounting in _compute_pass_at_k (#1245)

    Signed-off-by: suriya <sgunasekar@nvidia.com>
    Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
    Co-authored-by: Igor Gitman <igitman@nvidia.com>

commit 1f1a2e7
Author: Igor Gitman <igitman@nvidia.com>
Date:   Fri Feb 20 15:58:40 2026 -0800

    Fix incorrect prompt tokens count due to HF api update (#1264)

    Signed-off-by: Igor Gitman <igitman@nvidia.com>

commit 8ebc6f5
Author: Igor Gitman <igitman@nvidia.com>
Date:   Fri Feb 20 09:05:33 2026 -0800

    Remove deprecated dataset group (#1263)

    Signed-off-by: Igor Gitman <igitman@nvidia.com>

commit ea4177f
Author: Yongqiang Wang <yongqiang.seagull@gmail.com>
Date:   Thu Feb 19 19:57:25 2026 -0500

    fix deps (#1258)

commit 60905a7
Author: Minho Ryu <ryumin93@gmail.com>
Date:   Fri Feb 20 09:39:39 2026 +0900

    Add aime26 (#1256)

    Signed-off-by: bzantium <ryumin93@gmail.com>

commit b28afc5
Author: Igor Gitman <igitman@nvidia.com>
Date:   Thu Feb 19 16:18:25 2026 -0800

    Rename custom -> external benchmarks (#1262)

    Signed-off-by: Igor Gitman <igitman@nvidia.com>

commit 6cc9c45
Author: Igor Gitman <igitman@nvidia.com>
Date:   Thu Feb 19 16:10:33 2026 -0800

    Add reference to internal benchmarks repo (#1261)

    Signed-off-by: Igor Gitman <igitman@nvidia.com>

commit 5202af6
Author: Igor Gitman <igitman@nvidia.com>
Date:   Thu Feb 19 16:08:05 2026 -0800

    Remove incorrect presence-penalty setting (#1259)

    Signed-off-by: Igor Gitman <igitman@nvidia.com>

commit 144c70b
Author: Igor Gitman <igitman@nvidia.com>
Date:   Thu Feb 19 15:26:33 2026 -0800

    Adding an option to store benchmarks in external repo (#1240)

    Signed-off-by: Igor Gitman <igitman@nvidia.com>
    Co-authored-by: George Armstrong <georgea@nvidia.com>

commit 10e6e39
Author: George <37293288+Jorjeous@users.noreply.github.com>
Date:   Thu Feb 19 19:57:21 2026 +0400

    update vllm miltimodal for api calls convenience (#1213)

    Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>
    Signed-off-by: mmkrtchyan <mmkrtchyan@nvidia.com>
    Co-authored-by: mmkrtchyan <mmkrtchyan@nvidia.com>

commit 1ba4219
Author: Nick Ludwig <nliudvig@nvidia.com>
Date:   Wed Feb 18 03:28:23 2026 +0400

    Fix --server_container not being applied to dependent jobs (#1244)

    Signed-off-by: Nikolai Ludwig <nliudvig@nvidia.com>
    Co-authored-by: Igor Gitman <igitman@nvidia.com>

commit 9517614
Author: Wasi Ahmad <wasiahmad@ucla.edu>
Date:   Mon Feb 16 11:13:24 2026 -0800

    Support mini-swe-agent as agent harness (#1212)

    Signed-off-by: wasiahmad <wasiahmad@ucla.edu>
    Signed-off-by: i-vainn <imoshkov@nvidia.com>
    Signed-off-by: George Armstrong <georgea@nvidia.com>
    Signed-off-by: Charlie Truong <chtruong@nvidia.com>
    Signed-off-by: Nikolai Ludwig <nliudvig@nvidia.com>
    Signed-off-by: Grigor Nalbandyan <gnalbandyan@nvidia.com>
    Signed-off-by: bzantium <ryumin93@gmail.com>
    Signed-off-by: Stephen Ge <stepheng@nvidia.com>
    Signed-off-by: Jiacheng Xu <jiachengx@nvidia.com>
    Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>
    Signed-off-by: Mateusz Winiarek <mwiniarek@nvidia.com>
    Signed-off-by: mmkrtchyan <mmkrtchyan@nvidia.com>
    Signed-off-by: Wei Du <wedu@nvidia.com>
    Signed-off-by: Igor Gitman <igitman@nvidia.com>
    Signed-off-by: George <37293288+Jorjeous@users.noreply.github.com>
    Signed-off-by: SeanNaren <snarenthiran@nvidia.com>
    Signed-off-by: Arkadiusz Nowaczynski <anowaczynski@nvidia.com>
    Signed-off-by: fzyzcjy <5236035+fzyzcjy@users.noreply.github.com>
    Co-authored-by: Ivan <imoshkov@nvidia.com>
    Co-authored-by: George Armstrong <georgea@nvidia.com>
    Co-authored-by: Charlie Truong <chtruong@nvidia.com>
    Co-authored-by: Nick Ludwig <nliudvig@nvidia.com>
    Co-authored-by: Wojciech Prazuch <wojciechprazuch3@gmail.com>
    Co-authored-by: gnalbandyan <153070076+gnalbandyan@users.noreply.github.com>
    Co-authored-by: Minho Ryu <ryumin93@gmail.com>
    Co-authored-by: Stephen Ge <stepheng@nvidia.com>
    Co-authored-by: Jiacheng Xu <jcxu@utexas.edu>
    Co-authored-by: Jiacheng Xu <jiachengx@nvidia.com>
    Co-authored-by: George <37293288+Jorjeous@users.noreply.github.com>
    Co-authored-by: Sanyam Kapoor <sanyamk@nvidia.com>
    Co-authored-by: Mateusz Winiarek <72758259+Froxyy-dev@users.noreply.github.com>
    Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
    Co-authored-by: Meline Mkrtchyan <72409758+melllinia@users.noreply.github.com>
    Co-authored-by: Wei Du <wedu@nvidia.com>
    Co-authored-by: Igor Gitman <igitman@nvidia.com>
    Co-authored-by: Sean Naren <snarenthiran@nvidia.com>
    Co-authored-by: Mehrzad Samadi <mehrzadsamadi@gmail.com>
    Co-authored-by: anowaczynski-nvidia <anowaczynski@nvidia.com>
    Co-authored-by: fzyzcjy <5236035+fzyzcjy@users.noreply.github.com>

commit a3d44dc
Author: Suriya Gunasekar <sgunasekar@users.noreply.github.com>
Date:   Fri Feb 13 22:32:15 2026 -0800

    Add --installation_command support to prepare_data (#1243)

    Signed-off-by: suriya <sgunasekar@nvidia.com>
    Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com>

commit e80d524
Author: George Armstrong <georgea@nvidia.com>
Date:   Thu Feb 12 17:26:00 2026 -0800

    Fix CI disk space for Docker image builds (#1241)

    Signed-off-by: George Armstrong <georgea@nvidia.com>

commit d22236c
Author: Sadegh Mahdavi <smahdavi4@gmail.com>
Date:   Wed Feb 11 17:55:00 2026 -0800

    Fix answerbench prompt parsing (#1235)

    Signed-off-by: Sadegh Mahdavi <smahdavi@nvidia.com>

commit 2401628
Author: George Armstrong <georgea@nvidia.com>
Date:   Wed Feb 11 14:56:43 2026 -0800

    feat: add lockfiles for reproducible sandbox builds (#1233)

    Signed-off-by: George Armstrong <georgea@nvidia.com>

commit 5a0a84d
Author: Wasi Ahmad <wasiahmad@ucla.edu>
Date:   Wed Feb 11 13:30:03 2026 -0800

    removing datasets version restriction for LCB eval (#1230)

    Signed-off-by: wasiahmad <wasiahmad@ucla.edu>

commit ef0a890
Author: gnalbandyan <153070076+gnalbandyan@users.noreply.github.com>
Date:   Wed Feb 11 12:03:16 2026 +0400

    Gnalbandyan/add physics (#1214)

    Signed-off-by: Grigor Nalbandyan <gnalbandyan@nvidia.com>
    Signed-off-by: gnalbandyan <153070076+gnalbandyan@users.noreply.github.com>

commit bd9d30c
Author: Wasi Ahmad <wasiahmad@ucla.edu>
Date:   Tue Feb 10 15:13:27 2026 -0800

    LCB generic prompting (#1215)

    Signed-off-by: wasiahmad <wasiahmad@ucla.edu>

commit 7d6c49a
Author: Sadegh Mahdavi <smahdavi4@gmail.com>
Date:   Sat Feb 7 08:45:46 2026 -0800

    Add support for different variations of nemo-rl (#1220)

    Signed-off-by: Sadegh Mahdavi <smahdavi@nvidia.com>

commit b19ba96
Author: George Armstrong <georgea@nvidia.com>
Date:   Fri Feb 6 21:40:56 2026 -0800

    Add multi-node sandbox support for SLURM clusters (#1218)

    Signed-off-by: George Armstrong <georgea@nvidia.com>

commit 8950bb0
Author: anowaczynski-nvidia <anowaczynski@nvidia.com>
Date:   Sat Feb 7 01:38:00 2026 +0100

    support structured outputs in hle judge for optional AA compatibility (#1186)

    Signed-off-by: Arkadiusz Nowaczynski <anowaczynski@nvidia.com>
    Signed-off-by: Igor Gitman <igitman@nvidia.com>
    Co-authored-by: Igor Gitman <igitman@nvidia.com>

commit b84f7a2
Author: Igor Gitman <igitman@nvidia.com>
Date:   Fri Feb 6 14:51:02 2026 -0800

    A small update on running tests docs (#1219)

    Signed-off-by: Igor Gitman <igitman@nvidia.com>

commit 8e838e1
Author: George Armstrong <georgea@nvidia.com>
Date:   Thu Feb 5 18:01:35 2026 -0800

    feat: add flag to disable sandbox replay (#1217)

    Signed-off-by: George Armstrong <georgea@nvidia.com>

commit 5fd9085
Author: Igor Gitman <igitman@nvidia.com>
Date:   Thu Feb 5 15:57:01 2026 -0800

    Add an option to limit number of tool calls (#1216)

    Signed-off-by: Igor Gitman <igitman@nvidia.com>

commit d820200
Author: Igor Gitman <igitman@nvidia.com>
Date:   Tue Feb 3 10:43:55 2026 -0800

    Add arena-hard v2 (#1205)

    Signed-off-by: bzantium <ryumin93@gmail.com>
    Signed-off-by: Igor Gitman <igitman@nvidia.com>
    Co-authored-by: bzantium <ryumin93@gmail.com>

commit a30920e
Author: Igor Gitman <igitman@nvidia.com>
Date:   Mon Feb 2 10:53:55 2026 -0800

    Fix mkdocs warnings (#1204)

    Signed-off-by: Igor Gitman <igitman@nvidia.com>

commit 19d7788
Author: Ivan <imoshkov@nvidia.com>
Date:   Mon Feb 2 23:25:13 2026 +0500

    Fix infinite wait in sandbox.wait_for_sandbox (#1206)

    Signed-off-by: i-vainn <imoshkov@nvidia.com>

commit 3e65fbf
Author: Sadegh Mahdavi <smahdavi4@gmail.com>
Date:   Fri Jan 30 19:38:38 2026 -0800

    Improve tts (#1203)

    Signed-off-by: Sadegh Mahdavi <smahdavi@nvidia.com>

commit 250c862
Author: Nick Ludwig <nliudvig@nvidia.com>
Date:   Fri Jan 30 22:12:29 2026 +0400

    SWE-bench: fix SWE-agent hanging, adjust expected scores (#1202)

    Signed-off-by: Nikolai Ludwig <nliudvig@nvidia.com>

commit 7ded756
Author: Ivan <imoshkov@nvidia.com>
Date:   Fri Jan 30 09:57:41 2026 +0500

     Add proper token counting to code execution model (#1184)

    Signed-off-by: i-vainn <imoshkov@nvidia.com>
    Co-authored-by: Igor Gitman <igitman@nvidia.com>

commit b986304
Author: Igor Gitman <igitman@nvidia.com>
Date:   Thu Jan 29 17:57:07 2026 -0800

    Upgrade containers (#1198)

    Signed-off-by: Sadegh Mahdavi <smahdavi@nvidia.com>
    Signed-off-by: Igor Gitman <igitman@nvidia.com>
    Co-authored-by: Sadegh Mahdavi <smahdavi@nvidia.com>

commit 3b44f02
Author: Dan Lord <blahblahasdf@gmail.com>
Date:   Thu Jan 29 16:40:47 2026 -0800

    Fix incorrect string format (#1199)

    Signed-off-by: dlord <dlord@nvidia.com>

commit c4854b8
Author: Sadegh Mahdavi <smahdavi4@gmail.com>
Date:   Thu Jan 29 13:43:36 2026 -0800

    Update nemo-rl to latest (#1087)

    Signed-off-by: Sadegh Mahdavi <smahdavi@nvidia.com>
    Signed-off-by: Igor Gitman <igitman@nvidia.com>
    Co-authored-by: Igor Gitman <igitman@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants