Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughAdds a Global PIQA dataset package: module constants, utilities to load/format the mrlbenchmarks/global-piqa-nonparallel dataset into MCQ prompts with extraction regexes, and a CLI to write per-language Changes
Sequence Diagram(s)sequenceDiagram
actor User
participant Prepare as "prepare.py"
participant Utils as "global_piqa_utils"
participant HF as "HuggingFace\n(global-piqa-nonparallel)"
participant FS as "FileSystem\n(test.jsonl)"
User->>Prepare: run with --languages
Prepare->>Utils: supported_languages()
Utils->>HF: list dataset configs
HF-->>Utils: available languages
Utils-->>Prepare: supported languages
Prepare->>Utils: load_global_piqa_datasets(languages, split="test")
Utils->>HF: load each language split
HF-->>Utils: dataset entries per language
Utils-->>Prepare: dict(language -> entries)
loop per language entry
Prepare->>Utils: get_mcq_fields(entry)
Utils-->>Prepare: MCQ prompt/options/labels
Prepare->>Prepare: format_entry(entry, language)
Prepare->>FS: append JSON line to test.jsonl
end
FS-->>User: test.jsonl ready
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes 🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
📝 Coding Plan
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment Tip You can enable review details to help with troubleshooting, context usage and more.Enable the |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@nemo_skills/dataset/global_piqa/prepare.py`:
- Around line 44-53: The current loop writes each formatted entry to output_file
as it's produced, risking partial writes if format_entry raises; instead, first
iterate through datasets (using load_global_piqa_datasets and format_entry) and
collect all formatted entries into a list (e.g., formatted_entries), then open
output_file (data_dir / "test.jsonl") once and write the precomputed entries to
the file; ensure you reference datasets, format_entry, and output_file in your
changes so you compute everything before any file I/O.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: ac8f52b0-5f48-44f2-bef8-9c8872479777
📒 Files selected for processing (3)
nemo_skills/dataset/global_piqa/__init__.pynemo_skills/dataset/global_piqa/global_piqa_utils.pynemo_skills/dataset/global_piqa/prepare.py
| datasets = load_global_piqa_datasets(args.languages) | ||
|
|
||
| data_dir = Path(__file__).absolute().parent | ||
| output_file = data_dir / "test.jsonl" | ||
| with open(output_file, "wt", encoding="utf-8") as fout: | ||
| for language in datasets: | ||
| for entry in datasets[language]: | ||
| entry = format_entry(entry=entry, language=language) | ||
| json.dump(entry, fout, ensure_ascii=False) | ||
| fout.write("\n") |
There was a problem hiding this comment.
Compute all entries before writing to prevent data loss on failure.
Per coding guidelines, when adding new benchmarks, all computation should complete before file writes to prevent accidental data loss if code fails mid-way. Currently, if format_entry raises an exception during iteration, the file will contain partial data.
🛡️ Proposed fix to prevent partial writes
datasets = load_global_piqa_datasets(args.languages)
+ # Compute all entries first to prevent partial writes on failure
+ all_entries = [
+ format_entry(entry=entry, language=language)
+ for language in datasets
+ for entry in datasets[language]
+ ]
+
data_dir = Path(__file__).absolute().parent
output_file = data_dir / "test.jsonl"
with open(output_file, "wt", encoding="utf-8") as fout:
- for language in datasets:
- for entry in datasets[language]:
- entry = format_entry(entry=entry, language=language)
- json.dump(entry, fout, ensure_ascii=False)
- fout.write("\n")
+ for entry in all_entries:
+ json.dump(entry, fout, ensure_ascii=False)
+ fout.write("\n")As per coding guidelines: "When adding new benchmarks, avoid data loss by doing all computation before re-opening files for writing; ensure computation completes before file writes to prevent accidental data loss if code fails".
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@nemo_skills/dataset/global_piqa/prepare.py` around lines 44 - 53, The current
loop writes each formatted entry to output_file as it's produced, risking
partial writes if format_entry raises; instead, first iterate through datasets
(using load_global_piqa_datasets and format_entry) and collect all formatted
entries into a list (e.g., formatted_entries), then open output_file (data_dir /
"test.jsonl") once and write the precomputed entries to the file; ensure you
reference datasets, format_entry, and output_file in your changes so you compute
everything before any file I/O.
shuoyangd
left a comment
There was a problem hiding this comment.
Just one nit question, otherwise looking good to me.
Signed-off-by: naymaraq <dkaramyan@nvidia.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: naymaraq <dkaramyan@nvidia.com>
36f080f to
49956cf
Compare
Signed-off-by: naymaraq <dkaramyan@nvidia.com>
There was a problem hiding this comment.
Actionable comments posted: 2
♻️ Duplicate comments (1)
nemo_skills/dataset/global_piqa/prepare.py (1)
48-53:⚠️ Potential issue | 🟡 MinorPrecompute the rows before truncating
test.jsonl.If
format_entry()fails mid-loop, the file has already been opened in write mode and will be left partially written. Build the formatted records first, then open the file once to emit them.🛡️ Proposed fix
datasets = load_global_piqa_datasets(args.languages) + formatted_entries = [ + format_entry(entry=entry, language=language) + for language, dataset in datasets.items() + for entry in dataset + ] + data_dir = Path(__file__).absolute().parent output_file = data_dir / "test.jsonl" with open(output_file, "wt", encoding="utf-8") as fout: - for language in datasets: - for entry in datasets[language]: - entry = format_entry(entry=entry, language=language) - json.dump(entry, fout, ensure_ascii=False) - fout.write("\n") + for entry in formatted_entries: + json.dump(entry, fout, ensure_ascii=False) + fout.write("\n")As per coding guidelines: "When adding new benchmarks, avoid data loss by doing all computation before re-opening files for writing; ensure computation completes before file writes to prevent accidental data loss if code fails".
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@nemo_skills/dataset/global_piqa/prepare.py` around lines 48 - 53, The current loop opens output_file for writing before calling format_entry, risking truncation if format_entry raises; instead, first iterate over datasets and collect formatted entries by calling format_entry(entry, language) into a list (e.g., formatted = []), handling or propagating errors during formatting, and only after all entries are successfully formatted open output_file once and write the JSON lines from that list using json.dump and fout.write("\n"); reference format_entry, datasets, and output_file to locate and modify the code.
🧹 Nitpick comments (1)
nemo_skills/dataset/global_piqa/prepare.py (1)
40-45: Avoid duplicatesupported_languages()lookup by deferring default to main().The function is called during parser setup (line 60) before args are parsed, then again in main() (line 41) for validation. Using
default=Noneand handling the default in main() withlanguages = args.languages or supported_languages()eliminates the unnecessary early lookup on the no-flag path while keeping the validation logic clean.♻️ Proposed refactor
def main(args): - invalid = set(args.languages) - set(supported_languages()) + all_languages = supported_languages() + languages = args.languages or all_languages + invalid = set(languages) - set(all_languages) if invalid: raise ValueError(f"Unsupported languages: {invalid}") - datasets = load_global_piqa_datasets(args.languages) + datasets = load_global_piqa_datasets(languages) ... parser.add_argument( "--languages", - default=supported_languages(), + default=None, nargs="+", - help="Languages to process.", + help="Languages to process. Defaults to all supported languages.", )🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@nemo_skills/dataset/global_piqa/prepare.py` around lines 40 - 45, Change the argument parser to use default=None for the --languages flag and in main() avoid calling supported_languages() twice by doing: languages = args.languages or supported_languages(); then validate using invalid = set(languages) - set(supported_languages()) and pass languages to load_global_piqa_datasets. Update references from args.languages to the local languages variable so the initial parser setup no longer triggers supported_languages() on the common (no-flag) path.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@nemo_skills/dataset/global_piqa/global_piqa_utils.py`:
- Around line 26-27: The function digit_to_letter currently maps any integer to
letters via chr(ord("A") + digit) which can produce labels beyond "A"/"B";
update digit_to_letter to validate the input digit (e.g., ensure it's 0 or 1 or
at least within 0..1) and raise a ValueError for out-of-range values, leaving
the mapping of 0->"A" and 1->"B" intact; reference the function name
digit_to_letter when making this change so callers receive a clear exception
instead of invalid labels.
- Around line 54-55: Update ANSWER_REGEX1 and ANSWER_REGEX2 to use word-boundary
or lookaround so captured A/B are not part of larger words (e.g., replace the
[^A-B]*([A-B]) postfix with a pattern that ensures the captured letter is not
preceded or followed by word characters, using constructs like
(?<![A-Za-z0-9_])([AB])(?![A-Za-z0-9_]) or \b([AB])\b so "maybe" or "probably"
won't match); then add bounds checking in digit_to_letter(digit: int) to raise
ValueError when digit is not 0 or 1 (e.g., if digit not in (0,1): raise
ValueError(...)) so invalid inputs fail fast.
---
Duplicate comments:
In `@nemo_skills/dataset/global_piqa/prepare.py`:
- Around line 48-53: The current loop opens output_file for writing before
calling format_entry, risking truncation if format_entry raises; instead, first
iterate over datasets and collect formatted entries by calling
format_entry(entry, language) into a list (e.g., formatted = []), handling or
propagating errors during formatting, and only after all entries are
successfully formatted open output_file once and write the JSON lines from that
list using json.dump and fout.write("\n"); reference format_entry, datasets, and
output_file to locate and modify the code.
---
Nitpick comments:
In `@nemo_skills/dataset/global_piqa/prepare.py`:
- Around line 40-45: Change the argument parser to use default=None for the
--languages flag and in main() avoid calling supported_languages() twice by
doing: languages = args.languages or supported_languages(); then validate using
invalid = set(languages) - set(supported_languages()) and pass languages to
load_global_piqa_datasets. Update references from args.languages to the local
languages variable so the initial parser setup no longer triggers
supported_languages() on the common (no-flag) path.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 979e869e-f766-4632-b6de-8d0a078dcfb7
📒 Files selected for processing (3)
nemo_skills/dataset/global_piqa/__init__.pynemo_skills/dataset/global_piqa/global_piqa_utils.pynemo_skills/dataset/global_piqa/prepare.py
| def digit_to_letter(digit: int) -> str: | ||
| return chr(ord("A") + digit) |
There was a problem hiding this comment.
Fail fast on labels outside A/B.
This is a two-choice benchmark, but chr(ord("A") + digit) will happily emit C, D, etc. if the source label is ever unexpected. Raising here is safer than writing impossible expected_answer values.
🛡️ Proposed fix
def digit_to_letter(digit: int) -> str:
+ if digit not in (0, 1):
+ raise ValueError(f"Unsupported Global PIQA label: {digit}")
return chr(ord("A") + digit)📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| def digit_to_letter(digit: int) -> str: | |
| return chr(ord("A") + digit) | |
| def digit_to_letter(digit: int) -> str: | |
| if digit not in (0, 1): | |
| raise ValueError(f"Unsupported Global PIQA label: {digit}") | |
| return chr(ord("A") + digit) |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@nemo_skills/dataset/global_piqa/global_piqa_utils.py` around lines 26 - 27,
The function digit_to_letter currently maps any integer to letters via
chr(ord("A") + digit) which can produce labels beyond "A"/"B"; update
digit_to_letter to validate the input digit (e.g., ensure it's 0 or 1 or at
least within 0..1) and raise a ValueError for out-of-range values, leaving the
mapping of 0->"A" and 1->"B" intact; reference the function name digit_to_letter
when making this change so callers receive a clear exception instead of invalid
labels.
| ANSWER_REGEX1 = r"(?i)[Tt]he (?:[Bb]est [Aa]nswer|[Ff]inal [Aa]nswer|[Aa]nswer)[^A-B]*([A-B])" | ||
| ANSWER_REGEX2 = r"(?i)[Aa]nswer\s*:[^A-B]*([A-B])" |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
python - <<'PY'
import re
pattern = re.compile(r"(?i)[Tt]he (?:[Bb]est [Aa]nswer|[Ff]inal [Aa]nswer|[Aa]nswer)[^A-B]*([A-B])")
for sample in [
"The answer is maybe A",
"The answer is probably B",
"The final answer is A",
]:
match = pattern.search(sample)
print(f"{sample!r} -> {match.group(1) if match else None}")
PYRepository: NVIDIA-NeMo/Skills
Length of output: 151
🏁 Script executed:
fd "global_piqa_utils.py" --type fRepository: NVIDIA-NeMo/Skills
Length of output: 114
🏁 Script executed:
cat -n nemo_skills/dataset/global_piqa/global_piqa_utils.py | head -100Repository: NVIDIA-NeMo/Skills
Length of output: 3055
🏁 Script executed:
rg "digit_to_letter" nemo_skills/dataset/global_piqa/global_piqa_utils.py -A 2 -B 2Repository: NVIDIA-NeMo/Skills
Length of output: 411
🏁 Script executed:
rg "EXTRACT_REGEX" nemo_skills/dataset/global_piqa/ -A 3 -B 2Repository: NVIDIA-NeMo/Skills
Length of output: 1484
🏁 Script executed:
rg "digit_to_letter" nemo_skills/dataset/global_piqa/ -B 2 -A 2Repository: NVIDIA-NeMo/Skills
Length of output: 1689
🏁 Script executed:
cat nemo_skills/dataset/global_piqa/prepare.py | grep -A 10 "Schema.LABEL"Repository: NVIDIA-NeMo/Skills
Length of output: 389
🏁 Script executed:
python3 <<'EOF'
import re
# Test the proposed fix from the review comment
proposed_regex1 = r"(?i)\bthe (?:best answer|final answer|answer)\b(?:\s+is)?\s*[:\-]?\s*(?:option\s*)?\(?([AB])\)?(?=[^\w]|$)"
proposed_regex2 = r"(?i)\banswer\s*:\s*(?:option\s*)?\(?([AB])\)?(?=[^\w]|$)"
# Also test a simpler fix: use [^A-Ba-b]
simple_regex1 = r"(?i)[Tt]he (?:[Bb]est [Aa]nswer|[Ff]inal [Aa]nswer|[Aa]nswer)[^A-Ba-b]*([A-B])"
test_cases = [
"The answer is maybe A",
"The answer is probably B",
"The final answer is A",
"The best answer is A",
"Answer: B",
]
print("=== ORIGINAL REGEX (BROKEN) ===")
original_regex1 = re.compile(r"(?i)[Tt]he (?:[Bb]est [Aa]nswer|[Ff]inal [Aa]nswer|[Aa]nswer)[^A-B]*([A-B])")
for test in test_cases:
match = original_regex1.search(test)
print(f"{test!r} -> {match.group(1) if match else None}")
print("\n=== PROPOSED FIX (COMPLEX) ===")
proposed_pattern1 = re.compile(proposed_regex1)
for test in test_cases:
match = proposed_pattern1.search(test)
print(f"{test!r} -> {match.group(1) if match else None}")
print("\n=== SIMPLE FIX [^A-Ba-b] ===")
simple_pattern1 = re.compile(simple_regex1)
for test in test_cases:
match = simple_pattern1.search(test)
print(f"{test!r} -> {match.group(1) if match else None}")
EOFRepository: NVIDIA-NeMo/Skills
Length of output: 575
🏁 Script executed:
python3 <<'EOF'
import re
# Test alternative fixes
word_boundary_regex1 = r"(?i)[Tt]he (?:[Bb]est [Aa]nswer|[Ff]inal [Aa]nswer|[Aa]nswer)[\s:=\-]*\b([A-B])\b"
word_boundary_regex2 = r"(?i)[Aa]nswer\s*:[\s]*\b([A-B])\b"
# Negative lookbehind/lookahead approach
negative_lb_regex1 = r"(?i)[Tt]he (?:[Bb]est [Aa]nswer|[Ff]inal [Aa]nswer|[Aa]nswer)[^A-B]*(?<![a-z])([A-B])(?![a-z])"
test_cases = [
"The answer is maybe A", # Invalid LLM output
"The answer is probably B", # Invalid LLM output
"The final answer is A", # Valid
"The best answer is A", # Valid
"The best answer is: A", # Valid
"Answer: B", # Valid
"The answer is B", # Valid
]
print("=== WORD BOUNDARY FIX ===")
pattern1 = re.compile(word_boundary_regex1)
pattern2 = re.compile(word_boundary_regex2)
for test in test_cases:
match = pattern1.search(test)
if not match:
match = pattern2.search(test)
print(f"{test!r} -> {match.group(1) if match else None}")
print("\n=== NEGATIVE LOOKBEHIND/LOOKAHEAD FIX ===")
pattern_lb = re.compile(negative_lb_regex1)
for test in test_cases:
match = pattern_lb.search(test)
print(f"{test!r} -> {match.group(1) if match else None}")
EOFRepository: NVIDIA-NeMo/Skills
Length of output: 531
🏁 Script executed:
python3 <<'EOF'
import re
# Test ANSWER_REGEX2 with the same approach
answer_regex2_original = r"(?i)[Aa]nswer\s*:[^A-B]*([A-B])"
answer_regex2_fixed = r"(?i)[Aa]nswer\s*:(?<![a-z])[^A-B]*(?<![a-z])([A-B])(?![a-z])"
test_cases = [
"answer: maybe A",
"answer: probably B",
"Answer: A",
"Answer: B",
]
print("=== ANSWER_REGEX2 ORIGINAL ===")
pattern = re.compile(answer_regex2_original)
for test in test_cases:
match = pattern.search(test)
print(f"{test!r} -> {match.group(1) if match else None}")
print("\n=== ANSWER_REGEX2 WITH NEGATIVE LOOKBEHIND/LOOKAHEAD ===")
pattern_fixed = re.compile(answer_regex2_fixed)
for test in test_cases:
match = pattern_fixed.search(test)
print(f"{test!r} -> {match.group(1) if match else None}")
EOFRepository: NVIDIA-NeMo/Skills
Length of output: 322
Fix regex patterns to reject letters inside words like "maybe" and "probably".
With [^A-B]*([A-B]), the regex captures the b in words like maybe and probably, marking correct outputs as B. Use negative lookbehind/lookahead instead to ensure the captured letter is not part of a word:
🔧 Proposed fix
-ANSWER_REGEX1 = r"(?i)[Tt]he (?:[Bb]est [Aa]nswer|[Ff]inal [Aa]nswer|[Aa]nswer)[^A-B]*([A-B])"
-ANSWER_REGEX2 = r"(?i)[Aa]nswer\s*:[^A-B]*([A-B])"
+ANSWER_REGEX1 = r"(?i)[Tt]he (?:[Bb]est [Aa]nswer|[Ff]inal [Aa]nswer|[Aa]nswer)[^A-B]*(?<![a-z])([A-B])(?![a-z])"
+ANSWER_REGEX2 = r"(?i)[Aa]nswer\s*:[^A-B]*(?<![a-z])([A-B])(?![a-z])"Additionally, add bounds checking to digit_to_letter() to fail fast if passed values outside [0, 1]:
def digit_to_letter(digit: int) -> str:
if digit not in (0, 1):
raise ValueError(f"digit must be 0 or 1, got {digit}")
return chr(ord("A") + digit)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@nemo_skills/dataset/global_piqa/global_piqa_utils.py` around lines 54 - 55,
Update ANSWER_REGEX1 and ANSWER_REGEX2 to use word-boundary or lookaround so
captured A/B are not part of larger words (e.g., replace the [^A-B]*([A-B])
postfix with a pattern that ensures the captured letter is not preceded or
followed by word characters, using constructs like
(?<![A-Za-z0-9_])([AB])(?![A-Za-z0-9_]) or \b([AB])\b so "maybe" or "probably"
won't match); then add bounds checking in digit_to_letter(digit: int) to raise
ValueError when digit is not 0 or 1 (e.g., if digit not in (0,1): raise
ValueError(...)) so invalid inputs fail fast.
There was a problem hiding this comment.
♻️ Duplicate comments (1)
nemo_skills/dataset/global_piqa/global_piqa_utils.py (1)
54-55:⚠️ Potential issue | 🟠 MajorTighten
ANSWER_REGEX1/2so they only capture standalone option letters.
[^A-B]*([A-B])can still grab thea/binside words likemaybeorprobably, soextract_answer()may return a choice before a realA/Btoken is reached. Make these two regexes strict and letGREEDY_REGEXremain the fallback for looser outputs.🔧 Proposed fix
-ANSWER_REGEX1 = r"(?i)[Tt]he (?:[Bb]est [Aa]nswer|[Ff]inal [Aa]nswer|[Aa]nswer)[^A-B]*([A-B])" -ANSWER_REGEX2 = r"(?i)[Aa]nswer\s*:[^A-B]*([A-B])" +ANSWER_REGEX1 = r"(?i)\bthe (?:best answer|final answer|answer)\b(?:\s+is)?\s*[:\-]?\s*(?:option\s*)?\(?([A-B])\)?(?=[^\w]|$)" +ANSWER_REGEX2 = r"(?i)\banswer\s*:\s*(?:option\s*)?\(?([A-B])\)?(?=[^\w]|$)"🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@nemo_skills/dataset/global_piqa/global_piqa_utils.py` around lines 54 - 55, The current ANSWER_REGEX1 and ANSWER_REGEX2 allow matching letters inside words (e.g., "maybe"), so update those two patterns used by extract_answer() to only capture standalone choice tokens A or B (use word boundaries or negative/positive lookarounds so the captured group is an isolated letter possibly followed/preceded by punctuation or whitespace, and keep GREEDY_REGEX as the fallback). Locate ANSWER_REGEX1, ANSWER_REGEX2 and the extract_answer() usage in global_piqa_utils.py and replace the loose [^A-B]*([A-B]) capture with a stricter pattern that enforces boundaries around the captured A/B while preserving case-insensitive matching.
🧹 Nitpick comments (1)
nemo_skills/dataset/global_piqa/global_piqa_utils.py (1)
22-23: ReturnDatasetdirectly and fix the annotation.This helper advertises
dict[str, list[dict]], but thedatasetsAPI returns a singleDatasetwhensplitis provided, and aDatasetDictonly when it is omitted. Switching toload_dataset(..., split=split)makes the contract accurate and removes the extra indexing hop. (huggingface.co)♻️ Proposed fix
-from datasets import get_dataset_config_names, load_dataset +from datasets import Dataset, get_dataset_config_names, load_dataset @@ -def load_global_piqa_datasets(languages: list[str], split: str = "test") -> dict[str, list[dict]]: - return {lang: load_dataset("mrlbenchmarks/global-piqa-nonparallel", lang)[split] for lang in languages} +def load_global_piqa_datasets(languages: list[str], split: str = "test") -> dict[str, Dataset]: + return { + lang: load_dataset("mrlbenchmarks/global-piqa-nonparallel", lang, split=split) + for lang in languages + }As per coding guidelines, "Keep code simple and elegant; reuse/extend existing functionality when possible, minimize conditional checks, use self-explanatory code over comments, avoid complicated type interfaces with unions, and keep naming consistent with existing conventions"
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@nemo_skills/dataset/global_piqa/global_piqa_utils.py` around lines 22 - 23, The function load_global_piqa_datasets currently uses load_dataset("mrlbenchmarks/global-piqa-nonparallel", lang)[split] and declares return type dict[str, list[dict]]; change it to call load_dataset("mrlbenchmarks/global-piqa-nonparallel", lang, split=split) so each value is a datasets.Dataset, update the return type annotation to dict[str, Dataset], and add the necessary import for Dataset from the datasets package; keep the mapping behavior (one entry per language) so callers still get a Dataset per language without the extra indexing hop.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Duplicate comments:
In `@nemo_skills/dataset/global_piqa/global_piqa_utils.py`:
- Around line 54-55: The current ANSWER_REGEX1 and ANSWER_REGEX2 allow matching
letters inside words (e.g., "maybe"), so update those two patterns used by
extract_answer() to only capture standalone choice tokens A or B (use word
boundaries or negative/positive lookarounds so the captured group is an isolated
letter possibly followed/preceded by punctuation or whitespace, and keep
GREEDY_REGEX as the fallback). Locate ANSWER_REGEX1, ANSWER_REGEX2 and the
extract_answer() usage in global_piqa_utils.py and replace the loose
[^A-B]*([A-B]) capture with a stricter pattern that enforces boundaries around
the captured A/B while preserving case-insensitive matching.
---
Nitpick comments:
In `@nemo_skills/dataset/global_piqa/global_piqa_utils.py`:
- Around line 22-23: The function load_global_piqa_datasets currently uses
load_dataset("mrlbenchmarks/global-piqa-nonparallel", lang)[split] and declares
return type dict[str, list[dict]]; change it to call
load_dataset("mrlbenchmarks/global-piqa-nonparallel", lang, split=split) so each
value is a datasets.Dataset, update the return type annotation to dict[str,
Dataset], and add the necessary import for Dataset from the datasets package;
keep the mapping behavior (one entry per language) so callers still get a
Dataset per language without the extra indexing hop.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 6a6669c8-6bf0-4020-8d56-492fb5fc13cd
📒 Files selected for processing (1)
nemo_skills/dataset/global_piqa/global_piqa_utils.py
Signed-off-by: naymaraq <dkaramyan@nvidia.com>
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@docs/evaluation/multilingual.md`:
- Around line 227-229: Replace the fenced markdown code blocks with indented
code blocks for the two command snippets (the one containing "ns prepare_data
mmmlu --languages <lang1> <lang2> ... --include_english" and the later snippet
around line 300) by removing the triple backticks and indenting each line of the
snippet with four spaces so the document matches the repository's MD046 style;
update both occurrences (the mmmlu prepare_data snippet and the other snippet
referenced) to use the indented format.
- Line 223: Replace the generic link text "here" with descriptive link text for
both instances in docs/evaluation/multilingual.md (lines referencing the
original benchmark and the other occurrence) to satisfy MD059; for example
change "[here](https://huggingface.co/datasets/openai/MMMLU)" to "[OpenAI MMMLU
dataset on Hugging Face](https://huggingface.co/datasets/openai/MMMLU)" (and
make a similarly descriptive replacement for the second occurrence) so the links
convey their destination and purpose.
- Around line 313-325: The benchmark flag in the command block uses the wrong
value; replace the --benchmarks mmmlu entry in the shown CLI snippet with the
correct benchmark name for Global PIQA (e.g., --benchmarks global_piqa) so the
invocation runs the intended Global PIQA benchmark; update the relevant fenced
code block where the CLI options are defined (the line containing "--benchmarks
mmmlu") to use the correct benchmark name.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: d68cb002-64ab-4c44-87ce-e3b3a2bf9e59
📒 Files selected for processing (1)
docs/evaluation/multilingual.md
| ### mmmlu | ||
|
|
||
| - Benchmark is defined in [`nemo_skills/dataset/mmmlu/__init__.py`](https://github.com/NVIDIA-NeMo/Skills/blob/main/nemo_skills/dataset/mmmlu/__init__.py) | ||
| - Original benchmark source is [here](https://huggingface.co/datasets/openai/MMMLU). |
There was a problem hiding this comment.
Use descriptive link text instead of “here”.
Line 223 and Line 296 use generic link text, which triggers MD059 and is less accessible in rendered docs.
Suggested fix
-- Original benchmark source is [here](https://huggingface.co/datasets/openai/MMMLU).
+- Original benchmark source is [OpenAI MMMLU dataset](https://huggingface.co/datasets/openai/MMMLU).
-- Original benchmark source is [here](https://huggingface.co/datasets/mrlbenchmarks/global-piqa-nonparallel).
+- Original benchmark source is [Global PIQA nonparallel dataset](https://huggingface.co/datasets/mrlbenchmarks/global-piqa-nonparallel).Also applies to: 296-296
🧰 Tools
🪛 markdownlint-cli2 (0.21.0)
[warning] 223-223: Link text should be descriptive
(MD059, descriptive-link-text)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@docs/evaluation/multilingual.md` at line 223, Replace the generic link text
"here" with descriptive link text for both instances in
docs/evaluation/multilingual.md (lines referencing the original benchmark and
the other occurrence) to satisfy MD059; for example change
"[here](https://huggingface.co/datasets/openai/MMMLU)" to "[OpenAI MMMLU dataset
on Hugging Face](https://huggingface.co/datasets/openai/MMMLU)" (and make a
similarly descriptive replacement for the second occurrence) so the links convey
their destination and purpose.
| ```bash | ||
| ns prepare_data mmmlu --languages <lang1> <lang2> ... --include_english | ||
| ``` |
There was a problem hiding this comment.
Match repository markdown code-block style (MD046).
Line 227 and Line 300 use fenced blocks, but lint expects indented blocks in this doc style. Please convert these two snippets to the expected indentation format.
Also applies to: 300-302
🧰 Tools
🪛 markdownlint-cli2 (0.21.0)
[warning] 227-227: Code block style
Expected: indented; Actual: fenced
(MD046, code-block-style)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@docs/evaluation/multilingual.md` around lines 227 - 229, Replace the fenced
markdown code blocks with indented code blocks for the two command snippets (the
one containing "ns prepare_data mmmlu --languages <lang1> <lang2> ...
--include_english" and the later snippet around line 300) by removing the triple
backticks and indenting each line of the snippet with four spaces so the
document matches the repository's MD046 style; update both occurrences (the
mmmlu prepare_data snippet and the other snippet referenced) to use the indented
format.
Signed-off-by: naymaraq <dkaramyan@nvidia.com>
Signed-off-by: naymaraq <dkaramyan@nvidia.com>
commit f5c0c53 Author: Dav Karamyan <47416614+naymaraq@users.noreply.github.com> Date: Mon Mar 16 16:45:33 2026 +0400 Add Global PIQA benchmark (#1299) Signed-off-by: naymaraq <dkaramyan@nvidia.com> Co-authored-by: naymaraq <dkaramyan@nvidia.com> Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com> Co-authored-by: Igor Gitman <igitman@nvidia.com> commit 86071c1 Author: Wasi Ahmad <wasiahmad@ucla.edu> Date: Thu Mar 12 21:16:32 2026 -0700 fixing sandbox use for livecodebench (#1304) Signed-off-by: wasiahmad <wasiahmad@ucla.edu> commit 4928ef5 Author: George Armstrong <georgea@nvidia.com> Date: Thu Mar 12 15:28:41 2026 -0700 nano v3 math tool calling slurm test (#1303) Signed-off-by: George Armstrong <georgea@nvidia.com> commit d4e4450 Author: George Armstrong <georgea@nvidia.com> Date: Thu Mar 12 14:17:03 2026 -0700 fix: restore SIGINT handler in sandbox shell worker to prevent session resets (#1302) Signed-off-by: George Armstrong <georgea@nvidia.com> commit 2b0a84d Author: Mahan <25934206+MahanFathi@users.noreply.github.com> Date: Thu Mar 12 00:07:49 2026 -0400 Add HotpotQA multi-hop QA benchmark (#1292) Signed-off-by: Meriem Boubdir <mboubdir@nvidia.com> Signed-off-by: Mahan Fathi <mfathi@nvidia.com> Signed-off-by: Prasoon Varshney <prasoonv@nvidia.com> Co-authored-by: Meriem B. <113170426+ka00ri@users.noreply.github.com> Co-authored-by: George Armstrong <georgea@nvidia.com> Co-authored-by: Prasoon Varshney <prasoon1995@gmail.com> Co-authored-by: Igor Gitman <igitman@nvidia.com> commit 75314b6 Author: gnalbandyan <153070076+gnalbandyan@users.noreply.github.com> Date: Thu Mar 12 08:06:51 2026 +0400 Gnalbandyan/ugph hle verified (#1293) Signed-off-by: Grigor Nalbandyan <gnalbandyan@nvidia.com> Signed-off-by: Igor Gitman <igitman@nvidia.com> Co-authored-by: Igor Gitman <igitman@nvidia.com> commit 8bbf387 Author: George Armstrong <georgea@nvidia.com> Date: Wed Mar 11 15:48:21 2026 -0700 build: fix gpu ci (#1301) Signed-off-by: George Armstrong <georgea@nvidia.com> commit 005cd03 Author: Vahid Noroozi <VahidooX@users.noreply.github.com> Date: Tue Mar 10 12:52:27 2026 -0700 Fix 1-hour client timeout in long-running generation jobs (#1297) Signed-off-by: vahidoox <vnoroozi@nvidia.com> commit 596b888 Author: anowaczynski-nvidia <anowaczynski@nvidia.com> Date: Tue Mar 10 19:11:26 2026 +0100 skip output-rs*_submissions.jsonl files when summarizing critpt (#1300) Signed-off-by: Arkadiusz Nowaczynski <anowaczynski@nvidia.com> commit fe92aec Author: anowaczynski-nvidia <anowaczynski@nvidia.com> Date: Tue Mar 10 00:00:57 2026 +0100 use output-rs prefix when detecting sampling results (#1296) Signed-off-by: Arkadiusz Nowaczynski <anowaczynski@nvidia.com> Co-authored-by: Igor Gitman <igitman@nvidia.com> commit f6f7041 Author: Dav Karamyan <47416614+naymaraq@users.noreply.github.com> Date: Tue Mar 10 02:40:06 2026 +0400 Add MMMLU benchmark (#1281) Signed-off-by: naymaraq <dkaramyan@nvidia.com> Signed-off-by: Igor Gitman <igor.a.gitman@gmail.com> Co-authored-by: naymaraq <dkaramyan@nvidia.com> Co-authored-by: George Armstrong <georgea@nvidia.com> Co-authored-by: Igor Gitman <igitman@nvidia.com> commit a5da597 Author: Igor Gitman <igitman@nvidia.com> Date: Fri Mar 6 12:13:36 2026 -0800 Revert "Eval kit support (#1239)" (#1294) Signed-off-by: Igor Gitman <igitman@nvidia.com> commit b237e33 Author: George <37293288+Jorjeous@users.noreply.github.com> Date: Fri Mar 6 20:25:37 2026 +0400 Eval kit support (#1239) Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com> Signed-off-by: George <37293288+Jorjeous@users.noreply.github.com> Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com> commit dc28bbf Author: George Armstrong <georgea@nvidia.com> Date: Thu Mar 5 10:17:44 2026 -0800 Python direct tool calling without MCP (#1286) Signed-off-by: George Armstrong <georgea@nvidia.com> commit 12454dd Author: Sadegh Mahdavi <smahdavi4@gmail.com> Date: Wed Mar 4 13:06:21 2026 -0800 Allow het servers for nemo-rl jobs (#1223) Signed-off-by: Sadegh Mahdavi <smahdavi@nvidia.com> Co-authored-by: Igor Gitman <igitman@nvidia.com> commit 8884a68 Author: Prasoon Varshney <prasoon1995@gmail.com> Date: Wed Mar 4 10:24:02 2026 -0800 Support source_lang param for translation recipe (#1290) Signed-off-by: Prasoon Varshney <prasoonv@nvidia.com> Co-authored-by: George Armstrong <georgea@nvidia.com> commit 4618b19 Author: Meriem B. <113170426+ka00ri@users.noreply.github.com> Date: Wed Mar 4 18:59:28 2026 +0100 Add MMLU-Pro 10% optimized subset for checkpoint selection (#1285) Signed-off-by: Meriem Boubdir <mboubdir@nvidia.com> Co-authored-by: George Armstrong <georgea@nvidia.com> commit 5ac8609 Author: Talor Abramovich <talor19@gmail.com> Date: Wed Mar 4 02:30:06 2026 +0200 Add SPEED-Bench (within repo) (#1279) Signed-off-by: Talor Abramovich <talora@nvidia.com> Signed-off-by: talora <talora@nvidia.com> Signed-off-by: Talor Abramovich <talor19@gmail.com> Signed-off-by: George Armstrong <georgea@nvidia.com> Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com> Co-authored-by: George Armstrong <georgea@nvidia.com> Co-authored-by: Igor Gitman <igor.a.gitman@gmail.com> commit c31eec5 Author: George Armstrong <georgea@nvidia.com> Date: Tue Mar 3 12:18:15 2026 -0800 Fix os.getlogin() crash in ns setup (#1289) Signed-off-by: George Armstrong <georgea@nvidia.com> commit c228e66 Author: George Armstrong <georgea@nvidia.com> Date: Tue Mar 3 11:04:54 2026 -0800 Fix streaming TypeError when delta.content is None (#1267) (#1288) Signed-off-by: George Armstrong <georgea@nvidia.com> commit aa47923 Author: Matvei Novikov <mnovikov@nvidia.com> Date: Mon Mar 2 16:28:41 2026 -0800 Add LibTrace recipe for generating domain-specific reasoning data (#1224) Signed-off-by: jubick1337 <mnovikov@nvidia.com> Signed-off-by: mnovikov <mnovikov@nvidia.com> Co-authored-by: George Armstrong <georgea@nvidia.com> commit 313cad7 Author: Stephen Ge <stepheng@nvidia.com> Date: Mon Mar 2 18:28:49 2026 -0500 fix: clean parse-failure retries in prover (#1284) Signed-off-by: Stephen Ge <stepheng@nvidia.com> commit 813cfa3 Author: George Armstrong <georgea@nvidia.com> Date: Mon Mar 2 15:10:08 2026 -0800 tst: rollback inference-api to integrate (#1287) Signed-off-by: George Armstrong <georgea@nvidia.com> commit 31735f9 Author: Valentin Mendelev <vmendelev@nvidia.com> Date: Mon Mar 2 23:11:25 2026 +0100 Add backend-agnostic unified inference server with NeMo ASR and TTS backends (#1250) Signed-off-by: Valentin Mendelev <vmendelev@nvidia.com> commit d4ef8c0 Author: George <37293288+Jorjeous@users.noreply.github.com> Date: Fri Feb 27 23:58:54 2026 +0400 Update promt_config to working with openai format + inline setup (#1210) Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com> Signed-off-by: George <37293288+Jorjeous@users.noreply.github.com> Co-authored-by: Igor Gitman <igitman@nvidia.com> Co-authored-by: George Armstrong <georgea@nvidia.com> commit e879cbc Author: George Armstrong <georgea@nvidia.com> Date: Fri Feb 27 10:41:23 2026 -0800 Update noc tutorial (#1282) Signed-off-by: George Armstrong <georgea@nvidia.com> commit f6e3505 Author: George Armstrong <georgea@nvidia.com> Date: Fri Feb 27 10:17:33 2026 -0800 Add noc reasoning tutorial (#1278) Signed-off-by: Amparo Canaveras <acanaveras@nvidia.com> Signed-off-by: rajeshwarid179 <rdevaramani@nvidia.com> Signed-off-by: acanaveras <142839082+acanaveras@users.noreply.github.com> Signed-off-by: George Armstrong <georgea@nvidia.com> Co-authored-by: Amparo Canaveras <acanaveras@nvidia.com> Co-authored-by: Cursor <cursoragent@cursor.com> Co-authored-by: acanaveras <142839082+acanaveras@users.noreply.github.com> Co-authored-by: rajeshwarid179 <rdevaramani@nvidia.com> commit fc2072a Author: Jiacheng Xu <jcxu@utexas.edu> Date: Fri Feb 27 10:10:25 2026 -0800 CritPt generation add prompt_format=None (#1280) Signed-off-by: Jiacheng Xu <jiachengx@nvidia.com> Co-authored-by: George Armstrong <georgea@nvidia.com> commit c8abe5d Author: Igor Gitman <igitman@nvidia.com> Date: Fri Feb 27 09:31:26 2026 -0800 New slurm customization parameters (account, containers) (#1209) Signed-off-by: Igor Gitman <igitman@nvidia.com> Signed-off-by: George Armstrong <georgea@nvidia.com> Co-authored-by: George Armstrong <georgea@nvidia.com> commit 2b38cce Author: George Armstrong <georgea@nvidia.com> Date: Wed Feb 25 17:59:52 2026 -0800 Add nemo-skills-core subpackage for lightweight installs (#1229) Signed-off-by: George Armstrong <georgea@nvidia.com> commit 9fa8e83 Author: Dheeraj Peri <peri.dheeraj@gmail.com> Date: Wed Feb 25 12:56:35 2026 -0800 feat: add custom judge type support for external repo integration (#1274) Signed-off-by: Igor Gitman <igitman@nvidia.com> Signed-off-by: bzantium <ryumin93@gmail.com> Signed-off-by: Dheeraj Peri <dperi@nvidia.com> Signed-off-by: suriya <sgunasekar@nvidia.com> Signed-off-by: Jiacheng Xu <jiachengx@nvidia.com> Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com> Co-authored-by: Igor Gitman <igitman@nvidia.com> Co-authored-by: Minho Ryu <ryumin93@gmail.com> Co-authored-by: Yongqiang Wang <yongqiang.seagull@gmail.com> Co-authored-by: Suriya Gunasekar <sgunasekar@users.noreply.github.com> Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com> Co-authored-by: Jiacheng Xu <jcxu@utexas.edu> Co-authored-by: George <37293288+Jorjeous@users.noreply.github.com> commit 8a32b13 Author: Igor Gitman <igitman@nvidia.com> Date: Tue Feb 24 15:24:42 2026 -0800 Exclude numb3rs form test_eval.py (#1275) Signed-off-by: Igor Gitman <igitman@nvidia.com> commit 6da2219 Author: George <37293288+Jorjeous@users.noreply.github.com> Date: Mon Feb 23 18:37:46 2026 +0400 Numb3rs ds addition (#1174) Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com> commit ad034b5 Author: Suriya Gunasekar <sgunasekar@users.noreply.github.com> Date: Sun Feb 22 11:55:24 2026 -0800 Add DSBench-DA evaluation (#1254) Squash merge of changes during code-review. Signed-off-by: suriya <sgunasekar@nvidia.com> commit 7593ab3 Author: Jiacheng Xu <jcxu@utexas.edu> Date: Fri Feb 20 16:42:01 2026 -0800 Add CritPt benchmark (#1200) Signed-off-by: Jiacheng Xu <jiachengx@nvidia.com> Co-authored-by: Igor Gitman <igitman@nvidia.com> commit 58c31b2 Author: Suriya Gunasekar <sgunasekar@users.noreply.github.com> Date: Fri Feb 20 16:19:22 2026 -0800 Fix no_answer metric overcounting in _compute_pass_at_k (#1245) Signed-off-by: suriya <sgunasekar@nvidia.com> Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com> Co-authored-by: Igor Gitman <igitman@nvidia.com> commit 1f1a2e7 Author: Igor Gitman <igitman@nvidia.com> Date: Fri Feb 20 15:58:40 2026 -0800 Fix incorrect prompt tokens count due to HF api update (#1264) Signed-off-by: Igor Gitman <igitman@nvidia.com> commit 8ebc6f5 Author: Igor Gitman <igitman@nvidia.com> Date: Fri Feb 20 09:05:33 2026 -0800 Remove deprecated dataset group (#1263) Signed-off-by: Igor Gitman <igitman@nvidia.com> commit ea4177f Author: Yongqiang Wang <yongqiang.seagull@gmail.com> Date: Thu Feb 19 19:57:25 2026 -0500 fix deps (#1258) commit 60905a7 Author: Minho Ryu <ryumin93@gmail.com> Date: Fri Feb 20 09:39:39 2026 +0900 Add aime26 (#1256) Signed-off-by: bzantium <ryumin93@gmail.com> commit b28afc5 Author: Igor Gitman <igitman@nvidia.com> Date: Thu Feb 19 16:18:25 2026 -0800 Rename custom -> external benchmarks (#1262) Signed-off-by: Igor Gitman <igitman@nvidia.com> commit 6cc9c45 Author: Igor Gitman <igitman@nvidia.com> Date: Thu Feb 19 16:10:33 2026 -0800 Add reference to internal benchmarks repo (#1261) Signed-off-by: Igor Gitman <igitman@nvidia.com> commit 5202af6 Author: Igor Gitman <igitman@nvidia.com> Date: Thu Feb 19 16:08:05 2026 -0800 Remove incorrect presence-penalty setting (#1259) Signed-off-by: Igor Gitman <igitman@nvidia.com> commit 144c70b Author: Igor Gitman <igitman@nvidia.com> Date: Thu Feb 19 15:26:33 2026 -0800 Adding an option to store benchmarks in external repo (#1240) Signed-off-by: Igor Gitman <igitman@nvidia.com> Co-authored-by: George Armstrong <georgea@nvidia.com> commit 10e6e39 Author: George <37293288+Jorjeous@users.noreply.github.com> Date: Thu Feb 19 19:57:21 2026 +0400 update vllm miltimodal for api calls convenience (#1213) Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com> Signed-off-by: mmkrtchyan <mmkrtchyan@nvidia.com> Co-authored-by: mmkrtchyan <mmkrtchyan@nvidia.com> commit 1ba4219 Author: Nick Ludwig <nliudvig@nvidia.com> Date: Wed Feb 18 03:28:23 2026 +0400 Fix --server_container not being applied to dependent jobs (#1244) Signed-off-by: Nikolai Ludwig <nliudvig@nvidia.com> Co-authored-by: Igor Gitman <igitman@nvidia.com> commit 9517614 Author: Wasi Ahmad <wasiahmad@ucla.edu> Date: Mon Feb 16 11:13:24 2026 -0800 Support mini-swe-agent as agent harness (#1212) Signed-off-by: wasiahmad <wasiahmad@ucla.edu> Signed-off-by: i-vainn <imoshkov@nvidia.com> Signed-off-by: George Armstrong <georgea@nvidia.com> Signed-off-by: Charlie Truong <chtruong@nvidia.com> Signed-off-by: Nikolai Ludwig <nliudvig@nvidia.com> Signed-off-by: Grigor Nalbandyan <gnalbandyan@nvidia.com> Signed-off-by: bzantium <ryumin93@gmail.com> Signed-off-by: Stephen Ge <stepheng@nvidia.com> Signed-off-by: Jiacheng Xu <jiachengx@nvidia.com> Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com> Signed-off-by: Mateusz Winiarek <mwiniarek@nvidia.com> Signed-off-by: mmkrtchyan <mmkrtchyan@nvidia.com> Signed-off-by: Wei Du <wedu@nvidia.com> Signed-off-by: Igor Gitman <igitman@nvidia.com> Signed-off-by: George <37293288+Jorjeous@users.noreply.github.com> Signed-off-by: SeanNaren <snarenthiran@nvidia.com> Signed-off-by: Arkadiusz Nowaczynski <anowaczynski@nvidia.com> Signed-off-by: fzyzcjy <5236035+fzyzcjy@users.noreply.github.com> Co-authored-by: Ivan <imoshkov@nvidia.com> Co-authored-by: George Armstrong <georgea@nvidia.com> Co-authored-by: Charlie Truong <chtruong@nvidia.com> Co-authored-by: Nick Ludwig <nliudvig@nvidia.com> Co-authored-by: Wojciech Prazuch <wojciechprazuch3@gmail.com> Co-authored-by: gnalbandyan <153070076+gnalbandyan@users.noreply.github.com> Co-authored-by: Minho Ryu <ryumin93@gmail.com> Co-authored-by: Stephen Ge <stepheng@nvidia.com> Co-authored-by: Jiacheng Xu <jcxu@utexas.edu> Co-authored-by: Jiacheng Xu <jiachengx@nvidia.com> Co-authored-by: George <37293288+Jorjeous@users.noreply.github.com> Co-authored-by: Sanyam Kapoor <sanyamk@nvidia.com> Co-authored-by: Mateusz Winiarek <72758259+Froxyy-dev@users.noreply.github.com> Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com> Co-authored-by: Meline Mkrtchyan <72409758+melllinia@users.noreply.github.com> Co-authored-by: Wei Du <wedu@nvidia.com> Co-authored-by: Igor Gitman <igitman@nvidia.com> Co-authored-by: Sean Naren <snarenthiran@nvidia.com> Co-authored-by: Mehrzad Samadi <mehrzadsamadi@gmail.com> Co-authored-by: anowaczynski-nvidia <anowaczynski@nvidia.com> Co-authored-by: fzyzcjy <5236035+fzyzcjy@users.noreply.github.com> commit a3d44dc Author: Suriya Gunasekar <sgunasekar@users.noreply.github.com> Date: Fri Feb 13 22:32:15 2026 -0800 Add --installation_command support to prepare_data (#1243) Signed-off-by: suriya <sgunasekar@nvidia.com> Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com> commit e80d524 Author: George Armstrong <georgea@nvidia.com> Date: Thu Feb 12 17:26:00 2026 -0800 Fix CI disk space for Docker image builds (#1241) Signed-off-by: George Armstrong <georgea@nvidia.com> commit d22236c Author: Sadegh Mahdavi <smahdavi4@gmail.com> Date: Wed Feb 11 17:55:00 2026 -0800 Fix answerbench prompt parsing (#1235) Signed-off-by: Sadegh Mahdavi <smahdavi@nvidia.com> commit 2401628 Author: George Armstrong <georgea@nvidia.com> Date: Wed Feb 11 14:56:43 2026 -0800 feat: add lockfiles for reproducible sandbox builds (#1233) Signed-off-by: George Armstrong <georgea@nvidia.com> commit 5a0a84d Author: Wasi Ahmad <wasiahmad@ucla.edu> Date: Wed Feb 11 13:30:03 2026 -0800 removing datasets version restriction for LCB eval (#1230) Signed-off-by: wasiahmad <wasiahmad@ucla.edu> commit ef0a890 Author: gnalbandyan <153070076+gnalbandyan@users.noreply.github.com> Date: Wed Feb 11 12:03:16 2026 +0400 Gnalbandyan/add physics (#1214) Signed-off-by: Grigor Nalbandyan <gnalbandyan@nvidia.com> Signed-off-by: gnalbandyan <153070076+gnalbandyan@users.noreply.github.com> commit bd9d30c Author: Wasi Ahmad <wasiahmad@ucla.edu> Date: Tue Feb 10 15:13:27 2026 -0800 LCB generic prompting (#1215) Signed-off-by: wasiahmad <wasiahmad@ucla.edu> commit 7d6c49a Author: Sadegh Mahdavi <smahdavi4@gmail.com> Date: Sat Feb 7 08:45:46 2026 -0800 Add support for different variations of nemo-rl (#1220) Signed-off-by: Sadegh Mahdavi <smahdavi@nvidia.com> commit b19ba96 Author: George Armstrong <georgea@nvidia.com> Date: Fri Feb 6 21:40:56 2026 -0800 Add multi-node sandbox support for SLURM clusters (#1218) Signed-off-by: George Armstrong <georgea@nvidia.com> commit 8950bb0 Author: anowaczynski-nvidia <anowaczynski@nvidia.com> Date: Sat Feb 7 01:38:00 2026 +0100 support structured outputs in hle judge for optional AA compatibility (#1186) Signed-off-by: Arkadiusz Nowaczynski <anowaczynski@nvidia.com> Signed-off-by: Igor Gitman <igitman@nvidia.com> Co-authored-by: Igor Gitman <igitman@nvidia.com> commit b84f7a2 Author: Igor Gitman <igitman@nvidia.com> Date: Fri Feb 6 14:51:02 2026 -0800 A small update on running tests docs (#1219) Signed-off-by: Igor Gitman <igitman@nvidia.com> commit 8e838e1 Author: George Armstrong <georgea@nvidia.com> Date: Thu Feb 5 18:01:35 2026 -0800 feat: add flag to disable sandbox replay (#1217) Signed-off-by: George Armstrong <georgea@nvidia.com> commit 5fd9085 Author: Igor Gitman <igitman@nvidia.com> Date: Thu Feb 5 15:57:01 2026 -0800 Add an option to limit number of tool calls (#1216) Signed-off-by: Igor Gitman <igitman@nvidia.com> commit d820200 Author: Igor Gitman <igitman@nvidia.com> Date: Tue Feb 3 10:43:55 2026 -0800 Add arena-hard v2 (#1205) Signed-off-by: bzantium <ryumin93@gmail.com> Signed-off-by: Igor Gitman <igitman@nvidia.com> Co-authored-by: bzantium <ryumin93@gmail.com> commit a30920e Author: Igor Gitman <igitman@nvidia.com> Date: Mon Feb 2 10:53:55 2026 -0800 Fix mkdocs warnings (#1204) Signed-off-by: Igor Gitman <igitman@nvidia.com> commit 19d7788 Author: Ivan <imoshkov@nvidia.com> Date: Mon Feb 2 23:25:13 2026 +0500 Fix infinite wait in sandbox.wait_for_sandbox (#1206) Signed-off-by: i-vainn <imoshkov@nvidia.com> commit 3e65fbf Author: Sadegh Mahdavi <smahdavi4@gmail.com> Date: Fri Jan 30 19:38:38 2026 -0800 Improve tts (#1203) Signed-off-by: Sadegh Mahdavi <smahdavi@nvidia.com> commit 250c862 Author: Nick Ludwig <nliudvig@nvidia.com> Date: Fri Jan 30 22:12:29 2026 +0400 SWE-bench: fix SWE-agent hanging, adjust expected scores (#1202) Signed-off-by: Nikolai Ludwig <nliudvig@nvidia.com> commit 7ded756 Author: Ivan <imoshkov@nvidia.com> Date: Fri Jan 30 09:57:41 2026 +0500 Add proper token counting to code execution model (#1184) Signed-off-by: i-vainn <imoshkov@nvidia.com> Co-authored-by: Igor Gitman <igitman@nvidia.com> commit b986304 Author: Igor Gitman <igitman@nvidia.com> Date: Thu Jan 29 17:57:07 2026 -0800 Upgrade containers (#1198) Signed-off-by: Sadegh Mahdavi <smahdavi@nvidia.com> Signed-off-by: Igor Gitman <igitman@nvidia.com> Co-authored-by: Sadegh Mahdavi <smahdavi@nvidia.com> commit 3b44f02 Author: Dan Lord <blahblahasdf@gmail.com> Date: Thu Jan 29 16:40:47 2026 -0800 Fix incorrect string format (#1199) Signed-off-by: dlord <dlord@nvidia.com> commit c4854b8 Author: Sadegh Mahdavi <smahdavi4@gmail.com> Date: Thu Jan 29 13:43:36 2026 -0800 Update nemo-rl to latest (#1087) Signed-off-by: Sadegh Mahdavi <smahdavi@nvidia.com> Signed-off-by: Igor Gitman <igitman@nvidia.com> Co-authored-by: Igor Gitman <igitman@nvidia.com>
Add Global PIQA Benchmark
Summary
prepare.pyto download and format the dataset from HuggingFace.--languagesflag.lm-evaluation-harness.Summary by CodeRabbit