Conversation
|
Cpu tests failture not related to PR |
Greptile OverviewGreptile SummaryThis PR adds support for the Numb3rs speech dataset for text normalization (TN) and inverse text normalization (ITN) evaluation. The changes enable evaluating ASR models against multiple reference fields (written vs spoken forms) and reporting per-reference WER metrics. Key Changes:
Implementation:
Confidence Score: 5/5
Important Files Changed
Sequence DiagramsequenceDiagram
participant User
participant PrepareScript as prepare.py
participant HuggingFace as HF Dataset
participant Generate as generate.py
participant AudioEval as audio.py
participant Metrics as audio_metrics.py
User->>PrepareScript: ns prepare_data numb3rs
PrepareScript->>HuggingFace: load_dataset("NNstuff/Numb3rs")
HuggingFace-->>PrepareScript: Dataset with audio samples
PrepareScript->>PrepareScript: Format entries with text_tn/text_itn
PrepareScript->>PrepareScript: Save audio files as FLAC
PrepareScript->>PrepareScript: Create test.jsonl with messages
User->>Generate: ns generate ++prompt_field=prompt_neutral
Generate->>Generate: fill_prompt() replaces <PLACEHOLDER>
Generate->>Generate: LLM generates transcription
Generate->>AudioEval: evaluate_sample() with reference_fields
AudioEval->>AudioEval: evaluate_asr(expected_answer, generation)
loop For each reference field
AudioEval->>AudioEval: evaluate_asr(sample[ref_field], generation)
AudioEval->>AudioEval: Store as wer_tn, wer_itn
end
AudioEval-->>Generate: Evaluation results with multiple WERs
Generate->>Metrics: update() with predictions
Metrics->>Metrics: Collect dynamic_wer_scores
Metrics->>Metrics: Aggregate wer_tn, wer_itn
Metrics-->>User: Final metrics report
|
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughAdds a new Numb3rs dataset integration (configs + prepare script) and extends audio evaluation/metrics to support a "no_tn_itn" normalization mode and per-reference WER (e.g., TN/ITN) with reference_fields-driven aggregation. Changes
Sequence DiagramsequenceDiagram
participant HF as HF Dataset
participant Formatter as Prepare Script
participant AudioFS as Audio Storage
participant Evaluator as Audio Evaluator
participant Metrics as Metrics Aggregator
participant Writer as JSONL Writer
HF->>Formatter: load test split & sample
Formatter->>AudioFS: save audio file (if with_audio)
AudioFS-->>Formatter: audio_filepath, duration, metadata
Formatter->>Writer: emit variants (neutral, tn, itn) with dual references
Writer-->>Formatter: write confirmation
Formatter->>Evaluator: submit sample + model generation
Evaluator->>Evaluator: preprocess (no_tn_itn or other normalization)
Evaluator->>Evaluator: compute WER per reference field (wer_tn, wer_itn)
Evaluator->>Metrics: push per-field WER scores
Metrics->>Writer: aggregate and output final metrics report
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related PRs
Suggested labels
Suggested reviewers
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 inconclusive)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Tip Issue Planner is now in beta. Read the docs and try it out! Share your feedback on Discord. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In `@nemo_skills/evaluation/evaluator/audio.py`:
- Around line 535-544: The reference-field WER calculations call evaluate_asr
without the normalization_mode, causing inconsistent WERs versus the main call;
update the loop that handles config.reference_fields to pass
normalization_mode=mode into evaluate_asr (i.e., call
evaluate_asr(sample[ref_field], generation, normalization_mode=mode)), keeping
the same metric naming logic that derives metric_suffix and updates
wer_{metric_suffix} and is_correct_{metric_suffix}.
🧹 Nitpick comments (4)
nemo_skills/inference/generate.py (1)
576-583: In-place mutation ofdata_point["messages"]may cause side effects.The code mutates
message["content"]directly withindata_point["messages"]. If the samedata_pointis reused elsewhere (e.g., for retries or logging), the placeholder will already be replaced. Consider whether a deep copy is needed here, similar to howdeepcopy(data_point)is used at line 600 for the non-openai path.♻️ Suggested fix using deepcopy for safety
if self.cfg.prompt_format == "openai": # Replace placeholder content with prompt from specified field if self.cfg.prompt_field and self.cfg.prompt_field in data_point: prompt_value = data_point[self.cfg.prompt_field] + # Work on a copy to avoid mutating original data_point + data_point = deepcopy(data_point) # Find and replace placeholder in messages for message in data_point["messages"]: if message.get("role") == "user" and message.get("content") == "<PLACEHOLDER>": message["content"] = prompt_value breaknemo_skills/dataset/numb3rs/prepare.py (3)
82-82: Hardcoded container path may need documentation.The path
/dataset/numb3rs/data/is hardcoded, which assumes a specific container/deployment structure. Consider adding a comment explaining when/how this path is used, or making it configurable.
187-192: Consider more specific exception handling for dataset loading.While catching a broad
Exceptionis pragmatic for HuggingFace dataset loading (which can fail in various ways), logging the exception type could help with debugging.♻️ Optional: More informative error logging
try: dataset = load_dataset("NNstuff/Numb3rs", split="train", trust_remote_code=True) print(f"Loaded {len(dataset)} total samples") except Exception as e: - print(f"Error loading dataset: {e}") + print(f"Error loading dataset ({type(e).__name__}): {e}") return
218-232: Potential inclusion of staletest.jsonlfrom previous runs.If
test.jsonlexists from a previous run and is not deleted before the glob, it could be included inall_jsonl_files. While the filter at line 223 removes it, this relies on the exact filename match. Consider explicitly deleting the combined file before processing, or using a more robust approach.♻️ Suggested fix to ensure clean state
# Combine all category JSONLs into test.jsonl combined_file = output_dir / "test.jsonl" + if combined_file.exists(): + combined_file.unlink() print(f"\nCreating combined file: {combined_file}") all_jsonl_files = sorted(output_dir.glob("*.jsonl")) category_files = [f for f in all_jsonl_files if f.name != "test.jsonl"]
|
@coderabbitai full review |
✅ Actions performedFull review triggered. |
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Fix all issues with AI agents
In `@nemo_skills/dataset/numb3rs/prepare.py`:
- Around line 55-70: The code currently computes duration = len(audio_array) /
sampling_rate without validating sampling_rate; add a guard in the prepare logic
(around where audio_info, audio_array and sampling_rate are read) to ensure
sampling_rate is a positive non-zero number (e.g., check
isinstance(sampling_rate, (int, float)) and sampling_rate > 0) and return None
for malformed entries if not, before computing duration and comparing to
MIN_AUDIO_DURATION.
In `@nemo_skills/inference/generate.py`:
- Around line 575-584: The placeholder-substitution block under the if
self.cfg.prompt_format == "openai": branch must fail fast when substitution
cannot be applied: if self.cfg.prompt_field is set but that key is missing in
data_point, or if data_point["messages"] contains no message with role "user"
and content "<PLACEHOLDER>", raise a clear exception (or call the existing error
handling) instead of silently continuing; update the logic around
self.cfg.prompt_field/data_point[self.cfg.prompt_field] and the loop over
data_point["messages"] to detect these two failure modes and surface a
descriptive error that includes the missing field name or the fact that
"<PLACEHOLDER>" was not found.
🧹 Nitpick comments (2)
nemo_skills/inference/generate.py (1)
221-224: Validateprompt_fieldusage for non-OpenAI prompts.
prompt_fieldis only used in the OpenAI path; whenprompt_format != "openai"it is silently ignored. Consider validating this in_post_init_validate_params(e.g., requireprompt_field is Noneunlessprompt_format == "openai"), to prevent accidental misconfiguration.nemo_skills/dataset/numb3rs/prepare.py (1)
186-192: Avoid swallowing unexpected dataset load errors.Catching all exceptions can mask real problems. Consider narrowing the exception type or re-raising after logging, so unexpected failures aren’t silently ignored.
| # Get audio info | ||
| audio_info = entry.get("audio", {}) | ||
| if not isinstance(audio_info, dict) or "array" not in audio_info or "sampling_rate" not in audio_info: | ||
| return None | ||
|
|
||
| audio_array = audio_info["array"] | ||
| sampling_rate = audio_info["sampling_rate"] | ||
|
|
||
| # Skip if audio array is empty or invalid | ||
| if audio_array is None or len(audio_array) == 0: | ||
| return None | ||
|
|
||
| duration = len(audio_array) / sampling_rate | ||
|
|
||
| if duration < MIN_AUDIO_DURATION: | ||
| return None |
There was a problem hiding this comment.
Guard against invalid sampling_rate.
If sampling_rate is 0/None, this will throw a division error. A small guard avoids hard failures on malformed entries.
🐛 Proposed guard
- duration = len(audio_array) / sampling_rate
+ if not sampling_rate or sampling_rate <= 0:
+ return None
+ duration = len(audio_array) / sampling_rate🤖 Prompt for AI Agents
In `@nemo_skills/dataset/numb3rs/prepare.py` around lines 55 - 70, The code
currently computes duration = len(audio_array) / sampling_rate without
validating sampling_rate; add a guard in the prepare logic (around where
audio_info, audio_array and sampling_rate are read) to ensure sampling_rate is a
positive non-zero number (e.g., check isinstance(sampling_rate, (int, float))
and sampling_rate > 0) and return None for malformed entries if not, before
computing duration and comparing to MIN_AUDIO_DURATION.
nemo_skills/inference/generate.py
Outdated
| if self.cfg.prompt_format == "openai": | ||
| # Replace placeholder content with prompt from specified field | ||
| if self.cfg.prompt_field and self.cfg.prompt_field in data_point: | ||
| prompt_value = data_point[self.cfg.prompt_field] | ||
| # Find and replace placeholder in messages | ||
| for message in data_point["messages"]: | ||
| if message.get("role") == "user" and message.get("content") == "<PLACEHOLDER>": | ||
| message["content"] = prompt_value | ||
| break | ||
|
|
There was a problem hiding this comment.
Fail fast when placeholder substitution can’t be applied.
If prompt_field is set but the field is missing or no <PLACEHOLDER> is found, the current logic silently leaves the placeholder in the prompt. This can waste runs and produce misleading results; a clear error (or at least a warning) is safer.
💡 Proposed fail-fast guard
if self.cfg.prompt_format == "openai":
# Replace placeholder content with prompt from specified field
- if self.cfg.prompt_field and self.cfg.prompt_field in data_point:
- prompt_value = data_point[self.cfg.prompt_field]
- # Find and replace placeholder in messages
- for message in data_point["messages"]:
- if message.get("role") == "user" and message.get("content") == "<PLACEHOLDER>":
- message["content"] = prompt_value
- break
+ if self.cfg.prompt_field:
+ if self.cfg.prompt_field not in data_point:
+ raise KeyError(f"prompt_field '{self.cfg.prompt_field}' not found in data point")
+ prompt_value = data_point[self.cfg.prompt_field]
+ replaced = False
+ # Find and replace placeholder in messages
+ for message in data_point.get("messages", []):
+ if message.get("role") == "user" and message.get("content") == "<PLACEHOLDER>":
+ message["content"] = prompt_value
+ replaced = True
+ break
+ if not replaced:
+ raise ValueError("No '<PLACEHOLDER>' found in user messages for prompt_field replacement")📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| if self.cfg.prompt_format == "openai": | |
| # Replace placeholder content with prompt from specified field | |
| if self.cfg.prompt_field and self.cfg.prompt_field in data_point: | |
| prompt_value = data_point[self.cfg.prompt_field] | |
| # Find and replace placeholder in messages | |
| for message in data_point["messages"]: | |
| if message.get("role") == "user" and message.get("content") == "<PLACEHOLDER>": | |
| message["content"] = prompt_value | |
| break | |
| if self.cfg.prompt_format == "openai": | |
| # Replace placeholder content with prompt from specified field | |
| if self.cfg.prompt_field: | |
| if self.cfg.prompt_field not in data_point: | |
| raise KeyError(f"prompt_field '{self.cfg.prompt_field}' not found in data point") | |
| prompt_value = data_point[self.cfg.prompt_field] | |
| replaced = False | |
| # Find and replace placeholder in messages | |
| for message in data_point.get("messages", []): | |
| if message.get("role") == "user" and message.get("content") == "<PLACEHOLDER>": | |
| message["content"] = prompt_value | |
| replaced = True | |
| break | |
| if not replaced: | |
| raise ValueError("No '<PLACEHOLDER>' found in user messages for prompt_field replacement") |
🤖 Prompt for AI Agents
In `@nemo_skills/inference/generate.py` around lines 575 - 584, The
placeholder-substitution block under the if self.cfg.prompt_format == "openai":
branch must fail fast when substitution cannot be applied: if
self.cfg.prompt_field is set but that key is missing in data_point, or if
data_point["messages"] contains no message with role "user" and content
"<PLACEHOLDER>", raise a clear exception (or call the existing error handling)
instead of silently continuing; update the logic around
self.cfg.prompt_field/data_point[self.cfg.prompt_field] and the loop over
data_point["messages"] to detect these two failure modes and surface a
descriptive error that includes the missing field name or the fact that
"<PLACEHOLDER>" was not found.
update source of data add special normalization mode Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>
14f1c57 to
676254d
Compare
Additional Comments (1)
|
Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>
Additional Comments (2)
In the
|
Additional Comments (1)
|
There was a problem hiding this comment.
Actionable comments posted: 4
🤖 Fix all issues with AI agents
In `@nemo_skills/dataset/numb3rs/prepare.py`:
- Line 301: The print statement uses an unnecessary f-string prefix; change the
call print(f"\nCreating combined test files for each variant...") to a regular
string print("\nCreating combined test files for each variant...") by removing
the leading "f" so there are no unused formatting placeholders (locate this
statement in nemo_skills/dataset/numb3rs/prepare.py where the combined test
files are created).
- Around line 269-274: Remove the broad try/except around the load_dataset call
so failures propagate instead of being swallowed; specifically delete the except
block that prints "Error loading dataset" and the early return, and let the call
to load_dataset("nvidia/Numb3rs", split="test", trust_remote_code=True) (and the
subsequent print of len(dataset)) raise its errors normally; keep the dataset
variable assignment and the success print as-is so callers see clear failures
from load_dataset.
- Around line 86-108: Replace lax .get() usage with direct dict access on the
expected Numb3rs fields to fail fast: access entry["original_text"],
entry["text"], and entry["file_name"] (instead of entry.get(...)) and let
KeyError surface for malformed entries; continue to strip the values and return
None if original_text or text are empty. Do not default duration to 1.0 — use
entry["duration"] and validate it exists, then compare against
MIN_AUDIO_DURATION and return None if too short. Preserve the existing
sample_id/audio_filename logic but read file_name from entry["file_name"] and
keep the .stem/.name handling and ".json" trimming as before so
malformed/missing keys raise clear errors.
In `@nemo_skills/evaluation/evaluator/audio.py`:
- Around line 37-42: Remove the duplicate declaration of the dataclass field
apply_whisper_normalization (it appears twice surrounding normalization_mode and
reference_fields); keep a single declaration with the intended default value
(True) and eliminate the shadowing duplicate so the dataclass has only one
apply_whisper_normalization field.
🧹 Nitpick comments (2)
nemo_skills/dataset/numb3rs/prepare.py (2)
110-123: Silent default forsampling_ratemay hide data issues.Line 117 defaults
sampling_rateto 16000 if missing. If the audio metadata is expected to contain this field, a missing value likely indicates a malformed entry that should be flagged rather than silently assumed.♻️ Suggested fix
- sampling_rate = audio_info.get("sampling_rate", 16000) + sampling_rate = audio_info["sampling_rate"]As per coding guidelines, "Do not use
.get()for accessing dictionary keys if the code expects them to be present; use direct dictionary accessdict[key]instead to allow proper error handling and fail fast with clear errors".
64-72: Add type hints to function signatures.Functions lack type hints for their parameters and return types. This applies to
build_messages_with_prompt,save_audio_and_format_entry,prepare_category, andmain.Example for
save_audio_and_format_entry:def save_audio_and_format_entry( entry: dict, category: str, audio_dir: Path, sample_idx: int, with_audio: bool = True, audio_prefix: str = "/data/numb3rs", ) -> dict | None:As per coding guidelines, "Use type hints for simple types (dict, list, int, float, existing classes) in Python code".
Also applies to: 75-75, 152-152, 233-233
Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>
Additional Comments (1)
|
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
nemo_skills/evaluation/evaluator/audio.py (1)
321-328:⚠️ Potential issue | 🟡 MinorDocstring doesn't list
"no_tn_itn"as a validnormalization_mode.Line 327 only mentions
"standard","audiobench","hf_leaderboard", and"none". The new"no_tn_itn"mode should be documented here as well for consistency withpreprocess_asr_text.📝 Proposed fix
Args: reference: Ground truth transcription. hypothesis: Model output transcription. - normalization_mode: "standard", "audiobench", "hf_leaderboard", or "none". + normalization_mode: "standard", "audiobench", "hf_leaderboard", "none", or "no_tn_itn". """
Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>
Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>
Kipok
left a comment
There was a problem hiding this comment.
please add this dataset to docs with some example commands / reference numbers https://github.com/NVIDIA-NeMo/Skills/blob/main/CONTRIBUTING.md#when-adding-new-benchmarks
Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>
Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>
done, no ref numbers for now |
Kipok
left a comment
There was a problem hiding this comment.
since the changes are fully local to audio logic / the new benchmark, I'm ok with merging this if you believe the code is good and if you tested it well. It seems like there is a breaking change in hf_leaderboard normalization, make sure it's intentional. Please have someone from your team review and approve this
also, highly recommend to run /review in codex or claude code - I think they have some helpful comments there. E.g. here is the output from claude
Overview
This PR adds the Numb3rs benchmark — a speech
dataset for evaluating text normalization (TN) and
inverse text normalization (ITN). It includes:
- New dataset definition (__init__.py) and
preparation script (prepare.py)
- A new no_tn_itn normalization mode in the audio
evaluator
- Multi-reference WER evaluation via
reference_fields
- Improvements to the strip_helpful_prefixes
function (colon-quote and contraction handling)
- New wer_tn / wer_itn metrics in AudioMetrics
- Documentation for the new benchmark
Critical Issue: Breaking change to hf_leaderboard
normalization mode
audio.py:288-298 — The hf_leaderboard mode's
dedicated code path was removed. Previously it did:
# OLD (main branch)
if mode == "hf_leaderboard":
text = unicodedata.normalize("NFC", text)
text = text.lower()
text = re.sub(r"[^\w\s]", "", text)
return re.sub(r"\s+", " ", text).strip()
Now it falls through to whisper normalization, which
behaves very differently (e.g., converts number
words to digits, has its own English-specific
transformations). This silently breaks
asr-leaderboard, which uses
++eval_config.normalization_mode=hf_leaderboard. The
new no_tn_itn mode is basically what hf_leaderboard
used to do (minus unicode NFC normalization).
This should be fixed — either restore the
hf_leaderboard path, or if the change was
intentional, it should be a separate commit with a
clear rationale and the asr-leaderboard dataset
updated accordingly.
Other Issues
1. audio.py:38 — Type annotation style: list[str] |
None uses PEP 604 union syntax. The rest of the
codebase (and CONTRIBUTING.md) says to avoid
"complicated types" and use simple types. Consider
checking whether this project targets Python 3.9
compatibility where list[str] | None would fail at
runtime unless from __future__ import annotations is
used.
2. audio_metrics.py — Hardcoded wer_tn / wer_itn
field names: The evaluator's reference_fields
feature is generic (any field name), but the metrics
collector only looks for the hardcoded wer_tn and
wer_itn keys. If someone uses
reference_fields=['text_written', 'text_spoken'],
the metrics would be wer_written and wer_spoken, but
AudioMetrics would silently ignore them. Either
make the metrics collection dynamic (iterate over
all wer_* keys in predictions), or document that
only text_tn/text_itn field names are supported.
3. prepare.py:95 — audio_metadata key leaks into the
entry dict: The audio_metadata key is added to the
formatted entry and then del'd in the loop (line
180). This is a fragile pattern — if any code path
writes the base entry before messages are added,
audio_metadata would leak into the JSONL. Consider
building messages inline instead of storing
intermediate state.
4. prepare.py:80 — sample_id field: The sample_id is
derived from the filename stem, but other audio
datasets use a simple integer index for sample_id.
This inconsistency could cause issues with
deduplication or ordering assumptions downstream.
Minor Suggestions
5. prepare.py:55 — SYSTEM_MESSAGE with /no_think:
The /no_think suffix in the system message is
model-specific behavior. Other audio datasets in the
repo use the same pattern, so this is consistent,
just noting it.
6. prepare.py:232-237 — Category validation: Unknown
categories print a warning and are silently
skipped. Per the project's "don't be overly
defensive" / "fail fast" style, this should probably
raise an error instead.
7. strip_helpful_prefixes changes (audio.py:74-89):
The rewrite to handle contractions like o'clock is a
nice improvement. However, the non-greedy '(.+?)'
pattern could still be too eager for some edge
cases. The change is also not specific to numb3rs —
it affects all audio evaluation. Worth testing
against existing benchmarks.
8. Documentation duplication: The Numb3rs section
appears twice in speech-audio.md — once as a brief
summary at the top (lines ~35-48), and again as a
full section at the bottom (lines ~419+). This is
consistent with how other benchmarks are documented
in that file, so it's fine.
Summary
The main concern is the silent breaking change to
hf_leaderboard normalization — this should be
addressed before merge. The rest of the
implementation follows project conventions well and
adds a clean, well-documented benchmark.
docs/evaluation/speech-audio.md
Outdated
| - Benchmark is defined in [`nemo_skills/dataset/mmau-pro/__init__.py`](https://github.com/NVIDIA-NeMo/Skills/blob/main/nemo_skills/dataset/mmau-pro/__init__.py) | ||
| - Original benchmark source is hosted on [HuggingFace](https://huggingface.co/datasets/gamma-lab-umd/MMAU-Pro) | ||
|
|
||
| ### Numb3rs |
There was a problem hiding this comment.
I think there is something wrong with headers and placement of docs in this file (not just for numb3rs). We should probably only have one header for each benchmark and all information goes into subheaders
| from datasets import load_dataset | ||
| from tqdm import tqdm | ||
|
|
||
| SYSTEM_MESSAGE = "You are a helpful assistant. /no_think" |
There was a problem hiding this comment.
do we need this? Especially the /no_think part which is qwen specific
There was a problem hiding this comment.
I'd suggest to not set system message and users can always add it via explicit ++system_message parameter
|
Hf leaderboard changes is intentional |
…ulation, refactored normalization process Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>
|
@melllinia plz retry exps |
|
@Jorjeous After the current changes the scores are back to normal. Approved! |
Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com> Signed-off-by: Igor Gitman <igitman@nvidia.com>
commit a5da597 Author: Igor Gitman <igitman@nvidia.com> Date: Fri Mar 6 12:13:36 2026 -0800 Revert "Eval kit support (#1239)" (#1294) Signed-off-by: Igor Gitman <igitman@nvidia.com> commit b237e33 Author: George <37293288+Jorjeous@users.noreply.github.com> Date: Fri Mar 6 20:25:37 2026 +0400 Eval kit support (#1239) Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com> Signed-off-by: George <37293288+Jorjeous@users.noreply.github.com> Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com> commit dc28bbf Author: George Armstrong <georgea@nvidia.com> Date: Thu Mar 5 10:17:44 2026 -0800 Python direct tool calling without MCP (#1286) Signed-off-by: George Armstrong <georgea@nvidia.com> commit 12454dd Author: Sadegh Mahdavi <smahdavi4@gmail.com> Date: Wed Mar 4 13:06:21 2026 -0800 Allow het servers for nemo-rl jobs (#1223) Signed-off-by: Sadegh Mahdavi <smahdavi@nvidia.com> Co-authored-by: Igor Gitman <igitman@nvidia.com> commit 8884a68 Author: Prasoon Varshney <prasoon1995@gmail.com> Date: Wed Mar 4 10:24:02 2026 -0800 Support source_lang param for translation recipe (#1290) Signed-off-by: Prasoon Varshney <prasoonv@nvidia.com> Co-authored-by: George Armstrong <georgea@nvidia.com> commit 4618b19 Author: Meriem B. <113170426+ka00ri@users.noreply.github.com> Date: Wed Mar 4 18:59:28 2026 +0100 Add MMLU-Pro 10% optimized subset for checkpoint selection (#1285) Signed-off-by: Meriem Boubdir <mboubdir@nvidia.com> Co-authored-by: George Armstrong <georgea@nvidia.com> commit 5ac8609 Author: Talor Abramovich <talor19@gmail.com> Date: Wed Mar 4 02:30:06 2026 +0200 Add SPEED-Bench (within repo) (#1279) Signed-off-by: Talor Abramovich <talora@nvidia.com> Signed-off-by: talora <talora@nvidia.com> Signed-off-by: Talor Abramovich <talor19@gmail.com> Signed-off-by: George Armstrong <georgea@nvidia.com> Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com> Co-authored-by: George Armstrong <georgea@nvidia.com> Co-authored-by: Igor Gitman <igor.a.gitman@gmail.com> commit c31eec5 Author: George Armstrong <georgea@nvidia.com> Date: Tue Mar 3 12:18:15 2026 -0800 Fix os.getlogin() crash in ns setup (#1289) Signed-off-by: George Armstrong <georgea@nvidia.com> commit c228e66 Author: George Armstrong <georgea@nvidia.com> Date: Tue Mar 3 11:04:54 2026 -0800 Fix streaming TypeError when delta.content is None (#1267) (#1288) Signed-off-by: George Armstrong <georgea@nvidia.com> commit aa47923 Author: Matvei Novikov <mnovikov@nvidia.com> Date: Mon Mar 2 16:28:41 2026 -0800 Add LibTrace recipe for generating domain-specific reasoning data (#1224) Signed-off-by: jubick1337 <mnovikov@nvidia.com> Signed-off-by: mnovikov <mnovikov@nvidia.com> Co-authored-by: George Armstrong <georgea@nvidia.com> commit 313cad7 Author: Stephen Ge <stepheng@nvidia.com> Date: Mon Mar 2 18:28:49 2026 -0500 fix: clean parse-failure retries in prover (#1284) Signed-off-by: Stephen Ge <stepheng@nvidia.com> commit 813cfa3 Author: George Armstrong <georgea@nvidia.com> Date: Mon Mar 2 15:10:08 2026 -0800 tst: rollback inference-api to integrate (#1287) Signed-off-by: George Armstrong <georgea@nvidia.com> commit 31735f9 Author: Valentin Mendelev <vmendelev@nvidia.com> Date: Mon Mar 2 23:11:25 2026 +0100 Add backend-agnostic unified inference server with NeMo ASR and TTS backends (#1250) Signed-off-by: Valentin Mendelev <vmendelev@nvidia.com> commit d4ef8c0 Author: George <37293288+Jorjeous@users.noreply.github.com> Date: Fri Feb 27 23:58:54 2026 +0400 Update promt_config to working with openai format + inline setup (#1210) Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com> Signed-off-by: George <37293288+Jorjeous@users.noreply.github.com> Co-authored-by: Igor Gitman <igitman@nvidia.com> Co-authored-by: George Armstrong <georgea@nvidia.com> commit e879cbc Author: George Armstrong <georgea@nvidia.com> Date: Fri Feb 27 10:41:23 2026 -0800 Update noc tutorial (#1282) Signed-off-by: George Armstrong <georgea@nvidia.com> commit f6e3505 Author: George Armstrong <georgea@nvidia.com> Date: Fri Feb 27 10:17:33 2026 -0800 Add noc reasoning tutorial (#1278) Signed-off-by: Amparo Canaveras <acanaveras@nvidia.com> Signed-off-by: rajeshwarid179 <rdevaramani@nvidia.com> Signed-off-by: acanaveras <142839082+acanaveras@users.noreply.github.com> Signed-off-by: George Armstrong <georgea@nvidia.com> Co-authored-by: Amparo Canaveras <acanaveras@nvidia.com> Co-authored-by: Cursor <cursoragent@cursor.com> Co-authored-by: acanaveras <142839082+acanaveras@users.noreply.github.com> Co-authored-by: rajeshwarid179 <rdevaramani@nvidia.com> commit fc2072a Author: Jiacheng Xu <jcxu@utexas.edu> Date: Fri Feb 27 10:10:25 2026 -0800 CritPt generation add prompt_format=None (#1280) Signed-off-by: Jiacheng Xu <jiachengx@nvidia.com> Co-authored-by: George Armstrong <georgea@nvidia.com> commit c8abe5d Author: Igor Gitman <igitman@nvidia.com> Date: Fri Feb 27 09:31:26 2026 -0800 New slurm customization parameters (account, containers) (#1209) Signed-off-by: Igor Gitman <igitman@nvidia.com> Signed-off-by: George Armstrong <georgea@nvidia.com> Co-authored-by: George Armstrong <georgea@nvidia.com> commit 2b38cce Author: George Armstrong <georgea@nvidia.com> Date: Wed Feb 25 17:59:52 2026 -0800 Add nemo-skills-core subpackage for lightweight installs (#1229) Signed-off-by: George Armstrong <georgea@nvidia.com> commit 9fa8e83 Author: Dheeraj Peri <peri.dheeraj@gmail.com> Date: Wed Feb 25 12:56:35 2026 -0800 feat: add custom judge type support for external repo integration (#1274) Signed-off-by: Igor Gitman <igitman@nvidia.com> Signed-off-by: bzantium <ryumin93@gmail.com> Signed-off-by: Dheeraj Peri <dperi@nvidia.com> Signed-off-by: suriya <sgunasekar@nvidia.com> Signed-off-by: Jiacheng Xu <jiachengx@nvidia.com> Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com> Co-authored-by: Igor Gitman <igitman@nvidia.com> Co-authored-by: Minho Ryu <ryumin93@gmail.com> Co-authored-by: Yongqiang Wang <yongqiang.seagull@gmail.com> Co-authored-by: Suriya Gunasekar <sgunasekar@users.noreply.github.com> Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com> Co-authored-by: Jiacheng Xu <jcxu@utexas.edu> Co-authored-by: George <37293288+Jorjeous@users.noreply.github.com> commit 8a32b13 Author: Igor Gitman <igitman@nvidia.com> Date: Tue Feb 24 15:24:42 2026 -0800 Exclude numb3rs form test_eval.py (#1275) Signed-off-by: Igor Gitman <igitman@nvidia.com> commit 6da2219 Author: George <37293288+Jorjeous@users.noreply.github.com> Date: Mon Feb 23 18:37:46 2026 +0400 Numb3rs ds addition (#1174) Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com> commit ad034b5 Author: Suriya Gunasekar <sgunasekar@users.noreply.github.com> Date: Sun Feb 22 11:55:24 2026 -0800 Add DSBench-DA evaluation (#1254) Squash merge of changes during code-review. Signed-off-by: suriya <sgunasekar@nvidia.com> commit 7593ab3 Author: Jiacheng Xu <jcxu@utexas.edu> Date: Fri Feb 20 16:42:01 2026 -0800 Add CritPt benchmark (#1200) Signed-off-by: Jiacheng Xu <jiachengx@nvidia.com> Co-authored-by: Igor Gitman <igitman@nvidia.com> commit 58c31b2 Author: Suriya Gunasekar <sgunasekar@users.noreply.github.com> Date: Fri Feb 20 16:19:22 2026 -0800 Fix no_answer metric overcounting in _compute_pass_at_k (#1245) Signed-off-by: suriya <sgunasekar@nvidia.com> Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com> Co-authored-by: Igor Gitman <igitman@nvidia.com> commit 1f1a2e7 Author: Igor Gitman <igitman@nvidia.com> Date: Fri Feb 20 15:58:40 2026 -0800 Fix incorrect prompt tokens count due to HF api update (#1264) Signed-off-by: Igor Gitman <igitman@nvidia.com> commit 8ebc6f5 Author: Igor Gitman <igitman@nvidia.com> Date: Fri Feb 20 09:05:33 2026 -0800 Remove deprecated dataset group (#1263) Signed-off-by: Igor Gitman <igitman@nvidia.com> commit ea4177f Author: Yongqiang Wang <yongqiang.seagull@gmail.com> Date: Thu Feb 19 19:57:25 2026 -0500 fix deps (#1258) commit 60905a7 Author: Minho Ryu <ryumin93@gmail.com> Date: Fri Feb 20 09:39:39 2026 +0900 Add aime26 (#1256) Signed-off-by: bzantium <ryumin93@gmail.com> commit b28afc5 Author: Igor Gitman <igitman@nvidia.com> Date: Thu Feb 19 16:18:25 2026 -0800 Rename custom -> external benchmarks (#1262) Signed-off-by: Igor Gitman <igitman@nvidia.com> commit 6cc9c45 Author: Igor Gitman <igitman@nvidia.com> Date: Thu Feb 19 16:10:33 2026 -0800 Add reference to internal benchmarks repo (#1261) Signed-off-by: Igor Gitman <igitman@nvidia.com> commit 5202af6 Author: Igor Gitman <igitman@nvidia.com> Date: Thu Feb 19 16:08:05 2026 -0800 Remove incorrect presence-penalty setting (#1259) Signed-off-by: Igor Gitman <igitman@nvidia.com> commit 144c70b Author: Igor Gitman <igitman@nvidia.com> Date: Thu Feb 19 15:26:33 2026 -0800 Adding an option to store benchmarks in external repo (#1240) Signed-off-by: Igor Gitman <igitman@nvidia.com> Co-authored-by: George Armstrong <georgea@nvidia.com> commit 10e6e39 Author: George <37293288+Jorjeous@users.noreply.github.com> Date: Thu Feb 19 19:57:21 2026 +0400 update vllm miltimodal for api calls convenience (#1213) Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com> Signed-off-by: mmkrtchyan <mmkrtchyan@nvidia.com> Co-authored-by: mmkrtchyan <mmkrtchyan@nvidia.com> commit 1ba4219 Author: Nick Ludwig <nliudvig@nvidia.com> Date: Wed Feb 18 03:28:23 2026 +0400 Fix --server_container not being applied to dependent jobs (#1244) Signed-off-by: Nikolai Ludwig <nliudvig@nvidia.com> Co-authored-by: Igor Gitman <igitman@nvidia.com> commit 9517614 Author: Wasi Ahmad <wasiahmad@ucla.edu> Date: Mon Feb 16 11:13:24 2026 -0800 Support mini-swe-agent as agent harness (#1212) Signed-off-by: wasiahmad <wasiahmad@ucla.edu> Signed-off-by: i-vainn <imoshkov@nvidia.com> Signed-off-by: George Armstrong <georgea@nvidia.com> Signed-off-by: Charlie Truong <chtruong@nvidia.com> Signed-off-by: Nikolai Ludwig <nliudvig@nvidia.com> Signed-off-by: Grigor Nalbandyan <gnalbandyan@nvidia.com> Signed-off-by: bzantium <ryumin93@gmail.com> Signed-off-by: Stephen Ge <stepheng@nvidia.com> Signed-off-by: Jiacheng Xu <jiachengx@nvidia.com> Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com> Signed-off-by: Mateusz Winiarek <mwiniarek@nvidia.com> Signed-off-by: mmkrtchyan <mmkrtchyan@nvidia.com> Signed-off-by: Wei Du <wedu@nvidia.com> Signed-off-by: Igor Gitman <igitman@nvidia.com> Signed-off-by: George <37293288+Jorjeous@users.noreply.github.com> Signed-off-by: SeanNaren <snarenthiran@nvidia.com> Signed-off-by: Arkadiusz Nowaczynski <anowaczynski@nvidia.com> Signed-off-by: fzyzcjy <5236035+fzyzcjy@users.noreply.github.com> Co-authored-by: Ivan <imoshkov@nvidia.com> Co-authored-by: George Armstrong <georgea@nvidia.com> Co-authored-by: Charlie Truong <chtruong@nvidia.com> Co-authored-by: Nick Ludwig <nliudvig@nvidia.com> Co-authored-by: Wojciech Prazuch <wojciechprazuch3@gmail.com> Co-authored-by: gnalbandyan <153070076+gnalbandyan@users.noreply.github.com> Co-authored-by: Minho Ryu <ryumin93@gmail.com> Co-authored-by: Stephen Ge <stepheng@nvidia.com> Co-authored-by: Jiacheng Xu <jcxu@utexas.edu> Co-authored-by: Jiacheng Xu <jiachengx@nvidia.com> Co-authored-by: George <37293288+Jorjeous@users.noreply.github.com> Co-authored-by: Sanyam Kapoor <sanyamk@nvidia.com> Co-authored-by: Mateusz Winiarek <72758259+Froxyy-dev@users.noreply.github.com> Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com> Co-authored-by: Meline Mkrtchyan <72409758+melllinia@users.noreply.github.com> Co-authored-by: Wei Du <wedu@nvidia.com> Co-authored-by: Igor Gitman <igitman@nvidia.com> Co-authored-by: Sean Naren <snarenthiran@nvidia.com> Co-authored-by: Mehrzad Samadi <mehrzadsamadi@gmail.com> Co-authored-by: anowaczynski-nvidia <anowaczynski@nvidia.com> Co-authored-by: fzyzcjy <5236035+fzyzcjy@users.noreply.github.com> commit a3d44dc Author: Suriya Gunasekar <sgunasekar@users.noreply.github.com> Date: Fri Feb 13 22:32:15 2026 -0800 Add --installation_command support to prepare_data (#1243) Signed-off-by: suriya <sgunasekar@nvidia.com> Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com> commit e80d524 Author: George Armstrong <georgea@nvidia.com> Date: Thu Feb 12 17:26:00 2026 -0800 Fix CI disk space for Docker image builds (#1241) Signed-off-by: George Armstrong <georgea@nvidia.com> commit d22236c Author: Sadegh Mahdavi <smahdavi4@gmail.com> Date: Wed Feb 11 17:55:00 2026 -0800 Fix answerbench prompt parsing (#1235) Signed-off-by: Sadegh Mahdavi <smahdavi@nvidia.com> commit 2401628 Author: George Armstrong <georgea@nvidia.com> Date: Wed Feb 11 14:56:43 2026 -0800 feat: add lockfiles for reproducible sandbox builds (#1233) Signed-off-by: George Armstrong <georgea@nvidia.com> commit 5a0a84d Author: Wasi Ahmad <wasiahmad@ucla.edu> Date: Wed Feb 11 13:30:03 2026 -0800 removing datasets version restriction for LCB eval (#1230) Signed-off-by: wasiahmad <wasiahmad@ucla.edu> commit ef0a890 Author: gnalbandyan <153070076+gnalbandyan@users.noreply.github.com> Date: Wed Feb 11 12:03:16 2026 +0400 Gnalbandyan/add physics (#1214) Signed-off-by: Grigor Nalbandyan <gnalbandyan@nvidia.com> Signed-off-by: gnalbandyan <153070076+gnalbandyan@users.noreply.github.com> commit bd9d30c Author: Wasi Ahmad <wasiahmad@ucla.edu> Date: Tue Feb 10 15:13:27 2026 -0800 LCB generic prompting (#1215) Signed-off-by: wasiahmad <wasiahmad@ucla.edu> commit 7d6c49a Author: Sadegh Mahdavi <smahdavi4@gmail.com> Date: Sat Feb 7 08:45:46 2026 -0800 Add support for different variations of nemo-rl (#1220) Signed-off-by: Sadegh Mahdavi <smahdavi@nvidia.com> commit b19ba96 Author: George Armstrong <georgea@nvidia.com> Date: Fri Feb 6 21:40:56 2026 -0800 Add multi-node sandbox support for SLURM clusters (#1218) Signed-off-by: George Armstrong <georgea@nvidia.com> commit 8950bb0 Author: anowaczynski-nvidia <anowaczynski@nvidia.com> Date: Sat Feb 7 01:38:00 2026 +0100 support structured outputs in hle judge for optional AA compatibility (#1186) Signed-off-by: Arkadiusz Nowaczynski <anowaczynski@nvidia.com> Signed-off-by: Igor Gitman <igitman@nvidia.com> Co-authored-by: Igor Gitman <igitman@nvidia.com> commit b84f7a2 Author: Igor Gitman <igitman@nvidia.com> Date: Fri Feb 6 14:51:02 2026 -0800 A small update on running tests docs (#1219) Signed-off-by: Igor Gitman <igitman@nvidia.com> commit 8e838e1 Author: George Armstrong <georgea@nvidia.com> Date: Thu Feb 5 18:01:35 2026 -0800 feat: add flag to disable sandbox replay (#1217) Signed-off-by: George Armstrong <georgea@nvidia.com> commit 5fd9085 Author: Igor Gitman <igitman@nvidia.com> Date: Thu Feb 5 15:57:01 2026 -0800 Add an option to limit number of tool calls (#1216) Signed-off-by: Igor Gitman <igitman@nvidia.com> commit d820200 Author: Igor Gitman <igitman@nvidia.com> Date: Tue Feb 3 10:43:55 2026 -0800 Add arena-hard v2 (#1205) Signed-off-by: bzantium <ryumin93@gmail.com> Signed-off-by: Igor Gitman <igitman@nvidia.com> Co-authored-by: bzantium <ryumin93@gmail.com> commit a30920e Author: Igor Gitman <igitman@nvidia.com> Date: Mon Feb 2 10:53:55 2026 -0800 Fix mkdocs warnings (#1204) Signed-off-by: Igor Gitman <igitman@nvidia.com> commit 19d7788 Author: Ivan <imoshkov@nvidia.com> Date: Mon Feb 2 23:25:13 2026 +0500 Fix infinite wait in sandbox.wait_for_sandbox (#1206) Signed-off-by: i-vainn <imoshkov@nvidia.com> commit 3e65fbf Author: Sadegh Mahdavi <smahdavi4@gmail.com> Date: Fri Jan 30 19:38:38 2026 -0800 Improve tts (#1203) Signed-off-by: Sadegh Mahdavi <smahdavi@nvidia.com> commit 250c862 Author: Nick Ludwig <nliudvig@nvidia.com> Date: Fri Jan 30 22:12:29 2026 +0400 SWE-bench: fix SWE-agent hanging, adjust expected scores (#1202) Signed-off-by: Nikolai Ludwig <nliudvig@nvidia.com> commit 7ded756 Author: Ivan <imoshkov@nvidia.com> Date: Fri Jan 30 09:57:41 2026 +0500 Add proper token counting to code execution model (#1184) Signed-off-by: i-vainn <imoshkov@nvidia.com> Co-authored-by: Igor Gitman <igitman@nvidia.com> commit b986304 Author: Igor Gitman <igitman@nvidia.com> Date: Thu Jan 29 17:57:07 2026 -0800 Upgrade containers (#1198) Signed-off-by: Sadegh Mahdavi <smahdavi@nvidia.com> Signed-off-by: Igor Gitman <igitman@nvidia.com> Co-authored-by: Sadegh Mahdavi <smahdavi@nvidia.com> commit 3b44f02 Author: Dan Lord <blahblahasdf@gmail.com> Date: Thu Jan 29 16:40:47 2026 -0800 Fix incorrect string format (#1199) Signed-off-by: dlord <dlord@nvidia.com> commit c4854b8 Author: Sadegh Mahdavi <smahdavi4@gmail.com> Date: Thu Jan 29 13:43:36 2026 -0800 Update nemo-rl to latest (#1087) Signed-off-by: Sadegh Mahdavi <smahdavi@nvidia.com> Signed-off-by: Igor Gitman <igitman@nvidia.com> Co-authored-by: Igor Gitman <igitman@nvidia.com>
Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>
Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com> Signed-off-by: dgitman <dgitman@nvidia.com>
Summary by CodeRabbit
New Features
Enhancements
Documentation