Numb3rs ds addition by Jorjeous · Pull Request #1174 · NVIDIA-NeMo/Skills

Jorjeous · 2026-01-19T22:04:38Z

Summary by CodeRabbit

New Features
- Added Numb3rs dataset support with automated audio preparation and neutral/tn/itn prompt variants.
- Audio evaluation now supports multi-reference/per-field scoring and reports per-reference WER metrics.
Enhancements
- New normalization option ("no_tn_itn") for evaluation preprocessing to preserve numeric forms during scoring.
- Aggregation and display of TN/ITN WER metrics alongside existing metrics.
Documentation
- Usage guidance and dataset description for Numb3rs preparation and evaluation workflows.

Jorjeous · 2026-01-21T10:41:32Z

Cpu tests failture not related to PR

greptile-apps · 2026-01-22T15:44:19Z

Greptile Overview

Greptile Summary

This PR adds support for the Numb3rs speech dataset for text normalization (TN) and inverse text normalization (ITN) evaluation. The changes enable evaluating ASR models against multiple reference fields (written vs spoken forms) and reporting per-reference WER metrics.

Key Changes:

Added Numb3rs dataset configuration and preparation tooling with audio export and per-category JSONL outputs
Enhanced audio evaluator to compute WER against multiple reference fields specified via reference_fields config
Extended metrics to dynamically aggregate and report WER variants (e.g., wer_tn, wer_itn)
Added prompt_field parameter to enable runtime prompt substitution in OpenAI-format messages

Implementation:

Dataset preparation loads from HuggingFace, formats entries with dual references (text_tn/text_itn), and creates placeholder-based messages
Audio evaluation loops through specified reference fields and computes per-reference WER metrics with proper naming
Metrics tracking collects dynamic WER scores and aggregates them alongside standard metrics
Generation replaces <PLACEHOLDER> content with dataset field values based on prompt_field setting

Confidence Score: 5/5

This PR is safe to merge with minimal risk
The implementation is well-structured and follows existing codebase patterns. The multi-reference evaluation feature cleanly extends the existing audio evaluation framework without breaking changes. The dataset preparation script properly handles edge cases (empty audio, invalid data) and includes appropriate validation. All changes are self-contained within the audio evaluation domain and properly documented.
No files require special attention

Important Files Changed

Filename	Overview
nemo_skills/dataset/numb3rs/prepare.py	Dataset preparation script for Numb3rs with audio export, JSONL formatting, and category filtering
nemo_skills/evaluation/evaluator/audio.py	Added multi-reference WER evaluation for TN/ITN tasks; computes WER against multiple ground truths
nemo_skills/inference/generate.py	Added prompt_field support to substitute dataset field values into OpenAI-format prompts at runtime

Sequence Diagram

sequenceDiagram
    participant User
    participant PrepareScript as prepare.py
    participant HuggingFace as HF Dataset
    participant Generate as generate.py
    participant AudioEval as audio.py
    participant Metrics as audio_metrics.py

    User->>PrepareScript: ns prepare_data numb3rs
    PrepareScript->>HuggingFace: load_dataset("NNstuff/Numb3rs")
    HuggingFace-->>PrepareScript: Dataset with audio samples
    PrepareScript->>PrepareScript: Format entries with text_tn/text_itn
    PrepareScript->>PrepareScript: Save audio files as FLAC
    PrepareScript->>PrepareScript: Create test.jsonl with messages

    User->>Generate: ns generate ++prompt_field=prompt_neutral
    Generate->>Generate: fill_prompt() replaces <PLACEHOLDER>
    Generate->>Generate: LLM generates transcription
    Generate->>AudioEval: evaluate_sample() with reference_fields
    AudioEval->>AudioEval: evaluate_asr(expected_answer, generation)
    loop For each reference field
        AudioEval->>AudioEval: evaluate_asr(sample[ref_field], generation)
        AudioEval->>AudioEval: Store as wer_tn, wer_itn
    end
    AudioEval-->>Generate: Evaluation results with multiple WERs
    Generate->>Metrics: update() with predictions
    Metrics->>Metrics: Collect dynamic_wer_scores
    Metrics->>Metrics: Aggregate wer_tn, wer_itn
    Metrics-->>User: Final metrics report

greptile-apps

_{5 files reviewed, 1 comment}

_{Edit Code Review Agent Settings | Greptile}

nemo_skills/evaluation/evaluator/audio.py

coderabbitai · 2026-01-22T15:44:46Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

📝 Walkthrough

Walkthrough

Adds a new Numb3rs dataset integration (configs + prepare script) and extends audio evaluation/metrics to support a "no_tn_itn" normalization mode and per-reference WER (e.g., TN/ITN) with reference_fields-driven aggregation.

Changes

Cohort / File(s)	Summary
Dataset Config `nemo_skills/dataset/numb3rs/__init__.py`	New dataset constants and defaults: `DATASET_GROUP`, `METRICS_TYPE`, `DEFAULT_SPLIT`, `EVAL_SPLIT`, `EVAL_ARGS` (references: `text_tn,text_itn`, `no_tn_itn`), and `GENERATION_ARGS`.
Dataset Preparation `nemo_skills/dataset/numb3rs/prepare.py`	New prepare CLI and utilities: loads HF test split, filters by category, optional audio saving, formats samples into neutral/tn/itn JSONL variants with fields like `audio_filepath`, `duration`, `text_tn`, `text_itn`, `sample_id`, `audio_metadata`; exports `save_audio_and_format_entry`, `prepare_category`, and `main`.
Audio Evaluator `nemo_skills/evaluation/evaluator/audio.py`	Adds `reference_fields` to `AudioEvaluatorConfig`; adds `no_tn_itn` to `VALID_NORMALIZATION_MODES`; introduces `no_tn_itn` preprocessing path and updates ASR preprocessing flow; computes per-field WER/is_correct when `reference_fields` present.
Audio Metrics `nemo_skills/evaluation/metrics/audio_metrics.py`	Adds per-reference score collectors (e.g., `wer_tn_scores`, `wer_itn_scores`), updates `update` to collect per-field WERs, and `get_metrics`/`metrics_to_print` to report averaged `wer_tn`/`wer_itn` metrics.

Sequence Diagram

sequenceDiagram
    participant HF as HF Dataset
    participant Formatter as Prepare Script
    participant AudioFS as Audio Storage
    participant Evaluator as Audio Evaluator
    participant Metrics as Metrics Aggregator
    participant Writer as JSONL Writer

    HF->>Formatter: load test split & sample
    Formatter->>AudioFS: save audio file (if with_audio)
    AudioFS-->>Formatter: audio_filepath, duration, metadata
    Formatter->>Writer: emit variants (neutral, tn, itn) with dual references
    Writer-->>Formatter: write confirmation
    Formatter->>Evaluator: submit sample + model generation
    Evaluator->>Evaluator: preprocess (no_tn_itn or other normalization)
    Evaluator->>Evaluator: compute WER per reference field (wer_tn, wer_itn)
    Evaluator->>Metrics: push per-field WER scores
    Metrics->>Writer: aggregate and output final metrics report

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

PR #1093: Related — modifies audio evaluator/metrics and overlaps on per-field WER and normalization changes.
PR #1140: Related — adjusts dataset/eval config and touches similar normalization/evaluation config logic.

Suggested labels

run GPU tests

Suggested reviewers

melllinia
gwarmstrong

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 inconclusive)

Check name	Status	Explanation	Resolution
Title check	❓ Inconclusive	The title 'Numb3rs ds addition' is vague and uses non-descriptive abbreviation 'ds'. While it refers to a dataset addition, it lacks clarity about what specifically is being added or its purpose.	Clarify the title to describe the main change more explicitly, such as 'Add Numb3rs dataset configuration and preparation for TN/ITN evaluation' or 'Add Numb3rs dataset with audio evaluation support'.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Docstring Coverage	✅ Passed	Docstring coverage is 87.50% which is sufficient. The required threshold is 80.00%.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch add_numb3rs_dataset

Tip

Issue Planner is now in beta. Read the docs and try it out! Share your feedback on Discord.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🤖 Fix all issues with AI agents

In `@nemo_skills/evaluation/evaluator/audio.py`:
- Around line 535-544: The reference-field WER calculations call evaluate_asr
without the normalization_mode, causing inconsistent WERs versus the main call;
update the loop that handles config.reference_fields to pass
normalization_mode=mode into evaluate_asr (i.e., call
evaluate_asr(sample[ref_field], generation, normalization_mode=mode)), keeping
the same metric naming logic that derives metric_suffix and updates
wer_{metric_suffix} and is_correct_{metric_suffix}.

🧹 Nitpick comments (4)

nemo_skills/inference/generate.py (1)
576-583: In-place mutation of data_point["messages"] may cause side effects.

The code mutates message["content"] directly within data_point["messages"]. If the same data_point is reused elsewhere (e.g., for retries or logging), the placeholder will already be replaced. Consider whether a deep copy is needed here, similar to how deepcopy(data_point) is used at line 600 for the non-openai path.
♻️ Suggested fix using deepcopy for safety
         if self.cfg.prompt_format == "openai":
             # Replace placeholder content with prompt from specified field
             if self.cfg.prompt_field and self.cfg.prompt_field in data_point:
                 prompt_value = data_point[self.cfg.prompt_field]
+                # Work on a copy to avoid mutating original data_point
+                data_point = deepcopy(data_point)
                 # Find and replace placeholder in messages
                 for message in data_point["messages"]:
                     if message.get("role") == "user" and message.get("content") == "<PLACEHOLDER>":
                         message["content"] = prompt_value
                         break
nemo_skills/dataset/numb3rs/prepare.py (3)
82-82: Hardcoded container path may need documentation.

The path /dataset/numb3rs/data/ is hardcoded, which assumes a specific container/deployment structure. Consider adding a comment explaining when/how this path is used, or making it configurable.

187-192: Consider more specific exception handling for dataset loading.

While catching a broad Exception is pragmatic for HuggingFace dataset loading (which can fail in various ways), logging the exception type could help with debugging.
♻️ Optional: More informative error logging
     try:
         dataset = load_dataset("NNstuff/Numb3rs", split="train", trust_remote_code=True)
         print(f"Loaded {len(dataset)} total samples")
     except Exception as e:
-        print(f"Error loading dataset: {e}")
+        print(f"Error loading dataset ({type(e).__name__}): {e}")
         return
218-232: Potential inclusion of stale test.jsonl from previous runs.

If test.jsonl exists from a previous run and is not deleted before the glob, it could be included in all_jsonl_files. While the filter at line 223 removes it, this relies on the exact filename match. Consider explicitly deleting the combined file before processing, or using a more robust approach.
♻️ Suggested fix to ensure clean state
     # Combine all category JSONLs into test.jsonl
     combined_file = output_dir / "test.jsonl"
+    if combined_file.exists():
+        combined_file.unlink()
     print(f"\nCreating combined file: {combined_file}")
 
     all_jsonl_files = sorted(output_dir.glob("*.jsonl"))
     category_files = [f for f in all_jsonl_files if f.name != "test.jsonl"]

nemo_skills/evaluation/evaluator/audio.py

Jorjeous · 2026-01-22T16:08:24Z

@coderabbitai full review

coderabbitai · 2026-01-22T16:08:35Z

✅ Actions performed

Full review triggered.

coderabbitai

Actionable comments posted: 2

🤖 Fix all issues with AI agents

In `@nemo_skills/dataset/numb3rs/prepare.py`:
- Around line 55-70: The code currently computes duration = len(audio_array) /
sampling_rate without validating sampling_rate; add a guard in the prepare logic
(around where audio_info, audio_array and sampling_rate are read) to ensure
sampling_rate is a positive non-zero number (e.g., check
isinstance(sampling_rate, (int, float)) and sampling_rate > 0) and return None
for malformed entries if not, before computing duration and comparing to
MIN_AUDIO_DURATION.

In `@nemo_skills/inference/generate.py`:
- Around line 575-584: The placeholder-substitution block under the if
self.cfg.prompt_format == "openai": branch must fail fast when substitution
cannot be applied: if self.cfg.prompt_field is set but that key is missing in
data_point, or if data_point["messages"] contains no message with role "user"
and content "<PLACEHOLDER>", raise a clear exception (or call the existing error
handling) instead of silently continuing; update the logic around
self.cfg.prompt_field/data_point[self.cfg.prompt_field] and the loop over
data_point["messages"] to detect these two failure modes and surface a
descriptive error that includes the missing field name or the fact that
"<PLACEHOLDER>" was not found.

🧹 Nitpick comments (2)

nemo_skills/inference/generate.py (1)

221-224: Validate prompt_field usage for non-OpenAI prompts.

prompt_field is only used in the OpenAI path; when prompt_format != "openai" it is silently ignored. Consider validating this in _post_init_validate_params (e.g., require prompt_field is None unless prompt_format == "openai"), to prevent accidental misconfiguration.

nemo_skills/dataset/numb3rs/prepare.py (1)

186-192: Avoid swallowing unexpected dataset load errors.

Catching all exceptions can mask real problems. Consider narrowing the exception type or re-raising after logging, so unexpected failures aren’t silently ignored.

coderabbitai · 2026-01-22T16:13:51Z

nemo_skills/dataset/numb3rs/prepare.py

+    # Get audio info
+    audio_info = entry.get("audio", {})
+    if not isinstance(audio_info, dict) or "array" not in audio_info or "sampling_rate" not in audio_info:
+        return None
+
+    audio_array = audio_info["array"]
+    sampling_rate = audio_info["sampling_rate"]
+
+    # Skip if audio array is empty or invalid
+    if audio_array is None or len(audio_array) == 0:
+        return None
+
+    duration = len(audio_array) / sampling_rate
+
+    if duration < MIN_AUDIO_DURATION:
+        return None


⚠️ Potential issue | 🟡 Minor

Guard against invalid sampling_rate.

If sampling_rate is 0/None, this will throw a division error. A small guard avoids hard failures on malformed entries.

🐛 Proposed guard

- duration = len(audio_array) / sampling_rate + if not sampling_rate or sampling_rate <= 0: + return None + duration = len(audio_array) / sampling_rate

🤖 Prompt for AI Agents

In `@nemo_skills/dataset/numb3rs/prepare.py` around lines 55 - 70, The code currently computes duration = len(audio_array) / sampling_rate without validating sampling_rate; add a guard in the prepare logic (around where audio_info, audio_array and sampling_rate are read) to ensure sampling_rate is a positive non-zero number (e.g., check isinstance(sampling_rate, (int, float)) and sampling_rate > 0) and return None for malformed entries if not, before computing duration and comparing to MIN_AUDIO_DURATION.

coderabbitai · 2026-01-22T16:13:51Z

nemo_skills/inference/generate.py

        if self.cfg.prompt_format == "openai":
+            # Replace placeholder content with prompt from specified field
+            if self.cfg.prompt_field and self.cfg.prompt_field in data_point:
+                prompt_value = data_point[self.cfg.prompt_field]
+                # Find and replace placeholder in messages
+                for message in data_point["messages"]:
+                    if message.get("role") == "user" and message.get("content") == "<PLACEHOLDER>":
+                        message["content"] = prompt_value
+                        break
+


⚠️ Potential issue | 🟡 Minor

Fail fast when placeholder substitution can’t be applied.

If prompt_field is set but the field is missing or no <PLACEHOLDER> is found, the current logic silently leaves the placeholder in the prompt. This can waste runs and produce misleading results; a clear error (or at least a warning) is safer.

💡 Proposed fail-fast guard

if self.cfg.prompt_format == "openai": # Replace placeholder content with prompt from specified field - if self.cfg.prompt_field and self.cfg.prompt_field in data_point: - prompt_value = data_point[self.cfg.prompt_field] - # Find and replace placeholder in messages - for message in data_point["messages"]: - if message.get("role") == "user" and message.get("content") == "<PLACEHOLDER>": - message["content"] = prompt_value - break + if self.cfg.prompt_field: + if self.cfg.prompt_field not in data_point: + raise KeyError(f"prompt_field '{self.cfg.prompt_field}' not found in data point") + prompt_value = data_point[self.cfg.prompt_field] + replaced = False + # Find and replace placeholder in messages + for message in data_point.get("messages", []): + if message.get("role") == "user" and message.get("content") == "<PLACEHOLDER>": + message["content"] = prompt_value + replaced = True + break + if not replaced: + raise ValueError("No '<PLACEHOLDER>' found in user messages for prompt_field replacement")

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

if self.cfg.prompt_format == "openai":

# Replace placeholder content with prompt from specified field

if self.cfg.prompt_field and self.cfg.prompt_field in data_point:

prompt_value = data_point[self.cfg.prompt_field]

# Find and replace placeholder in messages

for message in data_point["messages"]:

if message.get("role") == "user" and message.get("content") == "<PLACEHOLDER>":

message["content"] = prompt_value

break

if self.cfg.prompt_format == "openai":

# Replace placeholder content with prompt from specified field

if self.cfg.prompt_field:

if self.cfg.prompt_field not in data_point:

raise KeyError(f"prompt_field '{self.cfg.prompt_field}' not found in data point")

prompt_value = data_point[self.cfg.prompt_field]

replaced = False

# Find and replace placeholder in messages

for message in data_point.get("messages", []):

if message.get("role") == "user" and message.get("content") == "<PLACEHOLDER>":

message["content"] = prompt_value

replaced = True

break

if not replaced:

raise ValueError("No '<PLACEHOLDER>' found in user messages for prompt_field replacement")

🤖 Prompt for AI Agents

In `@nemo_skills/inference/generate.py` around lines 575 - 584, The placeholder-substitution block under the if self.cfg.prompt_format == "openai": branch must fail fast when substitution cannot be applied: if self.cfg.prompt_field is set but that key is missing in data_point, or if data_point["messages"] contains no message with role "user" and content "<PLACEHOLDER>", raise a clear exception (or call the existing error handling) instead of silently continuing; update the logic around self.cfg.prompt_field/data_point[self.cfg.prompt_field] and the loop over data_point["messages"] to detect these two failure modes and surface a descriptive error that includes the missing field name or the fact that "<PLACEHOLDER>" was not found.

greptile-apps

_{No files reviewed, no comments}

_{Edit Code Review Agent Settings | Greptile}

update source of data add special normalization mode Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>

greptile-apps

_{1 file reviewed, 1 comment}

_{Edit Code Review Agent Settings | Greptile}

greptile-apps · 2026-02-09T13:37:08Z

Additional Comments (1)

nemo_skills/evaluation/evaluator/audio.py
hf_leaderboard normalization changed

preprocess_asr_text(..., mode="hf_leaderboard") previously did NFC normalization + lowercase + punctuation stripping (see prior code), but this PR routes hf_leaderboard through Whisper normalization (and removes the NFC step). That’s a backwards-incompatible behavior change that will alter WER for any existing evaluations that rely on normalization_mode=hf_leaderboard. Consider restoring the old hf_leaderboard branch (including unicodedata.normalize("NFC", ...)) and keeping Whisper normalization only for standard/audiobench (or introduce a new mode name if the new behavior is desired).

Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>

greptile-apps

_{2 files reviewed, 2 comments}

_{Edit Code Review Agent Settings | Greptile}

greptile-apps · 2026-02-09T13:40:28Z

Additional Comments (2)

nemo_skills/evaluation/evaluator/audio.py
Missing per-reference metrics

In the missing_generation early return for ASR tasks, only the generic wer is set. When eval_config.reference_fields is used, those derived metrics (e.g., wer_tn, wer_itn, and is_correct_*) are never populated for empty generations, so aggregates will silently exclude these samples and bias per-reference WER downward. This affects any run where reference_fields is provided and generation is empty.

nemo_skills/evaluation/metrics/audio_metrics.py
Dynamic WER double counting

AudioMetrics.update() already iterates predictions and appends pred["wer"] to self.wer_scores, but the new loop also collects any key starting with wer_. That includes the main wer key (since it also starts with wer_), so self.dynamic_wer_scores["wer"] will be populated and then get_metrics() will overwrite agg_metrics["wer"] with the dynamic average computed from the same underlying values. This changes behavior and can diverge if rounding/filters differ. The exclusion list should also exclude the plain "wer" key (and likely any other core WER keys already handled).

greptile-apps

_{1 file reviewed, 2 comments}

_{Edit Code Review Agent Settings | Greptile}

nemo_skills/evaluation/evaluator/audio.py

greptile-apps · 2026-02-09T13:43:18Z

Additional Comments (1)

nemo_skills/evaluation/evaluator/audio.py
Duplicate config field
AudioEvaluatorConfig defines apply_whisper_normalization twice (lines 37 and 39). In a dataclass this will silently override the first definition and can break config schema/introspection (and makes defaults/metadata ambiguous). Remove the duplicate so the config has a single source of truth.

coderabbitai

Actionable comments posted: 4

🤖 Fix all issues with AI agents

In `@nemo_skills/dataset/numb3rs/prepare.py`:
- Line 301: The print statement uses an unnecessary f-string prefix; change the
call print(f"\nCreating combined test files for each variant...") to a regular
string print("\nCreating combined test files for each variant...") by removing
the leading "f" so there are no unused formatting placeholders (locate this
statement in nemo_skills/dataset/numb3rs/prepare.py where the combined test
files are created).
- Around line 269-274: Remove the broad try/except around the load_dataset call
so failures propagate instead of being swallowed; specifically delete the except
block that prints "Error loading dataset" and the early return, and let the call
to load_dataset("nvidia/Numb3rs", split="test", trust_remote_code=True) (and the
subsequent print of len(dataset)) raise its errors normally; keep the dataset
variable assignment and the success print as-is so callers see clear failures
from load_dataset.
- Around line 86-108: Replace lax .get() usage with direct dict access on the
expected Numb3rs fields to fail fast: access entry["original_text"],
entry["text"], and entry["file_name"] (instead of entry.get(...)) and let
KeyError surface for malformed entries; continue to strip the values and return
None if original_text or text are empty. Do not default duration to 1.0 — use
entry["duration"] and validate it exists, then compare against
MIN_AUDIO_DURATION and return None if too short. Preserve the existing
sample_id/audio_filename logic but read file_name from entry["file_name"] and
keep the .stem/.name handling and ".json" trimming as before so
malformed/missing keys raise clear errors.

In `@nemo_skills/evaluation/evaluator/audio.py`:
- Around line 37-42: Remove the duplicate declaration of the dataclass field
apply_whisper_normalization (it appears twice surrounding normalization_mode and
reference_fields); keep a single declaration with the intended default value
(True) and eliminate the shadowing duplicate so the dataclass has only one
apply_whisper_normalization field.

🧹 Nitpick comments (2)

nemo_skills/dataset/numb3rs/prepare.py (2)
110-123: Silent default for sampling_rate may hide data issues.

Line 117 defaults sampling_rate to 16000 if missing. If the audio metadata is expected to contain this field, a missing value likely indicates a malformed entry that should be flagged rather than silently assumed.
♻️ Suggested fix
-        sampling_rate = audio_info.get("sampling_rate", 16000)
+        sampling_rate = audio_info["sampling_rate"]
As per coding guidelines, "Do not use .get() for accessing dictionary keys if the code expects them to be present; use direct dictionary access dict[key] instead to allow proper error handling and fail fast with clear errors".

64-72: Add type hints to function signatures.

Functions lack type hints for their parameters and return types. This applies to build_messages_with_prompt, save_audio_and_format_entry, prepare_category, and main.

Example for save_audio_and_format_entry:
def save_audio_and_format_entry(
    entry: dict, category: str, audio_dir: Path, sample_idx: int,
    with_audio: bool = True, audio_prefix: str = "/data/numb3rs",
) -> dict | None:
As per coding guidelines, "Use type hints for simple types (dict, list, int, float, existing classes) in Python code".

Also applies to: 75-75, 152-152, 233-233

nemo_skills/dataset/numb3rs/prepare.py

nemo_skills/evaluation/evaluator/audio.py

Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>

greptile-apps

_{1 file reviewed, 1 comment}

_{Edit Code Review Agent Settings | Greptile}

greptile-apps · 2026-02-09T13:49:40Z

Additional Comments (1)

nemo_skills/evaluation/evaluator/audio.py
Docstring mismatch

evaluate_asr’s docstring still says normalization_mode is only "standard", "audiobench", "hf_leaderboard", or "none", but this PR adds and uses the new "no_tn_itn" mode. This makes the public contract misleading and can cause users to pass the new mode and think it’s unsupported (or vice versa). Please update the docstring to include "no_tn_itn" (and keep it consistent with VALID_NORMALIZATION_MODES).

coderabbitai

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

nemo_skills/evaluation/evaluator/audio.py (1)
321-328: ⚠️ Potential issue | 🟡 Minor

Docstring doesn't list "no_tn_itn" as a valid normalization_mode.

Line 327 only mentions "standard", "audiobench", "hf_leaderboard", and "none". The new "no_tn_itn" mode should be documented here as well for consistency with preprocess_asr_text.
📝 Proposed fix
     Args:
         reference: Ground truth transcription.
         hypothesis: Model output transcription.
-        normalization_mode: "standard", "audiobench", "hf_leaderboard", or "none".
+        normalization_mode: "standard", "audiobench", "hf_leaderboard", "none", or "no_tn_itn".
     """

Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>

greptile-apps

_{No files reviewed, no comments}

_{Edit Code Review Agent Settings | Greptile}

Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>

greptile-apps

_{No files reviewed, no comments}

_{Edit Code Review Agent Settings | Greptile}

Kipok

please add this dataset to docs with some example commands / reference numbers https://github.com/NVIDIA-NeMo/Skills/blob/main/CONTRIBUTING.md#when-adding-new-benchmarks

nemo_skills/dataset/numb3rs/__init__.py

Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>

greptile-apps

_{4 files reviewed, 1 comment}

_{Edit Code Review Agent Settings | Greptile}

nemo_skills/dataset/numb3rs/prepare.py

Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>

Jorjeous · 2026-02-11T13:49:06Z

please add this dataset to docs with some example commands / reference numbers https://github.com/NVIDIA-NeMo/Skills/blob/main/CONTRIBUTING.md#when-adding-new-benchmarks

done, no ref numbers for now

greptile-apps

_{5 files reviewed, no comments}

_{Edit Code Review Agent Settings | Greptile}

greptile-apps

_{5 files reviewed, no comments}

_{Edit Code Review Agent Settings | Greptile}

Jorjeous · 2026-02-13T09:27:32Z

@melllinia

Kipok

since the changes are fully local to audio logic / the new benchmark, I'm ok with merging this if you believe the code is good and if you tested it well. It seems like there is a breaking change in hf_leaderboard normalization, make sure it's intentional. Please have someone from your team review and approve this

also, highly recommend to run /review in codex or claude code - I think they have some helpful comments there. E.g. here is the output from claude

Overview

  This PR adds the Numb3rs benchmark — a speech
  dataset for evaluating text normalization (TN) and
  inverse text normalization (ITN). It includes:
  - New dataset definition (__init__.py) and
  preparation script (prepare.py)
  - A new no_tn_itn normalization mode in the audio
  evaluator
  - Multi-reference WER evaluation via
  reference_fields
  - Improvements to the strip_helpful_prefixes
  function (colon-quote and contraction handling)
  - New wer_tn / wer_itn metrics in AudioMetrics
  - Documentation for the new benchmark

  Critical Issue: Breaking change to hf_leaderboard 
  normalization mode

  audio.py:288-298 — The hf_leaderboard mode's
  dedicated code path was removed. Previously it did:
  # OLD (main branch)
  if mode == "hf_leaderboard":
      text = unicodedata.normalize("NFC", text)
      text = text.lower()
      text = re.sub(r"[^\w\s]", "", text)
      return re.sub(r"\s+", " ", text).strip()

  Now it falls through to whisper normalization, which
   behaves very differently (e.g., converts number
  words to digits, has its own English-specific
  transformations). This silently breaks
  asr-leaderboard, which uses
  ++eval_config.normalization_mode=hf_leaderboard. The
   new no_tn_itn mode is basically what hf_leaderboard
   used to do (minus unicode NFC normalization).

  This should be fixed — either restore the
  hf_leaderboard path, or if the change was
  intentional, it should be a separate commit with a
  clear rationale and the asr-leaderboard dataset
  updated accordingly.

  Other Issues

  1. audio.py:38 — Type annotation style: list[str] |
  None uses PEP 604 union syntax. The rest of the
  codebase (and CONTRIBUTING.md) says to avoid
  "complicated types" and use simple types. Consider
  checking whether this project targets Python 3.9
  compatibility where list[str] | None would fail at
  runtime unless from __future__ import annotations is
   used.
  2. audio_metrics.py — Hardcoded wer_tn / wer_itn 
  field names: The evaluator's reference_fields
  feature is generic (any field name), but the metrics
   collector only looks for the hardcoded wer_tn and
  wer_itn keys. If someone uses
  reference_fields=['text_written', 'text_spoken'],
  the metrics would be wer_written and wer_spoken, but
   AudioMetrics would silently ignore them. Either
  make the metrics collection dynamic (iterate over
  all wer_* keys in predictions), or document that
  only text_tn/text_itn field names are supported.
  3. prepare.py:95 — audio_metadata key leaks into the
   entry dict: The audio_metadata key is added to the
  formatted entry and then del'd in the loop (line
  180). This is a fragile pattern — if any code path
  writes the base entry before messages are added,
  audio_metadata would leak into the JSONL. Consider
  building messages inline instead of storing
  intermediate state.
  4. prepare.py:80 — sample_id field: The sample_id is
   derived from the filename stem, but other audio
  datasets use a simple integer index for sample_id.
  This inconsistency could cause issues with
  deduplication or ordering assumptions downstream.

  Minor Suggestions

  5. prepare.py:55 — SYSTEM_MESSAGE with /no_think:
  The /no_think suffix in the system message is
  model-specific behavior. Other audio datasets in the
   repo use the same pattern, so this is consistent,
  just noting it.
  6. prepare.py:232-237 — Category validation: Unknown
   categories print a warning and are silently
  skipped. Per the project's "don't be overly
  defensive" / "fail fast" style, this should probably
   raise an error instead.
  7. strip_helpful_prefixes changes (audio.py:74-89):
  The rewrite to handle contractions like o'clock is a
   nice improvement. However, the non-greedy '(.+?)'
  pattern could still be too eager for some edge
  cases. The change is also not specific to numb3rs —
  it affects all audio evaluation. Worth testing
  against existing benchmarks.
  8. Documentation duplication: The Numb3rs section
  appears twice in speech-audio.md — once as a brief
  summary at the top (lines ~35-48), and again as a
  full section at the bottom (lines ~419+). This is
  consistent with how other benchmarks are documented
  in that file, so it's fine.

  Summary

  The main concern is the silent breaking change to
  hf_leaderboard normalization — this should be
  addressed before merge. The rest of the
  implementation follows project conventions well and
  adds a clean, well-documented benchmark.

Kipok · 2026-02-14T06:20:29Z

docs/evaluation/speech-audio.md

 - Benchmark is defined in [`nemo_skills/dataset/mmau-pro/__init__.py`](https://github.com/NVIDIA-NeMo/Skills/blob/main/nemo_skills/dataset/mmau-pro/__init__.py)
 - Original benchmark source is hosted on [HuggingFace](https://huggingface.co/datasets/gamma-lab-umd/MMAU-Pro)

+### Numb3rs


I think there is something wrong with headers and placement of docs in this file (not just for numb3rs). We should probably only have one header for each benchmark and all information goes into subheaders

Kipok · 2026-02-14T06:22:09Z

nemo_skills/dataset/numb3rs/prepare.py

+from datasets import load_dataset
+from tqdm import tqdm
+
+SYSTEM_MESSAGE = "You are a helpful assistant. /no_think"


do we need this? Especially the /no_think part which is qwen specific

I'd suggest to not set system message and users can always add it via explicit ++system_message parameter

Jorjeous · 2026-02-16T15:59:21Z

Hf leaderboard changes is intentional
@melllinia plz check, but i believe there shold be normalization

…ulation, refactored normalization process Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>

Jorjeous · 2026-02-20T15:35:21Z

@melllinia plz retry exps

melllinia · 2026-02-23T13:36:11Z

@Jorjeous After the current changes the scores are back to normal. Approved!

Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com> Signed-off-by: Igor Gitman <igitman@nvidia.com>

commit a5da597 Author: Igor Gitman <igitman@nvidia.com> Date: Fri Mar 6 12:13:36 2026 -0800 Revert "Eval kit support (#1239)" (#1294) Signed-off-by: Igor Gitman <igitman@nvidia.com> commit b237e33 Author: George <37293288+Jorjeous@users.noreply.github.com> Date: Fri Mar 6 20:25:37 2026 +0400 Eval kit support (#1239) Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com> Signed-off-by: George <37293288+Jorjeous@users.noreply.github.com> Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com> commit dc28bbf Author: George Armstrong <georgea@nvidia.com> Date: Thu Mar 5 10:17:44 2026 -0800 Python direct tool calling without MCP (#1286) Signed-off-by: George Armstrong <georgea@nvidia.com> commit 12454dd Author: Sadegh Mahdavi <smahdavi4@gmail.com> Date: Wed Mar 4 13:06:21 2026 -0800 Allow het servers for nemo-rl jobs (#1223) Signed-off-by: Sadegh Mahdavi <smahdavi@nvidia.com> Co-authored-by: Igor Gitman <igitman@nvidia.com> commit 8884a68 Author: Prasoon Varshney <prasoon1995@gmail.com> Date: Wed Mar 4 10:24:02 2026 -0800 Support source_lang param for translation recipe (#1290) Signed-off-by: Prasoon Varshney <prasoonv@nvidia.com> Co-authored-by: George Armstrong <georgea@nvidia.com> commit 4618b19 Author: Meriem B. <113170426+ka00ri@users.noreply.github.com> Date: Wed Mar 4 18:59:28 2026 +0100 Add MMLU-Pro 10% optimized subset for checkpoint selection (#1285) Signed-off-by: Meriem Boubdir <mboubdir@nvidia.com> Co-authored-by: George Armstrong <georgea@nvidia.com> commit 5ac8609 Author: Talor Abramovich <talor19@gmail.com> Date: Wed Mar 4 02:30:06 2026 +0200 Add SPEED-Bench (within repo) (#1279) Signed-off-by: Talor Abramovich <talora@nvidia.com> Signed-off-by: talora <talora@nvidia.com> Signed-off-by: Talor Abramovich <talor19@gmail.com> Signed-off-by: George Armstrong <georgea@nvidia.com> Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com> Co-authored-by: George Armstrong <georgea@nvidia.com> Co-authored-by: Igor Gitman <igor.a.gitman@gmail.com> commit c31eec5 Author: George Armstrong <georgea@nvidia.com> Date: Tue Mar 3 12:18:15 2026 -0800 Fix os.getlogin() crash in ns setup (#1289) Signed-off-by: George Armstrong <georgea@nvidia.com> commit c228e66 Author: George Armstrong <georgea@nvidia.com> Date: Tue Mar 3 11:04:54 2026 -0800 Fix streaming TypeError when delta.content is None (#1267) (#1288) Signed-off-by: George Armstrong <georgea@nvidia.com> commit aa47923 Author: Matvei Novikov <mnovikov@nvidia.com> Date: Mon Mar 2 16:28:41 2026 -0800 Add LibTrace recipe for generating domain-specific reasoning data (#1224) Signed-off-by: jubick1337 <mnovikov@nvidia.com> Signed-off-by: mnovikov <mnovikov@nvidia.com> Co-authored-by: George Armstrong <georgea@nvidia.com> commit 313cad7 Author: Stephen Ge <stepheng@nvidia.com> Date: Mon Mar 2 18:28:49 2026 -0500 fix: clean parse-failure retries in prover (#1284) Signed-off-by: Stephen Ge <stepheng@nvidia.com> commit 813cfa3 Author: George Armstrong <georgea@nvidia.com> Date: Mon Mar 2 15:10:08 2026 -0800 tst: rollback inference-api to integrate (#1287) Signed-off-by: George Armstrong <georgea@nvidia.com> commit 31735f9 Author: Valentin Mendelev <vmendelev@nvidia.com> Date: Mon Mar 2 23:11:25 2026 +0100 Add backend-agnostic unified inference server with NeMo ASR and TTS backends (#1250) Signed-off-by: Valentin Mendelev <vmendelev@nvidia.com> commit d4ef8c0 Author: George <37293288+Jorjeous@users.noreply.github.com> Date: Fri Feb 27 23:58:54 2026 +0400 Update promt_config to working with openai format + inline setup (#1210) Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com> Signed-off-by: George <37293288+Jorjeous@users.noreply.github.com> Co-authored-by: Igor Gitman <igitman@nvidia.com> Co-authored-by: George Armstrong <georgea@nvidia.com> commit e879cbc Author: George Armstrong <georgea@nvidia.com> Date: Fri Feb 27 10:41:23 2026 -0800 Update noc tutorial (#1282) Signed-off-by: George Armstrong <georgea@nvidia.com> commit f6e3505 Author: George Armstrong <georgea@nvidia.com> Date: Fri Feb 27 10:17:33 2026 -0800 Add noc reasoning tutorial (#1278) Signed-off-by: Amparo Canaveras <acanaveras@nvidia.com> Signed-off-by: rajeshwarid179 <rdevaramani@nvidia.com> Signed-off-by: acanaveras <142839082+acanaveras@users.noreply.github.com> Signed-off-by: George Armstrong <georgea@nvidia.com> Co-authored-by: Amparo Canaveras <acanaveras@nvidia.com> Co-authored-by: Cursor <cursoragent@cursor.com> Co-authored-by: acanaveras <142839082+acanaveras@users.noreply.github.com> Co-authored-by: rajeshwarid179 <rdevaramani@nvidia.com> commit fc2072a Author: Jiacheng Xu <jcxu@utexas.edu> Date: Fri Feb 27 10:10:25 2026 -0800 CritPt generation add prompt_format=None (#1280) Signed-off-by: Jiacheng Xu <jiachengx@nvidia.com> Co-authored-by: George Armstrong <georgea@nvidia.com> commit c8abe5d Author: Igor Gitman <igitman@nvidia.com> Date: Fri Feb 27 09:31:26 2026 -0800 New slurm customization parameters (account, containers) (#1209) Signed-off-by: Igor Gitman <igitman@nvidia.com> Signed-off-by: George Armstrong <georgea@nvidia.com> Co-authored-by: George Armstrong <georgea@nvidia.com> commit 2b38cce Author: George Armstrong <georgea@nvidia.com> Date: Wed Feb 25 17:59:52 2026 -0800 Add nemo-skills-core subpackage for lightweight installs (#1229) Signed-off-by: George Armstrong <georgea@nvidia.com> commit 9fa8e83 Author: Dheeraj Peri <peri.dheeraj@gmail.com> Date: Wed Feb 25 12:56:35 2026 -0800 feat: add custom judge type support for external repo integration (#1274) Signed-off-by: Igor Gitman <igitman@nvidia.com> Signed-off-by: bzantium <ryumin93@gmail.com> Signed-off-by: Dheeraj Peri <dperi@nvidia.com> Signed-off-by: suriya <sgunasekar@nvidia.com> Signed-off-by: Jiacheng Xu <jiachengx@nvidia.com> Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com> Co-authored-by: Igor Gitman <igitman@nvidia.com> Co-authored-by: Minho Ryu <ryumin93@gmail.com> Co-authored-by: Yongqiang Wang <yongqiang.seagull@gmail.com> Co-authored-by: Suriya Gunasekar <sgunasekar@users.noreply.github.com> Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com> Co-authored-by: Jiacheng Xu <jcxu@utexas.edu> Co-authored-by: George <37293288+Jorjeous@users.noreply.github.com> commit 8a32b13 Author: Igor Gitman <igitman@nvidia.com> Date: Tue Feb 24 15:24:42 2026 -0800 Exclude numb3rs form test_eval.py (#1275) Signed-off-by: Igor Gitman <igitman@nvidia.com> commit 6da2219 Author: George <37293288+Jorjeous@users.noreply.github.com> Date: Mon Feb 23 18:37:46 2026 +0400 Numb3rs ds addition (#1174) Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com> commit ad034b5 Author: Suriya Gunasekar <sgunasekar@users.noreply.github.com> Date: Sun Feb 22 11:55:24 2026 -0800 Add DSBench-DA evaluation (#1254) Squash merge of changes during code-review. Signed-off-by: suriya <sgunasekar@nvidia.com> commit 7593ab3 Author: Jiacheng Xu <jcxu@utexas.edu> Date: Fri Feb 20 16:42:01 2026 -0800 Add CritPt benchmark (#1200) Signed-off-by: Jiacheng Xu <jiachengx@nvidia.com> Co-authored-by: Igor Gitman <igitman@nvidia.com> commit 58c31b2 Author: Suriya Gunasekar <sgunasekar@users.noreply.github.com> Date: Fri Feb 20 16:19:22 2026 -0800 Fix no_answer metric overcounting in _compute_pass_at_k (#1245) Signed-off-by: suriya <sgunasekar@nvidia.com> Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com> Co-authored-by: Igor Gitman <igitman@nvidia.com> commit 1f1a2e7 Author: Igor Gitman <igitman@nvidia.com> Date: Fri Feb 20 15:58:40 2026 -0800 Fix incorrect prompt tokens count due to HF api update (#1264) Signed-off-by: Igor Gitman <igitman@nvidia.com> commit 8ebc6f5 Author: Igor Gitman <igitman@nvidia.com> Date: Fri Feb 20 09:05:33 2026 -0800 Remove deprecated dataset group (#1263) Signed-off-by: Igor Gitman <igitman@nvidia.com> commit ea4177f Author: Yongqiang Wang <yongqiang.seagull@gmail.com> Date: Thu Feb 19 19:57:25 2026 -0500 fix deps (#1258) commit 60905a7 Author: Minho Ryu <ryumin93@gmail.com> Date: Fri Feb 20 09:39:39 2026 +0900 Add aime26 (#1256) Signed-off-by: bzantium <ryumin93@gmail.com> commit b28afc5 Author: Igor Gitman <igitman@nvidia.com> Date: Thu Feb 19 16:18:25 2026 -0800 Rename custom -> external benchmarks (#1262) Signed-off-by: Igor Gitman <igitman@nvidia.com> commit 6cc9c45 Author: Igor Gitman <igitman@nvidia.com> Date: Thu Feb 19 16:10:33 2026 -0800 Add reference to internal benchmarks repo (#1261) Signed-off-by: Igor Gitman <igitman@nvidia.com> commit 5202af6 Author: Igor Gitman <igitman@nvidia.com> Date: Thu Feb 19 16:08:05 2026 -0800 Remove incorrect presence-penalty setting (#1259) Signed-off-by: Igor Gitman <igitman@nvidia.com> commit 144c70b Author: Igor Gitman <igitman@nvidia.com> Date: Thu Feb 19 15:26:33 2026 -0800 Adding an option to store benchmarks in external repo (#1240) Signed-off-by: Igor Gitman <igitman@nvidia.com> Co-authored-by: George Armstrong <georgea@nvidia.com> commit 10e6e39 Author: George <37293288+Jorjeous@users.noreply.github.com> Date: Thu Feb 19 19:57:21 2026 +0400 update vllm miltimodal for api calls convenience (#1213) Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com> Signed-off-by: mmkrtchyan <mmkrtchyan@nvidia.com> Co-authored-by: mmkrtchyan <mmkrtchyan@nvidia.com> commit 1ba4219 Author: Nick Ludwig <nliudvig@nvidia.com> Date: Wed Feb 18 03:28:23 2026 +0400 Fix --server_container not being applied to dependent jobs (#1244) Signed-off-by: Nikolai Ludwig <nliudvig@nvidia.com> Co-authored-by: Igor Gitman <igitman@nvidia.com> commit 9517614 Author: Wasi Ahmad <wasiahmad@ucla.edu> Date: Mon Feb 16 11:13:24 2026 -0800 Support mini-swe-agent as agent harness (#1212) Signed-off-by: wasiahmad <wasiahmad@ucla.edu> Signed-off-by: i-vainn <imoshkov@nvidia.com> Signed-off-by: George Armstrong <georgea@nvidia.com> Signed-off-by: Charlie Truong <chtruong@nvidia.com> Signed-off-by: Nikolai Ludwig <nliudvig@nvidia.com> Signed-off-by: Grigor Nalbandyan <gnalbandyan@nvidia.com> Signed-off-by: bzantium <ryumin93@gmail.com> Signed-off-by: Stephen Ge <stepheng@nvidia.com> Signed-off-by: Jiacheng Xu <jiachengx@nvidia.com> Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com> Signed-off-by: Mateusz Winiarek <mwiniarek@nvidia.com> Signed-off-by: mmkrtchyan <mmkrtchyan@nvidia.com> Signed-off-by: Wei Du <wedu@nvidia.com> Signed-off-by: Igor Gitman <igitman@nvidia.com> Signed-off-by: George <37293288+Jorjeous@users.noreply.github.com> Signed-off-by: SeanNaren <snarenthiran@nvidia.com> Signed-off-by: Arkadiusz Nowaczynski <anowaczynski@nvidia.com> Signed-off-by: fzyzcjy <5236035+fzyzcjy@users.noreply.github.com> Co-authored-by: Ivan <imoshkov@nvidia.com> Co-authored-by: George Armstrong <georgea@nvidia.com> Co-authored-by: Charlie Truong <chtruong@nvidia.com> Co-authored-by: Nick Ludwig <nliudvig@nvidia.com> Co-authored-by: Wojciech Prazuch <wojciechprazuch3@gmail.com> Co-authored-by: gnalbandyan <153070076+gnalbandyan@users.noreply.github.com> Co-authored-by: Minho Ryu <ryumin93@gmail.com> Co-authored-by: Stephen Ge <stepheng@nvidia.com> Co-authored-by: Jiacheng Xu <jcxu@utexas.edu> Co-authored-by: Jiacheng Xu <jiachengx@nvidia.com> Co-authored-by: George <37293288+Jorjeous@users.noreply.github.com> Co-authored-by: Sanyam Kapoor <sanyamk@nvidia.com> Co-authored-by: Mateusz Winiarek <72758259+Froxyy-dev@users.noreply.github.com> Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com> Co-authored-by: Meline Mkrtchyan <72409758+melllinia@users.noreply.github.com> Co-authored-by: Wei Du <wedu@nvidia.com> Co-authored-by: Igor Gitman <igitman@nvidia.com> Co-authored-by: Sean Naren <snarenthiran@nvidia.com> Co-authored-by: Mehrzad Samadi <mehrzadsamadi@gmail.com> Co-authored-by: anowaczynski-nvidia <anowaczynski@nvidia.com> Co-authored-by: fzyzcjy <5236035+fzyzcjy@users.noreply.github.com> commit a3d44dc Author: Suriya Gunasekar <sgunasekar@users.noreply.github.com> Date: Fri Feb 13 22:32:15 2026 -0800 Add --installation_command support to prepare_data (#1243) Signed-off-by: suriya <sgunasekar@nvidia.com> Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com> commit e80d524 Author: George Armstrong <georgea@nvidia.com> Date: Thu Feb 12 17:26:00 2026 -0800 Fix CI disk space for Docker image builds (#1241) Signed-off-by: George Armstrong <georgea@nvidia.com> commit d22236c Author: Sadegh Mahdavi <smahdavi4@gmail.com> Date: Wed Feb 11 17:55:00 2026 -0800 Fix answerbench prompt parsing (#1235) Signed-off-by: Sadegh Mahdavi <smahdavi@nvidia.com> commit 2401628 Author: George Armstrong <georgea@nvidia.com> Date: Wed Feb 11 14:56:43 2026 -0800 feat: add lockfiles for reproducible sandbox builds (#1233) Signed-off-by: George Armstrong <georgea@nvidia.com> commit 5a0a84d Author: Wasi Ahmad <wasiahmad@ucla.edu> Date: Wed Feb 11 13:30:03 2026 -0800 removing datasets version restriction for LCB eval (#1230) Signed-off-by: wasiahmad <wasiahmad@ucla.edu> commit ef0a890 Author: gnalbandyan <153070076+gnalbandyan@users.noreply.github.com> Date: Wed Feb 11 12:03:16 2026 +0400 Gnalbandyan/add physics (#1214) Signed-off-by: Grigor Nalbandyan <gnalbandyan@nvidia.com> Signed-off-by: gnalbandyan <153070076+gnalbandyan@users.noreply.github.com> commit bd9d30c Author: Wasi Ahmad <wasiahmad@ucla.edu> Date: Tue Feb 10 15:13:27 2026 -0800 LCB generic prompting (#1215) Signed-off-by: wasiahmad <wasiahmad@ucla.edu> commit 7d6c49a Author: Sadegh Mahdavi <smahdavi4@gmail.com> Date: Sat Feb 7 08:45:46 2026 -0800 Add support for different variations of nemo-rl (#1220) Signed-off-by: Sadegh Mahdavi <smahdavi@nvidia.com> commit b19ba96 Author: George Armstrong <georgea@nvidia.com> Date: Fri Feb 6 21:40:56 2026 -0800 Add multi-node sandbox support for SLURM clusters (#1218) Signed-off-by: George Armstrong <georgea@nvidia.com> commit 8950bb0 Author: anowaczynski-nvidia <anowaczynski@nvidia.com> Date: Sat Feb 7 01:38:00 2026 +0100 support structured outputs in hle judge for optional AA compatibility (#1186) Signed-off-by: Arkadiusz Nowaczynski <anowaczynski@nvidia.com> Signed-off-by: Igor Gitman <igitman@nvidia.com> Co-authored-by: Igor Gitman <igitman@nvidia.com> commit b84f7a2 Author: Igor Gitman <igitman@nvidia.com> Date: Fri Feb 6 14:51:02 2026 -0800 A small update on running tests docs (#1219) Signed-off-by: Igor Gitman <igitman@nvidia.com> commit 8e838e1 Author: George Armstrong <georgea@nvidia.com> Date: Thu Feb 5 18:01:35 2026 -0800 feat: add flag to disable sandbox replay (#1217) Signed-off-by: George Armstrong <georgea@nvidia.com> commit 5fd9085 Author: Igor Gitman <igitman@nvidia.com> Date: Thu Feb 5 15:57:01 2026 -0800 Add an option to limit number of tool calls (#1216) Signed-off-by: Igor Gitman <igitman@nvidia.com> commit d820200 Author: Igor Gitman <igitman@nvidia.com> Date: Tue Feb 3 10:43:55 2026 -0800 Add arena-hard v2 (#1205) Signed-off-by: bzantium <ryumin93@gmail.com> Signed-off-by: Igor Gitman <igitman@nvidia.com> Co-authored-by: bzantium <ryumin93@gmail.com> commit a30920e Author: Igor Gitman <igitman@nvidia.com> Date: Mon Feb 2 10:53:55 2026 -0800 Fix mkdocs warnings (#1204) Signed-off-by: Igor Gitman <igitman@nvidia.com> commit 19d7788 Author: Ivan <imoshkov@nvidia.com> Date: Mon Feb 2 23:25:13 2026 +0500 Fix infinite wait in sandbox.wait_for_sandbox (#1206) Signed-off-by: i-vainn <imoshkov@nvidia.com> commit 3e65fbf Author: Sadegh Mahdavi <smahdavi4@gmail.com> Date: Fri Jan 30 19:38:38 2026 -0800 Improve tts (#1203) Signed-off-by: Sadegh Mahdavi <smahdavi@nvidia.com> commit 250c862 Author: Nick Ludwig <nliudvig@nvidia.com> Date: Fri Jan 30 22:12:29 2026 +0400 SWE-bench: fix SWE-agent hanging, adjust expected scores (#1202) Signed-off-by: Nikolai Ludwig <nliudvig@nvidia.com> commit 7ded756 Author: Ivan <imoshkov@nvidia.com> Date: Fri Jan 30 09:57:41 2026 +0500 Add proper token counting to code execution model (#1184) Signed-off-by: i-vainn <imoshkov@nvidia.com> Co-authored-by: Igor Gitman <igitman@nvidia.com> commit b986304 Author: Igor Gitman <igitman@nvidia.com> Date: Thu Jan 29 17:57:07 2026 -0800 Upgrade containers (#1198) Signed-off-by: Sadegh Mahdavi <smahdavi@nvidia.com> Signed-off-by: Igor Gitman <igitman@nvidia.com> Co-authored-by: Sadegh Mahdavi <smahdavi@nvidia.com> commit 3b44f02 Author: Dan Lord <blahblahasdf@gmail.com> Date: Thu Jan 29 16:40:47 2026 -0800 Fix incorrect string format (#1199) Signed-off-by: dlord <dlord@nvidia.com> commit c4854b8 Author: Sadegh Mahdavi <smahdavi4@gmail.com> Date: Thu Jan 29 13:43:36 2026 -0800 Update nemo-rl to latest (#1087) Signed-off-by: Sadegh Mahdavi <smahdavi@nvidia.com> Signed-off-by: Igor Gitman <igitman@nvidia.com> Co-authored-by: Igor Gitman <igitman@nvidia.com>

Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>

Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com> Signed-off-by: dgitman <dgitman@nvidia.com>

Jorjeous marked this pull request as ready for review January 22, 2026 15:41

Jorjeous changed the title ~~Draft Numb3rs ds addition~~ Numb3rs ds addition Jan 22, 2026

greptile-apps bot reviewed Jan 22, 2026

View reviewed changes

nemo_skills/evaluation/evaluator/audio.py Outdated Show resolved Hide resolved

coderabbitai bot reviewed Jan 22, 2026

View reviewed changes

nemo_skills/evaluation/evaluator/audio.py Outdated Show resolved Hide resolved

Jorjeous requested a review from melllinia January 22, 2026 15:49

coderabbitai bot reviewed Jan 22, 2026

View reviewed changes

Jorjeous added run GPU tests and removed run GPU tests labels Jan 22, 2026

greptile-apps bot reviewed Jan 27, 2026

View reviewed changes

update idea how the set is added. (now 3 predefined promts))

676254d

update source of data add special normalization mode Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>

Jorjeous force-pushed the add_numb3rs_dataset branch from 14f1c57 to 676254d Compare February 9, 2026 13:35

greptile-apps bot reviewed Feb 9, 2026

View reviewed changes

upd path's

afa8429

Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>

greptile-apps bot reviewed Feb 9, 2026

View reviewed changes

Merge remote-tracking branch 'origin/main' into add_numb3rs_dataset

812a575

greptile-apps bot reviewed Feb 9, 2026

View reviewed changes

nemo_skills/evaluation/evaluator/audio.py Outdated Show resolved Hide resolved

coderabbitai bot reviewed Feb 9, 2026

View reviewed changes

nemo_skills/dataset/numb3rs/prepare.py Show resolved Hide resolved

nemo_skills/dataset/numb3rs/prepare.py Outdated Show resolved Hide resolved

nemo_skills/dataset/numb3rs/prepare.py Outdated Show resolved Hide resolved

nemo_skills/evaluation/evaluator/audio.py Show resolved Hide resolved

rm merge artifact

7bd885a

Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>

greptile-apps bot reviewed Feb 9, 2026

View reviewed changes

coderabbitai bot reviewed Feb 9, 2026

View reviewed changes

simplification of metric appearance, expicit failing on errors

8e5d51f

Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>

greptile-apps bot reviewed Feb 9, 2026

View reviewed changes

Jorjeous added 2 commits February 9, 2026 06:28

update docstrings

00b3958

Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>

lint

b8c960a

Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>

greptile-apps bot reviewed Feb 9, 2026

View reviewed changes

Kipok reviewed Feb 9, 2026

View reviewed changes

nemo_skills/dataset/numb3rs/__init__.py Outdated Show resolved Hide resolved

add enable audio by default

7a4d02b

Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>

greptile-apps bot reviewed Feb 11, 2026

View reviewed changes

nemo_skills/dataset/numb3rs/prepare.py Show resolved Hide resolved

update docs with desct and example command

00511e7

Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>

greptile-apps bot reviewed Feb 11, 2026

View reviewed changes

Jorjeous enabled auto-merge (squash) February 12, 2026 11:21

Merge branch 'main' into add_numb3rs_dataset

8b15f08

greptile-apps bot reviewed Feb 12, 2026

View reviewed changes

Jorjeous requested a review from Kipok February 13, 2026 09:27

Kipok reviewed Feb 14, 2026

View reviewed changes

unification of refenence field, updated to set level aggregation calc…

e639023

…ulation, refactored normalization process Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>

melllinia approved these changes Feb 23, 2026

View reviewed changes

Merge branch 'main' into add_numb3rs_dataset

fb32760

Jorjeous merged commit 6da2219 into main Feb 23, 2026
5 checks passed

Jorjeous deleted the add_numb3rs_dataset branch February 23, 2026 14:37

Kipok pushed a commit that referenced this pull request Feb 24, 2026

Numb3rs ds addition (#1174)

b7b7dbe

Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com> Signed-off-by: Igor Gitman <igitman@nvidia.com>

gwarmstrong added run GPU tests and removed run GPU tests labels Feb 24, 2026

dgtm777 pushed a commit that referenced this pull request Mar 18, 2026

Numb3rs ds addition (#1174)

b002fec

Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>

dgtm777 pushed a commit that referenced this pull request Mar 18, 2026

Numb3rs ds addition (#1174)

008b38f

Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com> Signed-off-by: dgitman <dgitman@nvidia.com>

Conversation

Jorjeous commented Jan 19, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

Jorjeous commented Jan 21, 2026

Uh oh!

greptile-apps bot commented Jan 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Overview

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Sequence Diagram

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot commented Jan 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Jorjeous commented Jan 22, 2026

Uh oh!

coderabbitai bot commented Jan 22, 2026

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jan 22, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jan 22, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot commented Feb 9, 2026

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot commented Feb 9, 2026

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

greptile-apps bot commented Feb 9, 2026

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot commented Feb 9, 2026

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Jorjeous commented Jan 19, 2026 •

edited by coderabbitai bot

Loading

greptile-apps bot commented Jan 22, 2026 •

edited

Loading

coderabbitai bot commented Jan 22, 2026 •

edited

Loading

Jorjeous commented Feb 11, 2026 •

edited

Loading