Skip to content

Numb3rs ds addition#1174

Merged
Jorjeous merged 13 commits intomainfrom
add_numb3rs_dataset
Feb 23, 2026
Merged

Numb3rs ds addition#1174
Jorjeous merged 13 commits intomainfrom
add_numb3rs_dataset

Conversation

@Jorjeous
Copy link
Member

@Jorjeous Jorjeous commented Jan 19, 2026

Summary by CodeRabbit

  • New Features

    • Added Numb3rs dataset support with automated audio preparation and neutral/tn/itn prompt variants.
    • Audio evaluation now supports multi-reference/per-field scoring and reports per-reference WER metrics.
  • Enhancements

    • New normalization option ("no_tn_itn") for evaluation preprocessing to preserve numeric forms during scoring.
    • Aggregation and display of TN/ITN WER metrics alongside existing metrics.
  • Documentation

    • Usage guidance and dataset description for Numb3rs preparation and evaluation workflows.

@Jorjeous
Copy link
Member Author

Cpu tests failture not related to PR

@Jorjeous Jorjeous marked this pull request as ready for review January 22, 2026 15:41
@Jorjeous Jorjeous changed the title Draft Numb3rs ds addition Numb3rs ds addition Jan 22, 2026
@greptile-apps
Copy link
Contributor

greptile-apps bot commented Jan 22, 2026

Greptile Overview

Greptile Summary

This PR adds support for the Numb3rs speech dataset for text normalization (TN) and inverse text normalization (ITN) evaluation. The changes enable evaluating ASR models against multiple reference fields (written vs spoken forms) and reporting per-reference WER metrics.

Key Changes:

  • Added Numb3rs dataset configuration and preparation tooling with audio export and per-category JSONL outputs
  • Enhanced audio evaluator to compute WER against multiple reference fields specified via reference_fields config
  • Extended metrics to dynamically aggregate and report WER variants (e.g., wer_tn, wer_itn)
  • Added prompt_field parameter to enable runtime prompt substitution in OpenAI-format messages

Implementation:

  • Dataset preparation loads from HuggingFace, formats entries with dual references (text_tn/text_itn), and creates placeholder-based messages
  • Audio evaluation loops through specified reference fields and computes per-reference WER metrics with proper naming
  • Metrics tracking collects dynamic WER scores and aggregates them alongside standard metrics
  • Generation replaces <PLACEHOLDER> content with dataset field values based on prompt_field setting

Confidence Score: 5/5

  • This PR is safe to merge with minimal risk
  • The implementation is well-structured and follows existing codebase patterns. The multi-reference evaluation feature cleanly extends the existing audio evaluation framework without breaking changes. The dataset preparation script properly handles edge cases (empty audio, invalid data) and includes appropriate validation. All changes are self-contained within the audio evaluation domain and properly documented.
  • No files require special attention

Important Files Changed

Filename Overview
nemo_skills/dataset/numb3rs/prepare.py Dataset preparation script for Numb3rs with audio export, JSONL formatting, and category filtering
nemo_skills/evaluation/evaluator/audio.py Added multi-reference WER evaluation for TN/ITN tasks; computes WER against multiple ground truths
nemo_skills/inference/generate.py Added prompt_field support to substitute dataset field values into OpenAI-format prompts at runtime

Sequence Diagram

sequenceDiagram
    participant User
    participant PrepareScript as prepare.py
    participant HuggingFace as HF Dataset
    participant Generate as generate.py
    participant AudioEval as audio.py
    participant Metrics as audio_metrics.py

    User->>PrepareScript: ns prepare_data numb3rs
    PrepareScript->>HuggingFace: load_dataset("NNstuff/Numb3rs")
    HuggingFace-->>PrepareScript: Dataset with audio samples
    PrepareScript->>PrepareScript: Format entries with text_tn/text_itn
    PrepareScript->>PrepareScript: Save audio files as FLAC
    PrepareScript->>PrepareScript: Create test.jsonl with messages

    User->>Generate: ns generate ++prompt_field=prompt_neutral
    Generate->>Generate: fill_prompt() replaces <PLACEHOLDER>
    Generate->>Generate: LLM generates transcription
    Generate->>AudioEval: evaluate_sample() with reference_fields
    AudioEval->>AudioEval: evaluate_asr(expected_answer, generation)
    loop For each reference field
        AudioEval->>AudioEval: evaluate_asr(sample[ref_field], generation)
        AudioEval->>AudioEval: Store as wer_tn, wer_itn
    end
    AudioEval-->>Generate: Evaluation results with multiple WERs
    Generate->>Metrics: update() with predictions
    Metrics->>Metrics: Collect dynamic_wer_scores
    Metrics->>Metrics: Aggregate wer_tn, wer_itn
    Metrics-->>User: Final metrics report
Loading

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

5 files reviewed, 1 comment

Edit Code Review Agent Settings | Greptile

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Jan 22, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Adds a new Numb3rs dataset integration (configs + prepare script) and extends audio evaluation/metrics to support a "no_tn_itn" normalization mode and per-reference WER (e.g., TN/ITN) with reference_fields-driven aggregation.

Changes

Cohort / File(s) Summary
Dataset Config
nemo_skills/dataset/numb3rs/__init__.py
New dataset constants and defaults: DATASET_GROUP, METRICS_TYPE, DEFAULT_SPLIT, EVAL_SPLIT, EVAL_ARGS (references: text_tn,text_itn, no_tn_itn), and GENERATION_ARGS.
Dataset Preparation
nemo_skills/dataset/numb3rs/prepare.py
New prepare CLI and utilities: loads HF test split, filters by category, optional audio saving, formats samples into neutral/tn/itn JSONL variants with fields like audio_filepath, duration, text_tn, text_itn, sample_id, audio_metadata; exports save_audio_and_format_entry, prepare_category, and main.
Audio Evaluator
nemo_skills/evaluation/evaluator/audio.py
Adds reference_fields to AudioEvaluatorConfig; adds no_tn_itn to VALID_NORMALIZATION_MODES; introduces no_tn_itn preprocessing path and updates ASR preprocessing flow; computes per-field WER/is_correct when reference_fields present.
Audio Metrics
nemo_skills/evaluation/metrics/audio_metrics.py
Adds per-reference score collectors (e.g., wer_tn_scores, wer_itn_scores), updates update to collect per-field WERs, and get_metrics/metrics_to_print to report averaged wer_tn/wer_itn metrics.

Sequence Diagram

sequenceDiagram
    participant HF as HF Dataset
    participant Formatter as Prepare Script
    participant AudioFS as Audio Storage
    participant Evaluator as Audio Evaluator
    participant Metrics as Metrics Aggregator
    participant Writer as JSONL Writer

    HF->>Formatter: load test split & sample
    Formatter->>AudioFS: save audio file (if with_audio)
    AudioFS-->>Formatter: audio_filepath, duration, metadata
    Formatter->>Writer: emit variants (neutral, tn, itn) with dual references
    Writer-->>Formatter: write confirmation
    Formatter->>Evaluator: submit sample + model generation
    Evaluator->>Evaluator: preprocess (no_tn_itn or other normalization)
    Evaluator->>Evaluator: compute WER per reference field (wer_tn, wer_itn)
    Evaluator->>Metrics: push per-field WER scores
    Metrics->>Writer: aggregate and output final metrics report
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

  • PR #1093: Related — modifies audio evaluator/metrics and overlaps on per-field WER and normalization changes.
  • PR #1140: Related — adjusts dataset/eval config and touches similar normalization/evaluation config logic.

Suggested labels

run GPU tests

Suggested reviewers

  • melllinia
  • gwarmstrong
🚥 Pre-merge checks | ✅ 2 | ❌ 1
❌ Failed checks (1 inconclusive)
Check name Status Explanation Resolution
Title check ❓ Inconclusive The title 'Numb3rs ds addition' is vague and uses non-descriptive abbreviation 'ds'. While it refers to a dataset addition, it lacks clarity about what specifically is being added or its purpose. Clarify the title to describe the main change more explicitly, such as 'Add Numb3rs dataset configuration and preparation for TN/ITN evaluation' or 'Add Numb3rs dataset with audio evaluation support'.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Docstring Coverage ✅ Passed Docstring coverage is 87.50% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch add_numb3rs_dataset

Tip

Issue Planner is now in beta. Read the docs and try it out! Share your feedback on Discord.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In `@nemo_skills/evaluation/evaluator/audio.py`:
- Around line 535-544: The reference-field WER calculations call evaluate_asr
without the normalization_mode, causing inconsistent WERs versus the main call;
update the loop that handles config.reference_fields to pass
normalization_mode=mode into evaluate_asr (i.e., call
evaluate_asr(sample[ref_field], generation, normalization_mode=mode)), keeping
the same metric naming logic that derives metric_suffix and updates
wer_{metric_suffix} and is_correct_{metric_suffix}.
🧹 Nitpick comments (4)
nemo_skills/inference/generate.py (1)

576-583: In-place mutation of data_point["messages"] may cause side effects.

The code mutates message["content"] directly within data_point["messages"]. If the same data_point is reused elsewhere (e.g., for retries or logging), the placeholder will already be replaced. Consider whether a deep copy is needed here, similar to how deepcopy(data_point) is used at line 600 for the non-openai path.

♻️ Suggested fix using deepcopy for safety
         if self.cfg.prompt_format == "openai":
             # Replace placeholder content with prompt from specified field
             if self.cfg.prompt_field and self.cfg.prompt_field in data_point:
                 prompt_value = data_point[self.cfg.prompt_field]
+                # Work on a copy to avoid mutating original data_point
+                data_point = deepcopy(data_point)
                 # Find and replace placeholder in messages
                 for message in data_point["messages"]:
                     if message.get("role") == "user" and message.get("content") == "<PLACEHOLDER>":
                         message["content"] = prompt_value
                         break
nemo_skills/dataset/numb3rs/prepare.py (3)

82-82: Hardcoded container path may need documentation.

The path /dataset/numb3rs/data/ is hardcoded, which assumes a specific container/deployment structure. Consider adding a comment explaining when/how this path is used, or making it configurable.


187-192: Consider more specific exception handling for dataset loading.

While catching a broad Exception is pragmatic for HuggingFace dataset loading (which can fail in various ways), logging the exception type could help with debugging.

♻️ Optional: More informative error logging
     try:
         dataset = load_dataset("NNstuff/Numb3rs", split="train", trust_remote_code=True)
         print(f"Loaded {len(dataset)} total samples")
     except Exception as e:
-        print(f"Error loading dataset: {e}")
+        print(f"Error loading dataset ({type(e).__name__}): {e}")
         return

218-232: Potential inclusion of stale test.jsonl from previous runs.

If test.jsonl exists from a previous run and is not deleted before the glob, it could be included in all_jsonl_files. While the filter at line 223 removes it, this relies on the exact filename match. Consider explicitly deleting the combined file before processing, or using a more robust approach.

♻️ Suggested fix to ensure clean state
     # Combine all category JSONLs into test.jsonl
     combined_file = output_dir / "test.jsonl"
+    if combined_file.exists():
+        combined_file.unlink()
     print(f"\nCreating combined file: {combined_file}")
 
     all_jsonl_files = sorted(output_dir.glob("*.jsonl"))
     category_files = [f for f in all_jsonl_files if f.name != "test.jsonl"]

@Jorjeous Jorjeous requested a review from melllinia January 22, 2026 15:49
@Jorjeous
Copy link
Member Author

@coderabbitai full review

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Jan 22, 2026

✅ Actions performed

Full review triggered.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Fix all issues with AI agents
In `@nemo_skills/dataset/numb3rs/prepare.py`:
- Around line 55-70: The code currently computes duration = len(audio_array) /
sampling_rate without validating sampling_rate; add a guard in the prepare logic
(around where audio_info, audio_array and sampling_rate are read) to ensure
sampling_rate is a positive non-zero number (e.g., check
isinstance(sampling_rate, (int, float)) and sampling_rate > 0) and return None
for malformed entries if not, before computing duration and comparing to
MIN_AUDIO_DURATION.

In `@nemo_skills/inference/generate.py`:
- Around line 575-584: The placeholder-substitution block under the if
self.cfg.prompt_format == "openai": branch must fail fast when substitution
cannot be applied: if self.cfg.prompt_field is set but that key is missing in
data_point, or if data_point["messages"] contains no message with role "user"
and content "<PLACEHOLDER>", raise a clear exception (or call the existing error
handling) instead of silently continuing; update the logic around
self.cfg.prompt_field/data_point[self.cfg.prompt_field] and the loop over
data_point["messages"] to detect these two failure modes and surface a
descriptive error that includes the missing field name or the fact that
"<PLACEHOLDER>" was not found.
🧹 Nitpick comments (2)
nemo_skills/inference/generate.py (1)

221-224: Validate prompt_field usage for non-OpenAI prompts.

prompt_field is only used in the OpenAI path; when prompt_format != "openai" it is silently ignored. Consider validating this in _post_init_validate_params (e.g., require prompt_field is None unless prompt_format == "openai"), to prevent accidental misconfiguration.

nemo_skills/dataset/numb3rs/prepare.py (1)

186-192: Avoid swallowing unexpected dataset load errors.

Catching all exceptions can mask real problems. Consider narrowing the exception type or re-raising after logging, so unexpected failures aren’t silently ignored.

Comment on lines +55 to +70
# Get audio info
audio_info = entry.get("audio", {})
if not isinstance(audio_info, dict) or "array" not in audio_info or "sampling_rate" not in audio_info:
return None

audio_array = audio_info["array"]
sampling_rate = audio_info["sampling_rate"]

# Skip if audio array is empty or invalid
if audio_array is None or len(audio_array) == 0:
return None

duration = len(audio_array) / sampling_rate

if duration < MIN_AUDIO_DURATION:
return None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Guard against invalid sampling_rate.

If sampling_rate is 0/None, this will throw a division error. A small guard avoids hard failures on malformed entries.

🐛 Proposed guard
-    duration = len(audio_array) / sampling_rate
+    if not sampling_rate or sampling_rate <= 0:
+        return None
+    duration = len(audio_array) / sampling_rate
🤖 Prompt for AI Agents
In `@nemo_skills/dataset/numb3rs/prepare.py` around lines 55 - 70, The code
currently computes duration = len(audio_array) / sampling_rate without
validating sampling_rate; add a guard in the prepare logic (around where
audio_info, audio_array and sampling_rate are read) to ensure sampling_rate is a
positive non-zero number (e.g., check isinstance(sampling_rate, (int, float))
and sampling_rate > 0) and return None for malformed entries if not, before
computing duration and comparing to MIN_AUDIO_DURATION.

Comment on lines +575 to +584
if self.cfg.prompt_format == "openai":
# Replace placeholder content with prompt from specified field
if self.cfg.prompt_field and self.cfg.prompt_field in data_point:
prompt_value = data_point[self.cfg.prompt_field]
# Find and replace placeholder in messages
for message in data_point["messages"]:
if message.get("role") == "user" and message.get("content") == "<PLACEHOLDER>":
message["content"] = prompt_value
break

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Fail fast when placeholder substitution can’t be applied.

If prompt_field is set but the field is missing or no <PLACEHOLDER> is found, the current logic silently leaves the placeholder in the prompt. This can waste runs and produce misleading results; a clear error (or at least a warning) is safer.

💡 Proposed fail-fast guard
         if self.cfg.prompt_format == "openai":
             # Replace placeholder content with prompt from specified field
-            if self.cfg.prompt_field and self.cfg.prompt_field in data_point:
-                prompt_value = data_point[self.cfg.prompt_field]
-                # Find and replace placeholder in messages
-                for message in data_point["messages"]:
-                    if message.get("role") == "user" and message.get("content") == "<PLACEHOLDER>":
-                        message["content"] = prompt_value
-                        break
+            if self.cfg.prompt_field:
+                if self.cfg.prompt_field not in data_point:
+                    raise KeyError(f"prompt_field '{self.cfg.prompt_field}' not found in data point")
+                prompt_value = data_point[self.cfg.prompt_field]
+                replaced = False
+                # Find and replace placeholder in messages
+                for message in data_point.get("messages", []):
+                    if message.get("role") == "user" and message.get("content") == "<PLACEHOLDER>":
+                        message["content"] = prompt_value
+                        replaced = True
+                        break
+                if not replaced:
+                    raise ValueError("No '<PLACEHOLDER>' found in user messages for prompt_field replacement")
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if self.cfg.prompt_format == "openai":
# Replace placeholder content with prompt from specified field
if self.cfg.prompt_field and self.cfg.prompt_field in data_point:
prompt_value = data_point[self.cfg.prompt_field]
# Find and replace placeholder in messages
for message in data_point["messages"]:
if message.get("role") == "user" and message.get("content") == "<PLACEHOLDER>":
message["content"] = prompt_value
break
if self.cfg.prompt_format == "openai":
# Replace placeholder content with prompt from specified field
if self.cfg.prompt_field:
if self.cfg.prompt_field not in data_point:
raise KeyError(f"prompt_field '{self.cfg.prompt_field}' not found in data point")
prompt_value = data_point[self.cfg.prompt_field]
replaced = False
# Find and replace placeholder in messages
for message in data_point.get("messages", []):
if message.get("role") == "user" and message.get("content") == "<PLACEHOLDER>":
message["content"] = prompt_value
replaced = True
break
if not replaced:
raise ValueError("No '<PLACEHOLDER>' found in user messages for prompt_field replacement")
🤖 Prompt for AI Agents
In `@nemo_skills/inference/generate.py` around lines 575 - 584, The
placeholder-substitution block under the if self.cfg.prompt_format == "openai":
branch must fail fast when substitution cannot be applied: if
self.cfg.prompt_field is set but that key is missing in data_point, or if
data_point["messages"] contains no message with role "user" and content
"<PLACEHOLDER>", raise a clear exception (or call the existing error handling)
instead of silently continuing; update the logic around
self.cfg.prompt_field/data_point[self.cfg.prompt_field] and the loop over
data_point["messages"] to detect these two failure modes and surface a
descriptive error that includes the missing field name or the fact that
"<PLACEHOLDER>" was not found.

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No files reviewed, no comments

Edit Code Review Agent Settings | Greptile

update source of data
add special normalization mode

Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>
@Jorjeous Jorjeous force-pushed the add_numb3rs_dataset branch from 14f1c57 to 676254d Compare February 9, 2026 13:35
Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 file reviewed, 1 comment

Edit Code Review Agent Settings | Greptile

@greptile-apps
Copy link
Contributor

greptile-apps bot commented Feb 9, 2026

Additional Comments (1)

nemo_skills/evaluation/evaluator/audio.py
hf_leaderboard normalization changed

preprocess_asr_text(..., mode="hf_leaderboard") previously did NFC normalization + lowercase + punctuation stripping (see prior code), but this PR routes hf_leaderboard through Whisper normalization (and removes the NFC step). That’s a backwards-incompatible behavior change that will alter WER for any existing evaluations that rely on normalization_mode=hf_leaderboard. Consider restoring the old hf_leaderboard branch (including unicodedata.normalize("NFC", ...)) and keeping Whisper normalization only for standard/audiobench (or introduce a new mode name if the new behavior is desired).

Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>
Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 files reviewed, 2 comments

Edit Code Review Agent Settings | Greptile

@greptile-apps
Copy link
Contributor

greptile-apps bot commented Feb 9, 2026

Additional Comments (2)

nemo_skills/evaluation/evaluator/audio.py
Missing per-reference metrics

In the missing_generation early return for ASR tasks, only the generic wer is set. When eval_config.reference_fields is used, those derived metrics (e.g., wer_tn, wer_itn, and is_correct_*) are never populated for empty generations, so aggregates will silently exclude these samples and bias per-reference WER downward. This affects any run where reference_fields is provided and generation is empty.


nemo_skills/evaluation/metrics/audio_metrics.py
Dynamic WER double counting

AudioMetrics.update() already iterates predictions and appends pred["wer"] to self.wer_scores, but the new loop also collects any key starting with wer_. That includes the main wer key (since it also starts with wer_), so self.dynamic_wer_scores["wer"] will be populated and then get_metrics() will overwrite agg_metrics["wer"] with the dynamic average computed from the same underlying values. This changes behavior and can diverge if rounding/filters differ. The exclusion list should also exclude the plain "wer" key (and likely any other core WER keys already handled).

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 file reviewed, 2 comments

Edit Code Review Agent Settings | Greptile

@greptile-apps
Copy link
Contributor

greptile-apps bot commented Feb 9, 2026

Additional Comments (1)

nemo_skills/evaluation/evaluator/audio.py
Duplicate config field
AudioEvaluatorConfig defines apply_whisper_normalization twice (lines 37 and 39). In a dataclass this will silently override the first definition and can break config schema/introspection (and makes defaults/metadata ambiguous). Remove the duplicate so the config has a single source of truth.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🤖 Fix all issues with AI agents
In `@nemo_skills/dataset/numb3rs/prepare.py`:
- Line 301: The print statement uses an unnecessary f-string prefix; change the
call print(f"\nCreating combined test files for each variant...") to a regular
string print("\nCreating combined test files for each variant...") by removing
the leading "f" so there are no unused formatting placeholders (locate this
statement in nemo_skills/dataset/numb3rs/prepare.py where the combined test
files are created).
- Around line 269-274: Remove the broad try/except around the load_dataset call
so failures propagate instead of being swallowed; specifically delete the except
block that prints "Error loading dataset" and the early return, and let the call
to load_dataset("nvidia/Numb3rs", split="test", trust_remote_code=True) (and the
subsequent print of len(dataset)) raise its errors normally; keep the dataset
variable assignment and the success print as-is so callers see clear failures
from load_dataset.
- Around line 86-108: Replace lax .get() usage with direct dict access on the
expected Numb3rs fields to fail fast: access entry["original_text"],
entry["text"], and entry["file_name"] (instead of entry.get(...)) and let
KeyError surface for malformed entries; continue to strip the values and return
None if original_text or text are empty. Do not default duration to 1.0 — use
entry["duration"] and validate it exists, then compare against
MIN_AUDIO_DURATION and return None if too short. Preserve the existing
sample_id/audio_filename logic but read file_name from entry["file_name"] and
keep the .stem/.name handling and ".json" trimming as before so
malformed/missing keys raise clear errors.

In `@nemo_skills/evaluation/evaluator/audio.py`:
- Around line 37-42: Remove the duplicate declaration of the dataclass field
apply_whisper_normalization (it appears twice surrounding normalization_mode and
reference_fields); keep a single declaration with the intended default value
(True) and eliminate the shadowing duplicate so the dataclass has only one
apply_whisper_normalization field.
🧹 Nitpick comments (2)
nemo_skills/dataset/numb3rs/prepare.py (2)

110-123: Silent default for sampling_rate may hide data issues.

Line 117 defaults sampling_rate to 16000 if missing. If the audio metadata is expected to contain this field, a missing value likely indicates a malformed entry that should be flagged rather than silently assumed.

♻️ Suggested fix
-        sampling_rate = audio_info.get("sampling_rate", 16000)
+        sampling_rate = audio_info["sampling_rate"]

As per coding guidelines, "Do not use .get() for accessing dictionary keys if the code expects them to be present; use direct dictionary access dict[key] instead to allow proper error handling and fail fast with clear errors".


64-72: Add type hints to function signatures.

Functions lack type hints for their parameters and return types. This applies to build_messages_with_prompt, save_audio_and_format_entry, prepare_category, and main.

Example for save_audio_and_format_entry:

def save_audio_and_format_entry(
    entry: dict, category: str, audio_dir: Path, sample_idx: int,
    with_audio: bool = True, audio_prefix: str = "/data/numb3rs",
) -> dict | None:

As per coding guidelines, "Use type hints for simple types (dict, list, int, float, existing classes) in Python code".

Also applies to: 75-75, 152-152, 233-233

Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>
Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 file reviewed, 1 comment

Edit Code Review Agent Settings | Greptile

@greptile-apps
Copy link
Contributor

greptile-apps bot commented Feb 9, 2026

Additional Comments (1)

nemo_skills/evaluation/evaluator/audio.py
Docstring mismatch

evaluate_asr’s docstring still says normalization_mode is only "standard", "audiobench", "hf_leaderboard", or "none", but this PR adds and uses the new "no_tn_itn" mode. This makes the public contract misleading and can cause users to pass the new mode and think it’s unsupported (or vice versa). Please update the docstring to include "no_tn_itn" (and keep it consistent with VALID_NORMALIZATION_MODES).

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
nemo_skills/evaluation/evaluator/audio.py (1)

321-328: ⚠️ Potential issue | 🟡 Minor

Docstring doesn't list "no_tn_itn" as a valid normalization_mode.

Line 327 only mentions "standard", "audiobench", "hf_leaderboard", and "none". The new "no_tn_itn" mode should be documented here as well for consistency with preprocess_asr_text.

📝 Proposed fix
     Args:
         reference: Ground truth transcription.
         hypothesis: Model output transcription.
-        normalization_mode: "standard", "audiobench", "hf_leaderboard", or "none".
+        normalization_mode: "standard", "audiobench", "hf_leaderboard", "none", or "no_tn_itn".
     """

Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>
Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No files reviewed, no comments

Edit Code Review Agent Settings | Greptile

Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>
Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>
Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No files reviewed, no comments

Edit Code Review Agent Settings | Greptile

Copy link
Collaborator

@Kipok Kipok left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please add this dataset to docs with some example commands / reference numbers https://github.com/NVIDIA-NeMo/Skills/blob/main/CONTRIBUTING.md#when-adding-new-benchmarks

Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>
Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

4 files reviewed, 1 comment

Edit Code Review Agent Settings | Greptile

Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>
@Jorjeous
Copy link
Member Author

Jorjeous commented Feb 11, 2026

please add this dataset to docs with some example commands / reference numbers https://github.com/NVIDIA-NeMo/Skills/blob/main/CONTRIBUTING.md#when-adding-new-benchmarks

done, no ref numbers for now

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

5 files reviewed, no comments

Edit Code Review Agent Settings | Greptile

@Jorjeous Jorjeous enabled auto-merge (squash) February 12, 2026 11:21
Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

5 files reviewed, no comments

Edit Code Review Agent Settings | Greptile

@Jorjeous Jorjeous requested a review from Kipok February 13, 2026 09:27
@Jorjeous
Copy link
Member Author

@melllinia

Copy link
Collaborator

@Kipok Kipok left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since the changes are fully local to audio logic / the new benchmark, I'm ok with merging this if you believe the code is good and if you tested it well. It seems like there is a breaking change in hf_leaderboard normalization, make sure it's intentional. Please have someone from your team review and approve this

also, highly recommend to run /review in codex or claude code - I think they have some helpful comments there. E.g. here is the output from claude

Overview

  This PR adds the Numb3rs benchmark — a speech
  dataset for evaluating text normalization (TN) and
  inverse text normalization (ITN). It includes:
  - New dataset definition (__init__.py) and
  preparation script (prepare.py)
  - A new no_tn_itn normalization mode in the audio
  evaluator
  - Multi-reference WER evaluation via
  reference_fields
  - Improvements to the strip_helpful_prefixes
  function (colon-quote and contraction handling)
  - New wer_tn / wer_itn metrics in AudioMetrics
  - Documentation for the new benchmark

  Critical Issue: Breaking change to hf_leaderboard 
  normalization mode

  audio.py:288-298 — The hf_leaderboard mode's
  dedicated code path was removed. Previously it did:
  # OLD (main branch)
  if mode == "hf_leaderboard":
      text = unicodedata.normalize("NFC", text)
      text = text.lower()
      text = re.sub(r"[^\w\s]", "", text)
      return re.sub(r"\s+", " ", text).strip()

  Now it falls through to whisper normalization, which
   behaves very differently (e.g., converts number
  words to digits, has its own English-specific
  transformations). This silently breaks
  asr-leaderboard, which uses
  ++eval_config.normalization_mode=hf_leaderboard. The
   new no_tn_itn mode is basically what hf_leaderboard
   used to do (minus unicode NFC normalization).

  This should be fixed — either restore the
  hf_leaderboard path, or if the change was
  intentional, it should be a separate commit with a
  clear rationale and the asr-leaderboard dataset
  updated accordingly.

  Other Issues

  1. audio.py:38 — Type annotation style: list[str] |
  None uses PEP 604 union syntax. The rest of the
  codebase (and CONTRIBUTING.md) says to avoid
  "complicated types" and use simple types. Consider
  checking whether this project targets Python 3.9
  compatibility where list[str] | None would fail at
  runtime unless from __future__ import annotations is
   used.
  2. audio_metrics.py — Hardcoded wer_tn / wer_itn 
  field names: The evaluator's reference_fields
  feature is generic (any field name), but the metrics
   collector only looks for the hardcoded wer_tn and
  wer_itn keys. If someone uses
  reference_fields=['text_written', 'text_spoken'],
  the metrics would be wer_written and wer_spoken, but
   AudioMetrics would silently ignore them. Either
  make the metrics collection dynamic (iterate over
  all wer_* keys in predictions), or document that
  only text_tn/text_itn field names are supported.
  3. prepare.py:95 — audio_metadata key leaks into the
   entry dict: The audio_metadata key is added to the
  formatted entry and then del'd in the loop (line
  180). This is a fragile pattern — if any code path
  writes the base entry before messages are added,
  audio_metadata would leak into the JSONL. Consider
  building messages inline instead of storing
  intermediate state.
  4. prepare.py:80 — sample_id field: The sample_id is
   derived from the filename stem, but other audio
  datasets use a simple integer index for sample_id.
  This inconsistency could cause issues with
  deduplication or ordering assumptions downstream.

  Minor Suggestions

  5. prepare.py:55 — SYSTEM_MESSAGE with /no_think:
  The /no_think suffix in the system message is
  model-specific behavior. Other audio datasets in the
   repo use the same pattern, so this is consistent,
  just noting it.
  6. prepare.py:232-237 — Category validation: Unknown
   categories print a warning and are silently
  skipped. Per the project's "don't be overly
  defensive" / "fail fast" style, this should probably
   raise an error instead.
  7. strip_helpful_prefixes changes (audio.py:74-89):
  The rewrite to handle contractions like o'clock is a
   nice improvement. However, the non-greedy '(.+?)'
  pattern could still be too eager for some edge
  cases. The change is also not specific to numb3rs —
  it affects all audio evaluation. Worth testing
  against existing benchmarks.
  8. Documentation duplication: The Numb3rs section
  appears twice in speech-audio.md — once as a brief
  summary at the top (lines ~35-48), and again as a
  full section at the bottom (lines ~419+). This is
  consistent with how other benchmarks are documented
  in that file, so it's fine.

  Summary

  The main concern is the silent breaking change to
  hf_leaderboard normalization — this should be
  addressed before merge. The rest of the
  implementation follows project conventions well and
  adds a clean, well-documented benchmark.

- Benchmark is defined in [`nemo_skills/dataset/mmau-pro/__init__.py`](https://github.com/NVIDIA-NeMo/Skills/blob/main/nemo_skills/dataset/mmau-pro/__init__.py)
- Original benchmark source is hosted on [HuggingFace](https://huggingface.co/datasets/gamma-lab-umd/MMAU-Pro)

### Numb3rs
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there is something wrong with headers and placement of docs in this file (not just for numb3rs). We should probably only have one header for each benchmark and all information goes into subheaders

from datasets import load_dataset
from tqdm import tqdm

SYSTEM_MESSAGE = "You are a helpful assistant. /no_think"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need this? Especially the /no_think part which is qwen specific

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd suggest to not set system message and users can always add it via explicit ++system_message parameter

@Jorjeous
Copy link
Member Author

Hf leaderboard changes is intentional
@melllinia plz check, but i believe there shold be normalization

…ulation, refactored normalization process

Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>
@Jorjeous
Copy link
Member Author

@melllinia plz retry exps

@melllinia
Copy link
Member

@Jorjeous After the current changes the scores are back to normal. Approved!

@Jorjeous Jorjeous merged commit 6da2219 into main Feb 23, 2026
5 checks passed
@Jorjeous Jorjeous deleted the add_numb3rs_dataset branch February 23, 2026 14:37
Kipok pushed a commit that referenced this pull request Feb 24, 2026
Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>
Signed-off-by: Igor Gitman <igitman@nvidia.com>
sgunasekar added a commit that referenced this pull request Mar 11, 2026
commit a5da597
Author: Igor Gitman <igitman@nvidia.com>
Date:   Fri Mar 6 12:13:36 2026 -0800

    Revert "Eval kit support  (#1239)" (#1294)

    Signed-off-by: Igor Gitman <igitman@nvidia.com>

commit b237e33
Author: George <37293288+Jorjeous@users.noreply.github.com>
Date:   Fri Mar 6 20:25:37 2026 +0400

    Eval kit support  (#1239)

    Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>
    Signed-off-by: George <37293288+Jorjeous@users.noreply.github.com>
    Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>

commit dc28bbf
Author: George Armstrong <georgea@nvidia.com>
Date:   Thu Mar 5 10:17:44 2026 -0800

    Python direct tool calling without MCP (#1286)

    Signed-off-by: George Armstrong <georgea@nvidia.com>

commit 12454dd
Author: Sadegh Mahdavi <smahdavi4@gmail.com>
Date:   Wed Mar 4 13:06:21 2026 -0800

    Allow het servers for nemo-rl jobs (#1223)

    Signed-off-by: Sadegh Mahdavi <smahdavi@nvidia.com>
    Co-authored-by: Igor Gitman <igitman@nvidia.com>

commit 8884a68
Author: Prasoon Varshney <prasoon1995@gmail.com>
Date:   Wed Mar 4 10:24:02 2026 -0800

    Support source_lang param for translation recipe (#1290)

    Signed-off-by: Prasoon Varshney <prasoonv@nvidia.com>
    Co-authored-by: George Armstrong <georgea@nvidia.com>

commit 4618b19
Author: Meriem B. <113170426+ka00ri@users.noreply.github.com>
Date:   Wed Mar 4 18:59:28 2026 +0100

    Add MMLU-Pro 10% optimized subset for checkpoint selection (#1285)

    Signed-off-by: Meriem Boubdir <mboubdir@nvidia.com>
    Co-authored-by: George Armstrong <georgea@nvidia.com>

commit 5ac8609
Author: Talor Abramovich <talor19@gmail.com>
Date:   Wed Mar 4 02:30:06 2026 +0200

    Add SPEED-Bench (within repo) (#1279)

    Signed-off-by: Talor Abramovich <talora@nvidia.com>
    Signed-off-by: talora <talora@nvidia.com>
    Signed-off-by: Talor Abramovich <talor19@gmail.com>
    Signed-off-by: George Armstrong <georgea@nvidia.com>
    Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
    Co-authored-by: George Armstrong <georgea@nvidia.com>
    Co-authored-by: Igor Gitman <igor.a.gitman@gmail.com>

commit c31eec5
Author: George Armstrong <georgea@nvidia.com>
Date:   Tue Mar 3 12:18:15 2026 -0800

    Fix os.getlogin() crash in ns setup (#1289)

    Signed-off-by: George Armstrong <georgea@nvidia.com>

commit c228e66
Author: George Armstrong <georgea@nvidia.com>
Date:   Tue Mar 3 11:04:54 2026 -0800

    Fix streaming TypeError when delta.content is None (#1267) (#1288)

    Signed-off-by: George Armstrong <georgea@nvidia.com>

commit aa47923
Author: Matvei Novikov <mnovikov@nvidia.com>
Date:   Mon Mar 2 16:28:41 2026 -0800

    Add LibTrace recipe for generating domain-specific reasoning data (#1224)

    Signed-off-by: jubick1337 <mnovikov@nvidia.com>
    Signed-off-by: mnovikov <mnovikov@nvidia.com>
    Co-authored-by: George Armstrong <georgea@nvidia.com>

commit 313cad7
Author: Stephen Ge <stepheng@nvidia.com>
Date:   Mon Mar 2 18:28:49 2026 -0500

    fix: clean parse-failure retries in prover (#1284)

    Signed-off-by: Stephen Ge <stepheng@nvidia.com>

commit 813cfa3
Author: George Armstrong <georgea@nvidia.com>
Date:   Mon Mar 2 15:10:08 2026 -0800

    tst: rollback inference-api to integrate (#1287)

    Signed-off-by: George Armstrong <georgea@nvidia.com>

commit 31735f9
Author: Valentin Mendelev <vmendelev@nvidia.com>
Date:   Mon Mar 2 23:11:25 2026 +0100

    Add backend-agnostic unified inference server with NeMo ASR and TTS backends (#1250)

    Signed-off-by: Valentin Mendelev <vmendelev@nvidia.com>

commit d4ef8c0
Author: George <37293288+Jorjeous@users.noreply.github.com>
Date:   Fri Feb 27 23:58:54 2026 +0400

    Update promt_config to working with openai format + inline setup (#1210)

    Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>
    Signed-off-by: George <37293288+Jorjeous@users.noreply.github.com>
    Co-authored-by: Igor Gitman <igitman@nvidia.com>
    Co-authored-by: George Armstrong <georgea@nvidia.com>

commit e879cbc
Author: George Armstrong <georgea@nvidia.com>
Date:   Fri Feb 27 10:41:23 2026 -0800

    Update noc tutorial (#1282)

    Signed-off-by: George Armstrong <georgea@nvidia.com>

commit f6e3505
Author: George Armstrong <georgea@nvidia.com>
Date:   Fri Feb 27 10:17:33 2026 -0800

    Add noc reasoning tutorial (#1278)

    Signed-off-by: Amparo Canaveras <acanaveras@nvidia.com>
    Signed-off-by: rajeshwarid179 <rdevaramani@nvidia.com>
    Signed-off-by: acanaveras <142839082+acanaveras@users.noreply.github.com>
    Signed-off-by: George Armstrong <georgea@nvidia.com>
    Co-authored-by: Amparo Canaveras <acanaveras@nvidia.com>
    Co-authored-by: Cursor <cursoragent@cursor.com>
    Co-authored-by: acanaveras <142839082+acanaveras@users.noreply.github.com>
    Co-authored-by: rajeshwarid179 <rdevaramani@nvidia.com>

commit fc2072a
Author: Jiacheng Xu <jcxu@utexas.edu>
Date:   Fri Feb 27 10:10:25 2026 -0800

    CritPt generation add prompt_format=None (#1280)

    Signed-off-by: Jiacheng Xu <jiachengx@nvidia.com>
    Co-authored-by: George Armstrong <georgea@nvidia.com>

commit c8abe5d
Author: Igor Gitman <igitman@nvidia.com>
Date:   Fri Feb 27 09:31:26 2026 -0800

    New slurm customization parameters (account, containers) (#1209)

    Signed-off-by: Igor Gitman <igitman@nvidia.com>
    Signed-off-by: George Armstrong <georgea@nvidia.com>
    Co-authored-by: George Armstrong <georgea@nvidia.com>

commit 2b38cce
Author: George Armstrong <georgea@nvidia.com>
Date:   Wed Feb 25 17:59:52 2026 -0800

    Add nemo-skills-core subpackage for lightweight installs (#1229)

    Signed-off-by: George Armstrong <georgea@nvidia.com>

commit 9fa8e83
Author: Dheeraj Peri <peri.dheeraj@gmail.com>
Date:   Wed Feb 25 12:56:35 2026 -0800

    feat: add custom judge type support for external repo integration (#1274)

    Signed-off-by: Igor Gitman <igitman@nvidia.com>
    Signed-off-by: bzantium <ryumin93@gmail.com>
    Signed-off-by: Dheeraj Peri <dperi@nvidia.com>
    Signed-off-by: suriya <sgunasekar@nvidia.com>
    Signed-off-by: Jiacheng Xu <jiachengx@nvidia.com>
    Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>
    Co-authored-by: Igor Gitman <igitman@nvidia.com>
    Co-authored-by: Minho Ryu <ryumin93@gmail.com>
    Co-authored-by: Yongqiang Wang <yongqiang.seagull@gmail.com>
    Co-authored-by: Suriya Gunasekar <sgunasekar@users.noreply.github.com>
    Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
    Co-authored-by: Jiacheng Xu <jcxu@utexas.edu>
    Co-authored-by: George <37293288+Jorjeous@users.noreply.github.com>

commit 8a32b13
Author: Igor Gitman <igitman@nvidia.com>
Date:   Tue Feb 24 15:24:42 2026 -0800

    Exclude numb3rs form test_eval.py (#1275)

    Signed-off-by: Igor Gitman <igitman@nvidia.com>

commit 6da2219
Author: George <37293288+Jorjeous@users.noreply.github.com>
Date:   Mon Feb 23 18:37:46 2026 +0400

    Numb3rs ds addition (#1174)

    Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>

commit ad034b5
Author: Suriya Gunasekar <sgunasekar@users.noreply.github.com>
Date:   Sun Feb 22 11:55:24 2026 -0800

    Add DSBench-DA evaluation (#1254)

    Squash merge of changes during code-review.
    Signed-off-by: suriya <sgunasekar@nvidia.com>

commit 7593ab3
Author: Jiacheng Xu <jcxu@utexas.edu>
Date:   Fri Feb 20 16:42:01 2026 -0800

    Add CritPt benchmark (#1200)

    Signed-off-by: Jiacheng Xu <jiachengx@nvidia.com>
    Co-authored-by: Igor Gitman <igitman@nvidia.com>

commit 58c31b2
Author: Suriya Gunasekar <sgunasekar@users.noreply.github.com>
Date:   Fri Feb 20 16:19:22 2026 -0800

    Fix no_answer metric overcounting in _compute_pass_at_k (#1245)

    Signed-off-by: suriya <sgunasekar@nvidia.com>
    Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
    Co-authored-by: Igor Gitman <igitman@nvidia.com>

commit 1f1a2e7
Author: Igor Gitman <igitman@nvidia.com>
Date:   Fri Feb 20 15:58:40 2026 -0800

    Fix incorrect prompt tokens count due to HF api update (#1264)

    Signed-off-by: Igor Gitman <igitman@nvidia.com>

commit 8ebc6f5
Author: Igor Gitman <igitman@nvidia.com>
Date:   Fri Feb 20 09:05:33 2026 -0800

    Remove deprecated dataset group (#1263)

    Signed-off-by: Igor Gitman <igitman@nvidia.com>

commit ea4177f
Author: Yongqiang Wang <yongqiang.seagull@gmail.com>
Date:   Thu Feb 19 19:57:25 2026 -0500

    fix deps (#1258)

commit 60905a7
Author: Minho Ryu <ryumin93@gmail.com>
Date:   Fri Feb 20 09:39:39 2026 +0900

    Add aime26 (#1256)

    Signed-off-by: bzantium <ryumin93@gmail.com>

commit b28afc5
Author: Igor Gitman <igitman@nvidia.com>
Date:   Thu Feb 19 16:18:25 2026 -0800

    Rename custom -> external benchmarks (#1262)

    Signed-off-by: Igor Gitman <igitman@nvidia.com>

commit 6cc9c45
Author: Igor Gitman <igitman@nvidia.com>
Date:   Thu Feb 19 16:10:33 2026 -0800

    Add reference to internal benchmarks repo (#1261)

    Signed-off-by: Igor Gitman <igitman@nvidia.com>

commit 5202af6
Author: Igor Gitman <igitman@nvidia.com>
Date:   Thu Feb 19 16:08:05 2026 -0800

    Remove incorrect presence-penalty setting (#1259)

    Signed-off-by: Igor Gitman <igitman@nvidia.com>

commit 144c70b
Author: Igor Gitman <igitman@nvidia.com>
Date:   Thu Feb 19 15:26:33 2026 -0800

    Adding an option to store benchmarks in external repo (#1240)

    Signed-off-by: Igor Gitman <igitman@nvidia.com>
    Co-authored-by: George Armstrong <georgea@nvidia.com>

commit 10e6e39
Author: George <37293288+Jorjeous@users.noreply.github.com>
Date:   Thu Feb 19 19:57:21 2026 +0400

    update vllm miltimodal for api calls convenience (#1213)

    Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>
    Signed-off-by: mmkrtchyan <mmkrtchyan@nvidia.com>
    Co-authored-by: mmkrtchyan <mmkrtchyan@nvidia.com>

commit 1ba4219
Author: Nick Ludwig <nliudvig@nvidia.com>
Date:   Wed Feb 18 03:28:23 2026 +0400

    Fix --server_container not being applied to dependent jobs (#1244)

    Signed-off-by: Nikolai Ludwig <nliudvig@nvidia.com>
    Co-authored-by: Igor Gitman <igitman@nvidia.com>

commit 9517614
Author: Wasi Ahmad <wasiahmad@ucla.edu>
Date:   Mon Feb 16 11:13:24 2026 -0800

    Support mini-swe-agent as agent harness (#1212)

    Signed-off-by: wasiahmad <wasiahmad@ucla.edu>
    Signed-off-by: i-vainn <imoshkov@nvidia.com>
    Signed-off-by: George Armstrong <georgea@nvidia.com>
    Signed-off-by: Charlie Truong <chtruong@nvidia.com>
    Signed-off-by: Nikolai Ludwig <nliudvig@nvidia.com>
    Signed-off-by: Grigor Nalbandyan <gnalbandyan@nvidia.com>
    Signed-off-by: bzantium <ryumin93@gmail.com>
    Signed-off-by: Stephen Ge <stepheng@nvidia.com>
    Signed-off-by: Jiacheng Xu <jiachengx@nvidia.com>
    Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>
    Signed-off-by: Mateusz Winiarek <mwiniarek@nvidia.com>
    Signed-off-by: mmkrtchyan <mmkrtchyan@nvidia.com>
    Signed-off-by: Wei Du <wedu@nvidia.com>
    Signed-off-by: Igor Gitman <igitman@nvidia.com>
    Signed-off-by: George <37293288+Jorjeous@users.noreply.github.com>
    Signed-off-by: SeanNaren <snarenthiran@nvidia.com>
    Signed-off-by: Arkadiusz Nowaczynski <anowaczynski@nvidia.com>
    Signed-off-by: fzyzcjy <5236035+fzyzcjy@users.noreply.github.com>
    Co-authored-by: Ivan <imoshkov@nvidia.com>
    Co-authored-by: George Armstrong <georgea@nvidia.com>
    Co-authored-by: Charlie Truong <chtruong@nvidia.com>
    Co-authored-by: Nick Ludwig <nliudvig@nvidia.com>
    Co-authored-by: Wojciech Prazuch <wojciechprazuch3@gmail.com>
    Co-authored-by: gnalbandyan <153070076+gnalbandyan@users.noreply.github.com>
    Co-authored-by: Minho Ryu <ryumin93@gmail.com>
    Co-authored-by: Stephen Ge <stepheng@nvidia.com>
    Co-authored-by: Jiacheng Xu <jcxu@utexas.edu>
    Co-authored-by: Jiacheng Xu <jiachengx@nvidia.com>
    Co-authored-by: George <37293288+Jorjeous@users.noreply.github.com>
    Co-authored-by: Sanyam Kapoor <sanyamk@nvidia.com>
    Co-authored-by: Mateusz Winiarek <72758259+Froxyy-dev@users.noreply.github.com>
    Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
    Co-authored-by: Meline Mkrtchyan <72409758+melllinia@users.noreply.github.com>
    Co-authored-by: Wei Du <wedu@nvidia.com>
    Co-authored-by: Igor Gitman <igitman@nvidia.com>
    Co-authored-by: Sean Naren <snarenthiran@nvidia.com>
    Co-authored-by: Mehrzad Samadi <mehrzadsamadi@gmail.com>
    Co-authored-by: anowaczynski-nvidia <anowaczynski@nvidia.com>
    Co-authored-by: fzyzcjy <5236035+fzyzcjy@users.noreply.github.com>

commit a3d44dc
Author: Suriya Gunasekar <sgunasekar@users.noreply.github.com>
Date:   Fri Feb 13 22:32:15 2026 -0800

    Add --installation_command support to prepare_data (#1243)

    Signed-off-by: suriya <sgunasekar@nvidia.com>
    Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com>

commit e80d524
Author: George Armstrong <georgea@nvidia.com>
Date:   Thu Feb 12 17:26:00 2026 -0800

    Fix CI disk space for Docker image builds (#1241)

    Signed-off-by: George Armstrong <georgea@nvidia.com>

commit d22236c
Author: Sadegh Mahdavi <smahdavi4@gmail.com>
Date:   Wed Feb 11 17:55:00 2026 -0800

    Fix answerbench prompt parsing (#1235)

    Signed-off-by: Sadegh Mahdavi <smahdavi@nvidia.com>

commit 2401628
Author: George Armstrong <georgea@nvidia.com>
Date:   Wed Feb 11 14:56:43 2026 -0800

    feat: add lockfiles for reproducible sandbox builds (#1233)

    Signed-off-by: George Armstrong <georgea@nvidia.com>

commit 5a0a84d
Author: Wasi Ahmad <wasiahmad@ucla.edu>
Date:   Wed Feb 11 13:30:03 2026 -0800

    removing datasets version restriction for LCB eval (#1230)

    Signed-off-by: wasiahmad <wasiahmad@ucla.edu>

commit ef0a890
Author: gnalbandyan <153070076+gnalbandyan@users.noreply.github.com>
Date:   Wed Feb 11 12:03:16 2026 +0400

    Gnalbandyan/add physics (#1214)

    Signed-off-by: Grigor Nalbandyan <gnalbandyan@nvidia.com>
    Signed-off-by: gnalbandyan <153070076+gnalbandyan@users.noreply.github.com>

commit bd9d30c
Author: Wasi Ahmad <wasiahmad@ucla.edu>
Date:   Tue Feb 10 15:13:27 2026 -0800

    LCB generic prompting (#1215)

    Signed-off-by: wasiahmad <wasiahmad@ucla.edu>

commit 7d6c49a
Author: Sadegh Mahdavi <smahdavi4@gmail.com>
Date:   Sat Feb 7 08:45:46 2026 -0800

    Add support for different variations of nemo-rl (#1220)

    Signed-off-by: Sadegh Mahdavi <smahdavi@nvidia.com>

commit b19ba96
Author: George Armstrong <georgea@nvidia.com>
Date:   Fri Feb 6 21:40:56 2026 -0800

    Add multi-node sandbox support for SLURM clusters (#1218)

    Signed-off-by: George Armstrong <georgea@nvidia.com>

commit 8950bb0
Author: anowaczynski-nvidia <anowaczynski@nvidia.com>
Date:   Sat Feb 7 01:38:00 2026 +0100

    support structured outputs in hle judge for optional AA compatibility (#1186)

    Signed-off-by: Arkadiusz Nowaczynski <anowaczynski@nvidia.com>
    Signed-off-by: Igor Gitman <igitman@nvidia.com>
    Co-authored-by: Igor Gitman <igitman@nvidia.com>

commit b84f7a2
Author: Igor Gitman <igitman@nvidia.com>
Date:   Fri Feb 6 14:51:02 2026 -0800

    A small update on running tests docs (#1219)

    Signed-off-by: Igor Gitman <igitman@nvidia.com>

commit 8e838e1
Author: George Armstrong <georgea@nvidia.com>
Date:   Thu Feb 5 18:01:35 2026 -0800

    feat: add flag to disable sandbox replay (#1217)

    Signed-off-by: George Armstrong <georgea@nvidia.com>

commit 5fd9085
Author: Igor Gitman <igitman@nvidia.com>
Date:   Thu Feb 5 15:57:01 2026 -0800

    Add an option to limit number of tool calls (#1216)

    Signed-off-by: Igor Gitman <igitman@nvidia.com>

commit d820200
Author: Igor Gitman <igitman@nvidia.com>
Date:   Tue Feb 3 10:43:55 2026 -0800

    Add arena-hard v2 (#1205)

    Signed-off-by: bzantium <ryumin93@gmail.com>
    Signed-off-by: Igor Gitman <igitman@nvidia.com>
    Co-authored-by: bzantium <ryumin93@gmail.com>

commit a30920e
Author: Igor Gitman <igitman@nvidia.com>
Date:   Mon Feb 2 10:53:55 2026 -0800

    Fix mkdocs warnings (#1204)

    Signed-off-by: Igor Gitman <igitman@nvidia.com>

commit 19d7788
Author: Ivan <imoshkov@nvidia.com>
Date:   Mon Feb 2 23:25:13 2026 +0500

    Fix infinite wait in sandbox.wait_for_sandbox (#1206)

    Signed-off-by: i-vainn <imoshkov@nvidia.com>

commit 3e65fbf
Author: Sadegh Mahdavi <smahdavi4@gmail.com>
Date:   Fri Jan 30 19:38:38 2026 -0800

    Improve tts (#1203)

    Signed-off-by: Sadegh Mahdavi <smahdavi@nvidia.com>

commit 250c862
Author: Nick Ludwig <nliudvig@nvidia.com>
Date:   Fri Jan 30 22:12:29 2026 +0400

    SWE-bench: fix SWE-agent hanging, adjust expected scores (#1202)

    Signed-off-by: Nikolai Ludwig <nliudvig@nvidia.com>

commit 7ded756
Author: Ivan <imoshkov@nvidia.com>
Date:   Fri Jan 30 09:57:41 2026 +0500

     Add proper token counting to code execution model (#1184)

    Signed-off-by: i-vainn <imoshkov@nvidia.com>
    Co-authored-by: Igor Gitman <igitman@nvidia.com>

commit b986304
Author: Igor Gitman <igitman@nvidia.com>
Date:   Thu Jan 29 17:57:07 2026 -0800

    Upgrade containers (#1198)

    Signed-off-by: Sadegh Mahdavi <smahdavi@nvidia.com>
    Signed-off-by: Igor Gitman <igitman@nvidia.com>
    Co-authored-by: Sadegh Mahdavi <smahdavi@nvidia.com>

commit 3b44f02
Author: Dan Lord <blahblahasdf@gmail.com>
Date:   Thu Jan 29 16:40:47 2026 -0800

    Fix incorrect string format (#1199)

    Signed-off-by: dlord <dlord@nvidia.com>

commit c4854b8
Author: Sadegh Mahdavi <smahdavi4@gmail.com>
Date:   Thu Jan 29 13:43:36 2026 -0800

    Update nemo-rl to latest (#1087)

    Signed-off-by: Sadegh Mahdavi <smahdavi@nvidia.com>
    Signed-off-by: Igor Gitman <igitman@nvidia.com>
    Co-authored-by: Igor Gitman <igitman@nvidia.com>
dgtm777 pushed a commit that referenced this pull request Mar 18, 2026
Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>
dgtm777 pushed a commit that referenced this pull request Mar 18, 2026
Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>
Signed-off-by: dgitman <dgitman@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants