Skip to content

Audiobench and LibriSpeech-PC Benchmarks Evaluation#1060

Closed
melllinia wants to merge 26 commits intomainfrom
audiobench_libri-pc
Closed

Audiobench and LibriSpeech-PC Benchmarks Evaluation#1060
melllinia wants to merge 26 commits intomainfrom
audiobench_libri-pc

Conversation

@melllinia
Copy link
Member

@melllinia melllinia commented Dec 1, 2025

The PR is the implementation of AudioBench and LibriSpeech-PC benchmarks evaluation.

Summary by CodeRabbit

Release Notes

  • New Features

    • Added AudioBench benchmark support with automatic and judge-based evaluation variants
    • Added LibriSpeech-PC ASR benchmark support with test splits
    • Introduced speech/language metrics for ASR, pronunciation, and translation evaluation
  • Documentation

    • Enhanced speech-audio evaluation documentation with benchmark details and configuration guidance
    • Added LibriSpeech-PC documentation
  • Chores

    • Expanded dataset registry and updated evaluation configurations
    • Improved SWE-bench execution with dynamic repository installation pipelines

✏️ Tip: You can customize this high-level summary in your review settings.

Jorjeous and others added 23 commits December 1, 2025 10:44
Resolve conflict
Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>

Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>
Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>
Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>
Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>
Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>
Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>
Signed-off-by: mmkrtchyan <mmkrtchyan@nvidia.com>
Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>
Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>
Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>
Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>
Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>
Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>
Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>
Author did not signed commit
This reverts commit ecfafd1.

Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>
Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>
Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>
Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>
- Add comprehensive documentation for LibriSpeech-PC benchmark in speech-audio.md
- Fix jiwer import to be lazy (only import when needed for ASR evaluation)

Signed-off-by: mmkrtchyan <mmkrtchyan@nvidia.com>
Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>
Signed-off-by: mmkrtchyan <mmkrtchyan@nvidia.com>
Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>
Signed-off-by: mmkrtchyan <mmkrtchyan@nvidia.com>
Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>
Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>
Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>
This reverts commit e5f0c2f.

Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>
@Jorjeous Jorjeous force-pushed the audiobench_libri-pc branch from 91423e6 to a908156 Compare December 1, 2025 18:44
@karpnv karpnv requested a review from ludwig-n December 3, 2025 01:49
Copy link
Collaborator

@gwarmstrong gwarmstrong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few comments/questions about the general approach. Also it looks like there are a lot of unrelated changes. Can you fix those?

Returns:
True if successful, False otherwise
"""
audiobench_url = "https://github.com/AudioLLMs/AudioBench.git"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to clone in the dockerfile instead? Dockerfile.nemo-skills.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probbaly possible, but why?
we use it as souce for dataset preparation, what's the point to clone it to dockerfile?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a more predictable user and dev experience.

from user side, e.g.

except ImportError as e:
raise ImportError(
f"Failed to import AudioBench Dataset class: {e}\n"
f"AudioBench path: {audiobench_path}\n"
f"Make sure AudioBench repository is properly set up."
)

if it is in the dockerfile, we wouldn't need to have the user check the dependencies, we'll have a better idea if it's set up properly.

From the dev side, if this repo changes and we need to pin it or it has compatibility issue with other dependencies, it's much easier to track down if all the dependencies are in one place (like, the dockerfile and requirements.txt). I don't usually expect my scripts to start puling git repos and failing with import errors if those don't get pulled properly. And reproducing for debug wouldn't be great either because the dependencies aren't static in the image--I would have to start the container and git pull with the script here, which is a trickier workflow than just having it saved to known location from the start.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually we do not use deps's from this repo, if more stable solution is needed i can hardcore sources.

PS
Moved this PR to new [https://github.com//pull/1103], as to many irrelevant changes emerged here.

Leaving this open just to keep comments and converstaions

Signed-off-by: George <37293288+Jorjeous@users.noreply.github.com>
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Dec 9, 2025

📝 Walkthrough

Walkthrough

This PR introduces AudioBench and LibriSpeech-PC benchmark support to the nemo-skills evaluation framework. It adds dataset modules with configuration constants, data preparation scripts for downloading/processing audio data, evaluation logic for ASR and translation tasks with automatic metrics and LLM judge scoring, and integrates these into the existing evaluator and metrics registries.

Changes

Cohort / File(s) Summary
AudioBench Dataset Configuration
nemo_skills/dataset/audiobench/__init__.py, nemo_skills/dataset/audiobench/judge/__init__.py, nemo_skills/dataset/audiobench/nonjudge/__init__.py
Adds module initialization with dataset group, metrics type, and pipeline configuration constants for LLM judge-based and automatic metric-based AudioBench evaluation.
AudioBench Data Preparation
nemo_skills/dataset/audiobench/prepare.py
Introduces script to clone AudioBench repository, extract audio data, build manifest entries with audio paths and messages, and write per-dataset JSON manifests for judge and nonjudge tasks.
LibriSpeech-PC Dataset Configuration & Preparation
nemo_skills/dataset/librispeech-pc/__init__.py, nemo_skills/dataset/librispeech-pc/prepare.py
Adds LibriSpeech-PC module with dataset configuration constants and preparation script to download manifests/audio splits from OpenSLR and convert to nemo-skills JSONL format.
AudioBench & LibriSpeech-PC Evaluation
nemo_skills/evaluation/evaluator/audiobench.py, nemo_skills/evaluation/metrics/speechlm_metrics.py
Implements evaluation logic with ASR (standard WER, punctuation-corrected WER/PER) and translation metrics (BLEU) via sacrebleu, judge parsing for LLM-based evaluation, and SpeechLMMetrics class extending BaseMetrics with score tracking and aggregation.
Evaluator & Metrics Registry Updates
nemo_skills/evaluation/evaluator/__init__.py, nemo_skills/evaluation/metrics/map_metrics.py
Registers eval_audiobench function for "audiobench" and "librispeech-pc" in EVALUATOR_MAP, and registers SpeechLMMetrics for "speechlm" in METRICS_MAP.
Judge Prompt Configuration
nemo_skills/prompt/config/judge/audiobench.yaml
Adds YAML judge prompt template for Yes/No style evaluation of audio model responses with structured reasoning and judgement output format.
Documentation Updates
docs/evaluation/code.md, docs/evaluation/speech-audio.md
Updates SWE-bench parameter references, adds global warning for audio-less evaluation, documents LibriSpeech-PC and simplified MMAU-Pro evaluation CLI.
Data Pipeline Configuration
nemo_skills/pipeline/prepare_data.py
Extends DATASETS_REQUIRE_DATA_DIR constant to include "librispeech-pc" and "audiobench" for data directory safety checks.
MMAU-Pro Updates
nemo_skills/dataset/mmau-pro/prepare.py
Prefixes all audio paths with "/dataset/mmau-pro/" for absolute path resolution and prepends system message with /no_think to message chain.
SWE-Bench Refactoring
nemo_skills/inference/eval/swebench.py
Removes setup_timeout configuration and local execution method; replaces copy-from-mount strategy with on-demand repository cloning and virtual environment setup using uv/conda inside container.
Miscellaneous Updates
nemo_skills/dataset/icpc25/__init__.py, nemo_skills/inference/generate.py, .gitignore, tests/test_datasets.py, nemo_skills/evaluation/metrics/icpc_metrics.py
Minor doc string update for ICPC25, removes internal reasoning warning tracking, updates .gitignore for AudioBench/, adds test cases for new datasets, removes tokens field from ICPC metrics.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50-70 minutes

  • nemo_skills/evaluation/evaluator/audiobench.py: Dense evaluation logic with multiple task-type dispatch paths (ASR-PC, ASR, translation), punctuation error rate calculation using DP-based alignment, and judgment text parsing.
  • nemo_skills/dataset/audiobench/prepare.py: Audio handling, manifest entry construction, repository cloning with error recovery, and per-dataset processing orchestration.
  • nemo_skills/inference/eval/swebench.py: Significant refactoring of container execution flow—replacing copy-based deployment with dynamic repository cloning and environment setup via uv/conda; changes control flow for agent/harness installation.
  • nemo_skills/evaluation/metrics/speechlm_metrics.py: Complex metrics aggregation across multiple per-metric trackers (WER, WER-C, WER-PC, PER, BLEU) with pass-at-k and majority voting logic.
  • Integration complexity: New evaluator and metrics registry entries span multiple files and depend on correct wiring of task types to evaluation functions.

Possibly related PRs

  • Evaluation on human-eval infilling #877: Modifies nemo_skills/evaluation/evaluator/init.py to register new evaluator functions (eval_human_eval_infilling) in EVALUATOR_MAP—same registry pattern as this PR's eval_audiobench registration.
  • SWE-bench: add Slurm tests & other improvements #920: Updates nemo_skills/inference/eval/swebench.py with SweBenchGenerationConfig and container execution flow changes—overlaps with this PR's swebench.py refactoring.
  • Evaluation on OJBench #848: Extends both EVALUATOR_MAP and METRICS_MAP registries with new benchmark support—analogous pattern to this PR's evaluator and metrics registry updates.

Suggested reviewers

  • Kipok

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The PR title 'Audiobench and LibriSpeech-PC Benchmarks Evaluation' directly and clearly summarizes the main changes—implementing evaluation support for two new audio benchmarks.
Docstring Coverage ✅ Passed Docstring coverage is 86.84% which is sufficient. The required threshold is 80.00%.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch audiobench_libri-pc

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Nitpick comments (12)
nemo_skills/inference/eval/swebench.py (3)

402-403: Redundant null check.

This check is already performed in __init__ at line 222. If reached here, agent_framework_repo is already set. Consider removing this duplicate check.


471-472: Redundant null check.

Same as above—this check is already performed in __init__ at line 236.


580-583: Duplicate output_dir initialization.

This logic duplicates what's already done in __init__ (lines 191-194). Since process_single_datapoint is called after initialization, self.output_dir should already be set. If this is intentional (e.g., for async safety), consider adding a comment to clarify.

     async def process_single_datapoint(self, data_point, data):
         """Will do all necessary generations to get a single answer for the data point."""
-        self.output_dir = Path(self.cfg.output_file).parent
-        if self.cfg.inference.random_seed is not None:
-            self.output_dir = self.output_dir / f"rs{self.cfg.inference.random_seed}"
-            self.output_dir.mkdir(exist_ok=True)
+        # Note: self.output_dir is already set in __init__
tests/test_datasets.py (1)

60-61: New audio datasets wired into DATASETS correctly

Adding audiobench and librispeech-pc here will ensure the default init checks run for them, which is good. One minor note: LibriSpeech‑PC’s DEFAULT_SPLIT is test-clean, while this list uses "test"; if these split names ever get used by tests or tooling in this file, consider aligning them to avoid confusion.

nemo_skills/pipeline/prepare_data.py (1)

34-34: Adding audiobench and librispeech-pc to DATASETS_REQUIRE_DATA_DIR is reasonable

Marking these new audio benchmarks as requiring data_dir is consistent with how other large datasets (e.g., mmau-pro) are handled and should prevent accidentally copying large blobs into each job. You may eventually want to honor the existing TODO and derive this list from dataset configs instead of hard‑coding.

nemo_skills/dataset/librispeech-pc/__init__.py (1)

15-26: LibriSpeech-PC dataset config aligns with new speechlm stack

DATASET_GROUP="speechlm" and METRICS_TYPE="speechlm" match the new metrics registration, and routing via ++eval_type=audiobench is consistent with the evaluator map. Using test-clean as DEFAULT_SPLIT also matches the documented LibriSpeech-PC splits; just keep in mind tests list "test" as the only split for this dataset, so if those ever start validating splits, you may want to align the naming.

nemo_skills/dataset/audiobench/prepare.py (3)

231-233: Type hint inconsistency: mixing built-in tuple with typing.List.

The return type uses lowercase tuple (Python 3.9+ built-in generic) but List from typing. For consistency, either use tuple[int, list[dict]] or Tuple[int, List[Dict]].

-) -> tuple[int, List[Dict]]:
+) -> tuple[int, list[dict]]:

250-259: Consider using exception chaining for better debugging.

When re-raising exceptions within an except block, use raise ... from e to preserve the original traceback context.

     try:
         from dataset import Dataset
     except ImportError as e:
         raise ImportError(
             f"Failed to import AudioBench Dataset class: {e}\n"
             f"AudioBench path: {audiobench_path}\n"
             f"Make sure AudioBench repository is properly set up."
-        )
+        ) from e

265-267: Use exception chaining for re-raised exceptions.

     except Exception as e:
-        raise Exception(f"Failed to load dataset {dataset_name}: {e}")
+        raise RuntimeError(f"Failed to load dataset {dataset_name}: {e}") from e
nemo_skills/evaluation/metrics/speechlm_metrics.py (1)

107-107: Rename unused loop variables.

Per Python convention, prefix unused loop variables with underscore.

-        for agg_mode, agg_metrics in metrics_dict.items():
+        for _agg_mode, agg_metrics in metrics_dict.items():

And in compute_score:

-        for benchmark_name, benchmark_data in benchmarks.items():
+        for _benchmark_name, benchmark_data in benchmarks.items():

Also applies to: 199-199

nemo_skills/evaluation/evaluator/audiobench.py (2)

240-240: Unused config parameter.

The config parameter is passed but never used in evaluate_sample. Either use it or remove it from the signature if it's reserved for future use.

-def evaluate_sample(sample: dict[str, Any], config: AudioBenchEvaluatorConfig) -> dict[str, Any]:
+def evaluate_sample(sample: dict[str, Any], _config: AudioBenchEvaluatorConfig) -> dict[str, Any]:

Or if intended for future use, add a comment:

+    # config is reserved for future judge evaluation settings
+    _ = config

159-164: Empty string handling with "empty" placeholder.

Using "empty" as a placeholder to avoid jiwer errors is a pragmatic workaround. Consider adding a brief comment explaining why this is done.

-    # Handle empty strings
+    # Handle empty strings - jiwer requires non-empty inputs
     if not ref:
         ref = "empty"
     if not hyp:
         hyp = "empty"
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between ac0b34d and 245743b.

📒 Files selected for processing (21)
  • .gitignore (1 hunks)
  • docs/evaluation/code.md (1 hunks)
  • docs/evaluation/speech-audio.md (4 hunks)
  • nemo_skills/dataset/audiobench/__init__.py (1 hunks)
  • nemo_skills/dataset/audiobench/judge/__init__.py (1 hunks)
  • nemo_skills/dataset/audiobench/nonjudge/__init__.py (1 hunks)
  • nemo_skills/dataset/audiobench/prepare.py (1 hunks)
  • nemo_skills/dataset/icpc25/__init__.py (1 hunks)
  • nemo_skills/dataset/librispeech-pc/__init__.py (1 hunks)
  • nemo_skills/dataset/librispeech-pc/prepare.py (1 hunks)
  • nemo_skills/dataset/mmau-pro/prepare.py (1 hunks)
  • nemo_skills/evaluation/evaluator/__init__.py (2 hunks)
  • nemo_skills/evaluation/evaluator/audiobench.py (1 hunks)
  • nemo_skills/evaluation/metrics/icpc_metrics.py (0 hunks)
  • nemo_skills/evaluation/metrics/map_metrics.py (2 hunks)
  • nemo_skills/evaluation/metrics/speechlm_metrics.py (1 hunks)
  • nemo_skills/inference/eval/swebench.py (7 hunks)
  • nemo_skills/inference/generate.py (0 hunks)
  • nemo_skills/pipeline/prepare_data.py (1 hunks)
  • nemo_skills/prompt/config/judge/audiobench.yaml (1 hunks)
  • tests/test_datasets.py (1 hunks)
💤 Files with no reviewable changes (2)
  • nemo_skills/inference/generate.py
  • nemo_skills/evaluation/metrics/icpc_metrics.py
🧰 Additional context used
🧠 Learnings (2)
📚 Learning: 2025-12-02T21:26:17.342Z
Learnt from: CR
Repo: NVIDIA-NeMo/Skills PR: 0
File: CONTRIBUTING.md:0-0
Timestamp: 2025-12-02T21:26:17.342Z
Learning: Update documentation to reflect changes made in the codebase

Applied to files:

  • docs/evaluation/code.md
  • nemo_skills/dataset/icpc25/__init__.py
📚 Learning: 2025-12-02T21:26:17.342Z
Learnt from: CR
Repo: NVIDIA-NeMo/Skills PR: 0
File: CONTRIBUTING.md:0-0
Timestamp: 2025-12-02T21:26:17.342Z
Learning: Follow the existing code style and conventions in the Nemo-Skills project

Applied to files:

  • nemo_skills/dataset/audiobench/prepare.py
🧬 Code graph analysis (4)
nemo_skills/evaluation/evaluator/audiobench.py (1)
nemo_skills/utils.py (2)
  • get_logger_name (39-43)
  • nested_dataclass (69-102)
nemo_skills/evaluation/metrics/map_metrics.py (1)
nemo_skills/evaluation/metrics/speechlm_metrics.py (1)
  • SpeechLMMetrics (23-163)
nemo_skills/evaluation/evaluator/__init__.py (1)
nemo_skills/evaluation/evaluator/audiobench.py (1)
  • eval_audiobench (200-237)
nemo_skills/inference/eval/swebench.py (1)
nemo_skills/inference/chat_interface/core.py (1)
  • cfg (181-182)
🪛 LanguageTool
docs/evaluation/speech-audio.md

[style] ~321-~321: Using many exclamation marks might seem excessive (in this case: 9 exclamation marks for a text that’s 5206 characters long)
Context: ... ## Running LibriSpeech-PC Evaluation !!! note Currently supports only Megatr...

(EN_EXCESSIVE_EXCLAMATION)

🪛 markdownlint-cli2 (0.18.1)
docs/evaluation/speech-audio.md

280-280: Fenced code blocks should have a language specified

(MD040, fenced-code-language)


288-288: Fenced code blocks should have a language specified

(MD040, fenced-code-language)


313-313: Fenced code blocks should have a language specified

(MD040, fenced-code-language)


320-320: Link text should be descriptive

(MD059, descriptive-link-text)


327-327: Link text should be descriptive

(MD059, descriptive-link-text)


332-332: Link text should be descriptive

(MD059, descriptive-link-text)


337-337: Link text should be descriptive

(MD059, descriptive-link-text)


342-342: Link text should be descriptive

(MD059, descriptive-link-text)


347-347: Link text should be descriptive

(MD059, descriptive-link-text)


352-352: Link text should be descriptive

(MD059, descriptive-link-text)

🪛 Ruff (0.14.8)
nemo_skills/evaluation/metrics/speechlm_metrics.py

107-107: Loop control variable agg_mode not used within loop body

Rename unused agg_mode to _agg_mode

(B007)


199-199: Loop control variable benchmark_name not used within loop body

Rename unused benchmark_name to _benchmark_name

(B007)

nemo_skills/evaluation/evaluator/audiobench.py

188-191: Consider moving this statement to an else block

(TRY300)


192-192: Do not catch blind exception: Exception

(BLE001)


240-240: Unused function argument: config

(ARG001)

nemo_skills/dataset/audiobench/prepare.py

210-210: subprocess call: check for execution of untrusted input

(S603)


211-211: Starting a process with a partial executable path

(S607)


217-217: Consider moving this statement to an else block

(TRY300)


231-231: PEP 484 prohibits implicit Optional

Convert to T | None

(RUF013)


255-259: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)


255-259: Avoid specifying long messages outside the exception class

(TRY003)


266-266: Do not catch blind exception: Exception

(BLE001)


267-267: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)


267-267: Create your own exception

(TRY002)


267-267: Avoid specifying long messages outside the exception class

(TRY003)


333-333: Do not catch blind exception: Exception

(BLE001)


351-351: Do not catch blind exception: Exception

(BLE001)


498-498: Do not catch blind exception: Exception

(BLE001)

nemo_skills/dataset/librispeech-pc/prepare.py

39-39: Unused function argument: blocknum

(ARG001)


44-44: Audit URL open for permitted schemes. Allowing use of file: or custom schemes is often unexpected.

(S310)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
  • GitHub Check: pre-commit
  • GitHub Check: unit-tests
🔇 Additional comments (23)
.gitignore (1)

48-49: LGTM!

The .gitignore entry for AudioBench/ with a descriptive comment is appropriate to prevent the auto-cloned benchmark repository from being tracked.

nemo_skills/dataset/audiobench/nonjudge/__init__.py (1)

15-31: LGTM!

The configuration constants are well-defined with clear documentation. The trailing spaces in EVAL_ARGS and GENERATION_ARGS appear intentional for argument concatenation.

nemo_skills/dataset/audiobench/judge/__init__.py (1)

31-39: LGTM with a note on hardcoded GPU count.

The judge configuration aligns with the AudioBench official implementation. The server_gpus: 8 default is documented as overridable in run scripts, which is acceptable for flexibility.

nemo_skills/inference/eval/swebench.py (3)

414-424: Verify SWE-agent installation commands.

The installation flow looks correct: installs uv, clones repo, checks out commit, creates venv, and installs dependencies. This duplicates installation per container execution but is necessary since containers start fresh.


513-525: OpenHands installation uses make build which may be heavyweight.

The comment at line 262 in __init__ states "we no longer use 'make build' because it installs lots of unnecessary dependencies," yet this container command still uses make build. Consider aligning the approach or verifying this discrepancy is intentional.


621-631: SWE-bench evaluation harness installation in container.

The installation commands for the eval harness look correct and follow the same pattern as agent installation.

nemo_skills/dataset/icpc25/__init__.py (1)

15-17: LGTM!

Documentation update correctly references "ICPC25" to match the module name.

nemo_skills/evaluation/metrics/map_metrics.py (1)

40-41: SpeechLM metrics registration looks consistent

Importing SpeechLMMetrics and adding the "speechlm" entry to METRICS_MAP aligns with the new METRICS_TYPE="speechlm" datasets and keeps the registry pattern consistent. No issues from this file.

Also applies to: 70-70

nemo_skills/prompt/config/judge/audiobench.yaml (1)

1-29: AudioBench judge prompt is consistent with yes/no scoring

The prompt cleanly defines the reference/model/question sections and asks for a Judgement: Yes or No, which matches the expected yes/no parsing in SpeechLMMetrics. This should work well for LLM‑judge evaluation.

nemo_skills/evaluation/evaluator/__init__.py (1)

18-18: Audiobench / LibriSpeech-PC evaluator wiring looks correct

Importing eval_audiobench and registering both "audiobench" and "librispeech-pc" in EVALUATOR_MAP is consistent with the new datasets and avoids any clash with class‑based evaluators. This should make both eval types available via the unified evaluate entrypoint.

Also applies to: 44-62

nemo_skills/dataset/audiobench/__init__.py (1)

15-36: LGTM!

The module metadata and documentation are well-structured. The docstring clearly explains the AudioBench benchmark tasks and evaluation categories, and the constants are appropriately defined for integration with the nemo-skills framework.

docs/evaluation/speech-audio.md (2)

5-8: LGTM!

The consolidated warning block at the top is well-positioned and clearly explains the --no-audio and --skip_data_dir_check flags.


273-400: Comprehensive LibriSpeech-PC documentation.

The documentation section is thorough, covering dataset location, available splits, data preparation, evaluation examples, metrics explanation, and output format. This provides users with everything needed to use the benchmark.

nemo_skills/dataset/audiobench/prepare.py (2)

369-412: CLI argument structure is well-designed.

The argument parser provides flexible options for processing specific datasets, categories, or all datasets with appropriate defaults and help text.


480-501: Robust error handling for dataset processing loop.

The try/except pattern here appropriately catches failures for individual datasets while allowing the script to continue processing remaining datasets, with proper tracking and summary reporting.

nemo_skills/dataset/librispeech-pc/prepare.py (3)

35-44: LGTM!

The download progress implementation is clean. The blocknum parameter is required by urllib.request.urlretrieve's reporthook signature even if not used directly.


89-144: Well-structured data processing with proper validation.

The function handles missing entries gracefully, creates proper nemo-skills format with messages structure, and cleans up intermediate manifest files after successful processing.


75-86: No action required. The audio paths are handled consistently: the underscore conversion in download_audio (line 77) matches the directory structure created by tar extraction, and the manifest's audio_filepath values already reference these correct paths from the OpenSLR source.

nemo_skills/evaluation/metrics/speechlm_metrics.py (3)

107-109: Verify the intent of dividing no_answer by 2.0.

This halving of no_answer appears unusual and may be a bug. If this is intentional (e.g., compensating for double-counting), please add a clarifying comment explaining the rationale.


117-128: WER/BLEU metrics computation looks correct.

The averaging and percentage conversion are properly implemented, and the metrics are conditionally added only when scores exist.


166-223: Aggregation function handles weighted averaging correctly.

The compute_score function properly weights metrics by num_entries when combining sub-benchmarks, which is the correct approach for aggregate statistics.

nemo_skills/evaluation/evaluator/audiobench.py (2)

56-96: PER calculation using edit distance is correctly implemented.

The dynamic programming approach for computing Punctuation Error Rate follows the arXiv:2310.02943 formula correctly. The handling of empty punctuation sequences (returning 0.0) is appropriate.


99-131: Comprehensive ASR-PC evaluation with multiple WER variants.

The implementation correctly computes WER (standard), WER_C (with capitalization), WER_PC (full punctuation + capitalization), and PER as separate metrics, providing granular insight into model performance.

- **++eval_harness_commit:** The commit hash, branch or tag to checkout after cloning eval_harness_repo. Defaults to `HEAD`, i.e. the latest commit.

- **++setup_timeout:** The timeout for downloading & installing the agent framework and the evaluation harness, in seconds. Defaults to 1200, i.e. 20 minutes.
- **++eval_harness_commit:** The commit hash, branch or tag to checkout after cloning agent_harness_repo. Defaults to `HEAD`, i.e. the latest commit.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Fix parameter name in ++eval_harness_commit description

++eval_harness_commit currently says “after cloning agent_harness_repo”, but there is no such parameter; the actual flag is ++eval_harness_repo. This is likely a copy‑paste typo and can confuse users; suggest changing it to:

- **++eval_harness_commit:** The commit hash, branch or tag to checkout after cloning `eval_harness_repo`. Defaults to `HEAD`, i.e. the latest commit.

The updated ++max_retries description (covering inference and evaluation retries) looks aligned with the new SWE‑bench flow.

Based on learnings, keeping docs tightly aligned with flags helps avoid user confusion.

Also applies to: 89-89

🤖 Prompt for AI Agents
In docs/evaluation/code.md around line 85 (also apply same change at line 89),
the description for ++eval_harness_commit incorrectly references cloning
"agent_harness_repo" — change that reference to "eval_harness_repo" so the text
reads that the commit/branch/tag is checked out after cloning eval_harness_repo;
keep the rest of the sentence (Defaults to HEAD) unchanged and ensure both
occurrences (lines 85 and 89) are updated to avoid the copy‑paste confusion.

Comment on lines 85 to +96
if entry.get("audio_path"):
audio_path = entry["audio_path"]

if isinstance(audio_path, list) and audio_path:
user_message["audios"] = [{"path": path, "duration": 10.0} for path in audio_path]
elif isinstance(audio_path, str):
user_message["audio"] = {"path": audio_path, "duration": 10.0}

formatted_entry["messages"] = [user_message]
# Prepend /dataset/mmau-pro/ to make paths absolute for cluster
if len(audio_path) == 1:
user_message["audio"] = {"path": f"/dataset/mmau-pro/{audio_path[0]}"}
else:
user_message["audios"] = [{"path": f"/dataset/mmau-pro/{path}"} for path in audio_path]

formatted_entry["messages"] = [
{"role": "system", "content": "You are a helpful assistant. /no_think"},
user_message,
]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Honor the with_audio flag when attaching audio paths

Right now format_entry ignores the with_audio argument: even when --no-audio is used (so with_audio=False), you still add audio/audios fields pointing to /dataset/mmau-pro/.... Since audio download is skipped in that mode, these paths are likely invalid and can break downstream evaluation or inference that assumes existing audio when the field is present.

You can keep the new absolute-path behavior while respecting the flag by gating this block:

-    if entry.get("audio_path"):
-        audio_path = entry["audio_path"]
-        # Prepend /dataset/mmau-pro/ to make paths absolute for cluster
-        if len(audio_path) == 1:
-            user_message["audio"] = {"path": f"/dataset/mmau-pro/{audio_path[0]}"}
-        else:
-            user_message["audios"] = [{"path": f"/dataset/mmau-pro/{path}"} for path in audio_path]
+    if with_audio and entry.get("audio_path"):
+        audio_path = entry["audio_path"]
+        # Prepend /dataset/mmau-pro/ to make paths absolute for cluster
+        if len(audio_path) == 1:
+            user_message["audio"] = {"path": f"/dataset/mmau-pro/{audio_path[0]}"}
+        else:
+            user_message["audios"] = [{"path": f"/dataset/mmau-pro/{path}"} for path in audio_path]

This restores the intent of --no-audio while preserving the cluster‑friendly absolute paths when audio is enabled.

🤖 Prompt for AI Agents
In nemo_skills/dataset/mmau-pro/prepare.py around lines 85 to 96, the code
always attaches audio paths to user_message even when with_audio is False;
update the block so it only adds the "audio" or "audios" fields when with_audio
is True. Specifically, wrap the existing if entry.get("audio_path") ... logic
inside a conditional checking if with_audio is truthy, preserving the current
absolute-path construction and single-vs-multiple handling, and leaving
formatted_entry["messages"] unchanged outside that gated block.

Comment on lines +224 to +230
samples_already_evaluated = sum(1 for sample in data if "is_correct" in sample)

if samples_already_evaluated > 0:
LOG.info(f"Resuming evaluation: {samples_already_evaluated}/{len(data)} samples already evaluated")

for idx, sample in enumerate(tqdm(data, desc="Evaluating samples")):
data[idx] = evaluate_sample(sample, eval_config)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Resume logic logs progress but doesn't skip already-evaluated samples.

The code counts already-evaluated samples and logs resume info, but the loop still re-evaluates all samples. If resumption is intended, samples with is_correct already set should be skipped.

     for idx, sample in enumerate(tqdm(data, desc="Evaluating samples")):
+        if "is_correct" in sample:
+            continue  # Skip already evaluated samples
         data[idx] = evaluate_sample(sample, eval_config)
🤖 Prompt for AI Agents
In nemo_skills/evaluation/evaluator/audiobench.py around lines 224-230, the
resume log notes already-evaluated samples but the loop still re-processes them;
change the loop to skip any sample that already contains "is_correct" (or other
evaluation markers) so previously evaluated entries are left unchanged, e.g.,
check if "is_correct" in sample at start of the loop and continue if true,
keeping the same tqdm description and index handling so data[idx] is only
overwritten for newly evaluated samples.

@Jorjeous
Copy link
Member

Jorjeous commented Dec 16, 2025

New version of PR exist: #1103

@Jorjeous Jorjeous closed this Dec 16, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants