Audiobench and LibriSpeech-PC Benchmarks Evaluation by melllinia · Pull Request #1060 · NVIDIA-NeMo/Skills

melllinia · 2025-12-01T12:15:06Z

The PR is the implementation of AudioBench and LibriSpeech-PC benchmarks evaluation.

Summary by CodeRabbit

Release Notes

New Features
- Added AudioBench benchmark support with automatic and judge-based evaluation variants
- Added LibriSpeech-PC ASR benchmark support with test splits
- Introduced speech/language metrics for ASR, pronunciation, and translation evaluation
Documentation
- Enhanced speech-audio evaluation documentation with benchmark details and configuration guidance
- Added LibriSpeech-PC documentation
Chores
- Expanded dataset registry and updated evaluation configurations
- Improved SWE-bench execution with dynamic repository installation pipelines

_{✏️ Tip: You can customize this high-level summary in your review settings.}

Resolve conflict Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com> Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>

Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>

Signed-off-by: mmkrtchyan <mmkrtchyan@nvidia.com> Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>

Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>

Author did not signed commit This reverts commit ecfafd1. Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>

Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>

- Add comprehensive documentation for LibriSpeech-PC benchmark in speech-audio.md - Fix jiwer import to be lazy (only import when needed for ASR evaluation) Signed-off-by: mmkrtchyan <mmkrtchyan@nvidia.com> Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>

Signed-off-by: mmkrtchyan <mmkrtchyan@nvidia.com> Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>

Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>

This reverts commit e5f0c2f. Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>

gwarmstrong

A few comments/questions about the general approach. Also it looks like there are a lot of unrelated changes. Can you fix those?

dockerfiles/README.md

docs/evaluation/code.md

nemo_skills/dataset/audiobench/prepare.py

nemo_skills/dataset/librispeech-pc/prepare.py

nemo_skills/inference/eval/swebench.py

nemo_skills/inference/generate.py

nemo_skills/__init__.py

gwarmstrong · 2025-12-07T00:20:27Z

nemo_skills/dataset/audiobench/prepare.py

+    Returns:
+        True if successful, False otherwise
+    """
+    audiobench_url = "https://github.com/AudioLLMs/AudioBench.git"


Is it possible to clone in the dockerfile instead? Dockerfile.nemo-skills.

Probbaly possible, but why?
we use it as souce for dataset preparation, what's the point to clone it to dockerfile?

It's a more predictable user and dev experience.

from user side, e.g.

Skills/nemo_skills/dataset/audiobench/prepare.py

Lines 254 to 259 in a908156

except ImportError as e:

raise ImportError(

f"Failed to import AudioBench Dataset class: {e}\n"

f"AudioBench path: {audiobench_path}\n"

f"Make sure AudioBench repository is properly set up."

)

if it is in the dockerfile, we wouldn't need to have the user check the dependencies, we'll have a better idea if it's set up properly.

From the dev side, if this repo changes and we need to pin it or it has compatibility issue with other dependencies, it's much easier to track down if all the dependencies are in one place (like, the dockerfile and requirements.txt). I don't usually expect my scripts to start puling git repos and failing with import errors if those don't get pulled properly. And reproducing for debug wouldn't be great either because the dependencies aren't static in the image--I would have to start the container and git pull with the script here, which is a trickier workflow than just having it saved to known location from the start.

actually we do not use deps's from this repo, if more stable solution is needed i can hardcore sources.

PS
Moved this PR to new [https://github.com//pull/1103], as to many irrelevant changes emerged here.

Leaving this open just to keep comments and converstaions

Signed-off-by: George <37293288+Jorjeous@users.noreply.github.com>

coderabbitai · 2025-12-09T11:35:56Z

📝 Walkthrough

Walkthrough

This PR introduces AudioBench and LibriSpeech-PC benchmark support to the nemo-skills evaluation framework. It adds dataset modules with configuration constants, data preparation scripts for downloading/processing audio data, evaluation logic for ASR and translation tasks with automatic metrics and LLM judge scoring, and integrates these into the existing evaluator and metrics registries.

Changes

Cohort / File(s)	Summary
AudioBench Dataset Configuration `nemo_skills/dataset/audiobench/__init__.py`, `nemo_skills/dataset/audiobench/judge/__init__.py`, `nemo_skills/dataset/audiobench/nonjudge/__init__.py`	Adds module initialization with dataset group, metrics type, and pipeline configuration constants for LLM judge-based and automatic metric-based AudioBench evaluation.
AudioBench Data Preparation `nemo_skills/dataset/audiobench/prepare.py`	Introduces script to clone AudioBench repository, extract audio data, build manifest entries with audio paths and messages, and write per-dataset JSON manifests for judge and nonjudge tasks.
LibriSpeech-PC Dataset Configuration & Preparation `nemo_skills/dataset/librispeech-pc/__init__.py`, `nemo_skills/dataset/librispeech-pc/prepare.py`	Adds LibriSpeech-PC module with dataset configuration constants and preparation script to download manifests/audio splits from OpenSLR and convert to nemo-skills JSONL format.
AudioBench & LibriSpeech-PC Evaluation `nemo_skills/evaluation/evaluator/audiobench.py`, `nemo_skills/evaluation/metrics/speechlm_metrics.py`	Implements evaluation logic with ASR (standard WER, punctuation-corrected WER/PER) and translation metrics (BLEU) via sacrebleu, judge parsing for LLM-based evaluation, and SpeechLMMetrics class extending BaseMetrics with score tracking and aggregation.
Evaluator & Metrics Registry Updates `nemo_skills/evaluation/evaluator/__init__.py`, `nemo_skills/evaluation/metrics/map_metrics.py`	Registers `eval_audiobench` function for "audiobench" and "librispeech-pc" in EVALUATOR_MAP, and registers SpeechLMMetrics for "speechlm" in METRICS_MAP.
Judge Prompt Configuration `nemo_skills/prompt/config/judge/audiobench.yaml`	Adds YAML judge prompt template for Yes/No style evaluation of audio model responses with structured reasoning and judgement output format.
Documentation Updates `docs/evaluation/code.md`, `docs/evaluation/speech-audio.md`	Updates SWE-bench parameter references, adds global warning for audio-less evaluation, documents LibriSpeech-PC and simplified MMAU-Pro evaluation CLI.
Data Pipeline Configuration `nemo_skills/pipeline/prepare_data.py`	Extends DATASETS_REQUIRE_DATA_DIR constant to include "librispeech-pc" and "audiobench" for data directory safety checks.
MMAU-Pro Updates `nemo_skills/dataset/mmau-pro/prepare.py`	Prefixes all audio paths with "/dataset/mmau-pro/" for absolute path resolution and prepends system message with `/no_think` to message chain.
SWE-Bench Refactoring `nemo_skills/inference/eval/swebench.py`	Removes setup_timeout configuration and local execution method; replaces copy-from-mount strategy with on-demand repository cloning and virtual environment setup using uv/conda inside container.
Miscellaneous Updates `nemo_skills/dataset/icpc25/__init__.py`, `nemo_skills/inference/generate.py`, `.gitignore`, `tests/test_datasets.py`, `nemo_skills/evaluation/metrics/icpc_metrics.py`	Minor doc string update for ICPC25, removes internal reasoning warning tracking, updates .gitignore for AudioBench/, adds test cases for new datasets, removes tokens field from ICPC metrics.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50-70 minutes

nemo_skills/evaluation/evaluator/audiobench.py: Dense evaluation logic with multiple task-type dispatch paths (ASR-PC, ASR, translation), punctuation error rate calculation using DP-based alignment, and judgment text parsing.
nemo_skills/dataset/audiobench/prepare.py: Audio handling, manifest entry construction, repository cloning with error recovery, and per-dataset processing orchestration.
nemo_skills/inference/eval/swebench.py: Significant refactoring of container execution flow—replacing copy-based deployment with dynamic repository cloning and environment setup via uv/conda; changes control flow for agent/harness installation.
nemo_skills/evaluation/metrics/speechlm_metrics.py: Complex metrics aggregation across multiple per-metric trackers (WER, WER-C, WER-PC, PER, BLEU) with pass-at-k and majority voting logic.
Integration complexity: New evaluator and metrics registry entries span multiple files and depend on correct wiring of task types to evaluation functions.

Possibly related PRs

Evaluation on human-eval infilling #877: Modifies nemo_skills/evaluation/evaluator/init.py to register new evaluator functions (eval_human_eval_infilling) in EVALUATOR_MAP—same registry pattern as this PR's eval_audiobench registration.
SWE-bench: add Slurm tests & other improvements #920: Updates nemo_skills/inference/eval/swebench.py with SweBenchGenerationConfig and container execution flow changes—overlaps with this PR's swebench.py refactoring.
Evaluation on OJBench #848: Extends both EVALUATOR_MAP and METRICS_MAP registries with new benchmark support—analogous pattern to this PR's evaluator and metrics registry updates.

Suggested reviewers

Kipok

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The PR title 'Audiobench and LibriSpeech-PC Benchmarks Evaluation' directly and clearly summarizes the main changes—implementing evaluation support for two new audio benchmarks.
Docstring Coverage	✅ Passed	Docstring coverage is 86.84% which is sufficient. The required threshold is 80.00%.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch audiobench_libri-pc

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 3

🧹 Nitpick comments (12)

nemo_skills/inference/eval/swebench.py (3)
402-403: Redundant null check.

This check is already performed in __init__ at line 222. If reached here, agent_framework_repo is already set. Consider removing this duplicate check.

471-472: Redundant null check.

Same as above—this check is already performed in __init__ at line 236.

580-583: Duplicate output_dir initialization.

This logic duplicates what's already done in __init__ (lines 191-194). Since process_single_datapoint is called after initialization, self.output_dir should already be set. If this is intentional (e.g., for async safety), consider adding a comment to clarify.
     async def process_single_datapoint(self, data_point, data):
         """Will do all necessary generations to get a single answer for the data point."""
-        self.output_dir = Path(self.cfg.output_file).parent
-        if self.cfg.inference.random_seed is not None:
-            self.output_dir = self.output_dir / f"rs{self.cfg.inference.random_seed}"
-            self.output_dir.mkdir(exist_ok=True)
+        # Note: self.output_dir is already set in __init__
tests/test_datasets.py (1)

60-61: New audio datasets wired into DATASETS correctly

Adding audiobench and librispeech-pc here will ensure the default init checks run for them, which is good. One minor note: LibriSpeech‑PC’s DEFAULT_SPLIT is test-clean, while this list uses "test"; if these split names ever get used by tests or tooling in this file, consider aligning them to avoid confusion.

nemo_skills/pipeline/prepare_data.py (1)

34-34: Adding audiobench and librispeech-pc to DATASETS_REQUIRE_DATA_DIR is reasonable

Marking these new audio benchmarks as requiring data_dir is consistent with how other large datasets (e.g., mmau-pro) are handled and should prevent accidentally copying large blobs into each job. You may eventually want to honor the existing TODO and derive this list from dataset configs instead of hard‑coding.

nemo_skills/dataset/librispeech-pc/__init__.py (1)

15-26: LibriSpeech-PC dataset config aligns with new speechlm stack

DATASET_GROUP="speechlm" and METRICS_TYPE="speechlm" match the new metrics registration, and routing via ++eval_type=audiobench is consistent with the evaluator map. Using test-clean as DEFAULT_SPLIT also matches the documented LibriSpeech-PC splits; just keep in mind tests list "test" as the only split for this dataset, so if those ever start validating splits, you may want to align the naming.
nemo_skills/dataset/audiobench/prepare.py (3)
231-233: Type hint inconsistency: mixing built-in tuple with typing.List.

The return type uses lowercase tuple (Python 3.9+ built-in generic) but List from typing. For consistency, either use tuple[int, list[dict]] or Tuple[int, List[Dict]].
-) -> tuple[int, List[Dict]]:
+) -> tuple[int, list[dict]]:
250-259: Consider using exception chaining for better debugging.

When re-raising exceptions within an except block, use raise ... from e to preserve the original traceback context.
     try:
         from dataset import Dataset
     except ImportError as e:
         raise ImportError(
             f"Failed to import AudioBench Dataset class: {e}\n"
             f"AudioBench path: {audiobench_path}\n"
             f"Make sure AudioBench repository is properly set up."
-        )
+        ) from e
265-267: Use exception chaining for re-raised exceptions.
     except Exception as e:
-        raise Exception(f"Failed to load dataset {dataset_name}: {e}")
+        raise RuntimeError(f"Failed to load dataset {dataset_name}: {e}") from e
nemo_skills/evaluation/metrics/speechlm_metrics.py (1)
107-107: Rename unused loop variables.

Per Python convention, prefix unused loop variables with underscore.
-        for agg_mode, agg_metrics in metrics_dict.items():
+        for _agg_mode, agg_metrics in metrics_dict.items():
And in compute_score:
-        for benchmark_name, benchmark_data in benchmarks.items():
+        for _benchmark_name, benchmark_data in benchmarks.items():
Also applies to: 199-199
nemo_skills/evaluation/evaluator/audiobench.py (2)
240-240: Unused config parameter.

The config parameter is passed but never used in evaluate_sample. Either use it or remove it from the signature if it's reserved for future use.
-def evaluate_sample(sample: dict[str, Any], config: AudioBenchEvaluatorConfig) -> dict[str, Any]:
+def evaluate_sample(sample: dict[str, Any], _config: AudioBenchEvaluatorConfig) -> dict[str, Any]:
Or if intended for future use, add a comment:
+    # config is reserved for future judge evaluation settings
+    _ = config
159-164: Empty string handling with "empty" placeholder.

Using "empty" as a placeholder to avoid jiwer errors is a pragmatic workaround. Consider adding a brief comment explaining why this is done.
-    # Handle empty strings
+    # Handle empty strings - jiwer requires non-empty inputs
     if not ref:
         ref = "empty"
     if not hyp:
         hyp = "empty"

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between ac0b34d and 245743b.

📒 Files selected for processing (21)

.gitignore (1 hunks)
docs/evaluation/code.md (1 hunks)
docs/evaluation/speech-audio.md (4 hunks)
nemo_skills/dataset/audiobench/__init__.py (1 hunks)
nemo_skills/dataset/audiobench/judge/__init__.py (1 hunks)
nemo_skills/dataset/audiobench/nonjudge/__init__.py (1 hunks)
nemo_skills/dataset/audiobench/prepare.py (1 hunks)
nemo_skills/dataset/icpc25/__init__.py (1 hunks)
nemo_skills/dataset/librispeech-pc/__init__.py (1 hunks)
nemo_skills/dataset/librispeech-pc/prepare.py (1 hunks)
nemo_skills/dataset/mmau-pro/prepare.py (1 hunks)
nemo_skills/evaluation/evaluator/__init__.py (2 hunks)
nemo_skills/evaluation/evaluator/audiobench.py (1 hunks)
nemo_skills/evaluation/metrics/icpc_metrics.py (0 hunks)
nemo_skills/evaluation/metrics/map_metrics.py (2 hunks)
nemo_skills/evaluation/metrics/speechlm_metrics.py (1 hunks)
nemo_skills/inference/eval/swebench.py (7 hunks)
nemo_skills/inference/generate.py (0 hunks)
nemo_skills/pipeline/prepare_data.py (1 hunks)
nemo_skills/prompt/config/judge/audiobench.yaml (1 hunks)
tests/test_datasets.py (1 hunks)

💤 Files with no reviewable changes (2)

nemo_skills/inference/generate.py
nemo_skills/evaluation/metrics/icpc_metrics.py

🧰 Additional context used

🧠 Learnings (2)

📚 Learning: 2025-12-02T21:26:17.342Z

Learnt from: CR
Repo: NVIDIA-NeMo/Skills PR: 0
File: CONTRIBUTING.md:0-0
Timestamp: 2025-12-02T21:26:17.342Z
Learning: Update documentation to reflect changes made in the codebase

Applied to files:

docs/evaluation/code.md
nemo_skills/dataset/icpc25/__init__.py

📚 Learning: 2025-12-02T21:26:17.342Z

Learnt from: CR
Repo: NVIDIA-NeMo/Skills PR: 0
File: CONTRIBUTING.md:0-0
Timestamp: 2025-12-02T21:26:17.342Z
Learning: Follow the existing code style and conventions in the Nemo-Skills project

Applied to files:

nemo_skills/dataset/audiobench/prepare.py

🧬 Code graph analysis (4)

nemo_skills/evaluation/evaluator/audiobench.py (1)

nemo_skills/utils.py (2)

get_logger_name (39-43)

nested_dataclass (69-102)

nemo_skills/evaluation/metrics/map_metrics.py (1)

nemo_skills/evaluation/metrics/speechlm_metrics.py (1)

SpeechLMMetrics (23-163)

nemo_skills/evaluation/evaluator/__init__.py (1)

nemo_skills/evaluation/evaluator/audiobench.py (1)

eval_audiobench (200-237)

nemo_skills/inference/eval/swebench.py (1)

nemo_skills/inference/chat_interface/core.py (1)

cfg (181-182)

🪛 LanguageTool

docs/evaluation/speech-audio.md

[style] ~321-~321: Using many exclamation marks might seem excessive (in this case: 9 exclamation marks for a text that’s 5206 characters long)
Context: ... ## Running LibriSpeech-PC Evaluation !!! note Currently supports only Megatr...

(EN_EXCESSIVE_EXCLAMATION)

🪛 markdownlint-cli2 (0.18.1)

docs/evaluation/speech-audio.md

280-280: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

288-288: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

313-313: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

320-320: Link text should be descriptive