NVIDIA-NeMo · melllinia · Nov 14, 2025 · Nov 14, 2025 · Nov 14, 2025 · Nov 14, 2025
diff --git a/.gitignore b/.gitignore
@@ -45,4 +45,5 @@ nemo_skills/dataset/aalcr/lcr/
 .idea/*
 CLAUDE.md
 
-.idea
+# AudioBench repository (auto-cloned during data preparation)
+AudioBench/
diff --git a/docs/evaluation/code.md b/docs/evaluation/code.md
@@ -82,13 +82,11 @@ There are a few parameters specific to SWE-bench. They have to be specified with
 
 - **++eval_harness_repo:** URL of the repository to use for the evaluation harness. This is passed directly as an argument to `git clone`. Defaults to [`https://github.com/Kipok/SWE-bench.git`](https://github.com/Kipok/SWE-bench), our fork of SWE-bench that supports local evaluation.
 
-- **++eval_harness_commit:** The commit hash, branch or tag to checkout after cloning eval_harness_repo. Defaults to `HEAD`, i.e. the latest commit.
-
-- **++setup_timeout:** The timeout for downloading & installing the agent framework and the evaluation harness, in seconds. Defaults to 1200, i.e. 20 minutes.
+- **++eval_harness_commit:** The commit hash, branch or tag to checkout after cloning agent_harness_repo. Defaults to `HEAD`, i.e. the latest commit.
 
 - **++swebench_tests_timeout:** The timeout for tests after applying the generated patch during evaluation, in seconds. Defaults to 1800, i.e. 30 minutes.
 
-- **++max_retries:** How many times to try running setup, inference and evaluation until a valid output file is produced. Defaults to 3.
+- **++max_retries:** How many times to try running inference and evaluation until a valid output file is produced. Defaults to 3.
 
 - **++min_retry_interval, ++max_retry_interval:** The interval between retries, in seconds. Selected randomly between min and max on each retry. Defaults to 60 and 180 respectively.
 

diff --git a/docs/evaluation/speech-audio.md b/docs/evaluation/speech-audio.md
@@ -2,6 +2,11 @@
 
 This section details how to evaluate speech and audio benchmarks, including understanding tasks that test models' ability to reason about audio content (speech, music, environmental sounds) and ASR tasks for transcription.
 
+!!! warning "Running without audio files"
+    If you want to evaluation without audio files (not recommended) use
+    `--no-audio` flag. In this case you can also set `--skip_data_dir_check`
+    as data is very lightweight when audio files aren't being used.
+
 ## Supported benchmarks
 
 ### MMAU-Pro
@@ -21,11 +26,6 @@ MMAU-Pro (Multimodal Audio Understanding - Pro) is a comprehensive benchmark for
 
 MMAU-Pro requires audio files for meaningful evaluation. **Audio files are downloaded by default** to ensure proper evaluation.
 
-!!! warning "Running without audio files"
-    If you want to evaluation without audio files (not recommended) use
-    `--no-audio` flag. In this case you can also set `--skip_data_dir_check`
-    as data is very lightweight when audio files aren't being used.
-
 ### Data Preparation
 
 To prepare the dataset with audio files:
@@ -46,7 +46,7 @@ ns prepare_data mmau-pro --data_dir=/path/to/data --cluster=<cluster_name>
 If you need to prepare without audio files:
 
 ```bash
-ns prepare_data mmau-pro --no-audio
+ns prepare_data mmau-pro --no-audio --skip_data_dir_check
 ```
 
 Note: The git repository check is automatically skipped with `--no-audio`.
@@ -100,12 +100,9 @@ eval(
         --server_container=/path/to/server_container.sqsh \
         --data_dir=/dataset \
         --installation_command="pip install sacrebleu" \
-        ++prompt_suffix='/no_think' \
+        ++max_concurrent_requests=1 \
         --server_args="--inference-max-requests 1 \
-                       --model-config /workspace/path/to/checkpoint-tp1/config.yaml \
-                       --num-tokens-to-generate 256 \
-                       --temperature 1.0 \
-                       --top_p 1.0"
+                       --model-config /workspace/path/to/checkpoint-tp1/config.yaml
     ```
 
 ## How Evaluation Works
@@ -271,3 +268,133 @@ pass@1          | 0          | 6580        | 55.52%       | 0.00%     | 290
 evaluation_mode | avg_tokens | gen_seconds | success_rate | no_answer | num_entries
 pass@1          | 11         | 6879        | 31.44%       | 0.00%     | 5305
 ```
+
+
+### LibriSpeech-PC
+
+LibriSpeech-PC is an Automatic Speech Recognition (ASR) benchmark that evaluates models' ability to transcribe speech with proper punctuation and capitalization. It builds upon the original LibriSpeech corpus with enhanced reference transcripts.
+
+#### Dataset Location
+
+- Benchmark is defined in [`nemo_skills/dataset/librispeech-pc/__init__.py`](https://github.com/NVIDIA-NeMo/Skills/blob/main/nemo_skills/dataset/librispeech-pc/__init__.py)
+- Manifests (with punctuation/capitalization) from [OpenSLR-145](https://www.openslr.org/145/)
+- Audio files from original [LibriSpeech OpenSLR-12](https://www.openslr.org/12/)
+
+#### Available Splits
+
+- `test-clean`: Clean speech recordings (easier subset)
+- `test-other`: More challenging recordings with varied acoustic conditions
+
+## Preparing LibriSpeech-PC Data
+
+LibriSpeech-PC requires audio files for ASR evaluation. **Audio files are downloaded by default**.
+
+### Data Preparation
+
+To prepare the dataset with audio files:
+
+```bash
+ns prepare_data librispeech-pc --data_dir=/path/to/data --cluster=<cluster_name>
+```
+
+**What happens:**
+
+- Downloads manifests with punctuation/capitalization from OpenSLR-145
+- Downloads audio files from original LibriSpeech (OpenSLR-12)
+- Prepares both `test-clean` and `test-other` splits
+
+### Preparing Specific Splits
+
+To prepare only one split:
+
+```bash
+ns prepare_data librispeech-pc --split test-clean --data_dir=/path/to/data
+```
+
+or
+
+```bash
+ns prepare_data librispeech-pc --split test-other --data_dir=/path/to/data
+```
+
+## Running LibriSpeech-PC Evaluation
+
+!!! note
+    Currently supports only Megatron server type (`--server_type=megatron`).
+
+### Evaluation Example
+
+```python
+import os
+from nemo_skills.pipeline.cli import wrap_arguments, eval
+
+eval(
+    ctx=wrap_arguments(""),
+    cluster="oci_iad",
+    output_dir="/workspace/librispeech-pc-eval",
+    benchmarks="librispeech-pc",
+    server_type="megatron",
+    server_gpus=1,
+    model="/workspace/checkpoint",
+    server_entrypoint="/workspace/megatron-lm/server.py",
+    server_container="/path/to/container.sqsh",
+    data_dir="/dataset",
+    installation_command="pip install sacrebleu whisper jiwer",
+    server_args="--inference-max-requests 1 --model-config /workspace/checkpoint/config.yaml",
+)
+```
+
+??? note "Alternative: Command-line usage"
+
+    If you prefer using the command-line interface, you can run:
+
+    ```bash
+    export MEGATRON_PATH=/workspace/path/to/megatron-lm
+
+    ns eval \
+        --cluster=oci_iad \
+        --output_dir=/workspace/path/to/librispeech-pc-eval \
+        --benchmarks=librispeech-pc \
+        --server_type=megatron \
+        --server_gpus=1 \
+        --model=/workspace/path/to/checkpoint-tp1 \
+        --server_entrypoint=$MEGATRON_PATH/path/to/server.py \
+        --server_container=/path/to/server_container.sqsh \
+        --data_dir=/dataset \
+        --installation_command="pip install sacrebleu whisper jiwer" \
+        ++max_concurrent_requests=1 \
+        --server_args="--inference-max-requests 1 \
+                       --model-config /workspace/path/to/checkpoint-tp1/config.yaml"
+    ```
+
+## How LibriSpeech-PC Evaluation Works
+
+The evaluation measures ASR accuracy using multiple Word Error Rate (WER) metrics:
+
+| Metric | Description |
+|--------|-------------|
+| **WER** | Word Error Rate - measures transcription accuracy ignoring punctuation and capitalization |
+| **WER_C** | Word Error Rate with Capitalization - measures accuracy including capitalization |
+| **WER_PC** | Word Error Rate with Punctuation and Capitalization - measures full accuracy including both |
+| **PER** | Punctuation Error Rate - measures how well the model predicts punctuation marks |
+
+### Sub-benchmarks
+
+Evaluate individual splits:
+
+- `librispeech-pc.test-clean` - Easier, clean speech subset
+- `librispeech-pc.test-other` - More challenging subset with varied conditions
+
+```python
+eval(benchmarks="librispeech-pc.test-clean", ...)
+```
+
+### Evaluation Output Format
+
+**test-clean Split:**
+
+```
+------------------------------- librispeech-pc.test-clean -----------------------------
+evaluation_mode | avg_tokens | gen_seconds | wer      | wer_c    | wer_pc   | per      | num_entries
+pass@1          | 15         | 120         | 4.23%    | 4.85%    | 5.12%    | 2.34%    | 2620
+```
diff --git a/nemo_skills/dataset/audiobench/__init__.py b/nemo_skills/dataset/audiobench/__init__.py
@@ -0,0 +1,36 @@
+# Copyright (c) 2025, NVIDIA CORPORATION & AFFILIATES.  All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""AudioBench: A comprehensive benchmark for speech and audio language models.
+
+AudioBench evaluates models across multiple tasks:
+- ASR (Automatic Speech Recognition)
+- Translation (speech-to-text translation)
+- Speech QA (question answering based on audio)
+- Audio understanding (emotion, gender, accent recognition, etc.)
+
+The benchmark is organized into two main categories:
+- nonjudge: Tasks evaluated with automatic metrics (WER, BLEU)
+- judge: Tasks requiring LLM-as-a-judge evaluation
+"""
+
+DATASET_GROUP = "speechlm"
+IS_BENCHMARK_GROUP = True
+SCORE_MODULE = "nemo_skills.evaluation.metrics.speechlm_metrics"
+
+# Top-level benchmarks: evaluate all judge or all nonjudge datasets
+BENCHMARKS = {
+    "audiobench.nonjudge": {},
+    "audiobench.judge": {},
+}
diff --git a/nemo_skills/dataset/audiobench/judge/__init__.py b/nemo_skills/dataset/audiobench/judge/__init__.py
@@ -0,0 +1,39 @@
+# Copyright (c) 2025, NVIDIA CORPORATION & AFFILIATES.  All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""AudioBench judge tasks dataset configuration.
+
+This dataset includes tasks that require LLM-based evaluation such as:
+- Audio captioning
+- Spoken question answering
+- Audio understanding and reasoning
+
+These tasks require an LLM judge for evaluation, matching MMAU-Pro evaluation setup.
+"""
+
+# Dataset configuration - CRITICAL: needed for audio to work
+DATASET_GROUP = "speechlm"
+METRICS_TYPE = "speechlm"
+DEFAULT_SPLIT = "test"
+GENERATION_ARGS = "++prompt_format=openai "
+
+# Judge configuration matching AudioBench official implementation
+# Using Llama-3.1-70B with vllm (can be overridden in run scripts)
+JUDGE_PIPELINE_ARGS = {
+    "model": "meta-llama/Meta-Llama-3.1-70B-Instruct",
+    "server_type": "vllm",
+    "server_gpus": 8,
+    "server_args": "--max-model-len 8192 --gpu-memory-utilization 0.95",
+}
+JUDGE_ARGS = "++prompt_config=judge/audiobench ++generation_key=judgement"
diff --git a/nemo_skills/dataset/audiobench/nonjudge/__init__.py b/nemo_skills/dataset/audiobench/nonjudge/__init__.py
@@ -0,0 +1,31 @@
+# Copyright (c) 2025, NVIDIA CORPORATION & AFFILIATES.  All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""AudioBench non-judge tasks dataset configuration.
+
+This dataset includes ASR, translation, and other tasks that use
+automatic metrics (WER, BLEU, WER-PC) instead of judge evaluation.
+
+NO JUDGE REQUIRED - Metrics computed automatically from model outputs.
+"""
+
+# Dataset configuration - CRITICAL: needed for audio to work
+DATASET_GROUP = "speechlm"
+METRICS_TYPE = "speechlm"
+
+# Evaluation settings
+EVAL_ARGS = "++eval_type=audiobench "
+
+# Generation settings - OpenAI format for audio-language models
+GENERATION_ARGS = "++prompt_format=openai "