Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
26 commits
Select commit Hold shift + click to select a range
73a757e
Add AudioBench benchmark for speech and audio language models
Jorjeous Nov 14, 2025
9225fab
update prepare.py for audiobench
Jorjeous Nov 14, 2025
de4914a
Fix on mmau-pro prepare.py
Jorjeous Nov 14, 2025
0997361
add absolute path's to prepare.py
Jorjeous Nov 14, 2025
e2a876b
update names
Jorjeous Nov 18, 2025
0a543be
update destination for downloading
Jorjeous Nov 18, 2025
ca3ffec
LibriSpeech PC Benchmark Evaluation
melllinia Nov 18, 2025
3d963f6
Testline
Jorjeous Nov 21, 2025
189a47f
revert
Jorjeous Nov 21, 2025
3dd21ff
upd strtucture
Jorjeous Nov 21, 2025
35f3666
Change judge config to align with Audiobench's
Jorjeous Nov 21, 2025
fccb644
upd __init__ files
Jorjeous Nov 21, 2025
0c10924
changed organization of sets + minor additions
Jorjeous Nov 21, 2025
7694a6b
Revert "Converting ICPC25 to ICPC evaluation (#1045)"
Jorjeous Nov 21, 2025
59b4f1d
linter
Jorjeous Nov 21, 2025
fd9838b
update .gitignore
Jorjeous Nov 21, 2025
907b9fb
add LS-PnC
Jorjeous Nov 21, 2025
e059482
Add LibriSpeech-PC documentation and fix jiwer import
melllinia Nov 25, 2025
c885b73
Improving mmau-pro metric calculation
melllinia Nov 25, 2025
2807505
Revert last two commits
melllinia Dec 1, 2025
7abea9e
Lint fix and merge
melllinia Dec 1, 2025
74e21f9
test
Jorjeous Dec 1, 2025
a908156
Revert "test"
Jorjeous Dec 1, 2025
245743b
Merge branch 'main' into audiobench_libri-pc
Jorjeous Dec 9, 2025
5c289f0
Merge branch 'main' into audiobench_libri-pc
gwarmstrong Dec 9, 2025
9cd886a
Merge branch 'main' into audiobench_libri-pc
Jorjeous Dec 11, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 2 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -45,4 +45,5 @@ nemo_skills/dataset/aalcr/lcr/
.idea/*
CLAUDE.md

.idea
# AudioBench repository (auto-cloned during data preparation)
AudioBench/
6 changes: 2 additions & 4 deletions docs/evaluation/code.md
Original file line number Diff line number Diff line change
Expand Up @@ -82,13 +82,11 @@ There are a few parameters specific to SWE-bench. They have to be specified with

- **++eval_harness_repo:** URL of the repository to use for the evaluation harness. This is passed directly as an argument to `git clone`. Defaults to [`https://github.com/Kipok/SWE-bench.git`](https://github.com/Kipok/SWE-bench), our fork of SWE-bench that supports local evaluation.

- **++eval_harness_commit:** The commit hash, branch or tag to checkout after cloning eval_harness_repo. Defaults to `HEAD`, i.e. the latest commit.

- **++setup_timeout:** The timeout for downloading & installing the agent framework and the evaluation harness, in seconds. Defaults to 1200, i.e. 20 minutes.
- **++eval_harness_commit:** The commit hash, branch or tag to checkout after cloning agent_harness_repo. Defaults to `HEAD`, i.e. the latest commit.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Fix parameter name in ++eval_harness_commit description

++eval_harness_commit currently says “after cloning agent_harness_repo”, but there is no such parameter; the actual flag is ++eval_harness_repo. This is likely a copy‑paste typo and can confuse users; suggest changing it to:

- **++eval_harness_commit:** The commit hash, branch or tag to checkout after cloning `eval_harness_repo`. Defaults to `HEAD`, i.e. the latest commit.

The updated ++max_retries description (covering inference and evaluation retries) looks aligned with the new SWE‑bench flow.

Based on learnings, keeping docs tightly aligned with flags helps avoid user confusion.

Also applies to: 89-89

🤖 Prompt for AI Agents
In docs/evaluation/code.md around line 85 (also apply same change at line 89),
the description for ++eval_harness_commit incorrectly references cloning
"agent_harness_repo" — change that reference to "eval_harness_repo" so the text
reads that the commit/branch/tag is checked out after cloning eval_harness_repo;
keep the rest of the sentence (Defaults to HEAD) unchanged and ensure both
occurrences (lines 85 and 89) are updated to avoid the copy‑paste confusion.


- **++swebench_tests_timeout:** The timeout for tests after applying the generated patch during evaluation, in seconds. Defaults to 1800, i.e. 30 minutes.

- **++max_retries:** How many times to try running setup, inference and evaluation until a valid output file is produced. Defaults to 3.
- **++max_retries:** How many times to try running inference and evaluation until a valid output file is produced. Defaults to 3.

- **++min_retry_interval, ++max_retry_interval:** The interval between retries, in seconds. Selected randomly between min and max on each retry. Defaults to 60 and 180 respectively.

Expand Down
149 changes: 138 additions & 11 deletions docs/evaluation/speech-audio.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,11 @@

This section details how to evaluate speech and audio benchmarks, including understanding tasks that test models' ability to reason about audio content (speech, music, environmental sounds) and ASR tasks for transcription.

!!! warning "Running without audio files"
If you want to evaluation without audio files (not recommended) use
`--no-audio` flag. In this case you can also set `--skip_data_dir_check`
as data is very lightweight when audio files aren't being used.

## Supported benchmarks

### MMAU-Pro
Expand All @@ -21,11 +26,6 @@ MMAU-Pro (Multimodal Audio Understanding - Pro) is a comprehensive benchmark for

MMAU-Pro requires audio files for meaningful evaluation. **Audio files are downloaded by default** to ensure proper evaluation.

!!! warning "Running without audio files"
If you want to evaluation without audio files (not recommended) use
`--no-audio` flag. In this case you can also set `--skip_data_dir_check`
as data is very lightweight when audio files aren't being used.

### Data Preparation

To prepare the dataset with audio files:
Expand All @@ -46,7 +46,7 @@ ns prepare_data mmau-pro --data_dir=/path/to/data --cluster=<cluster_name>
If you need to prepare without audio files:

```bash
ns prepare_data mmau-pro --no-audio
ns prepare_data mmau-pro --no-audio --skip_data_dir_check
```

Note: The git repository check is automatically skipped with `--no-audio`.
Expand Down Expand Up @@ -100,12 +100,9 @@ eval(
--server_container=/path/to/server_container.sqsh \
--data_dir=/dataset \
--installation_command="pip install sacrebleu" \
++prompt_suffix='/no_think' \
++max_concurrent_requests=1 \
--server_args="--inference-max-requests 1 \
--model-config /workspace/path/to/checkpoint-tp1/config.yaml \
--num-tokens-to-generate 256 \
--temperature 1.0 \
--top_p 1.0"
--model-config /workspace/path/to/checkpoint-tp1/config.yaml
```

## How Evaluation Works
Expand Down Expand Up @@ -271,3 +268,133 @@ pass@1 | 0 | 6580 | 55.52% | 0.00% | 290
evaluation_mode | avg_tokens | gen_seconds | success_rate | no_answer | num_entries
pass@1 | 11 | 6879 | 31.44% | 0.00% | 5305
```


### LibriSpeech-PC

LibriSpeech-PC is an Automatic Speech Recognition (ASR) benchmark that evaluates models' ability to transcribe speech with proper punctuation and capitalization. It builds upon the original LibriSpeech corpus with enhanced reference transcripts.

#### Dataset Location

- Benchmark is defined in [`nemo_skills/dataset/librispeech-pc/__init__.py`](https://github.com/NVIDIA-NeMo/Skills/blob/main/nemo_skills/dataset/librispeech-pc/__init__.py)
- Manifests (with punctuation/capitalization) from [OpenSLR-145](https://www.openslr.org/145/)
- Audio files from original [LibriSpeech OpenSLR-12](https://www.openslr.org/12/)

#### Available Splits

- `test-clean`: Clean speech recordings (easier subset)
- `test-other`: More challenging recordings with varied acoustic conditions

## Preparing LibriSpeech-PC Data

LibriSpeech-PC requires audio files for ASR evaluation. **Audio files are downloaded by default**.

### Data Preparation

To prepare the dataset with audio files:

```bash
ns prepare_data librispeech-pc --data_dir=/path/to/data --cluster=<cluster_name>
```

**What happens:**

- Downloads manifests with punctuation/capitalization from OpenSLR-145
- Downloads audio files from original LibriSpeech (OpenSLR-12)
- Prepares both `test-clean` and `test-other` splits

### Preparing Specific Splits

To prepare only one split:

```bash
ns prepare_data librispeech-pc --split test-clean --data_dir=/path/to/data
```

or

```bash
ns prepare_data librispeech-pc --split test-other --data_dir=/path/to/data
```

## Running LibriSpeech-PC Evaluation

!!! note
Currently supports only Megatron server type (`--server_type=megatron`).

### Evaluation Example

```python
import os
from nemo_skills.pipeline.cli import wrap_arguments, eval

eval(
ctx=wrap_arguments(""),
cluster="oci_iad",
output_dir="/workspace/librispeech-pc-eval",
benchmarks="librispeech-pc",
server_type="megatron",
server_gpus=1,
model="/workspace/checkpoint",
server_entrypoint="/workspace/megatron-lm/server.py",
server_container="/path/to/container.sqsh",
data_dir="/dataset",
installation_command="pip install sacrebleu whisper jiwer",
server_args="--inference-max-requests 1 --model-config /workspace/checkpoint/config.yaml",
)
```

??? note "Alternative: Command-line usage"

If you prefer using the command-line interface, you can run:

```bash
export MEGATRON_PATH=/workspace/path/to/megatron-lm

ns eval \
--cluster=oci_iad \
--output_dir=/workspace/path/to/librispeech-pc-eval \
--benchmarks=librispeech-pc \
--server_type=megatron \
--server_gpus=1 \
--model=/workspace/path/to/checkpoint-tp1 \
--server_entrypoint=$MEGATRON_PATH/path/to/server.py \
--server_container=/path/to/server_container.sqsh \
--data_dir=/dataset \
--installation_command="pip install sacrebleu whisper jiwer" \
++max_concurrent_requests=1 \
--server_args="--inference-max-requests 1 \
--model-config /workspace/path/to/checkpoint-tp1/config.yaml"
```

## How LibriSpeech-PC Evaluation Works

The evaluation measures ASR accuracy using multiple Word Error Rate (WER) metrics:

| Metric | Description |
|--------|-------------|
| **WER** | Word Error Rate - measures transcription accuracy ignoring punctuation and capitalization |
| **WER_C** | Word Error Rate with Capitalization - measures accuracy including capitalization |
| **WER_PC** | Word Error Rate with Punctuation and Capitalization - measures full accuracy including both |
| **PER** | Punctuation Error Rate - measures how well the model predicts punctuation marks |

### Sub-benchmarks

Evaluate individual splits:

- `librispeech-pc.test-clean` - Easier, clean speech subset
- `librispeech-pc.test-other` - More challenging subset with varied conditions

```python
eval(benchmarks="librispeech-pc.test-clean", ...)
```

### Evaluation Output Format

**test-clean Split:**

```
------------------------------- librispeech-pc.test-clean -----------------------------
evaluation_mode | avg_tokens | gen_seconds | wer | wer_c | wer_pc | per | num_entries
pass@1 | 15 | 120 | 4.23% | 4.85% | 5.12% | 2.34% | 2620
```
36 changes: 36 additions & 0 deletions nemo_skills/dataset/audiobench/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,36 @@
# Copyright (c) 2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

"""AudioBench: A comprehensive benchmark for speech and audio language models.

AudioBench evaluates models across multiple tasks:
- ASR (Automatic Speech Recognition)
- Translation (speech-to-text translation)
- Speech QA (question answering based on audio)
- Audio understanding (emotion, gender, accent recognition, etc.)

The benchmark is organized into two main categories:
- nonjudge: Tasks evaluated with automatic metrics (WER, BLEU)
- judge: Tasks requiring LLM-as-a-judge evaluation
"""

DATASET_GROUP = "speechlm"
IS_BENCHMARK_GROUP = True
SCORE_MODULE = "nemo_skills.evaluation.metrics.speechlm_metrics"

# Top-level benchmarks: evaluate all judge or all nonjudge datasets
BENCHMARKS = {
"audiobench.nonjudge": {},
"audiobench.judge": {},
}
39 changes: 39 additions & 0 deletions nemo_skills/dataset/audiobench/judge/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,39 @@
# Copyright (c) 2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

"""AudioBench judge tasks dataset configuration.

This dataset includes tasks that require LLM-based evaluation such as:
- Audio captioning
- Spoken question answering
- Audio understanding and reasoning

These tasks require an LLM judge for evaluation, matching MMAU-Pro evaluation setup.
"""

# Dataset configuration - CRITICAL: needed for audio to work
DATASET_GROUP = "speechlm"
METRICS_TYPE = "speechlm"
DEFAULT_SPLIT = "test"
GENERATION_ARGS = "++prompt_format=openai "

# Judge configuration matching AudioBench official implementation
# Using Llama-3.1-70B with vllm (can be overridden in run scripts)
JUDGE_PIPELINE_ARGS = {
"model": "meta-llama/Meta-Llama-3.1-70B-Instruct",
"server_type": "vllm",
"server_gpus": 8,
"server_args": "--max-model-len 8192 --gpu-memory-utilization 0.95",
}
JUDGE_ARGS = "++prompt_config=judge/audiobench ++generation_key=judgement"
31 changes: 31 additions & 0 deletions nemo_skills/dataset/audiobench/nonjudge/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
# Copyright (c) 2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

"""AudioBench non-judge tasks dataset configuration.

This dataset includes ASR, translation, and other tasks that use
automatic metrics (WER, BLEU, WER-PC) instead of judge evaluation.

NO JUDGE REQUIRED - Metrics computed automatically from model outputs.
"""

# Dataset configuration - CRITICAL: needed for audio to work
DATASET_GROUP = "speechlm"
METRICS_TYPE = "speechlm"

# Evaluation settings
EVAL_ARGS = "++eval_type=audiobench "

# Generation settings - OpenAI format for audio-language models
GENERATION_ARGS = "++prompt_format=openai "
Loading