Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,7 @@ Here are some of the features we support:
- [**Long-context**](https://nvidia-nemo.github.io/Skills/evaluation/long-context): e.g. [ruler](https://nvidia-nemo.github.io/Skills/evaluation/long-context/#ruler), [mrcr](https://nvidia-nemo.github.io/Skills/evaluation/long-context/#mrcr), [aalcr](https://nvidia-nemo.github.io/Skills/evaluation/long-context/#aalcr)
- [**Tool-calling**](https://nvidia-nemo.github.io/Skills/evaluation/tool-calling): e.g. [bfcl_v3](https://nvidia-nemo.github.io/Skills/evaluation/tool-calling/#bfcl_v3)
- [**Multilingual**](https://nvidia-nemo.github.io/Skills/evaluation/multilingual): e.g. [mmlu-prox](https://nvidia-nemo.github.io/Skills/evaluation/multilingual/#mmlu-prox), [FLORES-200](https://nvidia-nemo.github.io/Skills/evaluation/multilingual/#FLORES-200), [wmt24pp](https://nvidia-nemo.github.io/Skills/evaluation/multilingual/#wmt24pp)
- [**Speech & Audio**](https://nvidia-nemo.github.io/Skills/evaluation/speech-audio): e.g. [mmau-pro](https://nvidia-nemo.github.io/Skills/evaluation/speech-audio/#mmau-pro)
- [**Speech & Audio**](https://nvidia-nemo.github.io/Skills/evaluation/speech-audio): e.g. [asr-leaderboard](https://nvidia-nemo.github.io/Skills/evaluation/speech-audio/#asr-leaderboard), [mmau-pro](https://nvidia-nemo.github.io/Skills/evaluation/speech-audio/#mmau-pro)
- Easily parallelize each evaluation across many slurm jobs, self-host LLM judges, bring your own prompts or change benchmark configuration in any other way.
- [Model training](https://nvidia-nemo.github.io/Skills/pipelines/training): Train models using [NeMo-RL](https://github.com/NVIDIA-NeMo/RL/) or [verl](https://github.com/volcengine/verl).

Expand Down
2 changes: 1 addition & 1 deletion docs/evaluation/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@ We support many popular benchmarks and it's easy to add new in the future. The f
- [**Long-context**](./long-context.md): e.g. [ruler](./long-context.md#ruler), [mrcr](./long-context.md#mrcr)
- [**Tool-calling**](./tool-calling.md): e.g. [bfcl_v3](./tool-calling.md#bfcl_v3)
- [**Multilingual**](./multilingual.md): e.g. [mmlu-prox](./multilingual.md#mmlu-prox), [flores-200](./multilingual.md#FLORES-200), [wmt24pp](./multilingual.md#wmt24pp)
- [**Speech & Audio**](./speech-audio.md): e.g. [mmau-pro](./speech-audio.md#mmau-pro)
- [**Speech & Audio**](./speech-audio.md): e.g. [asr-leaderboard](./speech-audio.md#asr-leaderboard), [mmau-pro](./speech-audio.md#mmau-pro)

See [nemo_skills/dataset](https://github.com/NVIDIA-NeMo/Skills/blob/main/nemo_skills/dataset) where each folder is a benchmark we support.

Expand Down
197 changes: 135 additions & 62 deletions docs/evaluation/speech-audio.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,8 +2,22 @@

This section details how to evaluate speech and audio benchmarks, including understanding tasks that test models' ability to reason about audio content (speech, music, environmental sounds) and ASR tasks for transcription.

!!! note
Currently supports only Megatron server type (`--server_type=megatron`).

## Supported benchmarks

### ASR Leaderboard

ASR benchmark based on the [HuggingFace Open ASR Leaderboard](https://huggingface.co/spaces/hf-audio/open_asr_leaderboard). Evaluates transcription quality using Word Error Rate (WER).

**Datasets:** `librispeech_clean`, `librispeech_other`, `voxpopuli`, `tedlium`, `gigaspeech`, `spgispeech`, `earnings22`, `ami`

#### Dataset Location

- Benchmark is defined in [`nemo_skills/dataset/asr-leaderboard/__init__.py`](https://github.com/NVIDIA-NeMo/Skills/blob/main/nemo_skills/dataset/asr-leaderboard/__init__.py)
- Original datasets are hosted on HuggingFace (downloaded automatically during preparation)

### MMAU-Pro

MMAU-Pro (Multimodal Audio Understanding - Pro) is a comprehensive benchmark for evaluating audio understanding capabilities across three different task categories:
Expand All @@ -17,108 +31,101 @@ MMAU-Pro (Multimodal Audio Understanding - Pro) is a comprehensive benchmark for
- Benchmark is defined in [`nemo_skills/dataset/mmau-pro/__init__.py`](https://github.com/NVIDIA-NeMo/Skills/blob/main/nemo_skills/dataset/mmau-pro/__init__.py)
- Original benchmark source is hosted on [HuggingFace](https://huggingface.co/datasets/gamma-lab-umd/MMAU-Pro)

## Preparing MMAU-Pro Data
## Preparing Data

MMAU-Pro requires audio files for meaningful evaluation. **Audio files are downloaded by default** to ensure proper evaluation.
These benchmarks require audio files for meaningful evaluation. **Audio files are downloaded by default** to ensure proper evaluation.

!!! warning "Running without audio files"
If you want to evaluation without audio files (not recommended) use
If you want to evaluate without audio files (not recommended) use
`--no-audio` flag. In this case you can also set `--skip_data_dir_check`
as data is very lightweight when audio files aren't being used.

### Data Preparation

To prepare the dataset with audio files:
### ASR Leaderboard

```bash
export HF_TOKEN=your_huggingface_token
ns prepare_data mmau-pro --data_dir=/path/to/data --cluster=<cluster_name>
ns prepare_data asr-leaderboard --data_dir=/path/to/data --cluster=<cluster>
```

**What happens:**

- Requires authentication (HuggingFace token via `HF_TOKEN` environment variable)
- Downloads audio archive from HuggingFace and extracts
- Prepares the dataset files for evaluation
Prepare specific datasets only:

### Text-Only Mode (Not Recommended)
```bash
ns prepare_data asr-leaderboard --datasets librispeech_clean ami
```

If you need to prepare without audio files:
### MMAU-Pro

```bash
ns prepare_data mmau-pro --no-audio
ns prepare_data mmau-pro --data_dir=/path/to/data --cluster=<cluster_name>
```

Note: The git repository check is automatically skipped with `--no-audio`.

## Running Evaluation

!!! note
Currently supports only Megatron server type (`--server_type=megatron`).

### Evaluation Example
### ASR Leaderboard

```python
import os
from nemo_skills.pipeline.cli import wrap_arguments, eval

os.environ["NVIDIA_API_KEY"] = "your_nvidia_api_key" # For LLM judge

eval(
ctx=wrap_arguments("++prompt_suffix='/no_think'"),
ctx=wrap_arguments(""),
cluster="oci_iad",
output_dir="/workspace/mmau-pro-eval",
benchmarks="mmau-pro",
output_dir="/workspace/asr-leaderboard-eval",
benchmarks="asr-leaderboard",
server_type="megatron",
server_gpus=1,
model="/workspace/checkpoint",
server_entrypoint="/workspace/megatron-lm/server.py",
server_container="/path/to/container.sqsh",
data_dir="/dataset",
installation_command="pip install sacrebleu",
installation_command="pip install sacrebleu jiwer openai-whisper"
server_args="--inference-max-requests 1 --model-config /workspace/checkpoint/config.yaml",
)
```

??? note "Alternative: Command-line usage"
Evaluate a specific dataset:

If you prefer using the command-line interface, you can run:
```python
eval(benchmarks="asr-leaderboard", split="librispeech_clean", ...)
```

```bash
export HF_TOKEN=your_huggingface_token
export NVIDIA_API_KEY=your_nvidia_api_key
export MEGATRON_PATH=/workspace/path/to/megatron-lm
??? note "Alternative: Command-line usage"

```bash
ns eval \
--cluster=oci_iad \
--output_dir=/workspace/path/to/mmau-pro-eval \
--benchmarks=mmau-pro \
--output_dir=/workspace/path/to/asr-leaderboard-eval \
--benchmarks=asr-leaderboard \
--server_type=megatron \
--server_gpus=1 \
--model=/workspace/path/to/checkpoint-tp1 \
--server_entrypoint=$MEGATRON_PATH/path/to/server.py \
--server_container=/path/to/server_container.sqsh \
--data_dir=/dataset \
--installation_command="pip install sacrebleu" \
++prompt_suffix='/no_think' \
--server_args="--inference-max-requests 1 \
--model-config /workspace/path/to/checkpoint-tp1/config.yaml \
--num-tokens-to-generate 256 \
--temperature 1.0 \
--top_p 1.0"
--model=/workspace/path/to/checkpoint \
--server_entrypoint=/workspace/megatron-lm/server.py \
--server_container=/path/to/container.sqsh \
--data_dir=/dataset
--installation_command="pip install sacrebleu jiwer openai-whisper"
```

## How Evaluation Works
### MMAU-Pro

Each category uses a different evaluation strategy:
```python
import os
from nemo_skills.pipeline.cli import wrap_arguments, eval

| Category | Evaluation Method | How It Works |
|----------|-------------------|--------------|
| **Closed-Form** | NVEmbed similarity matching | Model generates short answer; compared to expected answer using embeddings |
| **Open-Ended** | LLM-as-a-judge (Qwen 2.5 7B) | Model generates detailed response; Qwen 2.5 judges quality and correctness |
| **Instruction Following** | Custom evaluation logic | Model follows instructions; evaluator checks adherence |
os.environ["NVIDIA_API_KEY"] = "your_nvidia_api_key" # For LLM judge

### Sub-benchmarks
eval(
ctx=wrap_arguments(""),
cluster="oci_iad",
output_dir="/workspace/mmau-pro-eval",
benchmarks="mmau-pro",
server_type="megatron",
server_gpus=1,
model="/workspace/checkpoint",
server_entrypoint="/workspace/megatron-lm/server.py",
server_container="/path/to/container.sqsh",
data_dir="/dataset",
installation_command="pip install sacrebleu",
server_args="--inference-max-requests 1 --model-config /workspace/checkpoint/config.yaml",
)
```

Evaluate individual categories:

Expand All @@ -130,6 +137,24 @@ Evaluate individual categories:
eval(benchmarks="mmau-pro.closed_form", ...)
```

??? note "Alternative: Command-line usage"

```bash
export NVIDIA_API_KEY=your_nvidia_api_key

ns eval \
--cluster=oci_iad \
--output_dir=/workspace/path/to/mmau-pro-eval \
--benchmarks=mmau-pro \
--server_type=megatron \
--server_gpus=1 \
--model=/workspace/path/to/checkpoint \
--server_entrypoint=/workspace/megatron-lm/server.py \
--server_container=/path/to/container.sqsh \
--data_dir=/dataset \
--installation_command="pip install sacrebleu"
```

### Using Custom Judge Models

The open-ended questions subset uses an LLM-as-a-judge (by default, Qwen 2.5 7B via NVIDIA API) to evaluate responses. You can customize the judge model for this subset:
Expand All @@ -143,7 +168,7 @@ The open-ended questions subset uses an LLM-as-a-judge (by default, Qwen 2.5 7B
os.environ["NVIDIA_API_KEY"] = "your_nvidia_api_key"

eval(
ctx=wrap_arguments("++prompt_suffix='/no_think'"),
ctx=wrap_arguments(""),
cluster="oci_iad",
output_dir="/workspace/path/to/mmau-pro-eval",
benchmarks="mmau-pro.open_ended", # Only open-ended uses LLM judge
Expand Down Expand Up @@ -180,7 +205,58 @@ The open-ended questions subset uses an LLM-as-a-judge (by default, Qwen 2.5 7B

## Understanding Results

After evaluation completes, results are saved in your output directory under `eval-results/`:
After evaluation completes, results are saved in your output directory under `eval-results/`.

### ASR Leaderboard Results

```
<output_dir>/
└── eval-results/
└── asr-leaderboard/
└──metrics.json
```

Example output:

```
------------------------------------- asr-leaderboard --------------------------------------
evaluation_mode | avg_tokens | gen_seconds | success_rate | no_answer | wer | num_entries
pass@1 | 736 | 233522 | 86.70% | 0.00% | 7.82% | 143597

----------------------------------- asr-leaderboard-ami ------------------------------------
evaluation_mode | avg_tokens | gen_seconds | success_rate | no_answer | wer | num_entries
pass@1 | 732 | 3680 | 81.27% | 0.00% | 18.45% | 12620

-------------------------------- asr-leaderboard-earnings22 --------------------------------
evaluation_mode | avg_tokens | gen_seconds | success_rate | no_answer | wer | num_entries
pass@1 | 736 | 3522 | 83.97% | 0.00% | 14.72% | 57390

-------------------------------- asr-leaderboard-gigaspeech --------------------------------
evaluation_mode | avg_tokens | gen_seconds | success_rate | no_answer | wer | num_entries
pass@1 | 736 | 233469 | 71.86% | 0.00% | 12.34% | 25376

---------------------------- asr-leaderboard-librispeech_clean ----------------------------
evaluation_mode | avg_tokens | gen_seconds | success_rate | no_answer | wer | num_entries
pass@1 | 735 | 3607 | 99.62% | 0.00% | 2.06% | 2620

---------------------------- asr-leaderboard-librispeech_other ----------------------------
evaluation_mode | avg_tokens | gen_seconds | success_rate | no_answer | wer | num_entries
pass@1 | 733 | 3927 | 98.67% | 0.00% | 4.34% | 2939

-------------------------------- asr-leaderboard-spgispeech -------------------------------
evaluation_mode | avg_tokens | gen_seconds | success_rate | no_answer | wer | num_entries
pass@1 | 740 | 4510 | 99.99% | 0.00% | 3.81% | 39341

--------------------------------- asr-leaderboard-tedlium ----------------------------------
evaluation_mode | avg_tokens | gen_seconds | success_rate | no_answer | wer | num_entries
pass@1 | 732 | 3878 | 77.74% | 0.00% | 7.89% | 1469

-------------------------------- asr-leaderboard-voxpopuli --------------------------------
evaluation_mode | avg_tokens | gen_seconds | success_rate | no_answer | wer | num_entries
pass@1 | 741 | 4007 | 99.51% | 0.00% | 6.47% | 1842
```

### MMAU-Pro Results

```
<output_dir>/
Expand All @@ -195,9 +271,7 @@ After evaluation completes, results are saved in your output directory under `ev
│ └── metrics.json
```

### Evaluation Output Format

When evaluation completes, results are displayed in formatted tables in the logs:
Example output:

**Open-Ended Questions:**

Expand All @@ -213,7 +287,6 @@ pass@1 | 82 | 196 | 14.88% | 0.00% | 625
-------------------------- mmau-pro.instruction_following -------------------------
evaluation_mode | avg_tokens | gen_seconds | success_rate | no_answer | num_entries
pass@1 | 0 | 102 | 21.84% | 0.00% | 87

```

**Closed-Form Questions (Main Category + Sub-categories):**
Expand Down
2 changes: 1 addition & 1 deletion docs/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,7 @@ Here are some of the features we support:
- [**Long-context**](./evaluation/long-context.md): e.g. [ruler](./evaluation/long-context.md#ruler), [mrcr](./evaluation/long-context.md#mrcr)
- [**Tool-calling**](./evaluation/tool-calling.md): e.g. [bfcl_v3](./evaluation/tool-calling.md#bfcl_v3)
- [**Multilingual capabilities**](./evaluation/multilingual.md): e.g. [mmlu-prox](./evaluation/multilingual.md#mmlu-prox), [flores-200](./evaluation/multilingual.md#FLORES-200), [wmt24pp](./evaluation/multilingual.md#wmt24pp)
- [**Speech & Audio**](./evaluation/speech-audio.md): e.g. [mmau-pro](./evaluation/speech-audio.md#mmau-pro)
- [**Speech & Audio**](./evaluation/speech-audio.md): e.g. [asr-leaderboard](./evaluation/speech-audio.md#asr-leaderboard), [mmau-pro](./evaluation/speech-audio.md#mmau-pro)
- [**Robustness evaluation**](./evaluation/robustness.md): Evaluate model sensitvity against changes in prompt.
- Easily parallelize each evaluation across many Slurm jobs, self-host LLM judges, bring your own prompts or change benchmark configuration in any other way.
- [Model training](pipelines/training.md): Train models using [NeMo-RL](https://github.com/NVIDIA-NeMo/RL/) or [verl](https://github.com/volcengine/verl).
Expand Down
21 changes: 21 additions & 0 deletions nemo_skills/dataset/asr-leaderboard/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
# Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# Settings that define how evaluation should be done by default (all can be changed from cmdline)
# Uses the audio evaluator which computes WER with HuggingFace leaderboard preprocessing
# Data samples should have task_type="ASR_LEADERBOARD" for proper WER calculation

DATASET_GROUP = "speechlm"
METRICS_TYPE = "audio"
GENERATION_ARGS = "++prompt_format=openai ++eval_type=audio"
Loading