NVIDIA-NeMo · gwarmstrong · Dec 15, 2025 · Dec 11, 2025 · Dec 12, 2025 · Dec 12, 2025
diff --git a/README.md b/README.md
@@ -19,7 +19,7 @@ Here are some of the features we support:
     - [**Long-context**](https://nvidia-nemo.github.io/Skills/evaluation/long-context): e.g. [ruler](https://nvidia-nemo.github.io/Skills/evaluation/long-context/#ruler), [mrcr](https://nvidia-nemo.github.io/Skills/evaluation/long-context/#mrcr), [aalcr](https://nvidia-nemo.github.io/Skills/evaluation/long-context/#aalcr)
     - [**Tool-calling**](https://nvidia-nemo.github.io/Skills/evaluation/tool-calling): e.g. [bfcl_v3](https://nvidia-nemo.github.io/Skills/evaluation/tool-calling/#bfcl_v3)
     - [**Multilingual**](https://nvidia-nemo.github.io/Skills/evaluation/multilingual): e.g. [mmlu-prox](https://nvidia-nemo.github.io/Skills/evaluation/multilingual/#mmlu-prox), [FLORES-200](https://nvidia-nemo.github.io/Skills/evaluation/multilingual/#FLORES-200), [wmt24pp](https://nvidia-nemo.github.io/Skills/evaluation/multilingual/#wmt24pp)
-    - [**Speech & Audio**](https://nvidia-nemo.github.io/Skills/evaluation/speech-audio): e.g. [mmau-pro](https://nvidia-nemo.github.io/Skills/evaluation/speech-audio/#mmau-pro)
+    - [**Speech & Audio**](https://nvidia-nemo.github.io/Skills/evaluation/speech-audio): e.g. [asr-leaderboard](https://nvidia-nemo.github.io/Skills/evaluation/speech-audio/#asr-leaderboard), [mmau-pro](https://nvidia-nemo.github.io/Skills/evaluation/speech-audio/#mmau-pro)
   - Easily parallelize each evaluation across many slurm jobs, self-host LLM judges, bring your own prompts or change benchmark configuration in any other way.
 - [Model training](https://nvidia-nemo.github.io/Skills/pipelines/training): Train models using [NeMo-RL](https://github.com/NVIDIA-NeMo/RL/) or [verl](https://github.com/volcengine/verl).
 

diff --git a/docs/evaluation/index.md b/docs/evaluation/index.md
@@ -10,7 +10,7 @@ We support many popular benchmarks and it's easy to add new in the future. The f
 - [**Long-context**](./long-context.md): e.g. [ruler](./long-context.md#ruler), [mrcr](./long-context.md#mrcr)
 - [**Tool-calling**](./tool-calling.md): e.g. [bfcl_v3](./tool-calling.md#bfcl_v3)
 - [**Multilingual**](./multilingual.md): e.g. [mmlu-prox](./multilingual.md#mmlu-prox), [flores-200](./multilingual.md#FLORES-200), [wmt24pp](./multilingual.md#wmt24pp)
-- [**Speech & Audio**](./speech-audio.md): e.g. [mmau-pro](./speech-audio.md#mmau-pro)
+- [**Speech & Audio**](./speech-audio.md): e.g. [asr-leaderboard](./speech-audio.md#asr-leaderboard), [mmau-pro](./speech-audio.md#mmau-pro)
 
 See [nemo_skills/dataset](https://github.com/NVIDIA-NeMo/Skills/blob/main/nemo_skills/dataset) where each folder is a benchmark we support.
 

diff --git a/docs/evaluation/speech-audio.md b/docs/evaluation/speech-audio.md
@@ -2,8 +2,22 @@
 
 This section details how to evaluate speech and audio benchmarks, including understanding tasks that test models' ability to reason about audio content (speech, music, environmental sounds) and ASR tasks for transcription.
 
+!!! note
+    Currently supports only Megatron server type (`--server_type=megatron`).
+
 ## Supported benchmarks
 
+### ASR Leaderboard
+
+ASR benchmark based on the [HuggingFace Open ASR Leaderboard](https://huggingface.co/spaces/hf-audio/open_asr_leaderboard). Evaluates transcription quality using Word Error Rate (WER).
+
+**Datasets:** `librispeech_clean`, `librispeech_other`, `voxpopuli`, `tedlium`, `gigaspeech`, `spgispeech`, `earnings22`, `ami`
+
+#### Dataset Location
+
+- Benchmark is defined in [`nemo_skills/dataset/asr-leaderboard/__init__.py`](https://github.com/NVIDIA-NeMo/Skills/blob/main/nemo_skills/dataset/asr-leaderboard/__init__.py)
+- Original datasets are hosted on HuggingFace (downloaded automatically during preparation)
+
 ### MMAU-Pro
 
 MMAU-Pro (Multimodal Audio Understanding - Pro) is a comprehensive benchmark for evaluating audio understanding capabilities across three different task categories:
@@ -17,108 +31,101 @@ MMAU-Pro (Multimodal Audio Understanding - Pro) is a comprehensive benchmark for
 - Benchmark is defined in [`nemo_skills/dataset/mmau-pro/__init__.py`](https://github.com/NVIDIA-NeMo/Skills/blob/main/nemo_skills/dataset/mmau-pro/__init__.py)
 - Original benchmark source is hosted on [HuggingFace](https://huggingface.co/datasets/gamma-lab-umd/MMAU-Pro)
 
-## Preparing MMAU-Pro Data
+## Preparing Data
 
-MMAU-Pro requires audio files for meaningful evaluation. **Audio files are downloaded by default** to ensure proper evaluation.
+These benchmarks require audio files for meaningful evaluation. **Audio files are downloaded by default** to ensure proper evaluation.
 
 !!! warning "Running without audio files"
-    If you want to evaluation without audio files (not recommended) use
+    If you want to evaluate without audio files (not recommended) use
     `--no-audio` flag. In this case you can also set `--skip_data_dir_check`
     as data is very lightweight when audio files aren't being used.
 
-### Data Preparation
-
-To prepare the dataset with audio files:
+### ASR Leaderboard
 
 ```bash
-export HF_TOKEN=your_huggingface_token
-ns prepare_data mmau-pro --data_dir=/path/to/data --cluster=<cluster_name>
+ns prepare_data asr-leaderboard --data_dir=/path/to/data --cluster=<cluster>
 ```
 
-**What happens:**
-
-- Requires authentication (HuggingFace token via `HF_TOKEN` environment variable)
-- Downloads audio archive from HuggingFace and extracts
-- Prepares the dataset files for evaluation
+Prepare specific datasets only:
 
-### Text-Only Mode (Not Recommended)
+```bash
+ns prepare_data asr-leaderboard --datasets librispeech_clean ami
+```
 
-If you need to prepare without audio files:
+### MMAU-Pro
 
 ```bash
-ns prepare_data mmau-pro --no-audio
+ns prepare_data mmau-pro --data_dir=/path/to/data --cluster=<cluster_name>
 ```
 
-Note: The git repository check is automatically skipped with `--no-audio`.
-
 ## Running Evaluation
 
-!!! note
-    Currently supports only Megatron server type (`--server_type=megatron`).
-
-### Evaluation Example
+### ASR Leaderboard
 
 ```python
-import os
 from nemo_skills.pipeline.cli import wrap_arguments, eval
 
-os.environ["NVIDIA_API_KEY"] = "your_nvidia_api_key"  # For LLM judge
-
 eval(
-    ctx=wrap_arguments("++prompt_suffix='/no_think'"),
+    ctx=wrap_arguments(""),
     cluster="oci_iad",
-    output_dir="/workspace/mmau-pro-eval",
-    benchmarks="mmau-pro",
+    output_dir="/workspace/asr-leaderboard-eval",
+    benchmarks="asr-leaderboard",
     server_type="megatron",
     server_gpus=1,
     model="/workspace/checkpoint",
     server_entrypoint="/workspace/megatron-lm/server.py",
     server_container="/path/to/container.sqsh",
     data_dir="/dataset",
-    installation_command="pip install sacrebleu",
+    installation_command="pip install sacrebleu jiwer openai-whisper"
     server_args="--inference-max-requests 1 --model-config /workspace/checkpoint/config.yaml",
 )
 ```
 
-??? note "Alternative: Command-line usage"
+Evaluate a specific dataset:
 
-    If you prefer using the command-line interface, you can run:
+```python
+eval(benchmarks="asr-leaderboard", split="librispeech_clean", ...)
+```
 
-    ```bash
-    export HF_TOKEN=your_huggingface_token
-    export NVIDIA_API_KEY=your_nvidia_api_key
-    export MEGATRON_PATH=/workspace/path/to/megatron-lm
+??? note "Alternative: Command-line usage"
 
+    ```bash
     ns eval \
         --cluster=oci_iad \
-        --output_dir=/workspace/path/to/mmau-pro-eval \
-        --benchmarks=mmau-pro \
+        --output_dir=/workspace/path/to/asr-leaderboard-eval \
+        --benchmarks=asr-leaderboard \
         --server_type=megatron \
         --server_gpus=1 \
-        --model=/workspace/path/to/checkpoint-tp1 \
-        --server_entrypoint=$MEGATRON_PATH/path/to/server.py \
-        --server_container=/path/to/server_container.sqsh \
-        --data_dir=/dataset \
-        --installation_command="pip install sacrebleu" \
-        ++prompt_suffix='/no_think' \
-        --server_args="--inference-max-requests 1 \
-                       --model-config /workspace/path/to/checkpoint-tp1/config.yaml \
-                       --num-tokens-to-generate 256 \
-                       --temperature 1.0 \
-                       --top_p 1.0"
+        --model=/workspace/path/to/checkpoint \
+        --server_entrypoint=/workspace/megatron-lm/server.py \
+        --server_container=/path/to/container.sqsh \
+        --data_dir=/dataset
+        --installation_command="pip install sacrebleu jiwer openai-whisper"
     ```
 
-## How Evaluation Works
+### MMAU-Pro
 
-Each category uses a different evaluation strategy:
+```python
+import os
+from nemo_skills.pipeline.cli import wrap_arguments, eval
 
-| Category | Evaluation Method | How It Works |
-|----------|-------------------|--------------|
-| **Closed-Form** | NVEmbed similarity matching | Model generates short answer; compared to expected answer using embeddings |
-| **Open-Ended** | LLM-as-a-judge (Qwen 2.5 7B) | Model generates detailed response; Qwen 2.5 judges quality and correctness |
-| **Instruction Following** | Custom evaluation logic | Model follows instructions; evaluator checks adherence |
+os.environ["NVIDIA_API_KEY"] = "your_nvidia_api_key"  # For LLM judge
 
-### Sub-benchmarks
+eval(
+    ctx=wrap_arguments(""),
+    cluster="oci_iad",
+    output_dir="/workspace/mmau-pro-eval",
+    benchmarks="mmau-pro",
+    server_type="megatron",
+    server_gpus=1,
+    model="/workspace/checkpoint",
+    server_entrypoint="/workspace/megatron-lm/server.py",
+    server_container="/path/to/container.sqsh",
+    data_dir="/dataset",
+    installation_command="pip install sacrebleu",
+    server_args="--inference-max-requests 1 --model-config /workspace/checkpoint/config.yaml",
+)
+```
 
 Evaluate individual categories:
 
@@ -130,6 +137,24 @@ Evaluate individual categories:
 eval(benchmarks="mmau-pro.closed_form", ...)
 ```
 
+??? note "Alternative: Command-line usage"
+
+    ```bash
+    export NVIDIA_API_KEY=your_nvidia_api_key
+
+    ns eval \
+        --cluster=oci_iad \
+        --output_dir=/workspace/path/to/mmau-pro-eval \
+        --benchmarks=mmau-pro \
+        --server_type=megatron \
+        --server_gpus=1 \
+        --model=/workspace/path/to/checkpoint \
+        --server_entrypoint=/workspace/megatron-lm/server.py \
+        --server_container=/path/to/container.sqsh \
+        --data_dir=/dataset \
+        --installation_command="pip install sacrebleu"
+    ```
+
 ### Using Custom Judge Models
 
 The open-ended questions subset uses an LLM-as-a-judge (by default, Qwen 2.5 7B via NVIDIA API) to evaluate responses. You can customize the judge model for this subset:
@@ -143,7 +168,7 @@ The open-ended questions subset uses an LLM-as-a-judge (by default, Qwen 2.5 7B
     os.environ["NVIDIA_API_KEY"] = "your_nvidia_api_key"
 
     eval(
-        ctx=wrap_arguments("++prompt_suffix='/no_think'"),
+        ctx=wrap_arguments(""),
         cluster="oci_iad",
         output_dir="/workspace/path/to/mmau-pro-eval",
         benchmarks="mmau-pro.open_ended",  # Only open-ended uses LLM judge
@@ -180,7 +205,58 @@ The open-ended questions subset uses an LLM-as-a-judge (by default, Qwen 2.5 7B
 
 ## Understanding Results
 
-After evaluation completes, results are saved in your output directory under `eval-results/`:
+After evaluation completes, results are saved in your output directory under `eval-results/`.
+
+### ASR Leaderboard Results
+
+```
+<output_dir>/
+└── eval-results/
+    └── asr-leaderboard/
+        └──metrics.json
+```
+
+Example output:
+
+```
+------------------------------------- asr-leaderboard --------------------------------------
+evaluation_mode | avg_tokens | gen_seconds | success_rate | no_answer | wer    | num_entries
+pass@1          | 736        | 233522      | 86.70%       | 0.00%     | 7.82%  | 143597
+
+----------------------------------- asr-leaderboard-ami ------------------------------------
+evaluation_mode | avg_tokens | gen_seconds | success_rate | no_answer | wer    | num_entries
+pass@1          | 732        | 3680        | 81.27%       | 0.00%     | 18.45% | 12620
+
+-------------------------------- asr-leaderboard-earnings22 --------------------------------
+evaluation_mode | avg_tokens | gen_seconds | success_rate | no_answer | wer    | num_entries
+pass@1          | 736        | 3522        | 83.97%       | 0.00%     | 14.72% | 57390
+
+-------------------------------- asr-leaderboard-gigaspeech --------------------------------
+evaluation_mode | avg_tokens | gen_seconds | success_rate | no_answer | wer    | num_entries
+pass@1          | 736        | 233469      | 71.86%       | 0.00%     | 12.34% | 25376
+
+---------------------------- asr-leaderboard-librispeech_clean ----------------------------
+evaluation_mode | avg_tokens | gen_seconds | success_rate | no_answer | wer   | num_entries
+pass@1          | 735        | 3607        | 99.62%       | 0.00%     | 2.06% | 2620
+
+---------------------------- asr-leaderboard-librispeech_other ----------------------------
+evaluation_mode | avg_tokens | gen_seconds | success_rate | no_answer | wer   | num_entries
+pass@1          | 733        | 3927        | 98.67%       | 0.00%     | 4.34% | 2939
+
+-------------------------------- asr-leaderboard-spgispeech -------------------------------
+evaluation_mode | avg_tokens | gen_seconds | success_rate | no_answer | wer   | num_entries
+pass@1          | 740        | 4510        | 99.99%       | 0.00%     | 3.81% | 39341
+
+--------------------------------- asr-leaderboard-tedlium ----------------------------------
+evaluation_mode | avg_tokens | gen_seconds | success_rate | no_answer | wer   | num_entries
+pass@1          | 732        | 3878        | 77.74%       | 0.00%     | 7.89% | 1469
+
+-------------------------------- asr-leaderboard-voxpopuli --------------------------------
+evaluation_mode | avg_tokens | gen_seconds | success_rate | no_answer | wer   | num_entries
+pass@1          | 741        | 4007        | 99.51%       | 0.00%     | 6.47% | 1842
+```
+
+### MMAU-Pro Results
 
 ```
 <output_dir>/
@@ -195,9 +271,7 @@ After evaluation completes, results are saved in your output directory under `ev
 │           └── metrics.json
 ```
 
-### Evaluation Output Format
-
-When evaluation completes, results are displayed in formatted tables in the logs:
+Example output:
 
 **Open-Ended Questions:**
 
@@ -213,7 +287,6 @@ pass@1          | 82         | 196         | 14.88%       | 0.00%     | 625
 -------------------------- mmau-pro.instruction_following -------------------------
 evaluation_mode | avg_tokens | gen_seconds | success_rate | no_answer | num_entries
 pass@1          | 0          | 102         | 21.84%       | 0.00%     | 87
-
 ```
 
 **Closed-Form Questions (Main Category + Sub-categories):**

diff --git a/docs/index.md b/docs/index.md
@@ -22,7 +22,7 @@ Here are some of the features we support:
         - [**Long-context**](./evaluation/long-context.md): e.g. [ruler](./evaluation/long-context.md#ruler), [mrcr](./evaluation/long-context.md#mrcr)
         - [**Tool-calling**](./evaluation/tool-calling.md): e.g. [bfcl_v3](./evaluation/tool-calling.md#bfcl_v3)
         - [**Multilingual capabilities**](./evaluation/multilingual.md): e.g. [mmlu-prox](./evaluation/multilingual.md#mmlu-prox), [flores-200](./evaluation/multilingual.md#FLORES-200), [wmt24pp](./evaluation/multilingual.md#wmt24pp)
-        - [**Speech & Audio**](./evaluation/speech-audio.md): e.g. [mmau-pro](./evaluation/speech-audio.md#mmau-pro)
+        - [**Speech & Audio**](./evaluation/speech-audio.md): e.g. [asr-leaderboard](./evaluation/speech-audio.md#asr-leaderboard), [mmau-pro](./evaluation/speech-audio.md#mmau-pro)
         - [**Robustness evaluation**](./evaluation/robustness.md): Evaluate model sensitvity against changes in prompt.
     - Easily parallelize each evaluation across many Slurm jobs, self-host LLM judges, bring your own prompts or change benchmark configuration in any other way.
 - [Model training](pipelines/training.md): Train models using [NeMo-RL](https://github.com/NVIDIA-NeMo/RL/) or [verl](https://github.com/volcengine/verl).

diff --git a/nemo_skills/dataset/asr-leaderboard/__init__.py b/nemo_skills/dataset/asr-leaderboard/__init__.py
@@ -0,0 +1,21 @@
+# Copyright (c) 2025, NVIDIA CORPORATION.  All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+# Settings that define how evaluation should be done by default (all can be changed from cmdline)
+# Uses the audio evaluator which computes WER with HuggingFace leaderboard preprocessing
+# Data samples should have task_type="ASR_LEADERBOARD" for proper WER calculation
+
+DATASET_GROUP = "speechlm"
+METRICS_TYPE = "audio"
+GENERATION_ARGS = "++prompt_format=openai ++eval_type=audio"