Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
75 changes: 75 additions & 0 deletions docs/evaluation/speech-audio.md
Original file line number Diff line number Diff line change
Expand Up @@ -408,3 +408,78 @@ or
ns prepare_data librispeech-pc --split test-other --data_dir=/path/to/data
```

## Numb3rs

Numb3rs is a speech benchmark for evaluating text normalization (TN) and inverse text normalization (ITN) capabilities of audio-language models. It contains paired written/spoken forms with corresponding synthetic audio, allowing evaluation of whether a model transcribes numbers in written form (e.g., `$100`, `3.14`) or spoken form (e.g., `one hundred dollars`, `three point one four`).

**Dataset:** [nvidia/Numb3rs on HuggingFace](https://huggingface.co/datasets/nvidia/Numb3rs)

**Categories:** `ADDRESS`, `CARDINAL`, `DATE`, `DECIMAL`, `DIGIT`, `FRACTION`, `MEASURE`, `MONEY`, `ORDINAL`, `PLAIN`, `TELEPHONE`, `TIME`

**Size:** ~10K samples, ~4.89h total audio duration

### Dataset Location

- Benchmark is defined in [`nemo_skills/dataset/numb3rs/__init__.py`](https://github.com/NVIDIA-NeMo/Skills/blob/main/nemo_skills/dataset/numb3rs/__init__.py)
- Original dataset is hosted on [HuggingFace](https://huggingface.co/datasets/nvidia/Numb3rs)

### Key Features

- **Dual reference evaluation**: Each sample has both a written form (`text_tn`) and a spoken form (`text_itn`). WER is computed against both references.
- **Three prompt variants** generated as separate split files:
- `test_neutral`: Neutral transcription prompt ("Transcribe the audio file into English text.")
- `test_tn`: Text normalization prompt — expects written form (e.g., `$100`)
- `test_itn`: Inverse text normalization prompt — expects spoken form (e.g., `one hundred dollars`)
- **Special normalization mode** `no_tn_itn`: Applies only lowercase + punctuation removal (no whisper normalization that would convert number words to digits, which would defeat the purpose of TN/ITN evaluation).

### Preparing Numb3rs Data

Numb3rs requires audio files for evaluation. **Audio files are downloaded by default** from HuggingFace.

```bash
ns prepare_data numb3rs --data_dir=/path/to/data --cluster=<cluster_name>
```

To prepare without saving audio files:

```bash
ns prepare_data numb3rs --no-audio --skip_data_dir_check
```

Prepare specific categories only:

```bash
ns prepare_data numb3rs --categories CARDINAL DATE MONEY --data_dir=/path/to/data
```

Set a custom audio path prefix (for non-standard mount points):

```bash
ns prepare_data numb3rs --audio-prefix /my/custom/path --data_dir=/path/to/data
```

### Running Numb3rs Evaluation

The `--split` flag selects the prompt variant:

```bash
# Neutral prompt (default)
ns eval --benchmarks=numb3rs:1 --split=test_neutral ...

# Text normalization prompt (expects written form, e.g. "$100")
ns eval --benchmarks=numb3rs:1 --split=test_tn ...

# Inverse text normalization prompt (expects spoken form, e.g. "one hundred dollars")
ns eval --benchmarks=numb3rs:1 --split=test_itn ...
```

### Understanding Numb3rs Results

Numb3rs reports the following metrics:

- **wer**: Word Error Rate against the expected answer (written form for TN, spoken form for ITN/neutral)
- **wer_tn**: WER against the written form reference (`text_tn`)
- **wer_itn**: WER against the spoken form reference (`text_itn`)
- **success_rate**: Percentage of samples with WER < 0.5

Per-category breakdowns (e.g., `numb3rs-numb3rs_CARDINAL`, `numb3rs-numb3rs_MONEY`) are included automatically.
67 changes: 67 additions & 0 deletions nemo_skills/dataset/numb3rs/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,67 @@
# Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

"""Numb3rs: Numbers Speech Benchmark for TN/ITN evaluation.

A speech dataset for text normalization (TN) and inverse text normalization (ITN) tasks,
containing paired written/spoken forms with corresponding synthetic audio.

Dataset: https://huggingface.co/datasets/nvidia/Numb3rs

Categories: ADDRESS, CARDINAL, DATE, DECIMAL, DIGIT, FRACTION, MEASURE, MONEY,
ORDINAL, PLAIN, TELEPHONE, TIME

Features:
- Dual reference evaluation: text_tn (written form) and text_itn (spoken form)
- Multiple prompt variants available as separate files:
- test_neutral.jsonl: Neutral transcription prompt
- test_tn.jsonl: Text normalization prompt (expects written form)
- test_itn.jsonl: Inverse text normalization prompt (expects spoken form)
- ~10K samples, ~4.89h total duration

Usage:
# Prepare dataset (generates all 3 prompt variants)
ns prepare_data numb3rs

# Generate with neutral prompt, evaluate against both references
ns generate ... \
++input_file=/path/to/numb3rs/test_neutral.jsonl \
++eval_config.reference_fields='[text_tn,text_itn]'

# Generate with TN prompt (expects written form as answer)
ns generate ... \
++input_file=/path/to/numb3rs/test_tn.jsonl \
++eval_config.reference_fields='[text_tn,text_itn]'

# Generate with ITN prompt (expects spoken form as answer)
ns generate ... \
++input_file=/path/to/numb3rs/test_itn.jsonl \
++eval_config.reference_fields='[text_tn,text_itn]'
"""

# Dataset configuration
DATASET_GROUP = "speechlm"
METRICS_TYPE = "audio"
DEFAULT_SPLIT = "test_neutral" # Use neutral prompt variant by default

# Evaluation settings
EVAL_SPLIT = "test_neutral" # Use neutral prompt variant by default
EVAL_ARGS = (
"++eval_type=audio "
"++eval_config.reference_fields='[text_tn,text_itn]' " # Evaluate against both references
"++eval_config.normalization_mode=no_tn_itn " # Lowercase + remove punct.
)

# Generation settings - OpenAI format for audio-language models
GENERATION_ARGS = "++prompt_format=openai ++enable_audio=true"
Loading
Loading