NVIDIA-NeMo · Jorjeous · Feb 23, 2026 · Feb 9, 2026 · Feb 9, 2026 · Feb 9, 2026
diff --git a/docs/evaluation/speech-audio.md b/docs/evaluation/speech-audio.md
@@ -408,3 +408,78 @@ or
 ns prepare_data librispeech-pc --split test-other --data_dir=/path/to/data
 ```
 
+## Numb3rs
+
+Numb3rs is a speech benchmark for evaluating text normalization (TN) and inverse text normalization (ITN) capabilities of audio-language models. It contains paired written/spoken forms with corresponding synthetic audio, allowing evaluation of whether a model transcribes numbers in written form (e.g., `$100`, `3.14`) or spoken form (e.g., `one hundred dollars`, `three point one four`).
+
+**Dataset:** [nvidia/Numb3rs on HuggingFace](https://huggingface.co/datasets/nvidia/Numb3rs)
+
+**Categories:** `ADDRESS`, `CARDINAL`, `DATE`, `DECIMAL`, `DIGIT`, `FRACTION`, `MEASURE`, `MONEY`, `ORDINAL`, `PLAIN`, `TELEPHONE`, `TIME`
+
+**Size:** ~10K samples, ~4.89h total audio duration
+
+### Dataset Location
+
+- Benchmark is defined in [`nemo_skills/dataset/numb3rs/__init__.py`](https://github.com/NVIDIA-NeMo/Skills/blob/main/nemo_skills/dataset/numb3rs/__init__.py)
+- Original dataset is hosted on [HuggingFace](https://huggingface.co/datasets/nvidia/Numb3rs)
+
+### Key Features
+
+- **Dual reference evaluation**: Each sample has both a written form (`text_tn`) and a spoken form (`text_itn`). WER is computed against both references.
+- **Three prompt variants** generated as separate split files:
+    - `test_neutral`: Neutral transcription prompt ("Transcribe the audio file into English text.")
+    - `test_tn`: Text normalization prompt — expects written form (e.g., `$100`)
+    - `test_itn`: Inverse text normalization prompt — expects spoken form (e.g., `one hundred dollars`)
+- **Special normalization mode** `no_tn_itn`: Applies only lowercase + punctuation removal (no whisper normalization that would convert number words to digits, which would defeat the purpose of TN/ITN evaluation).
+
+### Preparing Numb3rs Data
+
+Numb3rs requires audio files for evaluation. **Audio files are downloaded by default** from HuggingFace.
+
+```bash
+ns prepare_data numb3rs --data_dir=/path/to/data --cluster=<cluster_name>
+```
+
+To prepare without saving audio files:
+
+```bash
+ns prepare_data numb3rs --no-audio --skip_data_dir_check
+```
+
+Prepare specific categories only:
+
+```bash
+ns prepare_data numb3rs --categories CARDINAL DATE MONEY --data_dir=/path/to/data
+```
+
+Set a custom audio path prefix (for non-standard mount points):
+
+```bash
+ns prepare_data numb3rs --audio-prefix /my/custom/path --data_dir=/path/to/data
+```
+
+### Running Numb3rs Evaluation
+
+The `--split` flag selects the prompt variant:
+
+```bash
+# Neutral prompt (default)
+ns eval --benchmarks=numb3rs:1 --split=test_neutral ...
+
+# Text normalization prompt (expects written form, e.g. "$100")
+ns eval --benchmarks=numb3rs:1 --split=test_tn ...
+
+# Inverse text normalization prompt (expects spoken form, e.g. "one hundred dollars")
+ns eval --benchmarks=numb3rs:1 --split=test_itn ...
+```
+
+### Understanding Numb3rs Results
+
+Numb3rs reports the following metrics:
+
+- **wer**: Word Error Rate against the expected answer (written form for TN, spoken form for ITN/neutral)
+- **wer_tn**: WER against the written form reference (`text_tn`)
+- **wer_itn**: WER against the spoken form reference (`text_itn`)
+- **success_rate**: Percentage of samples with WER < 0.5
+
+Per-category breakdowns (e.g., `numb3rs-numb3rs_CARDINAL`, `numb3rs-numb3rs_MONEY`) are included automatically.
diff --git a/nemo_skills/dataset/numb3rs/__init__.py b/nemo_skills/dataset/numb3rs/__init__.py
@@ -0,0 +1,67 @@
+# Copyright (c) 2025, NVIDIA CORPORATION.  All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""Numb3rs: Numbers Speech Benchmark for TN/ITN evaluation.
+
+A speech dataset for text normalization (TN) and inverse text normalization (ITN) tasks,
+containing paired written/spoken forms with corresponding synthetic audio.
+
+Dataset: https://huggingface.co/datasets/nvidia/Numb3rs
+
+Categories: ADDRESS, CARDINAL, DATE, DECIMAL, DIGIT, FRACTION, MEASURE, MONEY,
+           ORDINAL, PLAIN, TELEPHONE, TIME
+
+Features:
+- Dual reference evaluation: text_tn (written form) and text_itn (spoken form)
+- Multiple prompt variants available as separate files:
+  - test_neutral.jsonl: Neutral transcription prompt
+  - test_tn.jsonl: Text normalization prompt (expects written form)
+  - test_itn.jsonl: Inverse text normalization prompt (expects spoken form)
+- ~10K samples, ~4.89h total duration
+
+Usage:
+    # Prepare dataset (generates all 3 prompt variants)
+    ns prepare_data numb3rs
+
+    # Generate with neutral prompt, evaluate against both references
+    ns generate ... \
+        ++input_file=/path/to/numb3rs/test_neutral.jsonl \
+        ++eval_config.reference_fields='[text_tn,text_itn]'
+
+    # Generate with TN prompt (expects written form as answer)
+    ns generate ... \
+        ++input_file=/path/to/numb3rs/test_tn.jsonl \
+        ++eval_config.reference_fields='[text_tn,text_itn]'
+
+    # Generate with ITN prompt (expects spoken form as answer)
+    ns generate ... \
+        ++input_file=/path/to/numb3rs/test_itn.jsonl \
+        ++eval_config.reference_fields='[text_tn,text_itn]'
+"""
+
+# Dataset configuration
+DATASET_GROUP = "speechlm"
+METRICS_TYPE = "audio"
+DEFAULT_SPLIT = "test_neutral"  # Use neutral prompt variant by default
+
+# Evaluation settings
+EVAL_SPLIT = "test_neutral"  # Use neutral prompt variant by default
+EVAL_ARGS = (
+    "++eval_type=audio "
+    "++eval_config.reference_fields='[text_tn,text_itn]' "  # Evaluate against both references
+    "++eval_config.normalization_mode=no_tn_itn "  # Lowercase + remove punct.
+)
+
+# Generation settings - OpenAI format for audio-language models
+GENERATION_ARGS = "++prompt_format=openai ++enable_audio=true"