diff --git a/.gitignore b/.gitignore index d9478b2c61..09edc8b81c 100644 --- a/.gitignore +++ b/.gitignore @@ -44,4 +44,5 @@ nemo_skills/dataset/aalcr/lcr/ .idea/* CLAUDE.md -.idea +# AudioBench repository (auto-cloned during data preparation) +AudioBench/ diff --git a/docs/evaluation/speech-audio.md b/docs/evaluation/speech-audio.md index 1d7d439928..07bf32332d 100644 --- a/docs/evaluation/speech-audio.md +++ b/docs/evaluation/speech-audio.md @@ -2,6 +2,11 @@ This section details how to evaluate speech and audio benchmarks, including understanding tasks that test models' ability to reason about audio content (speech, music, environmental sounds) and ASR tasks for transcription. +!!! warning "Running without audio files" + If you want to evaluation without audio files (not recommended) use + `--no-audio` flag. In this case you can also set `--skip_data_dir_check` + as data is very lightweight when audio files aren't being used. + ## Supported benchmarks ### MMAU-Pro @@ -21,11 +26,6 @@ MMAU-Pro (Multimodal Audio Understanding - Pro) is a comprehensive benchmark for MMAU-Pro requires audio files for meaningful evaluation. **Audio files are downloaded by default** to ensure proper evaluation. -!!! warning "Running without audio files" - If you want to evaluation without audio files (not recommended) use - `--no-audio` flag. In this case you can also set `--skip_data_dir_check` - as data is very lightweight when audio files aren't being used. - ### Data Preparation To prepare the dataset with audio files: @@ -46,7 +46,7 @@ ns prepare_data mmau-pro --data_dir=/path/to/data --cluster= If you need to prepare without audio files: ```bash -ns prepare_data mmau-pro --no-audio +ns prepare_data mmau-pro --no-audio --skip_data_dir_check ``` Note: The git repository check is automatically skipped with `--no-audio`. @@ -100,12 +100,9 @@ eval( --server_container=/path/to/server_container.sqsh \ --data_dir=/dataset \ --installation_command="pip install sacrebleu" \ - ++prompt_suffix='/no_think' \ + ++max_concurrent_requests=1 \ --server_args="--inference-max-requests 1 \ - --model-config /workspace/path/to/checkpoint-tp1/config.yaml \ - --num-tokens-to-generate 256 \ - --temperature 1.0 \ - --top_p 1.0" + --model-config /workspace/path/to/checkpoint-tp1/config.yaml ``` ## How Evaluation Works @@ -271,3 +268,133 @@ pass@1 | 0 | 6580 | 55.52% | 0.00% | 290 evaluation_mode | avg_tokens | gen_seconds | success_rate | no_answer | num_entries pass@1 | 11 | 6879 | 31.44% | 0.00% | 5305 ``` + + +### LibriSpeech-PC + +LibriSpeech-PC is an Automatic Speech Recognition (ASR) benchmark that evaluates models' ability to transcribe speech with proper punctuation and capitalization. It builds upon the original LibriSpeech corpus with enhanced reference transcripts. + +#### Dataset Location + +- Benchmark is defined in [`nemo_skills/dataset/librispeech-pc/__init__.py`](https://github.com/NVIDIA-NeMo/Skills/blob/main/nemo_skills/dataset/librispeech-pc/__init__.py) +- Manifests (with punctuation/capitalization) from [OpenSLR-145](https://www.openslr.org/145/) +- Audio files from original [LibriSpeech OpenSLR-12](https://www.openslr.org/12/) + +#### Available Splits + +- `test-clean`: Clean speech recordings (easier subset) +- `test-other`: More challenging recordings with varied acoustic conditions + +## Preparing LibriSpeech-PC Data + +LibriSpeech-PC requires audio files for ASR evaluation. **Audio files are downloaded by default**. + +### Data Preparation + +To prepare the dataset with audio files: + +```bash +ns prepare_data librispeech-pc --data_dir=/path/to/data --cluster= +``` + +**What happens:** + +- Downloads manifests with punctuation/capitalization from OpenSLR-145 +- Downloads audio files from original LibriSpeech (OpenSLR-12) +- Prepares both `test-clean` and `test-other` splits + +### Preparing Specific Splits + +To prepare only one split: + +```bash +ns prepare_data librispeech-pc --split test-clean --data_dir=/path/to/data +``` + +or + +```bash +ns prepare_data librispeech-pc --split test-other --data_dir=/path/to/data +``` + +## Running LibriSpeech-PC Evaluation + +!!! note + Currently supports only Megatron server type (`--server_type=megatron`). + +### Evaluation Example + +```python +import os +from nemo_skills.pipeline.cli import wrap_arguments, eval + +eval( + ctx=wrap_arguments(""), + cluster="oci_iad", + output_dir="/workspace/librispeech-pc-eval", + benchmarks="librispeech-pc", + server_type="megatron", + server_gpus=1, + model="/workspace/checkpoint", + server_entrypoint="/workspace/megatron-lm/server.py", + server_container="/path/to/container.sqsh", + data_dir="/dataset", + installation_command="pip install sacrebleu whisper jiwer", + server_args="--inference-max-requests 1 --model-config /workspace/checkpoint/config.yaml", +) +``` + +??? note "Alternative: Command-line usage" + + If you prefer using the command-line interface, you can run: + + ```bash + export MEGATRON_PATH=/workspace/path/to/megatron-lm + + ns eval \ + --cluster=oci_iad \ + --output_dir=/workspace/path/to/librispeech-pc-eval \ + --benchmarks=librispeech-pc \ + --server_type=megatron \ + --server_gpus=1 \ + --model=/workspace/path/to/checkpoint-tp1 \ + --server_entrypoint=$MEGATRON_PATH/path/to/server.py \ + --server_container=/path/to/server_container.sqsh \ + --data_dir=/dataset \ + --installation_command="pip install sacrebleu whisper jiwer" \ + ++max_concurrent_requests=1 \ + --server_args="--inference-max-requests 1 \ + --model-config /workspace/path/to/checkpoint-tp1/config.yaml" + ``` + +## How LibriSpeech-PC Evaluation Works + +The evaluation measures ASR accuracy using multiple Word Error Rate (WER) metrics: + +| Metric | Description | +|--------|-------------| +| **WER** | Word Error Rate - measures transcription accuracy ignoring punctuation and capitalization | +| **WER_C** | Word Error Rate with Capitalization - measures accuracy including capitalization | +| **WER_PC** | Word Error Rate with Punctuation and Capitalization - measures full accuracy including both | +| **PER** | Punctuation Error Rate - measures how well the model predicts punctuation marks | + +### Sub-benchmarks + +Evaluate individual splits: + +- `librispeech-pc.test-clean` - Easier, clean speech subset +- `librispeech-pc.test-other` - More challenging subset with varied conditions + +```python +eval(benchmarks="librispeech-pc.test-clean", ...) +``` + +### Evaluation Output Format + +**test-clean Split:** + +``` +------------------------------- librispeech-pc.test-clean ----------------------------- +evaluation_mode | avg_tokens | gen_seconds | wer | wer_c | wer_pc | per | num_entries +pass@1 | 15 | 120 | 4.23% | 4.85% | 5.12% | 2.34% | 2620 +``` \ No newline at end of file diff --git a/nemo_skills/dataset/audiobench/__init__.py b/nemo_skills/dataset/audiobench/__init__.py new file mode 100644 index 0000000000..007c326511 --- /dev/null +++ b/nemo_skills/dataset/audiobench/__init__.py @@ -0,0 +1,36 @@ +# Copyright (c) 2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +"""AudioBench: A comprehensive benchmark for speech and audio language models. + +AudioBench evaluates models across multiple tasks: +- ASR (Automatic Speech Recognition) +- Translation (speech-to-text translation) +- Speech QA (question answering based on audio) +- Audio understanding (emotion, gender, accent recognition, etc.) + +The benchmark is organized into two main categories: +- nonjudge: Tasks evaluated with automatic metrics (WER, BLEU) +- judge: Tasks requiring LLM-as-a-judge evaluation +""" + +DATASET_GROUP = "speechlm" +IS_BENCHMARK_GROUP = True +SCORE_MODULE = "nemo_skills.evaluation.metrics.speechlm_metrics" + +# Top-level benchmarks: evaluate all judge or all nonjudge datasets +BENCHMARKS = { + "audiobench.nonjudge": {}, + "audiobench.judge": {}, +} diff --git a/nemo_skills/dataset/audiobench/judge/__init__.py b/nemo_skills/dataset/audiobench/judge/__init__.py new file mode 100644 index 0000000000..69b86c575d --- /dev/null +++ b/nemo_skills/dataset/audiobench/judge/__init__.py @@ -0,0 +1,39 @@ +# Copyright (c) 2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +"""AudioBench judge tasks dataset configuration. + +This dataset includes tasks that require LLM-based evaluation such as: +- Audio captioning +- Spoken question answering +- Audio understanding and reasoning + +These tasks require an LLM judge for evaluation, matching MMAU-Pro evaluation setup. +""" + +# Dataset configuration - CRITICAL: needed for audio to work +DATASET_GROUP = "speechlm" +METRICS_TYPE = "speechlm" +DEFAULT_SPLIT = "test" +GENERATION_ARGS = "++prompt_format=openai " + +# Judge configuration matching AudioBench official implementation +# Using Llama-3.1-70B with vllm (can be overridden in run scripts) +JUDGE_PIPELINE_ARGS = { + "model": "meta-llama/Meta-Llama-3.1-70B-Instruct", + "server_type": "vllm", + "server_gpus": 8, + "server_args": "--max-model-len 8192 --gpu-memory-utilization 0.95", +} +JUDGE_ARGS = "++prompt_config=judge/audiobench ++generation_key=judgement" diff --git a/nemo_skills/dataset/audiobench/nonjudge/__init__.py b/nemo_skills/dataset/audiobench/nonjudge/__init__.py new file mode 100644 index 0000000000..fc4704d75c --- /dev/null +++ b/nemo_skills/dataset/audiobench/nonjudge/__init__.py @@ -0,0 +1,31 @@ +# Copyright (c) 2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +"""AudioBench non-judge tasks dataset configuration. + +This dataset includes ASR, translation, and other tasks that use +automatic metrics (WER, BLEU, WER-PC) instead of judge evaluation. + +NO JUDGE REQUIRED - Metrics computed automatically from model outputs. +""" + +# Dataset configuration - CRITICAL: needed for audio to work +DATASET_GROUP = "speechlm" +METRICS_TYPE = "speechlm" + +# Evaluation settings +EVAL_ARGS = "++eval_type=audiobench " + +# Generation settings - OpenAI format for audio-language models +GENERATION_ARGS = "++prompt_format=openai " diff --git a/nemo_skills/dataset/audiobench/prepare.py b/nemo_skills/dataset/audiobench/prepare.py new file mode 100644 index 0000000000..d27a174ebf --- /dev/null +++ b/nemo_skills/dataset/audiobench/prepare.py @@ -0,0 +1,546 @@ +# Copyright (c) 2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +"""AudioBench Dataset Preparation for nemo-skills + +This script prepares AudioBench datasets for evaluation with nemo-skills. +AudioBench is a comprehensive benchmark for evaluating speech and audio models +across multiple tasks including ASR, translation, speech QA, and more. + +Usage: + python -m nemo_skills.dataset.audiobench.prepare --split test + python -m nemo_skills.dataset.audiobench.prepare --datasets librispeech_test_clean earnings21_test + python -m nemo_skills.dataset.audiobench.prepare --category nonjudge +""" + +import argparse +import json +import os +import shutil +import subprocess +import sys +from pathlib import Path +from typing import Dict, List + +import numpy as np +import soundfile as sf +from tqdm import tqdm + +# AudioBench datasets categorized by evaluation type +JUDGE_DATASETS = [ + "alpaca_audio_test", + "audiocaps_qa_test", + "audiocaps_test", + "clotho_aqa_test", + "cn_college_listen_mcq_test", + "dream_tts_mcq_test", + "iemocap_emotion_test", + "iemocap_gender_test", + "imda_ar_dialogue", + "imda_ar_sentence", + "imda_gr_dialogue", + "imda_gr_sentence", + "imda_part3_30s_ds_human_test", + "imda_part4_30s_ds_human_test", + "imda_part5_30s_ds_human_test", + "imda_part6_30s_ds_human_test", + "imda_part3_30s_sqa_human_test", + "imda_part4_30s_sqa_human_test", + "imda_part5_30s_sqa_human_test", + "imda_part6_30s_sqa_human_test", + "meld_emotion_test", + "meld_sentiment_test", + "mmau_mini", + "muchomusic_test", + "openhermes_audio_test", + "public_sg_speech_qa_test", + "slue_p2_sqa5_test", + "spoken_squad_test", + "voxceleb_accent_test", + "voxceleb_gender_test", + "wavcaps_qa_test", + "wavcaps_test", +] + +NONJUDGE_DATASETS = [ + "aishell_asr_zh_test", + "common_voice_15_en_test", + "covost2_en_id_test", + "covost2_en_ta_test", + "covost2_en_zh_test", + "covost2_id_en_test", + "covost2_ta_en_test", + "covost2_zh_en_test", + "earnings21_test", + "earnings22_test", + "gigaspeech_test", + "gigaspeech2_indo", + "gigaspeech2_thai", + "gigaspeech2_viet", + "imda_part1_asr_test", + "imda_part2_asr_test", + "imda_part3_30s_asr_test", + "imda_part4_30s_asr_test", + "imda_part5_30s_asr_test", + "imda_part6_30s_asr_test", + "librispeech_test_clean", + "librispeech_test_other", + "peoples_speech_test", + "seame_dev_man", + "seame_dev_sge", + "spoken-mqa_long_digit", + "spoken-mqa_multi_step_reasoning", + "spoken-mqa_short_digit", + "spoken-mqa_single_step_reasoning", + "tedlium3_test", + "tedlium3_long_form_test", +] + + +def get_audio_duration(audio_array: np.ndarray, sampling_rate: int) -> float: + """Compute audio duration in seconds from array and sampling rate.""" + if audio_array is None or len(audio_array) == 0: + return 0.0 + return float(len(audio_array) / sampling_rate) + + +def save_audio_file(audio_array: np.ndarray, sampling_rate: int, output_path: str): + """Save audio array to WAV file.""" + os.makedirs(os.path.dirname(output_path), exist_ok=True) + sf.write(output_path, audio_array, sampling_rate) + + +def create_manifest_entry( + sample: Dict, + audio_filename: str, + duration: float, + dataset_name: str, + sample_id: int, + category: str, +) -> Dict: + """Create a nemo-skills compatible manifest entry. + + Args: + sample: Raw sample from AudioBench dataset + audio_filename: Audio filename (relative path within audiobench directory) + duration: Audio duration in seconds + dataset_name: Name of the dataset + sample_id: Sample index + category: Category (judge/nonjudge) + + Returns: + Manifest entry dict with proper format for nemo-skills + """ + instruction = sample.get("instruction", sample.get("text", "Process the audio")) + reference = sample.get("reference", sample.get("answer", "")) + task_type = sample.get("task_type", "unknown") + + # Create absolute audio path with /data/ prefix for cluster deployment + # Format: /data/audiobench/{category}/audio/{dataset_name}/{filename} + audio_rel_path = f"/data/audiobench/{category}/audio/{dataset_name}/{audio_filename}" + + # Create audio metadata (both singular and plural forms for compatibility) + audio_metadata = {"path": audio_rel_path, "duration": duration} + + entry = { + "expected_answer": reference, + "audio_path": [audio_rel_path], + "messages": [ + {"role": "system", "content": "You are a helpful assistant. /no_think"}, + { + "role": "user", + "content": instruction, + "audio": audio_metadata, + "audios": [audio_metadata], + }, + ], + "dataset": dataset_name, + "subset_for_metrics": dataset_name, + "sample_id": sample_id, + "task_type": task_type, + "question": instruction, + } + + for key in [ + "choices", + "options", + "audio_text_instruction", + "audio_gt", + "dimension", + "rule_type", + "rule_target", + "task", + ]: + if key in sample: + entry[key] = sample[key] + + return entry + + +def clone_audiobench_repo(target_dir: Path) -> bool: + """Clone AudioBench repository if it doesn't exist. + + Args: + target_dir: Directory where AudioBench should be cloned + + Returns: + True if successful, False otherwise + """ + audiobench_url = "https://github.com/AudioLLMs/AudioBench.git" + + if target_dir.exists(): + print(f"AudioBench already exists at {target_dir}") + return True + + print(f"\nCloning AudioBench repository to {target_dir}...") + print("This may take a few minutes...") + + try: + subprocess.run( + ["git", "clone", audiobench_url, str(target_dir)], + check=True, + capture_output=True, + text=True, + ) + print("✓ Successfully cloned AudioBench") + return True + except subprocess.CalledProcessError as e: + print(f"✗ Failed to clone AudioBench: {e.stderr}") + return False + except FileNotFoundError: + print("✗ git command not found. Please install git or manually clone AudioBench.") + return False + + +def process_dataset( + dataset_name: str, + output_dir: Path, + save_audio: bool = True, + split: str = "test", + audiobench_path: Path = None, + max_samples: int = -1, +) -> tuple[int, List[Dict]]: + """Process a single AudioBench dataset. + + Args: + dataset_name: Name of the dataset to process + output_dir: Base output directory + save_audio: Whether to save audio files + split: Dataset split (default: "test") + audiobench_path: Path to AudioBench repository + + Returns: + Tuple of (num_samples, manifest_entries) + """ + print(f"\n{'=' * 60}") + print(f"Processing: {dataset_name}") + print(f"{'=' * 60}") + + # Import AudioBench Dataset class + sys.path.insert(0, str(audiobench_path / "src")) + try: + from dataset import Dataset + except ImportError as e: + raise ImportError( + f"Failed to import AudioBench Dataset class: {e}\n" + f"AudioBench path: {audiobench_path}\n" + f"Make sure AudioBench repository is properly set up." + ) + + # Load dataset + try: + dataset = Dataset(dataset_name, number_of_samples=max_samples) + data_samples = dataset.input_data + print(f"Loaded {len(data_samples)} samples via AudioBench") + except Exception as e: + raise Exception(f"Failed to load dataset {dataset_name}: {e}") + + # Determine category (handle _test suffix variants) + dataset_base = dataset_name.replace("_test", "") + if dataset_name in JUDGE_DATASETS or dataset_base in JUDGE_DATASETS: + category = "judge" + elif dataset_name in NONJUDGE_DATASETS or dataset_base in NONJUDGE_DATASETS: + category = "nonjudge" + else: + category = "unknown" + + # Create output directories + audio_dir = output_dir / category / "audio" / dataset_name + dataset_dir = output_dir / category / dataset_name + os.makedirs(audio_dir, exist_ok=True) + os.makedirs(dataset_dir, exist_ok=True) + + # Copy __init__.py from category folder to dataset folder + category_init = output_dir / category / "__init__.py" + dataset_init = dataset_dir / "__init__.py" + if category_init.exists() and not dataset_init.exists(): + shutil.copy2(category_init, dataset_init) + print(f"✓ Copied __init__.py to {dataset_dir}") + + manifest_entries = [] + successful = 0 + failed = 0 + + for idx, sample in enumerate(tqdm(data_samples, desc=f"Processing {dataset_name}")): + try: + # Get audio data + audio_dict = sample.get("audio") + if audio_dict is None: + print(f"Warning: Sample {idx} has no audio, skipping") + failed += 1 + continue + + # Extract audio array and sampling rate + if isinstance(audio_dict, dict): + audio_array = audio_dict.get("array") + sampling_rate = audio_dict.get("sampling_rate", 16000) + else: + print(f"Warning: Unexpected audio format at sample {idx}") + failed += 1 + continue + + if audio_array is None or len(audio_array) == 0: + print(f"Warning: Empty audio at sample {idx}, skipping") + failed += 1 + continue + + # Convert to numpy array if needed + if isinstance(audio_array, list): + audio_array = np.array(audio_array) + + # Compute duration + duration = get_audio_duration(audio_array, sampling_rate) + + # Define audio file paths + audio_filename = f"{dataset_name}_{idx:06d}.wav" + local_audio_path = audio_dir / audio_filename + + # Save audio file + if save_audio: + try: + save_audio_file(audio_array, sampling_rate, str(local_audio_path)) + except Exception as e: + print(f"Warning: Failed to save audio for sample {idx}: {e}") + failed += 1 + continue + + # Create manifest entry with relative path + entry = create_manifest_entry( + sample=sample, + audio_filename=audio_filename, + duration=duration, + dataset_name=dataset_name, + sample_id=idx, + category=category, + ) + + manifest_entries.append(entry) + successful += 1 + + except Exception as e: + print(f"Error processing sample {idx}: {e}") + failed += 1 + continue + + # Save dataset-specific manifest to dataset directory + manifest_path = dataset_dir / f"{split}.jsonl" + with open(manifest_path, "w", encoding="utf-8") as f: + for entry in manifest_entries: + f.write(json.dumps(entry, ensure_ascii=False) + "\n") + + print(f"✓ Saved {successful} samples to {manifest_path}") + if failed > 0: + print(f"✗ Failed to process {failed} samples") + + return successful, manifest_entries + + +def main(): + parser = argparse.ArgumentParser(description="Prepare AudioBench datasets for nemo-skills evaluation") + parser.add_argument( + "--split", + default="test", + choices=["train", "validation", "test"], + help="Dataset split to prepare", + ) + parser.add_argument( + "--output_dir", + type=str, + default=None, + help="Output directory (defaults to $NEMO_SKILLS_DATA_DIR/audiobench)", + ) + parser.add_argument( + "--datasets", + nargs="+", + help="Specific dataset(s) to process (e.g., librispeech_test_clean earnings21)", + ) + parser.add_argument( + "--category", + choices=["judge", "nonjudge", "all"], + default="all", + help="Process only judge, nonjudge, or all datasets", + ) + parser.add_argument( + "--no-audio", + dest="save_audio", + action="store_false", + help="Skip saving audio files (only create manifests)", + ) + parser.add_argument( + "--audiobench-path", + type=str, + default=None, + help="Path to AudioBench repository (will auto-clone if not found)", + ) + parser.add_argument( + "--max-samples", + type=int, + default=-1, + help="Maximum number of samples to process per dataset (-1 for all)", + ) + parser.set_defaults(save_audio=True) + + args = parser.parse_args() + + # Determine output directory + if args.output_dir: + output_dir = Path(args.output_dir) + else: + # Use dataset directory as output (files will be in nemo_skills/dataset/audiobench/) + output_dir = Path(__file__).parent + + output_dir.mkdir(parents=True, exist_ok=True) + + # Determine AudioBench repository path + if args.audiobench_path: + audiobench_path = Path(args.audiobench_path) + else: + audiobench_path = os.getenv("AUDIOBENCH_REPO_PATH") + if audiobench_path: + audiobench_path = Path(audiobench_path) + else: + # Default to AudioBench directory (same level as this script) + audiobench_path = Path(__file__).parent / "AudioBench" + + # Clone AudioBench if it doesn't exist + if not audiobench_path.exists(): + print(f"\nAudioBench not found at {audiobench_path}") + if not clone_audiobench_repo(audiobench_path): + print("\nFailed to clone AudioBench. Please clone it manually:") + print(" git clone https://github.com/AudioLLMs/AudioBench.git") + sys.exit(1) + + # Verify AudioBench structure + if not (audiobench_path / "src" / "dataset.py").exists(): + print(f"\nError: AudioBench repository at {audiobench_path} is missing src/dataset.py") + print("Please ensure the repository is properly cloned.") + sys.exit(1) + + print("\n" + "=" * 60) + print("AudioBench Dataset Preparation") + print("=" * 60) + print(f"AudioBench path: {audiobench_path}") + print(f"Output directory: {output_dir}") + print(f"Save audio files: {args.save_audio}") + print(f"Split: {args.split}") + print("=" * 60 + "\n") + + # Determine which datasets to process + if args.datasets: + target_datasets = args.datasets + else: + all_datasets = JUDGE_DATASETS + NONJUDGE_DATASETS + if args.category == "judge": + target_datasets = JUDGE_DATASETS + elif args.category == "nonjudge": + target_datasets = NONJUDGE_DATASETS + else: # all + target_datasets = all_datasets + + print(f"Processing {len(target_datasets)} dataset(s)\n") + + # Process datasets + total_samples = 0 + successful_datasets = [] + failed_datasets = [] + judge_entries = [] + nonjudge_entries = [] + + for dataset_name in target_datasets: + try: + num_samples, entries = process_dataset( + dataset_name=dataset_name, + output_dir=output_dir, + save_audio=args.save_audio, + split=args.split, + audiobench_path=audiobench_path, + max_samples=args.max_samples, + ) + total_samples += num_samples + successful_datasets.append(dataset_name) + + if dataset_name in JUDGE_DATASETS: + judge_entries.extend(entries) + elif dataset_name in NONJUDGE_DATASETS: + nonjudge_entries.extend(entries) + + except Exception as e: + print(f"\n✗ FAILED: {dataset_name}") + print(f" Error: {e}\n") + failed_datasets.append((dataset_name, str(e))) + + # Create combined test.jsonl files + if judge_entries: + judge_test_path = output_dir / "judge" / f"{args.split}.jsonl" + judge_test_path.parent.mkdir(parents=True, exist_ok=True) + with open(judge_test_path, "w", encoding="utf-8") as f: + for entry in judge_entries: + f.write(json.dumps(entry, ensure_ascii=False) + "\n") + print(f"\n✓ Combined judge {args.split}.jsonl: {judge_test_path}") + print(f" Total samples: {len(judge_entries)}") + + if nonjudge_entries: + nonjudge_test_path = output_dir / "nonjudge" / f"{args.split}.jsonl" + nonjudge_test_path.parent.mkdir(parents=True, exist_ok=True) + with open(nonjudge_test_path, "w", encoding="utf-8") as f: + for entry in nonjudge_entries: + f.write(json.dumps(entry, ensure_ascii=False) + "\n") + print(f"\n✓ Combined nonjudge {args.split}.jsonl: {nonjudge_test_path}") + print(f" Total samples: {len(nonjudge_entries)}") + + # Print summary + print("\n" + "=" * 60) + print("SUMMARY") + print("=" * 60) + print(f"Datasets requested: {len(target_datasets)}") + print(f"Successfully processed: {len(successful_datasets)}") + print(f"Failed: {len(failed_datasets)}") + print(f"Total samples: {total_samples}") + + if successful_datasets: + print(f"\nSuccessful datasets ({len(successful_datasets)}):") + for name in successful_datasets: + category = "judge" if name in JUDGE_DATASETS else "nonjudge" + print(f" ✓ {name} ({category})") + + if failed_datasets: + print(f"\nFailed datasets ({len(failed_datasets)}):") + for name, error in failed_datasets: + print(f" ✗ {name}: {error}") + + print("=" * 60 + "\n") + + +if __name__ == "__main__": + main() diff --git a/nemo_skills/dataset/librispeech-pc/__init__.py b/nemo_skills/dataset/librispeech-pc/__init__.py new file mode 100644 index 0000000000..5fbfe2b1cb --- /dev/null +++ b/nemo_skills/dataset/librispeech-pc/__init__.py @@ -0,0 +1,26 @@ +# Copyright (c) 2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +"""LibriSpeech-PC: ASR evaluation with Punctuation and Capitalization. + +Test sets (evaluation only): +- test-clean: Clean speech recordings (~2.6k samples) +- test-other: More challenging speech with various acoustic conditions (~2.9k samples) +""" + +DATASET_GROUP = "speechlm" +METRICS_TYPE = "speechlm" +DEFAULT_SPLIT = "test-clean" +EVAL_ARGS = "++eval_type=audiobench " +GENERATION_ARGS = "++prompt_format=openai " diff --git a/nemo_skills/dataset/librispeech-pc/prepare.py b/nemo_skills/dataset/librispeech-pc/prepare.py new file mode 100644 index 0000000000..b2c6b6c87d --- /dev/null +++ b/nemo_skills/dataset/librispeech-pc/prepare.py @@ -0,0 +1,175 @@ +# Copyright (c) 2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +"""Prepare LibriSpeech-PC for ASR evaluation with punctuation and capitalization. + +LibriSpeech-PC provides manifests with punctuation/capitalization from OpenSLR-145. +Audio files are downloaded from original LibriSpeech at OpenSLR-12. + +Usage: + ns prepare_data librispeech-pc --data_dir + ns prepare_data librispeech-pc --split test-clean (or test-other) --data_dir +""" + +import argparse +import json +import os +import tarfile +import urllib.request +from pathlib import Path + +from tqdm import tqdm + + +def download_with_progress(url: str, output_path: Path, desc: str): + """Download file with tqdm progress bar.""" + with tqdm(unit="B", unit_scale=True, unit_divisor=1024, desc=desc) as pbar: + + def reporthook(blocknum, blocksize, totalsize): + if pbar.total != totalsize: + pbar.total = totalsize + pbar.update(blocksize) + + urllib.request.urlretrieve(url, output_path, reporthook) + + +# LibriSpeech-PC manifests (with punctuation and capitalization) +MANIFESTS_URL = "https://www.openslr.org/resources/145/manifests.tar.gz" + +# Original LibriSpeech audio files +AUDIO_URLS = { + "test-clean": "https://www.openslr.org/resources/12/test-clean.tar.gz", + "test-other": "https://www.openslr.org/resources/12/test-other.tar.gz", +} + + +def download_manifests(output_dir: Path) -> Path: + """Download LibriSpeech-PC manifests if not already present.""" + if (output_dir / "test-clean.json").exists(): + return output_dir + + tar_path = output_dir / "manifests.tar.gz" + download_with_progress(MANIFESTS_URL, tar_path, "Downloading manifests") + + with tarfile.open(tar_path, "r:gz") as tar: + for member in tar.getmembers(): + if member.name in ["test-clean.json", "test-other.json"]: + tar.extract(member, output_dir, filter="data") + os.remove(tar_path) + + print("✓ Manifests ready\n") + return output_dir + + +def download_audio(split: str, audio_dir: Path): + """Download LibriSpeech audio files if not already present.""" + split_dir = audio_dir / "LibriSpeech" / split.replace("-", "_") + if split_dir.exists(): + return + + tar_path = audio_dir / f"{split}.tar.gz" + download_with_progress(AUDIO_URLS[split], tar_path, f"Downloading {split}") + + with tarfile.open(tar_path, "r:gz") as tar: + tar.extractall(audio_dir, filter="data") + os.remove(tar_path) + + +def process_split(split: str, data_dir: Path, audio_dir: Path, with_audio: bool) -> int: + """Process one LibriSpeech-PC split into nemo-skills format.""" + + output_file = data_dir / f"{split}.jsonl" + manifest_file = data_dir / f"{split}.json" + if not manifest_file.exists(): + print(f"✗ Manifest not found: {manifest_file}") + return 0 + + if with_audio: + download_audio(split, audio_dir) + + with open(manifest_file, "r") as f: + entries = [json.loads(line) for line in f if line.strip()] + + processed = 0 + skipped = 0 + + with open(output_file, "w") as fout: + for entry in entries: + audio_filepath = entry.get("audio_filepath", "") + text = entry.get("text", "") + + if not audio_filepath or not text: + skipped += 1 + continue + + audio_id = Path(audio_filepath).stem + + container_path = f"/dataset/librispeech-pc/LibriSpeech/{audio_filepath}" + + user_message = { + "role": "user", + "content": "Transcribe the audio with proper punctuation and capitalization.", + "audio": {"path": container_path}, + } + + output_entry = { + "audio_filepath": container_path, + "text": text, + "expected_answer": text, + "task_type": "ASR-PC", + "sample_id": audio_id, + "split": split, + "messages": [{"role": "system", "content": "You are a helpful assistant. /no_think"}, user_message], + } + + fout.write(json.dumps(output_entry, ensure_ascii=False) + "\n") + processed += 1 + + print(f"✓ {split}: {processed} samples" + (f" ({skipped} skipped)" if skipped > 0 else "")) + + if processed > 0 and manifest_file.exists(): + os.remove(manifest_file) + + return processed + + +def main(): + parser = argparse.ArgumentParser(description="Prepare LibriSpeech-PC for ASR evaluation") + parser.add_argument( + "--split", + default="all", + choices=["all", "test-clean", "test-other"], + help="Which split to prepare (default: all)", + ) + parser.add_argument( + "--no-audio", + action="store_true", + help="Skip audio download", + ) + args = parser.parse_args() + + data_dir = Path(__file__).parent + audio_dir = data_dir + audio_dir.mkdir(exist_ok=True) + + download_manifests(data_dir) + + splits = ["test-clean", "test-other"] if args.split == "all" else [args.split] + total = sum(process_split(split, data_dir, audio_dir, not args.no_audio) for split in splits) + + print(f"\n✓ Complete: {total} samples") + + +if __name__ == "__main__": + main() diff --git a/nemo_skills/dataset/mmau-pro/open_ended/__init__.py b/nemo_skills/dataset/mmau-pro/open_ended/__init__.py index 22773d6fed..c5f09272d2 100644 --- a/nemo_skills/dataset/mmau-pro/open_ended/__init__.py +++ b/nemo_skills/dataset/mmau-pro/open_ended/__init__.py @@ -23,4 +23,4 @@ "server_type": "openai", "server_address": "https://integrate.api.nvidia.com/v1", } -JUDGE_ARGS = "++prompt_config=judge/speechlm ++generation_key=judgement" +JUDGE_ARGS = "++prompt_config=judge/mmau-pro ++generation_key=judgement" diff --git a/nemo_skills/dataset/mmau-pro/prepare.py b/nemo_skills/dataset/mmau-pro/prepare.py index a6f04d621b..0ea66ec2b7 100644 --- a/nemo_skills/dataset/mmau-pro/prepare.py +++ b/nemo_skills/dataset/mmau-pro/prepare.py @@ -75,8 +75,8 @@ def format_entry(entry, with_audio=False): if category == "open": content = entry["question"] elif choices and len(choices) > 1: - options_text = "\n".join(f"{chr(65 + i)}. {choice}" for i, choice in enumerate(choices)) - content = f"{entry['question']}\n\n{options_text}" + options_text = "\n".join(f"{chr(65 + i)}) {choice}" for i, choice in enumerate(choices)) + content = f"{entry['question']}\n\n{options_text}\n\nRespond with the complete text of the correct option, not just the letter." else: content = entry["question"] @@ -84,13 +84,18 @@ def format_entry(entry, with_audio=False): if entry.get("audio_path"): audio_path = entry["audio_path"] - - if isinstance(audio_path, list) and audio_path: - user_message["audios"] = [{"path": path, "duration": 10.0} for path in audio_path] - elif isinstance(audio_path, str): - user_message["audio"] = {"path": audio_path, "duration": 10.0} - - formatted_entry["messages"] = [user_message] + # Prepend /dataset/mmau-pro/ to make paths absolute for cluster + if len(audio_path) == 1: + user_message["audio"] = {"path": f"/dataset/mmau-pro/{audio_path[0]}"} + else: + user_message["audios"] = [{"path": f"/dataset/mmau-pro/{path}"} for path in audio_path] + + # Don't use /no_think for open-ended questions to allow reasoning + system_content = "You are a helpful assistant." + if category != "open": + system_content += " /no_think" + + formatted_entry["messages"] = [{"role": "system", "content": system_content}, user_message] return formatted_entry diff --git a/nemo_skills/evaluation/evaluator/__init__.py b/nemo_skills/evaluation/evaluator/__init__.py index 21f8a0e3d2..cd36509ce4 100644 --- a/nemo_skills/evaluation/evaluator/__init__.py +++ b/nemo_skills/evaluation/evaluator/__init__.py @@ -15,6 +15,7 @@ import asyncio from typing import Any, Callable, Dict +from nemo_skills.evaluation.evaluator.audiobench import eval_audiobench from nemo_skills.evaluation.evaluator.base import BaseEvaluator from nemo_skills.evaluation.evaluator.bfcl import eval_bfcl from nemo_skills.evaluation.evaluator.code import ( @@ -56,6 +57,8 @@ "bigcodebench": eval_bigcodebench, "human_eval_infilling": eval_human_eval_infilling, "mmau-pro": eval_mmau_pro, + "audiobench": eval_audiobench, + "librispeech-pc": eval_audiobench, } # Evaluator class mapping, other evaluators can be added here as they're converted to classes diff --git a/nemo_skills/evaluation/evaluator/audiobench.py b/nemo_skills/evaluation/evaluator/audiobench.py new file mode 100644 index 0000000000..681dd97b49 --- /dev/null +++ b/nemo_skills/evaluation/evaluator/audiobench.py @@ -0,0 +1,283 @@ +# Copyright (c) 2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import json +import logging +import re +from typing import Any + +import numpy as np +from tqdm import tqdm + +from nemo_skills.utils import get_logger_name, nested_dataclass + +LOG = logging.getLogger(get_logger_name(__file__)) + + +@nested_dataclass(kw_only=True) +class AudioBenchEvaluatorConfig: + """Configuration for AudioBench evaluation.""" + + # Prompt configuration for judge tasks + prompt_config: str = "eval/speechlm/audiobench" + + +# ============================================================================= +# ASR-PC Helper Functions (LibriSpeech-PC with Punctuation/Capitalization) +# ============================================================================= + + +def normalize_whitespace(text: str) -> str: + """Normalize multiple spaces to single space.""" + return re.sub(r"\s+", " ", text).strip() + + +def split_tokens(text: str) -> list[str]: + """Split text into words and punctuation as separate tokens.""" + return re.findall(r"\w+|[^\w\s]", text) + + +def extract_punctuation(text: str) -> list[str]: + """Extract only punctuation characters from text.""" + return [c for c in text if not c.isalnum() and not c.isspace()] + + +def calculate_per(reference: str, hypothesis: str) -> float: + """ + Calculate Punctuation Error Rate (PER) according to + arXiv:2310.02943 formula: + PER = (I + D + S) / (I + D + S + C) + """ + ref_punct = extract_punctuation(reference) + hyp_punct = extract_punctuation(hypothesis) + + len_r, len_h = len(ref_punct), len(hyp_punct) + + if len_r == 0 and len_h == 0: + return 0.0 + + # Dynamic programming: dp[i,j] = (C, S, D, I) + dp = np.zeros((len_r + 1, len_h + 1, 4), dtype=int) + + for i in range(1, len_r + 1): + dp[i, 0][2] = i # all deletions + for j in range(1, len_h + 1): + dp[0, j][3] = j # all insertions + + # Fill DP table + for i in range(1, len_r + 1): + for j in range(1, len_h + 1): + if ref_punct[i - 1] == hyp_punct[j - 1]: + dp[i, j] = dp[i - 1, j - 1].copy() + dp[i, j][0] += 1 # correct + else: + sub = dp[i - 1, j - 1].copy() + sub[1] += 1 + delete = dp[i - 1, j].copy() + delete[2] += 1 + insert = dp[i, j - 1].copy() + insert[3] += 1 + dp[i, j] = min([sub, delete, insert], key=lambda x: x[1] + x[2] + x[3]) + + correct, substitution, deletion, insertion = dp[len_r, len_h] + total = correct + substitution + deletion + insertion + per = (substitution + deletion + insertion) / total if total > 0 else 0.0 + return per + + +def evaluate_asr_pc(reference: str, hypothesis: str) -> dict[str, Any]: + """Evaluate ASR with punctuation and capitalization (LibriSpeech-PC style).""" + import jiwer + + # Normalize whitespace + ref_pc = normalize_whitespace(reference) + hyp_pc = normalize_whitespace(hypothesis) + + # WER_PC: Full metric with punctuation and capitalization + ref_tokens = split_tokens(ref_pc) + hyp_tokens = split_tokens(hyp_pc) + wer_pc = jiwer.wer(" ".join(ref_tokens), " ".join(hyp_tokens)) + + # WER_C: Capitalization only + ref_c = normalize_whitespace(re.sub(r"[^\w\s]", "", reference)) + hyp_c = normalize_whitespace(re.sub(r"[^\w\s]", "", hypothesis)) + wer_c = jiwer.wer(ref_c, hyp_c) + + # WER: Standard (lowercase, no punctuation) + ref_std = normalize_whitespace(re.sub(r"[^\w\s]", "", reference.lower())) + hyp_std = normalize_whitespace(re.sub(r"[^\w\s]", "", hypothesis.lower())) + wer_std = jiwer.wer(ref_std, hyp_std) + + # PER: Punctuation Error Rate + per = calculate_per(reference, hypothesis) + + return { + "wer": wer_std, + "wer_c": wer_c, + "wer_pc": wer_pc, + "per": per, + "is_correct": wer_pc < 0.5, + } + + +# ============================================================================= +# Standard ASR Helper Functions +# ============================================================================= + + +def preprocess_asr_text(text: str) -> str: + """Preprocess text for standard ASR evaluation (Whisper-style normalization).""" + from whisper.normalizers import EnglishTextNormalizer + + text = text.lower() + normalizer = EnglishTextNormalizer() + text = normalizer(text) + # Remove bracketed content + text = re.sub(r"(\[|\(|\{|\<)[^\(\)\\n\[\]]*(\]|\)|\}|\>)", "", text) + text = re.sub(r"\s+", " ", text).strip() + return text + + +def evaluate_asr(reference: str, hypothesis: str) -> dict[str, Any]: + """Evaluate standard ASR with normalization.""" + import jiwer + + ref = preprocess_asr_text(reference) + hyp = preprocess_asr_text(hypothesis) + + # Handle empty strings + if not ref: + ref = "empty" + if not hyp: + hyp = "empty" + + wer_score = jiwer.wer(ref, hyp) + + return { + "wer": wer_score, + "is_correct": wer_score < 0.5, + } + + +# ============================================================================= +# Translation Helper Functions +# ============================================================================= + + +def evaluate_translation(reference: str, hypothesis: str) -> dict[str, Any]: + """Evaluate translation using BLEU score.""" + try: + import sacrebleu + + ref = [reference.strip()] + hyp = hypothesis.strip() + bleu = sacrebleu.sentence_bleu(hyp, ref) + bleu_score = bleu.score / 100.0 + + return { + "bleu": bleu_score, + "is_correct": bleu_score > 0.3, + } + except Exception as e: + return { + "bleu": 0.0, + "is_correct": False, + "error": str(e), + } + + +def eval_audiobench(cfg): + """Evaluate AudioBench and ASR datasets using nemo-skills framework. + + This evaluator processes JSONL files with speech model outputs + and evaluates them using automatic metrics: + - ASR tasks: Word Error Rate (WER) + * Standard ASR: Normalized WER (removes punctuation/capitalization) + * LibriSpeech-PC: Multiple metrics (WER, WER_C, WER_PC, PER) + - Translation tasks: BLEU score + - Other tasks: May require LLM-as-a-judge (handled separately) + + Separate datasets allow tracking performance across different tasks. + """ + # Extract only the fields that belong to AudioBenchEvaluatorConfig + config_fields = {"prompt_config"} + config_kwargs = {k: v for k, v in cfg.items() if k in config_fields} + eval_config = AudioBenchEvaluatorConfig(**config_kwargs) + + jsonl_file = cfg["input_file"] + LOG.info(f"Evaluating {jsonl_file}") + + with open(jsonl_file, "rt", encoding="utf-8") as fin: + data = [json.loads(line) for line in fin] + + samples_already_evaluated = sum(1 for sample in data if "is_correct" in sample) + + if samples_already_evaluated > 0: + LOG.info(f"Resuming evaluation: {samples_already_evaluated}/{len(data)} samples already evaluated") + + for idx, sample in enumerate(tqdm(data, desc="Evaluating samples")): + data[idx] = evaluate_sample(sample, eval_config) + + # Write all results at once + with open(jsonl_file, "wt", encoding="utf-8") as fout: + for sample in data: + fout.write(json.dumps(sample) + "\n") + + LOG.info(f"Evaluation completed for {jsonl_file}") + + +def evaluate_sample(sample: dict[str, Any], config: AudioBenchEvaluatorConfig) -> dict[str, Any]: + """Evaluate a single sample based on task type.""" + sample = sample.copy() + task_type = sample.get("task_type", "unknown") + generation = sample.get("generation", "").strip() + expected_answer = sample.get("expected_answer", "").strip() + + # Handle missing generation for automatic metrics + if task_type in ["ASR", "ASR-PC", "Translation"] and not generation: + sample.update( + { + "is_correct": False, + "wer": 1.0, + "error": "missing_generation", + "predicted_answer": "", + } + ) + return sample + + # Evaluate based on task type + if task_type == "ASR-PC": + metrics = evaluate_asr_pc(expected_answer, generation) + sample.update(metrics) + sample["predicted_answer"] = generation + + elif task_type == "ASR": + metrics = evaluate_asr(expected_answer, generation) + sample.update(metrics) + sample["predicted_answer"] = generation + + elif task_type == "Translation": + metrics = evaluate_translation(expected_answer, generation) + sample.update(metrics) + sample["predicted_answer"] = generation + + else: + # QA and other tasks require LLM judge evaluation + if "requires_judge" not in sample: + sample["requires_judge"] = True + sample["predicted_answer"] = generation + if "is_correct" not in sample: + sample["is_correct"] = False + + return sample diff --git a/nemo_skills/evaluation/metrics/map_metrics.py b/nemo_skills/evaluation/metrics/map_metrics.py index 34dd0192e6..794b3b19a2 100644 --- a/nemo_skills/evaluation/metrics/map_metrics.py +++ b/nemo_skills/evaluation/metrics/map_metrics.py @@ -37,6 +37,7 @@ from nemo_skills.evaluation.metrics.mrcr_metrics import MRCRMetrics from nemo_skills.evaluation.metrics.ruler_metrics import RulerMetrics from nemo_skills.evaluation.metrics.simpleqa_metrics import SimpleQAMetrics +from nemo_skills.evaluation.metrics.speechlm_metrics import SpeechLMMetrics from nemo_skills.evaluation.metrics.translation_metrics import TranslationMetrics METRICS_MAP = { @@ -66,6 +67,7 @@ "mmau_pro_closed_form": MMAUProMetrics, "mmau_pro_open_ended": MMAUProMetrics, "mmau_pro_instruction_following": MMAUProMetrics, + "speechlm": SpeechLMMetrics, } diff --git a/nemo_skills/evaluation/metrics/mmau_pro_metrics.py b/nemo_skills/evaluation/metrics/mmau_pro_metrics.py index f079049cc1..dd80c204f8 100644 --- a/nemo_skills/evaluation/metrics/mmau_pro_metrics.py +++ b/nemo_skills/evaluation/metrics/mmau_pro_metrics.py @@ -13,20 +13,72 @@ # limitations under the License. import logging +import re from nemo_skills.evaluation.metrics.base import BaseMetrics, as_int, as_percentage -from nemo_skills.evaluation.metrics.utils import is_correct_judgement from nemo_skills.utils import get_logger_name LOG = logging.getLogger(get_logger_name(__file__)) +def extract_multicriteria_scores(judgement_text: str) -> dict[str, float]: + """Extract multi-criteria scores (1-5 scale) from LLM judge evaluation. + + Expected format: + CORRECTNESS: [score] - [justification] + RELEVANCE: [score] - [justification] + COMPLETENESS: [score] - [justification] + CLARITY: [score] - [justification] + OVERALL: [score] - [overall assessment] + + Args: + judgement_text: The raw judgement text from the LLM judge + + Returns: + Dictionary with keys: correctness, relevance, completeness, clarity, overall + Each containing a float score (1-5). Defaults to 3.0 if not found. + """ + scores = {} + + # Define patterns to extract scores + patterns = { + "correctness": r"CORRECTNESS:\s*(\d+(?:\.\d+)?)", + "relevance": r"RELEVANCE:\s*(\d+(?:\.\d+)?)", + "completeness": r"COMPLETENESS:\s*(\d+(?:\.\d+)?)", + "clarity": r"CLARITY:\s*(\d+(?:\.\d+)?)", + "overall": r"OVERALL:\s*(\d+(?:\.\d+)?)", + } + + for criterion, pattern in patterns.items(): + match = re.search(pattern, judgement_text, re.IGNORECASE) + if match: + scores[criterion] = float(match.group(1)) + else: + # Fallback: assign neutral score if not found + scores[criterion] = 3.0 + + # Calculate overall if not found or if it's still 3.0 (default) + if "overall" not in scores or scores["overall"] == 3.0: + criteria_scores = [scores.get(k, 3.0) for k in ["correctness", "relevance", "completeness", "clarity"]] + scores["overall"] = sum(criteria_scores) / len(criteria_scores) + + return scores + + class MMAUProMetrics(BaseMetrics): """Metrics class for MMAU-Pro benchmark (all subgroups).""" def __init__(self, compute_no_answer: bool = True, max_k: int = 1): super().__init__(compute_no_answer=compute_no_answer) self.max_k = max_k + # Track multi-criteria scores for open-ended questions + self.multicriteria_scores = { + "correctness": [], + "relevance": [], + "completeness": [], + "clarity": [], + "overall": [], + } def _get_score_dict(self, prediction: dict) -> dict[str, bool | int | float]: """Extract correctness scores from prediction.""" @@ -34,9 +86,10 @@ def _get_score_dict(self, prediction: dict) -> dict[str, bool | int | float]: # Open-ended: extract from judge result if "judgement" in prediction: - judge_result = is_correct_judgement(prediction["judgement"]) - score_dict["judge_correct"] = judge_result - score_dict["correct"] = judge_result + # Extract multi-criteria scores + multicriteria = extract_multicriteria_scores(prediction["judgement"]) + score_dict["correct"] = multicriteria.get("overall", 3.0) >= 3.0 + # Closed-form and instruction following: use is_correct elif "is_correct" in prediction: score_dict["correct"] = prediction["is_correct"] @@ -62,6 +115,13 @@ def update(self, predictions): self._compute_pass_at_k(predictions=predictions, predicted_answers=predicted_answers) self._compute_majority_at_k(predictions=predictions, predicted_answers=predicted_answers) + # Collect multi-criteria scores for open-ended questions + for pred in predictions: + if "judgement" in pred: + multicriteria = extract_multicriteria_scores(pred["judgement"]) + for criterion in self.multicriteria_scores: + self.multicriteria_scores[criterion].append(multicriteria.get(criterion, 3.0)) + def get_metrics(self): """Get computed metrics.""" metrics_dict = super().get_metrics() @@ -71,11 +131,36 @@ def get_metrics(self): agg_metrics["avg_tokens"] = 0 if "no_answer" in agg_metrics: agg_metrics["no_answer"] = agg_metrics["no_answer"] / 2.0 - # Set success_rate from correct or judge_correct - if "judge_correct" in agg_metrics: - agg_metrics["success_rate"] = agg_metrics["judge_correct"] + + # Add multi-criteria score averages for open-ended questions + # These are on 1-5 scale, normalize to percentages (0-100) + if self.multicriteria_scores["overall"]: + import numpy as np + + for criterion in self.multicriteria_scores: + scores = self.multicriteria_scores[criterion] + if scores: + # Normalize to 0-100 scale: (score/5.0) * 100 + agg_metrics[f"avg_{criterion}"] = round((np.mean(scores) / 5.0) * 100, 2) + agg_metrics[f"std_{criterion}"] = round((np.std(scores) / 5.0) * 100, 2) + + # For open-ended questions, use avg_overall as the success_rate + # This represents the average quality score from the multi-criteria judge + agg_metrics["success_rate"] = agg_metrics["avg_overall"] + + # Calculate good response rate (score >= 4.0) for additional insight + overall_scores = self.multicriteria_scores["overall"] + good_responses = sum(1 for score in overall_scores if score >= 4.0) + agg_metrics["good_response_rate"] = round((good_responses / len(overall_scores)) * 100, 2) + + # Calculate poor response rate (score <= 2.0) + poor_responses = sum(1 for score in overall_scores if score <= 2.0) + agg_metrics["poor_response_rate"] = round((poor_responses / len(overall_scores)) * 100, 2) + + # For closed-form and instruction following, use binary correctness elif "correct" in agg_metrics: agg_metrics["success_rate"] = agg_metrics["correct"] + return metrics_dict def metrics_to_print(self): @@ -87,5 +172,20 @@ def metrics_to_print(self): } if self.compute_no_answer: base_metrics["no_answer"] = as_percentage + + # Add multi-criteria metrics if available (for open-ended questions) + if self.multicriteria_scores["overall"]: + base_metrics.update( + { + "avg_overall": as_percentage, + "avg_correctness": as_percentage, + "avg_relevance": as_percentage, + "avg_completeness": as_percentage, + "avg_clarity": as_percentage, + "good_response_rate": as_percentage, + "poor_response_rate": as_percentage, + } + ) + base_metrics["num_entries"] = as_int return base_metrics diff --git a/nemo_skills/evaluation/metrics/speechlm_metrics.py b/nemo_skills/evaluation/metrics/speechlm_metrics.py new file mode 100644 index 0000000000..2ebec7ab5e --- /dev/null +++ b/nemo_skills/evaluation/metrics/speechlm_metrics.py @@ -0,0 +1,223 @@ +# Copyright (c) 2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import logging + +from nemo_skills.evaluation.metrics.base import BaseMetrics, as_int, as_percentage +from nemo_skills.utils import get_logger_name + +LOG = logging.getLogger(get_logger_name(__file__)) + + +class SpeechLMMetrics(BaseMetrics): + """Metrics class for speech/audio language model evaluation tasks.""" + + def __init__(self, compute_no_answer: bool = True, max_k: int = 1): + super().__init__(compute_no_answer=compute_no_answer) + self.max_k = max_k + self.wer_scores = [] + self.wer_c_scores = [] + self.wer_pc_scores = [] + self.per_scores = [] + self.bleu_scores = [] + + def _extract_judge_result(self, judgement_text: str) -> bool: + """Extract judge result from judgement text.""" + import re + + if re.search(r"\byes\b", judgement_text, re.IGNORECASE): + return True + elif re.search(r"\bno\b", judgement_text, re.IGNORECASE): + return False + else: + return False + + def _get_score_dict(self, prediction: dict) -> dict[str, bool | int | float]: + """Extract correctness scores from prediction.""" + score_dict = {} + + category = prediction.get("category", "unknown") + + if "judgement" in prediction and category == "open": + judge_result = self._extract_judge_result(prediction["judgement"]) + score_dict["judge_correct"] = judge_result + + if category == "open" and "judge_correct" in score_dict: + score_dict["correct"] = score_dict["judge_correct"] + elif "is_correct" in prediction: + score_dict["correct"] = prediction["is_correct"] + else: + score_dict["correct"] = False + + return score_dict + + def get_incorrect_sample(self, prediction: dict) -> dict: + """Return a sample marked as incorrect for all metrics.""" + prediction = prediction.copy() + prediction["is_correct"] = False + prediction["judge_correct"] = False + if not prediction.get("generation", "").strip(): + prediction["generation"] = None + return prediction + + def update_common_metrics(self, agg_dict): + """Override to always include avg_tokens even if 0 since it's in metrics_to_print.""" + agg_dict["num_entries"] = self.total + agg_dict["avg_tokens"] = int(self.avg_tokens / self.total) if self.total > 0 else 0 + if self.max_end_time > float("-inf") and self.min_start_time < float("inf"): + agg_dict["gen_seconds"] = int(self.max_end_time - self.min_start_time) + + def update(self, predictions): + """Update metrics with new predictions.""" + super().update(predictions) + + predicted_answers = [pred.get("generation", "").strip() or None for pred in predictions] + + # Collect WER, PnC, and BLEU scores + for pred in predictions: + if "wer" in pred and pred["wer"] is not None: + self.wer_scores.append(pred["wer"]) + if "wer_c" in pred and pred["wer_c"] is not None: + self.wer_c_scores.append(pred["wer_c"]) + if "wer_pc" in pred and pred["wer_pc"] is not None: + self.wer_pc_scores.append(pred["wer_pc"]) + if "per" in pred and pred["per"] is not None: + self.per_scores.append(pred["per"]) + if "bleu" in pred and pred["bleu"] is not None: + self.bleu_scores.append(pred["bleu"]) + + self._compute_pass_at_k(predictions=predictions, predicted_answers=predicted_answers) + self._compute_majority_at_k(predictions=predictions, predicted_answers=predicted_answers) + + def get_metrics(self): + """Get computed metrics.""" + metrics_dict = super().get_metrics() + + for agg_mode, agg_metrics in metrics_dict.items(): + if "no_answer" in agg_metrics: + agg_metrics["no_answer"] = agg_metrics["no_answer"] / 2.0 + + # Set success_rate based on correct field + if "correct" in agg_metrics: + agg_metrics["success_rate"] = agg_metrics["correct"] + elif "judge_correct" in agg_metrics: + agg_metrics["success_rate"] = agg_metrics["judge_correct"] + + # Add WER, PnC, and BLEU if available (convert to percentages and round to 2 decimals) + if self.wer_scores: + agg_metrics["wer"] = round(100.0 * sum(self.wer_scores) / len(self.wer_scores), 2) + if self.wer_c_scores: + agg_metrics["wer_c"] = round(100.0 * sum(self.wer_c_scores) / len(self.wer_c_scores), 2) + if self.wer_pc_scores: + agg_metrics["wer_pc"] = round(100.0 * sum(self.wer_pc_scores) / len(self.wer_pc_scores), 2) + if self.per_scores: + agg_metrics["per"] = round(100.0 * sum(self.per_scores) / len(self.per_scores), 2) + if self.bleu_scores: + agg_metrics["bleu"] = round(100.0 * sum(self.bleu_scores) / len(self.bleu_scores), 2) + + return metrics_dict + + def evaluations_to_print(self): + """Specify which evaluation modes to print.""" + evals = [f"pass@{self.max_k}"] + if self.max_k > 1: + evals.extend([f"majority@{self.max_k}", f"pass@1[avg-of-{self.max_k}]"]) + return evals + + def metrics_to_print(self): + """Specify which metrics to print.""" + base_metrics = { + "avg_tokens": as_int, + "gen_seconds": as_int, + "success_rate": as_percentage, + } + + if self.compute_no_answer: + base_metrics["no_answer"] = as_percentage + + # Add WER, PnC, and BLEU if they were computed + if self.wer_scores: + base_metrics["wer"] = as_percentage + if self.wer_c_scores: + base_metrics["wer_c"] = as_percentage + if self.wer_pc_scores: + base_metrics["wer_pc"] = as_percentage + if self.per_scores: + base_metrics["per"] = as_percentage + if self.bleu_scores: + base_metrics["bleu"] = as_percentage + + base_metrics["num_entries"] = as_int # Add at end for better display order + + return base_metrics + + +def compute_score(combined_metrics: dict) -> dict: + """ + Aggregate metrics from multiple sub-benchmarks into a single group score. + + Args: + combined_metrics: Dictionary with benchmark names as keys. + Each benchmark has eval modes (e.g., 'pass@1') as keys, + which contain the actual metrics. + Format: {benchmark_name: {eval_mode: {metrics...}}} + + Returns: + Aggregated metrics dictionary in the same format. + """ + # Identify main benchmark categories (nonjudge, judge) + main_benchmark_names = ["nonjudge", "judge"] + benchmarks = {k: v for k, v in combined_metrics.items() if k.split(".")[-1] in main_benchmark_names} + + if not benchmarks: + return {} + + # Get all eval modes from first benchmark (they should all have the same modes) + first_benchmark = next(iter(benchmarks.values())) + eval_modes = list(first_benchmark.keys()) + + # Aggregate metrics for each evaluation mode + aggregated = {} + for eval_mode in eval_modes: + total_entries = 0 + weighted_success = 0.0 + total_gen_seconds = 0 + weighted_tokens = 0.0 + weighted_no_answer = 0.0 + + for benchmark_name, benchmark_data in benchmarks.items(): + if eval_mode not in benchmark_data: + continue + + metrics = benchmark_data[eval_mode] + num_entries = metrics.get("num_entries", 0) + total_entries += num_entries + + # Aggregate weighted by number of entries (metrics are already percentages) + if num_entries > 0: + weighted_success += metrics.get("success_rate", 0.0) * num_entries + total_gen_seconds += metrics.get("gen_seconds", 0) + weighted_tokens += metrics.get("avg_tokens", 0.0) * num_entries + weighted_no_answer += metrics.get("no_answer", 0.0) * num_entries + + # Compute aggregated metrics + aggregated[eval_mode] = { + "avg_tokens": int(weighted_tokens / total_entries) if total_entries > 0 else 0, + "gen_seconds": total_gen_seconds, + "success_rate": weighted_success / total_entries if total_entries > 0 else 0.0, + "no_answer": weighted_no_answer / total_entries if total_entries > 0 else 0.0, + "num_entries": total_entries, + } + + return aggregated diff --git a/nemo_skills/pipeline/prepare_data.py b/nemo_skills/pipeline/prepare_data.py index 49a401978d..6760eecd72 100644 --- a/nemo_skills/pipeline/prepare_data.py +++ b/nemo_skills/pipeline/prepare_data.py @@ -31,7 +31,7 @@ # TODO: read this from init.py -DATASETS_REQUIRE_DATA_DIR = ["ruler", "ioi24", "mmau-pro"] +DATASETS_REQUIRE_DATA_DIR = ["ruler", "ioi24", "mmau-pro", "librispeech-pc", "audiobench"] @app.command(context_settings={"allow_extra_args": True, "ignore_unknown_options": True}) diff --git a/nemo_skills/prompt/config/judge/audiobench.yaml b/nemo_skills/prompt/config/judge/audiobench.yaml new file mode 100644 index 0000000000..9e886ed1a5 --- /dev/null +++ b/nemo_skills/prompt/config/judge/audiobench.yaml @@ -0,0 +1,29 @@ +# Judge prompt configuration for AudioBench evaluation +# Based on AudioBench's official llama3_70b_as_judge_binary prompt +# Adapted to nemo-skills Yes/No format (instead of 0/1 Rating) + +user: |- + [Reference Answer] + {expected_answer} + + [Model Answer] + {generation} + + [Question] + {question} + + [Task] + Rate the model's answer based on its alignment with the reference answer, focusing on accuracy and relevance to the reference provided. Please be critical on the details. + + Criteria: Assess if the model's response mirrors the reference in terms of content, accuracy, and relevance. + + The answer is INCORRECT if: + - The answer is refusing to give concrete results, providing something like 'cannot decide' + - The answer is wrong, providing incorrect or irrelevant information compared to the reference + + The answer is CORRECT if: + - The answer is correct, capturing or covering the meaning from the reference + + Your response should be formatted as follows: + Reasoning: (Provide a concise explanation of your rating, comparing the reference answer with the model's response. "The reference answer is [XXX], while the model's answer is [YYY]. I think ...") + Judgement: [Yes or No] diff --git a/nemo_skills/prompt/config/judge/mmau-pro.yaml b/nemo_skills/prompt/config/judge/mmau-pro.yaml new file mode 100644 index 0000000000..5339e4ab0d --- /dev/null +++ b/nemo_skills/prompt/config/judge/mmau-pro.yaml @@ -0,0 +1,30 @@ +# Judge prompt configuration for Speech/Audio Language Model evaluation +# Used for evaluating open-ended responses in MMAU-Pro benchmark +# Uses multi-criteria scoring on 1-5 scale + +user: |- + You are an expert evaluator for audio and speech-related questions. Please evaluate the quality of a model's response to a question. + + Question: {question} + + Reference Answer: {expected_answer} + + Model Response: {generation} + + Please evaluate the model response on the following criteria and provide scores from 1-5 (where 5 is best): + + 1. **Correctness**: How factually accurate is the response compared to the reference? + 2. **Relevance**: How well does the response address the specific question asked? + 3. **Completeness**: Does the response cover all important aspects mentioned in the reference? + 4. **Clarity**: How clear and well-structured is the response? + + For each criterion, provide: + - A score from 1-5 + - A brief justification (1-2 sentences) + + Format your response as: + CORRECTNESS: [score] - [justification] + RELEVANCE: [score] - [justification] + COMPLETENESS: [score] - [justification] + CLARITY: [score] - [justification] + OVERALL: [average score] - [overall assessment] diff --git a/nemo_skills/prompt/config/judge/speechlm.yaml b/nemo_skills/prompt/config/judge/speechlm.yaml deleted file mode 100644 index 4862558145..0000000000 --- a/nemo_skills/prompt/config/judge/speechlm.yaml +++ /dev/null @@ -1,28 +0,0 @@ -# Judge prompt configuration for Speech/Audio Language Model evaluation -# Used for evaluating open-ended responses in MMAU-Pro benchmark -# Follows nemo-skills standard Yes/No judgement pattern - -user: |- - You are an expert evaluator for audio and speech-related questions. Please evaluate whether the model's response correctly answers the question. - - Question: {question} - - Reference Answer: {expected_answer} - - Model Response: {generation} - - Your task is to determine if the model's response is correct based on the reference answer. Consider: - - 1. **Factual Accuracy**: Is the information in the response factually correct? - 2. **Relevance**: Does the response address the specific question asked? - 3. **Completeness**: Does the response cover the key points from the reference answer? - - Please first explain your reasoning in 2-3 sentences, then provide your final judgement. - - Your final judgement must be either "Yes" or "No": - - "Yes" if the model response is correct and adequately answers the question - - "No" if the model response is incorrect, irrelevant, or inadequate - - Format your response as: - Reasoning: [Your explanation] - Judgement: [Yes or No] diff --git a/tests/test_datasets.py b/tests/test_datasets.py index 39d4b0398a..86fd152df2 100644 --- a/tests/test_datasets.py +++ b/tests/test_datasets.py @@ -57,6 +57,8 @@ ("college_math", ["test"]), ("comp-math-24-25", ["test"]), ("mmau-pro", ["test"]), + ("audiobench", ["test"]), + ("librispeech-pc", ["test"]), ]