diff --git a/.gitignore b/.gitignore index ecb9012331..49bdeb2cf7 100644 --- a/.gitignore +++ b/.gitignore @@ -45,4 +45,5 @@ nemo_skills/dataset/aalcr/lcr/ .idea/* CLAUDE.md -.idea +# AudioBench repository (auto-cloned during data preparation) +AudioBench/ diff --git a/docs/evaluation/code.md b/docs/evaluation/code.md index 5a8d634bcc..4353bbb59d 100644 --- a/docs/evaluation/code.md +++ b/docs/evaluation/code.md @@ -82,13 +82,11 @@ There are a few parameters specific to SWE-bench. They have to be specified with - **++eval_harness_repo:** URL of the repository to use for the evaluation harness. This is passed directly as an argument to `git clone`. Defaults to [`https://github.com/Kipok/SWE-bench.git`](https://github.com/Kipok/SWE-bench), our fork of SWE-bench that supports local evaluation. -- **++eval_harness_commit:** The commit hash, branch or tag to checkout after cloning eval_harness_repo. Defaults to `HEAD`, i.e. the latest commit. - -- **++setup_timeout:** The timeout for downloading & installing the agent framework and the evaluation harness, in seconds. Defaults to 1200, i.e. 20 minutes. +- **++eval_harness_commit:** The commit hash, branch or tag to checkout after cloning agent_harness_repo. Defaults to `HEAD`, i.e. the latest commit. - **++swebench_tests_timeout:** The timeout for tests after applying the generated patch during evaluation, in seconds. Defaults to 1800, i.e. 30 minutes. -- **++max_retries:** How many times to try running setup, inference and evaluation until a valid output file is produced. Defaults to 3. +- **++max_retries:** How many times to try running inference and evaluation until a valid output file is produced. Defaults to 3. - **++min_retry_interval, ++max_retry_interval:** The interval between retries, in seconds. Selected randomly between min and max on each retry. Defaults to 60 and 180 respectively. diff --git a/docs/evaluation/speech-audio.md b/docs/evaluation/speech-audio.md index 1d7d439928..07bf32332d 100644 --- a/docs/evaluation/speech-audio.md +++ b/docs/evaluation/speech-audio.md @@ -2,6 +2,11 @@ This section details how to evaluate speech and audio benchmarks, including understanding tasks that test models' ability to reason about audio content (speech, music, environmental sounds) and ASR tasks for transcription. +!!! warning "Running without audio files" + If you want to evaluation without audio files (not recommended) use + `--no-audio` flag. In this case you can also set `--skip_data_dir_check` + as data is very lightweight when audio files aren't being used. + ## Supported benchmarks ### MMAU-Pro @@ -21,11 +26,6 @@ MMAU-Pro (Multimodal Audio Understanding - Pro) is a comprehensive benchmark for MMAU-Pro requires audio files for meaningful evaluation. **Audio files are downloaded by default** to ensure proper evaluation. -!!! warning "Running without audio files" - If you want to evaluation without audio files (not recommended) use - `--no-audio` flag. In this case you can also set `--skip_data_dir_check` - as data is very lightweight when audio files aren't being used. - ### Data Preparation To prepare the dataset with audio files: @@ -46,7 +46,7 @@ ns prepare_data mmau-pro --data_dir=/path/to/data --cluster= If you need to prepare without audio files: ```bash -ns prepare_data mmau-pro --no-audio +ns prepare_data mmau-pro --no-audio --skip_data_dir_check ``` Note: The git repository check is automatically skipped with `--no-audio`. @@ -100,12 +100,9 @@ eval( --server_container=/path/to/server_container.sqsh \ --data_dir=/dataset \ --installation_command="pip install sacrebleu" \ - ++prompt_suffix='/no_think' \ + ++max_concurrent_requests=1 \ --server_args="--inference-max-requests 1 \ - --model-config /workspace/path/to/checkpoint-tp1/config.yaml \ - --num-tokens-to-generate 256 \ - --temperature 1.0 \ - --top_p 1.0" + --model-config /workspace/path/to/checkpoint-tp1/config.yaml ``` ## How Evaluation Works @@ -271,3 +268,133 @@ pass@1 | 0 | 6580 | 55.52% | 0.00% | 290 evaluation_mode | avg_tokens | gen_seconds | success_rate | no_answer | num_entries pass@1 | 11 | 6879 | 31.44% | 0.00% | 5305 ``` + + +### LibriSpeech-PC + +LibriSpeech-PC is an Automatic Speech Recognition (ASR) benchmark that evaluates models' ability to transcribe speech with proper punctuation and capitalization. It builds upon the original LibriSpeech corpus with enhanced reference transcripts. + +#### Dataset Location + +- Benchmark is defined in [`nemo_skills/dataset/librispeech-pc/__init__.py`](https://github.com/NVIDIA-NeMo/Skills/blob/main/nemo_skills/dataset/librispeech-pc/__init__.py) +- Manifests (with punctuation/capitalization) from [OpenSLR-145](https://www.openslr.org/145/) +- Audio files from original [LibriSpeech OpenSLR-12](https://www.openslr.org/12/) + +#### Available Splits + +- `test-clean`: Clean speech recordings (easier subset) +- `test-other`: More challenging recordings with varied acoustic conditions + +## Preparing LibriSpeech-PC Data + +LibriSpeech-PC requires audio files for ASR evaluation. **Audio files are downloaded by default**. + +### Data Preparation + +To prepare the dataset with audio files: + +```bash +ns prepare_data librispeech-pc --data_dir=/path/to/data --cluster= +``` + +**What happens:** + +- Downloads manifests with punctuation/capitalization from OpenSLR-145 +- Downloads audio files from original LibriSpeech (OpenSLR-12) +- Prepares both `test-clean` and `test-other` splits + +### Preparing Specific Splits + +To prepare only one split: + +```bash +ns prepare_data librispeech-pc --split test-clean --data_dir=/path/to/data +``` + +or + +```bash +ns prepare_data librispeech-pc --split test-other --data_dir=/path/to/data +``` + +## Running LibriSpeech-PC Evaluation + +!!! note + Currently supports only Megatron server type (`--server_type=megatron`). + +### Evaluation Example + +```python +import os +from nemo_skills.pipeline.cli import wrap_arguments, eval + +eval( + ctx=wrap_arguments(""), + cluster="oci_iad", + output_dir="/workspace/librispeech-pc-eval", + benchmarks="librispeech-pc", + server_type="megatron", + server_gpus=1, + model="/workspace/checkpoint", + server_entrypoint="/workspace/megatron-lm/server.py", + server_container="/path/to/container.sqsh", + data_dir="/dataset", + installation_command="pip install sacrebleu whisper jiwer", + server_args="--inference-max-requests 1 --model-config /workspace/checkpoint/config.yaml", +) +``` + +??? note "Alternative: Command-line usage" + + If you prefer using the command-line interface, you can run: + + ```bash + export MEGATRON_PATH=/workspace/path/to/megatron-lm + + ns eval \ + --cluster=oci_iad \ + --output_dir=/workspace/path/to/librispeech-pc-eval \ + --benchmarks=librispeech-pc \ + --server_type=megatron \ + --server_gpus=1 \ + --model=/workspace/path/to/checkpoint-tp1 \ + --server_entrypoint=$MEGATRON_PATH/path/to/server.py \ + --server_container=/path/to/server_container.sqsh \ + --data_dir=/dataset \ + --installation_command="pip install sacrebleu whisper jiwer" \ + ++max_concurrent_requests=1 \ + --server_args="--inference-max-requests 1 \ + --model-config /workspace/path/to/checkpoint-tp1/config.yaml" + ``` + +## How LibriSpeech-PC Evaluation Works + +The evaluation measures ASR accuracy using multiple Word Error Rate (WER) metrics: + +| Metric | Description | +|--------|-------------| +| **WER** | Word Error Rate - measures transcription accuracy ignoring punctuation and capitalization | +| **WER_C** | Word Error Rate with Capitalization - measures accuracy including capitalization | +| **WER_PC** | Word Error Rate with Punctuation and Capitalization - measures full accuracy including both | +| **PER** | Punctuation Error Rate - measures how well the model predicts punctuation marks | + +### Sub-benchmarks + +Evaluate individual splits: + +- `librispeech-pc.test-clean` - Easier, clean speech subset +- `librispeech-pc.test-other` - More challenging subset with varied conditions + +```python +eval(benchmarks="librispeech-pc.test-clean", ...) +``` + +### Evaluation Output Format + +**test-clean Split:** + +``` +------------------------------- librispeech-pc.test-clean ----------------------------- +evaluation_mode | avg_tokens | gen_seconds | wer | wer_c | wer_pc | per | num_entries +pass@1 | 15 | 120 | 4.23% | 4.85% | 5.12% | 2.34% | 2620 +``` \ No newline at end of file diff --git a/nemo_skills/dataset/audiobench/__init__.py b/nemo_skills/dataset/audiobench/__init__.py new file mode 100644 index 0000000000..007c326511 --- /dev/null +++ b/nemo_skills/dataset/audiobench/__init__.py @@ -0,0 +1,36 @@ +# Copyright (c) 2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +"""AudioBench: A comprehensive benchmark for speech and audio language models. + +AudioBench evaluates models across multiple tasks: +- ASR (Automatic Speech Recognition) +- Translation (speech-to-text translation) +- Speech QA (question answering based on audio) +- Audio understanding (emotion, gender, accent recognition, etc.) + +The benchmark is organized into two main categories: +- nonjudge: Tasks evaluated with automatic metrics (WER, BLEU) +- judge: Tasks requiring LLM-as-a-judge evaluation +""" + +DATASET_GROUP = "speechlm" +IS_BENCHMARK_GROUP = True +SCORE_MODULE = "nemo_skills.evaluation.metrics.speechlm_metrics" + +# Top-level benchmarks: evaluate all judge or all nonjudge datasets +BENCHMARKS = { + "audiobench.nonjudge": {}, + "audiobench.judge": {}, +} diff --git a/nemo_skills/dataset/audiobench/judge/__init__.py b/nemo_skills/dataset/audiobench/judge/__init__.py new file mode 100644 index 0000000000..69b86c575d --- /dev/null +++ b/nemo_skills/dataset/audiobench/judge/__init__.py @@ -0,0 +1,39 @@ +# Copyright (c) 2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +"""AudioBench judge tasks dataset configuration. + +This dataset includes tasks that require LLM-based evaluation such as: +- Audio captioning +- Spoken question answering +- Audio understanding and reasoning + +These tasks require an LLM judge for evaluation, matching MMAU-Pro evaluation setup. +""" + +# Dataset configuration - CRITICAL: needed for audio to work +DATASET_GROUP = "speechlm" +METRICS_TYPE = "speechlm" +DEFAULT_SPLIT = "test" +GENERATION_ARGS = "++prompt_format=openai " + +# Judge configuration matching AudioBench official implementation +# Using Llama-3.1-70B with vllm (can be overridden in run scripts) +JUDGE_PIPELINE_ARGS = { + "model": "meta-llama/Meta-Llama-3.1-70B-Instruct", + "server_type": "vllm", + "server_gpus": 8, + "server_args": "--max-model-len 8192 --gpu-memory-utilization 0.95", +} +JUDGE_ARGS = "++prompt_config=judge/audiobench ++generation_key=judgement" diff --git a/nemo_skills/dataset/audiobench/nonjudge/__init__.py b/nemo_skills/dataset/audiobench/nonjudge/__init__.py new file mode 100644 index 0000000000..fc4704d75c --- /dev/null +++ b/nemo_skills/dataset/audiobench/nonjudge/__init__.py @@ -0,0 +1,31 @@ +# Copyright (c) 2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +"""AudioBench non-judge tasks dataset configuration. + +This dataset includes ASR, translation, and other tasks that use +automatic metrics (WER, BLEU, WER-PC) instead of judge evaluation. + +NO JUDGE REQUIRED - Metrics computed automatically from model outputs. +""" + +# Dataset configuration - CRITICAL: needed for audio to work +DATASET_GROUP = "speechlm" +METRICS_TYPE = "speechlm" + +# Evaluation settings +EVAL_ARGS = "++eval_type=audiobench " + +# Generation settings - OpenAI format for audio-language models +GENERATION_ARGS = "++prompt_format=openai " diff --git a/nemo_skills/dataset/audiobench/prepare.py b/nemo_skills/dataset/audiobench/prepare.py new file mode 100644 index 0000000000..d27a174ebf --- /dev/null +++ b/nemo_skills/dataset/audiobench/prepare.py @@ -0,0 +1,546 @@ +# Copyright (c) 2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +"""AudioBench Dataset Preparation for nemo-skills + +This script prepares AudioBench datasets for evaluation with nemo-skills. +AudioBench is a comprehensive benchmark for evaluating speech and audio models +across multiple tasks including ASR, translation, speech QA, and more. + +Usage: + python -m nemo_skills.dataset.audiobench.prepare --split test + python -m nemo_skills.dataset.audiobench.prepare --datasets librispeech_test_clean earnings21_test + python -m nemo_skills.dataset.audiobench.prepare --category nonjudge +""" + +import argparse +import json +import os +import shutil +import subprocess +import sys +from pathlib import Path +from typing import Dict, List + +import numpy as np +import soundfile as sf +from tqdm import tqdm + +# AudioBench datasets categorized by evaluation type +JUDGE_DATASETS = [ + "alpaca_audio_test", + "audiocaps_qa_test", + "audiocaps_test", + "clotho_aqa_test", + "cn_college_listen_mcq_test", + "dream_tts_mcq_test", + "iemocap_emotion_test", + "iemocap_gender_test", + "imda_ar_dialogue", + "imda_ar_sentence", + "imda_gr_dialogue", + "imda_gr_sentence", + "imda_part3_30s_ds_human_test", + "imda_part4_30s_ds_human_test", + "imda_part5_30s_ds_human_test", + "imda_part6_30s_ds_human_test", + "imda_part3_30s_sqa_human_test", + "imda_part4_30s_sqa_human_test", + "imda_part5_30s_sqa_human_test", + "imda_part6_30s_sqa_human_test", + "meld_emotion_test", + "meld_sentiment_test", + "mmau_mini", + "muchomusic_test", + "openhermes_audio_test", + "public_sg_speech_qa_test", + "slue_p2_sqa5_test", + "spoken_squad_test", + "voxceleb_accent_test", + "voxceleb_gender_test", + "wavcaps_qa_test", + "wavcaps_test", +] + +NONJUDGE_DATASETS = [ + "aishell_asr_zh_test", + "common_voice_15_en_test", + "covost2_en_id_test", + "covost2_en_ta_test", + "covost2_en_zh_test", + "covost2_id_en_test", + "covost2_ta_en_test", + "covost2_zh_en_test", + "earnings21_test", + "earnings22_test", + "gigaspeech_test", + "gigaspeech2_indo", + "gigaspeech2_thai", + "gigaspeech2_viet", + "imda_part1_asr_test", + "imda_part2_asr_test", + "imda_part3_30s_asr_test", + "imda_part4_30s_asr_test", + "imda_part5_30s_asr_test", + "imda_part6_30s_asr_test", + "librispeech_test_clean", + "librispeech_test_other", + "peoples_speech_test", + "seame_dev_man", + "seame_dev_sge", + "spoken-mqa_long_digit", + "spoken-mqa_multi_step_reasoning", + "spoken-mqa_short_digit", + "spoken-mqa_single_step_reasoning", + "tedlium3_test", + "tedlium3_long_form_test", +] + + +def get_audio_duration(audio_array: np.ndarray, sampling_rate: int) -> float: + """Compute audio duration in seconds from array and sampling rate.""" + if audio_array is None or len(audio_array) == 0: + return 0.0 + return float(len(audio_array) / sampling_rate) + + +def save_audio_file(audio_array: np.ndarray, sampling_rate: int, output_path: str): + """Save audio array to WAV file.""" + os.makedirs(os.path.dirname(output_path), exist_ok=True) + sf.write(output_path, audio_array, sampling_rate) + + +def create_manifest_entry( + sample: Dict, + audio_filename: str, + duration: float, + dataset_name: str, + sample_id: int, + category: str, +) -> Dict: + """Create a nemo-skills compatible manifest entry. + + Args: + sample: Raw sample from AudioBench dataset + audio_filename: Audio filename (relative path within audiobench directory) + duration: Audio duration in seconds + dataset_name: Name of the dataset + sample_id: Sample index + category: Category (judge/nonjudge) + + Returns: + Manifest entry dict with proper format for nemo-skills + """ + instruction = sample.get("instruction", sample.get("text", "Process the audio")) + reference = sample.get("reference", sample.get("answer", "")) + task_type = sample.get("task_type", "unknown") + + # Create absolute audio path with /data/ prefix for cluster deployment + # Format: /data/audiobench/{category}/audio/{dataset_name}/{filename} + audio_rel_path = f"/data/audiobench/{category}/audio/{dataset_name}/{audio_filename}" + + # Create audio metadata (both singular and plural forms for compatibility) + audio_metadata = {"path": audio_rel_path, "duration": duration} + + entry = { + "expected_answer": reference, + "audio_path": [audio_rel_path], + "messages": [ + {"role": "system", "content": "You are a helpful assistant. /no_think"}, + { + "role": "user", + "content": instruction, + "audio": audio_metadata, + "audios": [audio_metadata], + }, + ], + "dataset": dataset_name, + "subset_for_metrics": dataset_name, + "sample_id": sample_id, + "task_type": task_type, + "question": instruction, + } + + for key in [ + "choices", + "options", + "audio_text_instruction", + "audio_gt", + "dimension", + "rule_type", + "rule_target", + "task", + ]: + if key in sample: + entry[key] = sample[key] + + return entry + + +def clone_audiobench_repo(target_dir: Path) -> bool: + """Clone AudioBench repository if it doesn't exist. + + Args: + target_dir: Directory where AudioBench should be cloned + + Returns: + True if successful, False otherwise + """ + audiobench_url = "https://github.com/AudioLLMs/AudioBench.git" + + if target_dir.exists(): + print(f"AudioBench already exists at {target_dir}") + return True + + print(f"\nCloning AudioBench repository to {target_dir}...") + print("This may take a few minutes...") + + try: + subprocess.run( + ["git", "clone", audiobench_url, str(target_dir)], + check=True, + capture_output=True, + text=True, + ) + print("✓ Successfully cloned AudioBench") + return True + except subprocess.CalledProcessError as e: + print(f"✗ Failed to clone AudioBench: {e.stderr}") + return False + except FileNotFoundError: + print("✗ git command not found. Please install git or manually clone AudioBench.") + return False + + +def process_dataset( + dataset_name: str, + output_dir: Path, + save_audio: bool = True, + split: str = "test", + audiobench_path: Path = None, + max_samples: int = -1, +) -> tuple[int, List[Dict]]: + """Process a single AudioBench dataset. + + Args: + dataset_name: Name of the dataset to process + output_dir: Base output directory + save_audio: Whether to save audio files + split: Dataset split (default: "test") + audiobench_path: Path to AudioBench repository + + Returns: + Tuple of (num_samples, manifest_entries) + """ + print(f"\n{'=' * 60}") + print(f"Processing: {dataset_name}") + print(f"{'=' * 60}") + + # Import AudioBench Dataset class + sys.path.insert(0, str(audiobench_path / "src")) + try: + from dataset import Dataset + except ImportError as e: + raise ImportError( + f"Failed to import AudioBench Dataset class: {e}\n" + f"AudioBench path: {audiobench_path}\n" + f"Make sure AudioBench repository is properly set up." + ) + + # Load dataset + try: + dataset = Dataset(dataset_name, number_of_samples=max_samples) + data_samples = dataset.input_data + print(f"Loaded {len(data_samples)} samples via AudioBench") + except Exception as e: + raise Exception(f"Failed to load dataset {dataset_name}: {e}") + + # Determine category (handle _test suffix variants) + dataset_base = dataset_name.replace("_test", "") + if dataset_name in JUDGE_DATASETS or dataset_base in JUDGE_DATASETS: + category = "judge" + elif dataset_name in NONJUDGE_DATASETS or dataset_base in NONJUDGE_DATASETS: + category = "nonjudge" + else: + category = "unknown" + + # Create output directories + audio_dir = output_dir / category / "audio" / dataset_name + dataset_dir = output_dir / category / dataset_name + os.makedirs(audio_dir, exist_ok=True) + os.makedirs(dataset_dir, exist_ok=True) + + # Copy __init__.py from category folder to dataset folder + category_init = output_dir / category / "__init__.py" + dataset_init = dataset_dir / "__init__.py" + if category_init.exists() and not dataset_init.exists(): + shutil.copy2(category_init, dataset_init) + print(f"✓ Copied __init__.py to {dataset_dir}") + + manifest_entries = [] + successful = 0 + failed = 0 + + for idx, sample in enumerate(tqdm(data_samples, desc=f"Processing {dataset_name}")): + try: + # Get audio data + audio_dict = sample.get("audio") + if audio_dict is None: + print(f"Warning: Sample {idx} has no audio, skipping") + failed += 1 + continue + + # Extract audio array and sampling rate + if isinstance(audio_dict, dict): + audio_array = audio_dict.get("array") + sampling_rate = audio_dict.get("sampling_rate", 16000) + else: + print(f"Warning: Unexpected audio format at sample {idx}") + failed += 1 + continue + + if audio_array is None or len(audio_array) == 0: + print(f"Warning: Empty audio at sample {idx}, skipping") + failed += 1 + continue + + # Convert to numpy array if needed + if isinstance(audio_array, list): + audio_array = np.array(audio_array) + + # Compute duration + duration = get_audio_duration(audio_array, sampling_rate) + + # Define audio file paths + audio_filename = f"{dataset_name}_{idx:06d}.wav" + local_audio_path = audio_dir / audio_filename + + # Save audio file + if save_audio: + try: + save_audio_file(audio_array, sampling_rate, str(local_audio_path)) + except Exception as e: + print(f"Warning: Failed to save audio for sample {idx}: {e}") + failed += 1 + continue + + # Create manifest entry with relative path + entry = create_manifest_entry( + sample=sample, + audio_filename=audio_filename, + duration=duration, + dataset_name=dataset_name, + sample_id=idx, + category=category, + ) + + manifest_entries.append(entry) + successful += 1 + + except Exception as e: + print(f"Error processing sample {idx}: {e}") + failed += 1 + continue + + # Save dataset-specific manifest to dataset directory + manifest_path = dataset_dir / f"{split}.jsonl" + with open(manifest_path, "w", encoding="utf-8") as f: + for entry in manifest_entries: + f.write(json.dumps(entry, ensure_ascii=False) + "\n") + + print(f"✓ Saved {successful} samples to {manifest_path}") + if failed > 0: + print(f"✗ Failed to process {failed} samples") + + return successful, manifest_entries + + +def main(): + parser = argparse.ArgumentParser(description="Prepare AudioBench datasets for nemo-skills evaluation") + parser.add_argument( + "--split", + default="test", + choices=["train", "validation", "test"], + help="Dataset split to prepare", + ) + parser.add_argument( + "--output_dir", + type=str, + default=None, + help="Output directory (defaults to $NEMO_SKILLS_DATA_DIR/audiobench)", + ) + parser.add_argument( + "--datasets", + nargs="+", + help="Specific dataset(s) to process (e.g., librispeech_test_clean earnings21)", + ) + parser.add_argument( + "--category", + choices=["judge", "nonjudge", "all"], + default="all", + help="Process only judge, nonjudge, or all datasets", + ) + parser.add_argument( + "--no-audio", + dest="save_audio", + action="store_false", + help="Skip saving audio files (only create manifests)", + ) + parser.add_argument( + "--audiobench-path", + type=str, + default=None, + help="Path to AudioBench repository (will auto-clone if not found)", + ) + parser.add_argument( + "--max-samples", + type=int, + default=-1, + help="Maximum number of samples to process per dataset (-1 for all)", + ) + parser.set_defaults(save_audio=True) + + args = parser.parse_args() + + # Determine output directory + if args.output_dir: + output_dir = Path(args.output_dir) + else: + # Use dataset directory as output (files will be in nemo_skills/dataset/audiobench/) + output_dir = Path(__file__).parent + + output_dir.mkdir(parents=True, exist_ok=True) + + # Determine AudioBench repository path + if args.audiobench_path: + audiobench_path = Path(args.audiobench_path) + else: + audiobench_path = os.getenv("AUDIOBENCH_REPO_PATH") + if audiobench_path: + audiobench_path = Path(audiobench_path) + else: + # Default to AudioBench directory (same level as this script) + audiobench_path = Path(__file__).parent / "AudioBench" + + # Clone AudioBench if it doesn't exist + if not audiobench_path.exists(): + print(f"\nAudioBench not found at {audiobench_path}") + if not clone_audiobench_repo(audiobench_path): + print("\nFailed to clone AudioBench. Please clone it manually:") + print(" git clone https://github.com/AudioLLMs/AudioBench.git") + sys.exit(1) + + # Verify AudioBench structure + if not (audiobench_path / "src" / "dataset.py").exists(): + print(f"\nError: AudioBench repository at {audiobench_path} is missing src/dataset.py") + print("Please ensure the repository is properly cloned.") + sys.exit(1) + + print("\n" + "=" * 60) + print("AudioBench Dataset Preparation") + print("=" * 60) + print(f"AudioBench path: {audiobench_path}") + print(f"Output directory: {output_dir}") + print(f"Save audio files: {args.save_audio}") + print(f"Split: {args.split}") + print("=" * 60 + "\n") + + # Determine which datasets to process + if args.datasets: + target_datasets = args.datasets + else: + all_datasets = JUDGE_DATASETS + NONJUDGE_DATASETS + if args.category == "judge": + target_datasets = JUDGE_DATASETS + elif args.category == "nonjudge": + target_datasets = NONJUDGE_DATASETS + else: # all + target_datasets = all_datasets + + print(f"Processing {len(target_datasets)} dataset(s)\n") + + # Process datasets + total_samples = 0 + successful_datasets = [] + failed_datasets = [] + judge_entries = [] + nonjudge_entries = [] + + for dataset_name in target_datasets: + try: + num_samples, entries = process_dataset( + dataset_name=dataset_name, + output_dir=output_dir, + save_audio=args.save_audio, + split=args.split, + audiobench_path=audiobench_path, + max_samples=args.max_samples, + ) + total_samples += num_samples + successful_datasets.append(dataset_name) + + if dataset_name in JUDGE_DATASETS: + judge_entries.extend(entries) + elif dataset_name in NONJUDGE_DATASETS: + nonjudge_entries.extend(entries) + + except Exception as e: + print(f"\n✗ FAILED: {dataset_name}") + print(f" Error: {e}\n") + failed_datasets.append((dataset_name, str(e))) + + # Create combined test.jsonl files + if judge_entries: + judge_test_path = output_dir / "judge" / f"{args.split}.jsonl" + judge_test_path.parent.mkdir(parents=True, exist_ok=True) + with open(judge_test_path, "w", encoding="utf-8") as f: + for entry in judge_entries: + f.write(json.dumps(entry, ensure_ascii=False) + "\n") + print(f"\n✓ Combined judge {args.split}.jsonl: {judge_test_path}") + print(f" Total samples: {len(judge_entries)}") + + if nonjudge_entries: + nonjudge_test_path = output_dir / "nonjudge" / f"{args.split}.jsonl" + nonjudge_test_path.parent.mkdir(parents=True, exist_ok=True) + with open(nonjudge_test_path, "w", encoding="utf-8") as f: + for entry in nonjudge_entries: + f.write(json.dumps(entry, ensure_ascii=False) + "\n") + print(f"\n✓ Combined nonjudge {args.split}.jsonl: {nonjudge_test_path}") + print(f" Total samples: {len(nonjudge_entries)}") + + # Print summary + print("\n" + "=" * 60) + print("SUMMARY") + print("=" * 60) + print(f"Datasets requested: {len(target_datasets)}") + print(f"Successfully processed: {len(successful_datasets)}") + print(f"Failed: {len(failed_datasets)}") + print(f"Total samples: {total_samples}") + + if successful_datasets: + print(f"\nSuccessful datasets ({len(successful_datasets)}):") + for name in successful_datasets: + category = "judge" if name in JUDGE_DATASETS else "nonjudge" + print(f" ✓ {name} ({category})") + + if failed_datasets: + print(f"\nFailed datasets ({len(failed_datasets)}):") + for name, error in failed_datasets: + print(f" ✗ {name}: {error}") + + print("=" * 60 + "\n") + + +if __name__ == "__main__": + main() diff --git a/nemo_skills/dataset/icpc/__init__.py b/nemo_skills/dataset/icpc25/__init__.py similarity index 97% rename from nemo_skills/dataset/icpc/__init__.py rename to nemo_skills/dataset/icpc25/__init__.py index 2742bad1e7..215753804c 100644 --- a/nemo_skills/dataset/icpc/__init__.py +++ b/nemo_skills/dataset/icpc25/__init__.py @@ -13,7 +13,7 @@ # limitations under the License. """ -todo: We are working on providing the data files that are necessary to run ICPC evaluation. +todo: We are working on providing the data files that are necessary to run ICPC25 evaluation. """ # settings that define how evaluation should be done by default (all can be changed from cmdline) diff --git a/nemo_skills/dataset/librispeech-pc/__init__.py b/nemo_skills/dataset/librispeech-pc/__init__.py new file mode 100644 index 0000000000..5fbfe2b1cb --- /dev/null +++ b/nemo_skills/dataset/librispeech-pc/__init__.py @@ -0,0 +1,26 @@ +# Copyright (c) 2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +"""LibriSpeech-PC: ASR evaluation with Punctuation and Capitalization. + +Test sets (evaluation only): +- test-clean: Clean speech recordings (~2.6k samples) +- test-other: More challenging speech with various acoustic conditions (~2.9k samples) +""" + +DATASET_GROUP = "speechlm" +METRICS_TYPE = "speechlm" +DEFAULT_SPLIT = "test-clean" +EVAL_ARGS = "++eval_type=audiobench " +GENERATION_ARGS = "++prompt_format=openai " diff --git a/nemo_skills/dataset/librispeech-pc/prepare.py b/nemo_skills/dataset/librispeech-pc/prepare.py new file mode 100644 index 0000000000..b2c6b6c87d --- /dev/null +++ b/nemo_skills/dataset/librispeech-pc/prepare.py @@ -0,0 +1,175 @@ +# Copyright (c) 2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +"""Prepare LibriSpeech-PC for ASR evaluation with punctuation and capitalization. + +LibriSpeech-PC provides manifests with punctuation/capitalization from OpenSLR-145. +Audio files are downloaded from original LibriSpeech at OpenSLR-12. + +Usage: + ns prepare_data librispeech-pc --data_dir + ns prepare_data librispeech-pc --split test-clean (or test-other) --data_dir +""" + +import argparse +import json +import os +import tarfile +import urllib.request +from pathlib import Path + +from tqdm import tqdm + + +def download_with_progress(url: str, output_path: Path, desc: str): + """Download file with tqdm progress bar.""" + with tqdm(unit="B", unit_scale=True, unit_divisor=1024, desc=desc) as pbar: + + def reporthook(blocknum, blocksize, totalsize): + if pbar.total != totalsize: + pbar.total = totalsize + pbar.update(blocksize) + + urllib.request.urlretrieve(url, output_path, reporthook) + + +# LibriSpeech-PC manifests (with punctuation and capitalization) +MANIFESTS_URL = "https://www.openslr.org/resources/145/manifests.tar.gz" + +# Original LibriSpeech audio files +AUDIO_URLS = { + "test-clean": "https://www.openslr.org/resources/12/test-clean.tar.gz", + "test-other": "https://www.openslr.org/resources/12/test-other.tar.gz", +} + + +def download_manifests(output_dir: Path) -> Path: + """Download LibriSpeech-PC manifests if not already present.""" + if (output_dir / "test-clean.json").exists(): + return output_dir + + tar_path = output_dir / "manifests.tar.gz" + download_with_progress(MANIFESTS_URL, tar_path, "Downloading manifests") + + with tarfile.open(tar_path, "r:gz") as tar: + for member in tar.getmembers(): + if member.name in ["test-clean.json", "test-other.json"]: + tar.extract(member, output_dir, filter="data") + os.remove(tar_path) + + print("✓ Manifests ready\n") + return output_dir + + +def download_audio(split: str, audio_dir: Path): + """Download LibriSpeech audio files if not already present.""" + split_dir = audio_dir / "LibriSpeech" / split.replace("-", "_") + if split_dir.exists(): + return + + tar_path = audio_dir / f"{split}.tar.gz" + download_with_progress(AUDIO_URLS[split], tar_path, f"Downloading {split}") + + with tarfile.open(tar_path, "r:gz") as tar: + tar.extractall(audio_dir, filter="data") + os.remove(tar_path) + + +def process_split(split: str, data_dir: Path, audio_dir: Path, with_audio: bool) -> int: + """Process one LibriSpeech-PC split into nemo-skills format.""" + + output_file = data_dir / f"{split}.jsonl" + manifest_file = data_dir / f"{split}.json" + if not manifest_file.exists(): + print(f"✗ Manifest not found: {manifest_file}") + return 0 + + if with_audio: + download_audio(split, audio_dir) + + with open(manifest_file, "r") as f: + entries = [json.loads(line) for line in f if line.strip()] + + processed = 0 + skipped = 0 + + with open(output_file, "w") as fout: + for entry in entries: + audio_filepath = entry.get("audio_filepath", "") + text = entry.get("text", "") + + if not audio_filepath or not text: + skipped += 1 + continue + + audio_id = Path(audio_filepath).stem + + container_path = f"/dataset/librispeech-pc/LibriSpeech/{audio_filepath}" + + user_message = { + "role": "user", + "content": "Transcribe the audio with proper punctuation and capitalization.", + "audio": {"path": container_path}, + } + + output_entry = { + "audio_filepath": container_path, + "text": text, + "expected_answer": text, + "task_type": "ASR-PC", + "sample_id": audio_id, + "split": split, + "messages": [{"role": "system", "content": "You are a helpful assistant. /no_think"}, user_message], + } + + fout.write(json.dumps(output_entry, ensure_ascii=False) + "\n") + processed += 1 + + print(f"✓ {split}: {processed} samples" + (f" ({skipped} skipped)" if skipped > 0 else "")) + + if processed > 0 and manifest_file.exists(): + os.remove(manifest_file) + + return processed + + +def main(): + parser = argparse.ArgumentParser(description="Prepare LibriSpeech-PC for ASR evaluation") + parser.add_argument( + "--split", + default="all", + choices=["all", "test-clean", "test-other"], + help="Which split to prepare (default: all)", + ) + parser.add_argument( + "--no-audio", + action="store_true", + help="Skip audio download", + ) + args = parser.parse_args() + + data_dir = Path(__file__).parent + audio_dir = data_dir + audio_dir.mkdir(exist_ok=True) + + download_manifests(data_dir) + + splits = ["test-clean", "test-other"] if args.split == "all" else [args.split] + total = sum(process_split(split, data_dir, audio_dir, not args.no_audio) for split in splits) + + print(f"\n✓ Complete: {total} samples") + + +if __name__ == "__main__": + main() diff --git a/nemo_skills/dataset/mmau-pro/prepare.py b/nemo_skills/dataset/mmau-pro/prepare.py index a6f04d621b..bb05083ee6 100644 --- a/nemo_skills/dataset/mmau-pro/prepare.py +++ b/nemo_skills/dataset/mmau-pro/prepare.py @@ -84,13 +84,16 @@ def format_entry(entry, with_audio=False): if entry.get("audio_path"): audio_path = entry["audio_path"] - - if isinstance(audio_path, list) and audio_path: - user_message["audios"] = [{"path": path, "duration": 10.0} for path in audio_path] - elif isinstance(audio_path, str): - user_message["audio"] = {"path": audio_path, "duration": 10.0} - - formatted_entry["messages"] = [user_message] + # Prepend /dataset/mmau-pro/ to make paths absolute for cluster + if len(audio_path) == 1: + user_message["audio"] = {"path": f"/dataset/mmau-pro/{audio_path[0]}"} + else: + user_message["audios"] = [{"path": f"/dataset/mmau-pro/{path}"} for path in audio_path] + + formatted_entry["messages"] = [ + {"role": "system", "content": "You are a helpful assistant. /no_think"}, + user_message, + ] return formatted_entry diff --git a/nemo_skills/evaluation/evaluator/__init__.py b/nemo_skills/evaluation/evaluator/__init__.py index 21f8a0e3d2..cd36509ce4 100644 --- a/nemo_skills/evaluation/evaluator/__init__.py +++ b/nemo_skills/evaluation/evaluator/__init__.py @@ -15,6 +15,7 @@ import asyncio from typing import Any, Callable, Dict +from nemo_skills.evaluation.evaluator.audiobench import eval_audiobench from nemo_skills.evaluation.evaluator.base import BaseEvaluator from nemo_skills.evaluation.evaluator.bfcl import eval_bfcl from nemo_skills.evaluation.evaluator.code import ( @@ -56,6 +57,8 @@ "bigcodebench": eval_bigcodebench, "human_eval_infilling": eval_human_eval_infilling, "mmau-pro": eval_mmau_pro, + "audiobench": eval_audiobench, + "librispeech-pc": eval_audiobench, } # Evaluator class mapping, other evaluators can be added here as they're converted to classes diff --git a/nemo_skills/evaluation/evaluator/audiobench.py b/nemo_skills/evaluation/evaluator/audiobench.py new file mode 100644 index 0000000000..681dd97b49 --- /dev/null +++ b/nemo_skills/evaluation/evaluator/audiobench.py @@ -0,0 +1,283 @@ +# Copyright (c) 2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import json +import logging +import re +from typing import Any + +import numpy as np +from tqdm import tqdm + +from nemo_skills.utils import get_logger_name, nested_dataclass + +LOG = logging.getLogger(get_logger_name(__file__)) + + +@nested_dataclass(kw_only=True) +class AudioBenchEvaluatorConfig: + """Configuration for AudioBench evaluation.""" + + # Prompt configuration for judge tasks + prompt_config: str = "eval/speechlm/audiobench" + + +# ============================================================================= +# ASR-PC Helper Functions (LibriSpeech-PC with Punctuation/Capitalization) +# ============================================================================= + + +def normalize_whitespace(text: str) -> str: + """Normalize multiple spaces to single space.""" + return re.sub(r"\s+", " ", text).strip() + + +def split_tokens(text: str) -> list[str]: + """Split text into words and punctuation as separate tokens.""" + return re.findall(r"\w+|[^\w\s]", text) + + +def extract_punctuation(text: str) -> list[str]: + """Extract only punctuation characters from text.""" + return [c for c in text if not c.isalnum() and not c.isspace()] + + +def calculate_per(reference: str, hypothesis: str) -> float: + """ + Calculate Punctuation Error Rate (PER) according to + arXiv:2310.02943 formula: + PER = (I + D + S) / (I + D + S + C) + """ + ref_punct = extract_punctuation(reference) + hyp_punct = extract_punctuation(hypothesis) + + len_r, len_h = len(ref_punct), len(hyp_punct) + + if len_r == 0 and len_h == 0: + return 0.0 + + # Dynamic programming: dp[i,j] = (C, S, D, I) + dp = np.zeros((len_r + 1, len_h + 1, 4), dtype=int) + + for i in range(1, len_r + 1): + dp[i, 0][2] = i # all deletions + for j in range(1, len_h + 1): + dp[0, j][3] = j # all insertions + + # Fill DP table + for i in range(1, len_r + 1): + for j in range(1, len_h + 1): + if ref_punct[i - 1] == hyp_punct[j - 1]: + dp[i, j] = dp[i - 1, j - 1].copy() + dp[i, j][0] += 1 # correct + else: + sub = dp[i - 1, j - 1].copy() + sub[1] += 1 + delete = dp[i - 1, j].copy() + delete[2] += 1 + insert = dp[i, j - 1].copy() + insert[3] += 1 + dp[i, j] = min([sub, delete, insert], key=lambda x: x[1] + x[2] + x[3]) + + correct, substitution, deletion, insertion = dp[len_r, len_h] + total = correct + substitution + deletion + insertion + per = (substitution + deletion + insertion) / total if total > 0 else 0.0 + return per + + +def evaluate_asr_pc(reference: str, hypothesis: str) -> dict[str, Any]: + """Evaluate ASR with punctuation and capitalization (LibriSpeech-PC style).""" + import jiwer + + # Normalize whitespace + ref_pc = normalize_whitespace(reference) + hyp_pc = normalize_whitespace(hypothesis) + + # WER_PC: Full metric with punctuation and capitalization + ref_tokens = split_tokens(ref_pc) + hyp_tokens = split_tokens(hyp_pc) + wer_pc = jiwer.wer(" ".join(ref_tokens), " ".join(hyp_tokens)) + + # WER_C: Capitalization only + ref_c = normalize_whitespace(re.sub(r"[^\w\s]", "", reference)) + hyp_c = normalize_whitespace(re.sub(r"[^\w\s]", "", hypothesis)) + wer_c = jiwer.wer(ref_c, hyp_c) + + # WER: Standard (lowercase, no punctuation) + ref_std = normalize_whitespace(re.sub(r"[^\w\s]", "", reference.lower())) + hyp_std = normalize_whitespace(re.sub(r"[^\w\s]", "", hypothesis.lower())) + wer_std = jiwer.wer(ref_std, hyp_std) + + # PER: Punctuation Error Rate + per = calculate_per(reference, hypothesis) + + return { + "wer": wer_std, + "wer_c": wer_c, + "wer_pc": wer_pc, + "per": per, + "is_correct": wer_pc < 0.5, + } + + +# ============================================================================= +# Standard ASR Helper Functions +# ============================================================================= + + +def preprocess_asr_text(text: str) -> str: + """Preprocess text for standard ASR evaluation (Whisper-style normalization).""" + from whisper.normalizers import EnglishTextNormalizer + + text = text.lower() + normalizer = EnglishTextNormalizer() + text = normalizer(text) + # Remove bracketed content + text = re.sub(r"(\[|\(|\{|\<)[^\(\)\\n\[\]]*(\]|\)|\}|\>)", "", text) + text = re.sub(r"\s+", " ", text).strip() + return text + + +def evaluate_asr(reference: str, hypothesis: str) -> dict[str, Any]: + """Evaluate standard ASR with normalization.""" + import jiwer + + ref = preprocess_asr_text(reference) + hyp = preprocess_asr_text(hypothesis) + + # Handle empty strings + if not ref: + ref = "empty" + if not hyp: + hyp = "empty" + + wer_score = jiwer.wer(ref, hyp) + + return { + "wer": wer_score, + "is_correct": wer_score < 0.5, + } + + +# ============================================================================= +# Translation Helper Functions +# ============================================================================= + + +def evaluate_translation(reference: str, hypothesis: str) -> dict[str, Any]: + """Evaluate translation using BLEU score.""" + try: + import sacrebleu + + ref = [reference.strip()] + hyp = hypothesis.strip() + bleu = sacrebleu.sentence_bleu(hyp, ref) + bleu_score = bleu.score / 100.0 + + return { + "bleu": bleu_score, + "is_correct": bleu_score > 0.3, + } + except Exception as e: + return { + "bleu": 0.0, + "is_correct": False, + "error": str(e), + } + + +def eval_audiobench(cfg): + """Evaluate AudioBench and ASR datasets using nemo-skills framework. + + This evaluator processes JSONL files with speech model outputs + and evaluates them using automatic metrics: + - ASR tasks: Word Error Rate (WER) + * Standard ASR: Normalized WER (removes punctuation/capitalization) + * LibriSpeech-PC: Multiple metrics (WER, WER_C, WER_PC, PER) + - Translation tasks: BLEU score + - Other tasks: May require LLM-as-a-judge (handled separately) + + Separate datasets allow tracking performance across different tasks. + """ + # Extract only the fields that belong to AudioBenchEvaluatorConfig + config_fields = {"prompt_config"} + config_kwargs = {k: v for k, v in cfg.items() if k in config_fields} + eval_config = AudioBenchEvaluatorConfig(**config_kwargs) + + jsonl_file = cfg["input_file"] + LOG.info(f"Evaluating {jsonl_file}") + + with open(jsonl_file, "rt", encoding="utf-8") as fin: + data = [json.loads(line) for line in fin] + + samples_already_evaluated = sum(1 for sample in data if "is_correct" in sample) + + if samples_already_evaluated > 0: + LOG.info(f"Resuming evaluation: {samples_already_evaluated}/{len(data)} samples already evaluated") + + for idx, sample in enumerate(tqdm(data, desc="Evaluating samples")): + data[idx] = evaluate_sample(sample, eval_config) + + # Write all results at once + with open(jsonl_file, "wt", encoding="utf-8") as fout: + for sample in data: + fout.write(json.dumps(sample) + "\n") + + LOG.info(f"Evaluation completed for {jsonl_file}") + + +def evaluate_sample(sample: dict[str, Any], config: AudioBenchEvaluatorConfig) -> dict[str, Any]: + """Evaluate a single sample based on task type.""" + sample = sample.copy() + task_type = sample.get("task_type", "unknown") + generation = sample.get("generation", "").strip() + expected_answer = sample.get("expected_answer", "").strip() + + # Handle missing generation for automatic metrics + if task_type in ["ASR", "ASR-PC", "Translation"] and not generation: + sample.update( + { + "is_correct": False, + "wer": 1.0, + "error": "missing_generation", + "predicted_answer": "", + } + ) + return sample + + # Evaluate based on task type + if task_type == "ASR-PC": + metrics = evaluate_asr_pc(expected_answer, generation) + sample.update(metrics) + sample["predicted_answer"] = generation + + elif task_type == "ASR": + metrics = evaluate_asr(expected_answer, generation) + sample.update(metrics) + sample["predicted_answer"] = generation + + elif task_type == "Translation": + metrics = evaluate_translation(expected_answer, generation) + sample.update(metrics) + sample["predicted_answer"] = generation + + else: + # QA and other tasks require LLM judge evaluation + if "requires_judge" not in sample: + sample["requires_judge"] = True + sample["predicted_answer"] = generation + if "is_correct" not in sample: + sample["is_correct"] = False + + return sample diff --git a/nemo_skills/evaluation/metrics/icpc_metrics.py b/nemo_skills/evaluation/metrics/icpc_metrics.py index 298b210f16..1d5d05f6ab 100644 --- a/nemo_skills/evaluation/metrics/icpc_metrics.py +++ b/nemo_skills/evaluation/metrics/icpc_metrics.py @@ -59,7 +59,6 @@ def extract_info(self, submission) -> dict: return { "score": submission["test_case_results"]["score"], "sample_score": submission["test_case_results"]["sample_score"], - "tokens": submission["num_generated_tokens"], "code": extract_final_cpp_block(submission["generation"]), } diff --git a/nemo_skills/evaluation/metrics/map_metrics.py b/nemo_skills/evaluation/metrics/map_metrics.py index 34dd0192e6..794b3b19a2 100644 --- a/nemo_skills/evaluation/metrics/map_metrics.py +++ b/nemo_skills/evaluation/metrics/map_metrics.py @@ -37,6 +37,7 @@ from nemo_skills.evaluation.metrics.mrcr_metrics import MRCRMetrics from nemo_skills.evaluation.metrics.ruler_metrics import RulerMetrics from nemo_skills.evaluation.metrics.simpleqa_metrics import SimpleQAMetrics +from nemo_skills.evaluation.metrics.speechlm_metrics import SpeechLMMetrics from nemo_skills.evaluation.metrics.translation_metrics import TranslationMetrics METRICS_MAP = { @@ -66,6 +67,7 @@ "mmau_pro_closed_form": MMAUProMetrics, "mmau_pro_open_ended": MMAUProMetrics, "mmau_pro_instruction_following": MMAUProMetrics, + "speechlm": SpeechLMMetrics, } diff --git a/nemo_skills/evaluation/metrics/speechlm_metrics.py b/nemo_skills/evaluation/metrics/speechlm_metrics.py new file mode 100644 index 0000000000..2ebec7ab5e --- /dev/null +++ b/nemo_skills/evaluation/metrics/speechlm_metrics.py @@ -0,0 +1,223 @@ +# Copyright (c) 2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import logging + +from nemo_skills.evaluation.metrics.base import BaseMetrics, as_int, as_percentage +from nemo_skills.utils import get_logger_name + +LOG = logging.getLogger(get_logger_name(__file__)) + + +class SpeechLMMetrics(BaseMetrics): + """Metrics class for speech/audio language model evaluation tasks.""" + + def __init__(self, compute_no_answer: bool = True, max_k: int = 1): + super().__init__(compute_no_answer=compute_no_answer) + self.max_k = max_k + self.wer_scores = [] + self.wer_c_scores = [] + self.wer_pc_scores = [] + self.per_scores = [] + self.bleu_scores = [] + + def _extract_judge_result(self, judgement_text: str) -> bool: + """Extract judge result from judgement text.""" + import re + + if re.search(r"\byes\b", judgement_text, re.IGNORECASE): + return True + elif re.search(r"\bno\b", judgement_text, re.IGNORECASE): + return False + else: + return False + + def _get_score_dict(self, prediction: dict) -> dict[str, bool | int | float]: + """Extract correctness scores from prediction.""" + score_dict = {} + + category = prediction.get("category", "unknown") + + if "judgement" in prediction and category == "open": + judge_result = self._extract_judge_result(prediction["judgement"]) + score_dict["judge_correct"] = judge_result + + if category == "open" and "judge_correct" in score_dict: + score_dict["correct"] = score_dict["judge_correct"] + elif "is_correct" in prediction: + score_dict["correct"] = prediction["is_correct"] + else: + score_dict["correct"] = False + + return score_dict + + def get_incorrect_sample(self, prediction: dict) -> dict: + """Return a sample marked as incorrect for all metrics.""" + prediction = prediction.copy() + prediction["is_correct"] = False + prediction["judge_correct"] = False + if not prediction.get("generation", "").strip(): + prediction["generation"] = None + return prediction + + def update_common_metrics(self, agg_dict): + """Override to always include avg_tokens even if 0 since it's in metrics_to_print.""" + agg_dict["num_entries"] = self.total + agg_dict["avg_tokens"] = int(self.avg_tokens / self.total) if self.total > 0 else 0 + if self.max_end_time > float("-inf") and self.min_start_time < float("inf"): + agg_dict["gen_seconds"] = int(self.max_end_time - self.min_start_time) + + def update(self, predictions): + """Update metrics with new predictions.""" + super().update(predictions) + + predicted_answers = [pred.get("generation", "").strip() or None for pred in predictions] + + # Collect WER, PnC, and BLEU scores + for pred in predictions: + if "wer" in pred and pred["wer"] is not None: + self.wer_scores.append(pred["wer"]) + if "wer_c" in pred and pred["wer_c"] is not None: + self.wer_c_scores.append(pred["wer_c"]) + if "wer_pc" in pred and pred["wer_pc"] is not None: + self.wer_pc_scores.append(pred["wer_pc"]) + if "per" in pred and pred["per"] is not None: + self.per_scores.append(pred["per"]) + if "bleu" in pred and pred["bleu"] is not None: + self.bleu_scores.append(pred["bleu"]) + + self._compute_pass_at_k(predictions=predictions, predicted_answers=predicted_answers) + self._compute_majority_at_k(predictions=predictions, predicted_answers=predicted_answers) + + def get_metrics(self): + """Get computed metrics.""" + metrics_dict = super().get_metrics() + + for agg_mode, agg_metrics in metrics_dict.items(): + if "no_answer" in agg_metrics: + agg_metrics["no_answer"] = agg_metrics["no_answer"] / 2.0 + + # Set success_rate based on correct field + if "correct" in agg_metrics: + agg_metrics["success_rate"] = agg_metrics["correct"] + elif "judge_correct" in agg_metrics: + agg_metrics["success_rate"] = agg_metrics["judge_correct"] + + # Add WER, PnC, and BLEU if available (convert to percentages and round to 2 decimals) + if self.wer_scores: + agg_metrics["wer"] = round(100.0 * sum(self.wer_scores) / len(self.wer_scores), 2) + if self.wer_c_scores: + agg_metrics["wer_c"] = round(100.0 * sum(self.wer_c_scores) / len(self.wer_c_scores), 2) + if self.wer_pc_scores: + agg_metrics["wer_pc"] = round(100.0 * sum(self.wer_pc_scores) / len(self.wer_pc_scores), 2) + if self.per_scores: + agg_metrics["per"] = round(100.0 * sum(self.per_scores) / len(self.per_scores), 2) + if self.bleu_scores: + agg_metrics["bleu"] = round(100.0 * sum(self.bleu_scores) / len(self.bleu_scores), 2) + + return metrics_dict + + def evaluations_to_print(self): + """Specify which evaluation modes to print.""" + evals = [f"pass@{self.max_k}"] + if self.max_k > 1: + evals.extend([f"majority@{self.max_k}", f"pass@1[avg-of-{self.max_k}]"]) + return evals + + def metrics_to_print(self): + """Specify which metrics to print.""" + base_metrics = { + "avg_tokens": as_int, + "gen_seconds": as_int, + "success_rate": as_percentage, + } + + if self.compute_no_answer: + base_metrics["no_answer"] = as_percentage + + # Add WER, PnC, and BLEU if they were computed + if self.wer_scores: + base_metrics["wer"] = as_percentage + if self.wer_c_scores: + base_metrics["wer_c"] = as_percentage + if self.wer_pc_scores: + base_metrics["wer_pc"] = as_percentage + if self.per_scores: + base_metrics["per"] = as_percentage + if self.bleu_scores: + base_metrics["bleu"] = as_percentage + + base_metrics["num_entries"] = as_int # Add at end for better display order + + return base_metrics + + +def compute_score(combined_metrics: dict) -> dict: + """ + Aggregate metrics from multiple sub-benchmarks into a single group score. + + Args: + combined_metrics: Dictionary with benchmark names as keys. + Each benchmark has eval modes (e.g., 'pass@1') as keys, + which contain the actual metrics. + Format: {benchmark_name: {eval_mode: {metrics...}}} + + Returns: + Aggregated metrics dictionary in the same format. + """ + # Identify main benchmark categories (nonjudge, judge) + main_benchmark_names = ["nonjudge", "judge"] + benchmarks = {k: v for k, v in combined_metrics.items() if k.split(".")[-1] in main_benchmark_names} + + if not benchmarks: + return {} + + # Get all eval modes from first benchmark (they should all have the same modes) + first_benchmark = next(iter(benchmarks.values())) + eval_modes = list(first_benchmark.keys()) + + # Aggregate metrics for each evaluation mode + aggregated = {} + for eval_mode in eval_modes: + total_entries = 0 + weighted_success = 0.0 + total_gen_seconds = 0 + weighted_tokens = 0.0 + weighted_no_answer = 0.0 + + for benchmark_name, benchmark_data in benchmarks.items(): + if eval_mode not in benchmark_data: + continue + + metrics = benchmark_data[eval_mode] + num_entries = metrics.get("num_entries", 0) + total_entries += num_entries + + # Aggregate weighted by number of entries (metrics are already percentages) + if num_entries > 0: + weighted_success += metrics.get("success_rate", 0.0) * num_entries + total_gen_seconds += metrics.get("gen_seconds", 0) + weighted_tokens += metrics.get("avg_tokens", 0.0) * num_entries + weighted_no_answer += metrics.get("no_answer", 0.0) * num_entries + + # Compute aggregated metrics + aggregated[eval_mode] = { + "avg_tokens": int(weighted_tokens / total_entries) if total_entries > 0 else 0, + "gen_seconds": total_gen_seconds, + "success_rate": weighted_success / total_entries if total_entries > 0 else 0.0, + "no_answer": weighted_no_answer / total_entries if total_entries > 0 else 0.0, + "num_entries": total_entries, + } + + return aggregated diff --git a/nemo_skills/inference/eval/swebench.py b/nemo_skills/inference/eval/swebench.py index f3610091e8..1c198522c3 100644 --- a/nemo_skills/inference/eval/swebench.py +++ b/nemo_skills/inference/eval/swebench.py @@ -111,7 +111,6 @@ class SweBenchGenerationConfig: eval_harness_repo: str = "https://github.com/Kipok/SWE-bench.git" eval_harness_commit: str = "HEAD" # Which commit to use when cloning the eval harness repo - setup_timeout: int = 60 * 20 # Timeout to download & install the agent framework and the eval harness, in seconds swebench_tests_timeout: int = 60 * 30 # Timeout for the tests after applying the patch, in seconds # How many times to try running inference & evaluation commands until they produce a valid output file @@ -306,40 +305,6 @@ async def evaluate_single_datapoint(self, data_point): # currently evaluation is done directly after generation already return data_point - async def _execute_local_command(self, command, timeout=None): - """Execute a command locally with retry logic.""" - for attempt in range(self.cfg.max_retries): - try: - # Create async subprocess - process = await asyncio.create_subprocess_shell(f"/bin/bash -c {shlex.quote(command)}") - - # Wait for completion - await asyncio.wait_for(process.communicate(), timeout=timeout) - - if process.returncode != 0: - raise ValueError(f"Command failed with return code {process.returncode}") - - except asyncio.TimeoutError: - raise ValueError(f"Command timed out after {timeout} seconds: '{command}'") - - except Exception: - if attempt < self.cfg.max_retries - 1: - retry_interval = random.randint(self.cfg.min_retry_interval, self.cfg.max_retry_interval) - LOG.warning( - "Attempt %d failed for command: '%s'. Retrying in %d seconds...", - attempt + 1, - command, - retry_interval, - ) - if retry_interval > 0: - await asyncio.sleep(retry_interval) - continue - else: - raise ValueError(f"All {self.cfg.max_retries} attempts failed for command: '{command}'") - - else: - return - async def _execute_container_command(self, data_point, command, expected_file_pattern, mode, timeout=100000): """Execute a command in an Apptainer container with retry logic.""" container_name = data_point["container_formatter"].format( @@ -357,7 +322,6 @@ async def _execute_container_command(self, data_point, command, expected_file_pa apptainer_cmd = ( f"apptainer exec --writable-tmpfs --no-mount home,tmp,bind-paths " f"--mount type=bind,src=/nemo_run/code,dst=/nemo_run/code " - f"--mount type=bind,src=/root,dst=/root_mount,ro " f"--mount type=bind,src={self.output_dir},dst=/trajectories_mount " f" {container_name} bash -c {shlex.quote(command)}" ) @@ -435,6 +399,8 @@ async def _run_swe_agent(self, data_point, api_base): """ if self.cfg.agent_config is None: self.cfg.agent_config = "eval/swe-bench/swe-agent/default" + if self.cfg.agent_framework_repo is None: + self.cfg.agent_framework_repo = "https://github.com/SWE-agent/SWE-agent.git" completion_kwargs = { openai_param: getattr(self.cfg.inference, ns_param) @@ -445,11 +411,18 @@ async def _run_swe_agent(self, data_point, api_base): completion_kwargs["logprobs"] = True swe_agent_cmd = ( - # copy installed repo & uv dir from /root_mount - "cp -r /root_mount/SWE-agent /root && " - "cp -r /root_mount/uv /root && " - "cd /root/SWE-agent && " - # run the agent + # first installing swe-agent repo + "curl -LsSf https://astral.sh/uv/install.sh | sh && " + "source /root/.local/bin/env && " + "cd /root && " + "mkdir SWE-agent && " + "cd SWE-agent && " + f"git clone {self.cfg.agent_framework_repo} . && " + f"git checkout {self.cfg.agent_framework_commit} && " + "uv venv --python 3.12 venv && " + "source venv/bin/activate && " + "uv pip install -e . && " + # then running the agent f"/root/SWE-agent/venv/bin/python -m sweagent run " f" --config {get_config_path(self.cfg.agent_config)} " f" --agent.model.name hosted_vllm/{self.cfg.server.model} " @@ -495,6 +468,8 @@ async def _run_openhands(self, data_point, api_base): """ if self.cfg.agent_config is None: self.cfg.agent_config = "eval/swe-bench/openhands/default" + if self.cfg.agent_framework_repo is None: + self.cfg.agent_framework_repo = "https://github.com/All-Hands-AI/OpenHands.git" # Add parameters to config.toml @@ -535,22 +510,19 @@ async def _run_openhands(self, data_point, api_base): " echo 'This is because OpenHands DELETES EVERYTHING in the /workspace folder if it exists.' && " " exit 1; " "fi && " - # copy installed repo, uv, tmux & jq dirs from /root_mount - "cp -r /root_mount/OpenHands /root && " - "cp -r /root_mount/uv /root && " - "cp -r /root_mount/tmux /root && " - "cp -r /root_mount/jq /root && " - "cd /root/OpenHands && " - # make soft links to poetry, tmux & jq in /usr/local/bin, so OpenHands can run them from the command line - "ln -sf /root/uv/tool-bin/poetry /usr/local/bin/poetry && " - "ln -sf /root/tmux/tmux /usr/local/bin/tmux && " - "ln -sf /root/jq/jq /usr/local/bin/jq && " - # enable tmux appimage to run without fusermount - # https://docs.appimage.org/user-guide/troubleshooting/fuse.html#extract-and-run-type-2-appimages - "export APPIMAGE_EXTRACT_AND_RUN=1 && " - "export NO_CLEANUP=1 && " - # activate openhands venv - "source /root/OpenHands/.venv/bin/activate && " + # install openhands repo + dependencies + "cd /root && " + 'curl -L -O "https://github.com/conda-forge/miniforge/releases/latest/download/Miniforge3-$(uname)-$(uname -m).sh" && ' + "bash Miniforge3-$(uname)-$(uname -m).sh -b && " + 'eval "$(/root/miniforge3/bin/conda shell.bash hook)" && ' + "mamba install -y --override-channels conda-forge::python=3.12 conda-forge::nodejs conda-forge::poetry conda-forge::tmux && " + "mkdir OpenHands && " + "cd OpenHands && " + f"git clone {self.cfg.agent_framework_repo} . && " + f"git checkout {self.cfg.agent_framework_commit} && " + "export INSTALL_DOCKER=0 && " + "make build && " + "poetry run python -m pip install datasets && " # copy dataset f"mkdir {data_dir} && " f"cp {self.cfg.input_file} {data_dir} && " @@ -564,7 +536,7 @@ async def _run_openhands(self, data_point, api_base): # run the agent f"./evaluation/benchmarks/swe_bench/scripts/run_infer.sh " f" llm.model " # name of llm config section in config.toml - f" HEAD " # openhands commit (HEAD = stay in the currently checked out commit) + f" {self.cfg.agent_framework_commit} " # openhands commit f" CodeActAgent " # agent f" 1 " # number of instances f" {self.cfg.agent_max_turns} " # max agent iterations @@ -605,6 +577,10 @@ async def _run_openhands(self, data_point, api_base): async def process_single_datapoint(self, data_point, data): """Will do all necessary generations to get a single answer for the data point.""" + self.output_dir = Path(self.cfg.output_file).parent + if self.cfg.inference.random_seed is not None: + self.output_dir = self.output_dir / f"rs{self.cfg.inference.random_seed}" + self.output_dir.mkdir(exist_ok=True) # TODO: what's the right way to support api models, so that our standard parameters for that can be used? # TODO: use self.cfg.server.base_url, etc. Can we pass in API key? @@ -642,11 +618,17 @@ async def process_single_datapoint(self, data_point, data): else: # Run full evaluation with streaming output swe_bench_cmd = ( - # copy installed repo & uv dir from /root_mount - "cp -r /root_mount/SWE-bench /root && " - "cp -r /root_mount/uv /root && " + # first installing SWE-bench repo + "curl -LsSf https://astral.sh/uv/install.sh | sh && " + "source /root/.local/bin/env && " + "mkdir /root/SWE-bench && " "cd /root/SWE-bench && " - # run the evaluation with streaming output + f"git clone {self.cfg.eval_harness_repo} . && " + f"git checkout {self.cfg.eval_harness_commit} && " + "uv venv --python 3.12 venv && " + "source venv/bin/activate && " + "uv pip install -e . && " + # then running the evaluation with streaming output f"/root/SWE-bench/venv/bin/python -m swebench.harness.run_local_evaluation " f" --predictions_path {pred_mounted_path} " f" --instance_ids {data_point['instance_id']} " diff --git a/nemo_skills/inference/generate.py b/nemo_skills/inference/generate.py index aae36c7351..b5c3b61b96 100644 --- a/nemo_skills/inference/generate.py +++ b/nemo_skills/inference/generate.py @@ -338,9 +338,6 @@ def __init__(self, cfg: GenerateSolutionsConfig): LOG.info("Evaluator supports per-datapoint evals, will interleave evaluation with generation.") self.evaluator = get_evaluator_class(self.cfg.eval_type, self.cfg.eval_config) - # Track whether we've shown the reasoning warning - self._reasoning_warning_shown = False - LOG.info( "Async loop is maintaining %d generations in parallel. " "Use max_concurrent_requests to control the number of concurrent requests.", @@ -547,16 +544,6 @@ async def postprocess_single_output(self, output, original_data_point): self.cfg.end_reasoning_string, ) - # Warn once if reasoning detected but not being parsed - if not self.cfg.parse_reasoning and not self._reasoning_warning_shown: - gen = output.get(self.cfg.generation_key) - if isinstance(gen, str) and self.cfg.end_reasoning_string in gen: - LOG.warning( - f"Detected '{self.cfg.end_reasoning_string}' in generation but parse_reasoning=False. " - "For reasoning models, set ++parse_reasoning=True to avoid incorrect code extraction." - ) - self._reasoning_warning_shown = True - def prefill_generation(self, data_point) -> dict | None: """Prefill generation in case LLM is not required.""" # Override this method to customize the prefilling behavior. diff --git a/nemo_skills/pipeline/prepare_data.py b/nemo_skills/pipeline/prepare_data.py index 49a401978d..6760eecd72 100644 --- a/nemo_skills/pipeline/prepare_data.py +++ b/nemo_skills/pipeline/prepare_data.py @@ -31,7 +31,7 @@ # TODO: read this from init.py -DATASETS_REQUIRE_DATA_DIR = ["ruler", "ioi24", "mmau-pro"] +DATASETS_REQUIRE_DATA_DIR = ["ruler", "ioi24", "mmau-pro", "librispeech-pc", "audiobench"] @app.command(context_settings={"allow_extra_args": True, "ignore_unknown_options": True}) diff --git a/nemo_skills/prompt/config/judge/audiobench.yaml b/nemo_skills/prompt/config/judge/audiobench.yaml new file mode 100644 index 0000000000..9e886ed1a5 --- /dev/null +++ b/nemo_skills/prompt/config/judge/audiobench.yaml @@ -0,0 +1,29 @@ +# Judge prompt configuration for AudioBench evaluation +# Based on AudioBench's official llama3_70b_as_judge_binary prompt +# Adapted to nemo-skills Yes/No format (instead of 0/1 Rating) + +user: |- + [Reference Answer] + {expected_answer} + + [Model Answer] + {generation} + + [Question] + {question} + + [Task] + Rate the model's answer based on its alignment with the reference answer, focusing on accuracy and relevance to the reference provided. Please be critical on the details. + + Criteria: Assess if the model's response mirrors the reference in terms of content, accuracy, and relevance. + + The answer is INCORRECT if: + - The answer is refusing to give concrete results, providing something like 'cannot decide' + - The answer is wrong, providing incorrect or irrelevant information compared to the reference + + The answer is CORRECT if: + - The answer is correct, capturing or covering the meaning from the reference + + Your response should be formatted as follows: + Reasoning: (Provide a concise explanation of your rating, comparing the reference answer with the model's response. "The reference answer is [XXX], while the model's answer is [YYY]. I think ...") + Judgement: [Yes or No] diff --git a/tests/test_datasets.py b/tests/test_datasets.py index 39d4b0398a..86fd152df2 100644 --- a/tests/test_datasets.py +++ b/tests/test_datasets.py @@ -57,6 +57,8 @@ ("college_math", ["test"]), ("comp-math-24-25", ["test"]), ("mmau-pro", ["test"]), + ("audiobench", ["test"]), + ("librispeech-pc", ["test"]), ]