diff --git a/README.md b/README.md index e5231eb89b..0153961b34 100644 --- a/README.md +++ b/README.md @@ -19,7 +19,7 @@ Here are some of the features we support: - [**Long-context**](https://nvidia-nemo.github.io/Skills/evaluation/long-context): e.g. [ruler](https://nvidia-nemo.github.io/Skills/evaluation/long-context/#ruler), [mrcr](https://nvidia-nemo.github.io/Skills/evaluation/long-context/#mrcr), [aalcr](https://nvidia-nemo.github.io/Skills/evaluation/long-context/#aalcr) - [**Tool-calling**](https://nvidia-nemo.github.io/Skills/evaluation/tool-calling): e.g. [bfcl_v3](https://nvidia-nemo.github.io/Skills/evaluation/tool-calling/#bfcl_v3) - [**Multilingual**](https://nvidia-nemo.github.io/Skills/evaluation/multilingual): e.g. [mmlu-prox](https://nvidia-nemo.github.io/Skills/evaluation/multilingual/#mmlu-prox), [FLORES-200](https://nvidia-nemo.github.io/Skills/evaluation/multilingual/#FLORES-200), [wmt24pp](https://nvidia-nemo.github.io/Skills/evaluation/multilingual/#wmt24pp) - - [**Speech & Audio**](https://nvidia-nemo.github.io/Skills/evaluation/speech-audio): e.g. [mmau-pro](https://nvidia-nemo.github.io/Skills/evaluation/speech-audio/#mmau-pro) + - [**Speech & Audio**](https://nvidia-nemo.github.io/Skills/evaluation/speech-audio): e.g. [asr-leaderboard](https://nvidia-nemo.github.io/Skills/evaluation/speech-audio/#asr-leaderboard), [mmau-pro](https://nvidia-nemo.github.io/Skills/evaluation/speech-audio/#mmau-pro) - Easily parallelize each evaluation across many slurm jobs, self-host LLM judges, bring your own prompts or change benchmark configuration in any other way. - [Model training](https://nvidia-nemo.github.io/Skills/pipelines/training): Train models using [NeMo-RL](https://github.com/NVIDIA-NeMo/RL/) or [verl](https://github.com/volcengine/verl). diff --git a/docs/evaluation/index.md b/docs/evaluation/index.md index 4153b7aabf..b63943e869 100644 --- a/docs/evaluation/index.md +++ b/docs/evaluation/index.md @@ -10,7 +10,7 @@ We support many popular benchmarks and it's easy to add new in the future. The f - [**Long-context**](./long-context.md): e.g. [ruler](./long-context.md#ruler), [mrcr](./long-context.md#mrcr) - [**Tool-calling**](./tool-calling.md): e.g. [bfcl_v3](./tool-calling.md#bfcl_v3) - [**Multilingual**](./multilingual.md): e.g. [mmlu-prox](./multilingual.md#mmlu-prox), [flores-200](./multilingual.md#FLORES-200), [wmt24pp](./multilingual.md#wmt24pp) -- [**Speech & Audio**](./speech-audio.md): e.g. [mmau-pro](./speech-audio.md#mmau-pro) +- [**Speech & Audio**](./speech-audio.md): e.g. [asr-leaderboard](./speech-audio.md#asr-leaderboard), [mmau-pro](./speech-audio.md#mmau-pro) See [nemo_skills/dataset](https://github.com/NVIDIA-NeMo/Skills/blob/main/nemo_skills/dataset) where each folder is a benchmark we support. diff --git a/docs/evaluation/speech-audio.md b/docs/evaluation/speech-audio.md index 1d7d439928..9a5f7c5251 100644 --- a/docs/evaluation/speech-audio.md +++ b/docs/evaluation/speech-audio.md @@ -2,8 +2,22 @@ This section details how to evaluate speech and audio benchmarks, including understanding tasks that test models' ability to reason about audio content (speech, music, environmental sounds) and ASR tasks for transcription. +!!! note + Currently supports only Megatron server type (`--server_type=megatron`). + ## Supported benchmarks +### ASR Leaderboard + +ASR benchmark based on the [HuggingFace Open ASR Leaderboard](https://huggingface.co/spaces/hf-audio/open_asr_leaderboard). Evaluates transcription quality using Word Error Rate (WER). + +**Datasets:** `librispeech_clean`, `librispeech_other`, `voxpopuli`, `tedlium`, `gigaspeech`, `spgispeech`, `earnings22`, `ami` + +#### Dataset Location + +- Benchmark is defined in [`nemo_skills/dataset/asr-leaderboard/__init__.py`](https://github.com/NVIDIA-NeMo/Skills/blob/main/nemo_skills/dataset/asr-leaderboard/__init__.py) +- Original datasets are hosted on HuggingFace (downloaded automatically during preparation) + ### MMAU-Pro MMAU-Pro (Multimodal Audio Understanding - Pro) is a comprehensive benchmark for evaluating audio understanding capabilities across three different task categories: @@ -17,108 +31,101 @@ MMAU-Pro (Multimodal Audio Understanding - Pro) is a comprehensive benchmark for - Benchmark is defined in [`nemo_skills/dataset/mmau-pro/__init__.py`](https://github.com/NVIDIA-NeMo/Skills/blob/main/nemo_skills/dataset/mmau-pro/__init__.py) - Original benchmark source is hosted on [HuggingFace](https://huggingface.co/datasets/gamma-lab-umd/MMAU-Pro) -## Preparing MMAU-Pro Data +## Preparing Data -MMAU-Pro requires audio files for meaningful evaluation. **Audio files are downloaded by default** to ensure proper evaluation. +These benchmarks require audio files for meaningful evaluation. **Audio files are downloaded by default** to ensure proper evaluation. !!! warning "Running without audio files" - If you want to evaluation without audio files (not recommended) use + If you want to evaluate without audio files (not recommended) use `--no-audio` flag. In this case you can also set `--skip_data_dir_check` as data is very lightweight when audio files aren't being used. -### Data Preparation - -To prepare the dataset with audio files: +### ASR Leaderboard ```bash -export HF_TOKEN=your_huggingface_token -ns prepare_data mmau-pro --data_dir=/path/to/data --cluster= +ns prepare_data asr-leaderboard --data_dir=/path/to/data --cluster= ``` -**What happens:** - -- Requires authentication (HuggingFace token via `HF_TOKEN` environment variable) -- Downloads audio archive from HuggingFace and extracts -- Prepares the dataset files for evaluation +Prepare specific datasets only: -### Text-Only Mode (Not Recommended) +```bash +ns prepare_data asr-leaderboard --datasets librispeech_clean ami +``` -If you need to prepare without audio files: +### MMAU-Pro ```bash -ns prepare_data mmau-pro --no-audio +ns prepare_data mmau-pro --data_dir=/path/to/data --cluster= ``` -Note: The git repository check is automatically skipped with `--no-audio`. - ## Running Evaluation -!!! note - Currently supports only Megatron server type (`--server_type=megatron`). - -### Evaluation Example +### ASR Leaderboard ```python -import os from nemo_skills.pipeline.cli import wrap_arguments, eval -os.environ["NVIDIA_API_KEY"] = "your_nvidia_api_key" # For LLM judge - eval( - ctx=wrap_arguments("++prompt_suffix='/no_think'"), + ctx=wrap_arguments(""), cluster="oci_iad", - output_dir="/workspace/mmau-pro-eval", - benchmarks="mmau-pro", + output_dir="/workspace/asr-leaderboard-eval", + benchmarks="asr-leaderboard", server_type="megatron", server_gpus=1, model="/workspace/checkpoint", server_entrypoint="/workspace/megatron-lm/server.py", server_container="/path/to/container.sqsh", data_dir="/dataset", - installation_command="pip install sacrebleu", + installation_command="pip install sacrebleu jiwer openai-whisper" server_args="--inference-max-requests 1 --model-config /workspace/checkpoint/config.yaml", ) ``` -??? note "Alternative: Command-line usage" +Evaluate a specific dataset: - If you prefer using the command-line interface, you can run: +```python +eval(benchmarks="asr-leaderboard", split="librispeech_clean", ...) +``` - ```bash - export HF_TOKEN=your_huggingface_token - export NVIDIA_API_KEY=your_nvidia_api_key - export MEGATRON_PATH=/workspace/path/to/megatron-lm +??? note "Alternative: Command-line usage" + ```bash ns eval \ --cluster=oci_iad \ - --output_dir=/workspace/path/to/mmau-pro-eval \ - --benchmarks=mmau-pro \ + --output_dir=/workspace/path/to/asr-leaderboard-eval \ + --benchmarks=asr-leaderboard \ --server_type=megatron \ --server_gpus=1 \ - --model=/workspace/path/to/checkpoint-tp1 \ - --server_entrypoint=$MEGATRON_PATH/path/to/server.py \ - --server_container=/path/to/server_container.sqsh \ - --data_dir=/dataset \ - --installation_command="pip install sacrebleu" \ - ++prompt_suffix='/no_think' \ - --server_args="--inference-max-requests 1 \ - --model-config /workspace/path/to/checkpoint-tp1/config.yaml \ - --num-tokens-to-generate 256 \ - --temperature 1.0 \ - --top_p 1.0" + --model=/workspace/path/to/checkpoint \ + --server_entrypoint=/workspace/megatron-lm/server.py \ + --server_container=/path/to/container.sqsh \ + --data_dir=/dataset + --installation_command="pip install sacrebleu jiwer openai-whisper" ``` -## How Evaluation Works +### MMAU-Pro -Each category uses a different evaluation strategy: +```python +import os +from nemo_skills.pipeline.cli import wrap_arguments, eval -| Category | Evaluation Method | How It Works | -|----------|-------------------|--------------| -| **Closed-Form** | NVEmbed similarity matching | Model generates short answer; compared to expected answer using embeddings | -| **Open-Ended** | LLM-as-a-judge (Qwen 2.5 7B) | Model generates detailed response; Qwen 2.5 judges quality and correctness | -| **Instruction Following** | Custom evaluation logic | Model follows instructions; evaluator checks adherence | +os.environ["NVIDIA_API_KEY"] = "your_nvidia_api_key" # For LLM judge -### Sub-benchmarks +eval( + ctx=wrap_arguments(""), + cluster="oci_iad", + output_dir="/workspace/mmau-pro-eval", + benchmarks="mmau-pro", + server_type="megatron", + server_gpus=1, + model="/workspace/checkpoint", + server_entrypoint="/workspace/megatron-lm/server.py", + server_container="/path/to/container.sqsh", + data_dir="/dataset", + installation_command="pip install sacrebleu", + server_args="--inference-max-requests 1 --model-config /workspace/checkpoint/config.yaml", +) +``` Evaluate individual categories: @@ -130,6 +137,24 @@ Evaluate individual categories: eval(benchmarks="mmau-pro.closed_form", ...) ``` +??? note "Alternative: Command-line usage" + + ```bash + export NVIDIA_API_KEY=your_nvidia_api_key + + ns eval \ + --cluster=oci_iad \ + --output_dir=/workspace/path/to/mmau-pro-eval \ + --benchmarks=mmau-pro \ + --server_type=megatron \ + --server_gpus=1 \ + --model=/workspace/path/to/checkpoint \ + --server_entrypoint=/workspace/megatron-lm/server.py \ + --server_container=/path/to/container.sqsh \ + --data_dir=/dataset \ + --installation_command="pip install sacrebleu" + ``` + ### Using Custom Judge Models The open-ended questions subset uses an LLM-as-a-judge (by default, Qwen 2.5 7B via NVIDIA API) to evaluate responses. You can customize the judge model for this subset: @@ -143,7 +168,7 @@ The open-ended questions subset uses an LLM-as-a-judge (by default, Qwen 2.5 7B os.environ["NVIDIA_API_KEY"] = "your_nvidia_api_key" eval( - ctx=wrap_arguments("++prompt_suffix='/no_think'"), + ctx=wrap_arguments(""), cluster="oci_iad", output_dir="/workspace/path/to/mmau-pro-eval", benchmarks="mmau-pro.open_ended", # Only open-ended uses LLM judge @@ -180,7 +205,58 @@ The open-ended questions subset uses an LLM-as-a-judge (by default, Qwen 2.5 7B ## Understanding Results -After evaluation completes, results are saved in your output directory under `eval-results/`: +After evaluation completes, results are saved in your output directory under `eval-results/`. + +### ASR Leaderboard Results + +``` +/ +└── eval-results/ + └── asr-leaderboard/ + └──metrics.json +``` + +Example output: + +``` +------------------------------------- asr-leaderboard -------------------------------------- +evaluation_mode | avg_tokens | gen_seconds | success_rate | no_answer | wer | num_entries +pass@1 | 736 | 233522 | 86.70% | 0.00% | 7.82% | 143597 + +----------------------------------- asr-leaderboard-ami ------------------------------------ +evaluation_mode | avg_tokens | gen_seconds | success_rate | no_answer | wer | num_entries +pass@1 | 732 | 3680 | 81.27% | 0.00% | 18.45% | 12620 + +-------------------------------- asr-leaderboard-earnings22 -------------------------------- +evaluation_mode | avg_tokens | gen_seconds | success_rate | no_answer | wer | num_entries +pass@1 | 736 | 3522 | 83.97% | 0.00% | 14.72% | 57390 + +-------------------------------- asr-leaderboard-gigaspeech -------------------------------- +evaluation_mode | avg_tokens | gen_seconds | success_rate | no_answer | wer | num_entries +pass@1 | 736 | 233469 | 71.86% | 0.00% | 12.34% | 25376 + +---------------------------- asr-leaderboard-librispeech_clean ---------------------------- +evaluation_mode | avg_tokens | gen_seconds | success_rate | no_answer | wer | num_entries +pass@1 | 735 | 3607 | 99.62% | 0.00% | 2.06% | 2620 + +---------------------------- asr-leaderboard-librispeech_other ---------------------------- +evaluation_mode | avg_tokens | gen_seconds | success_rate | no_answer | wer | num_entries +pass@1 | 733 | 3927 | 98.67% | 0.00% | 4.34% | 2939 + +-------------------------------- asr-leaderboard-spgispeech ------------------------------- +evaluation_mode | avg_tokens | gen_seconds | success_rate | no_answer | wer | num_entries +pass@1 | 740 | 4510 | 99.99% | 0.00% | 3.81% | 39341 + +--------------------------------- asr-leaderboard-tedlium ---------------------------------- +evaluation_mode | avg_tokens | gen_seconds | success_rate | no_answer | wer | num_entries +pass@1 | 732 | 3878 | 77.74% | 0.00% | 7.89% | 1469 + +-------------------------------- asr-leaderboard-voxpopuli -------------------------------- +evaluation_mode | avg_tokens | gen_seconds | success_rate | no_answer | wer | num_entries +pass@1 | 741 | 4007 | 99.51% | 0.00% | 6.47% | 1842 +``` + +### MMAU-Pro Results ``` / @@ -195,9 +271,7 @@ After evaluation completes, results are saved in your output directory under `ev │ └── metrics.json ``` -### Evaluation Output Format - -When evaluation completes, results are displayed in formatted tables in the logs: +Example output: **Open-Ended Questions:** @@ -213,7 +287,6 @@ pass@1 | 82 | 196 | 14.88% | 0.00% | 625 -------------------------- mmau-pro.instruction_following ------------------------- evaluation_mode | avg_tokens | gen_seconds | success_rate | no_answer | num_entries pass@1 | 0 | 102 | 21.84% | 0.00% | 87 - ``` **Closed-Form Questions (Main Category + Sub-categories):** diff --git a/docs/index.md b/docs/index.md index 01d63c1959..ce53e256c3 100644 --- a/docs/index.md +++ b/docs/index.md @@ -22,7 +22,7 @@ Here are some of the features we support: - [**Long-context**](./evaluation/long-context.md): e.g. [ruler](./evaluation/long-context.md#ruler), [mrcr](./evaluation/long-context.md#mrcr) - [**Tool-calling**](./evaluation/tool-calling.md): e.g. [bfcl_v3](./evaluation/tool-calling.md#bfcl_v3) - [**Multilingual capabilities**](./evaluation/multilingual.md): e.g. [mmlu-prox](./evaluation/multilingual.md#mmlu-prox), [flores-200](./evaluation/multilingual.md#FLORES-200), [wmt24pp](./evaluation/multilingual.md#wmt24pp) - - [**Speech & Audio**](./evaluation/speech-audio.md): e.g. [mmau-pro](./evaluation/speech-audio.md#mmau-pro) + - [**Speech & Audio**](./evaluation/speech-audio.md): e.g. [asr-leaderboard](./evaluation/speech-audio.md#asr-leaderboard), [mmau-pro](./evaluation/speech-audio.md#mmau-pro) - [**Robustness evaluation**](./evaluation/robustness.md): Evaluate model sensitvity against changes in prompt. - Easily parallelize each evaluation across many Slurm jobs, self-host LLM judges, bring your own prompts or change benchmark configuration in any other way. - [Model training](pipelines/training.md): Train models using [NeMo-RL](https://github.com/NVIDIA-NeMo/RL/) or [verl](https://github.com/volcengine/verl). diff --git a/nemo_skills/dataset/asr-leaderboard/__init__.py b/nemo_skills/dataset/asr-leaderboard/__init__.py new file mode 100644 index 0000000000..b81cace3bc --- /dev/null +++ b/nemo_skills/dataset/asr-leaderboard/__init__.py @@ -0,0 +1,21 @@ +# Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +# Settings that define how evaluation should be done by default (all can be changed from cmdline) +# Uses the audio evaluator which computes WER with HuggingFace leaderboard preprocessing +# Data samples should have task_type="ASR_LEADERBOARD" for proper WER calculation + +DATASET_GROUP = "speechlm" +METRICS_TYPE = "audio" +GENERATION_ARGS = "++prompt_format=openai ++eval_type=audio" diff --git a/nemo_skills/dataset/asr-leaderboard/prepare.py b/nemo_skills/dataset/asr-leaderboard/prepare.py new file mode 100644 index 0000000000..25bbafd986 --- /dev/null +++ b/nemo_skills/dataset/asr-leaderboard/prepare.py @@ -0,0 +1,200 @@ +# Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +"""Prepare ASR Leaderboard datasets for evaluation. + +Downloads and formats datasets from the HuggingFace Open ASR Leaderboard. +Audio paths in JSONL: /dataset/asr-leaderboard/data/{dataset}/{sample_id}.flac + +Usage: + ns prepare_data asr-leaderboard + ns prepare_data asr-leaderboard --datasets librispeech_clean ami + ns prepare_data asr-leaderboard --no-audio # skip saving audio files +""" + +import argparse +import json +from pathlib import Path + +import soundfile as sf +from datasets import load_dataset +from tqdm import tqdm + +SYSTEM_MESSAGE = "You are a helpful assistant. /no_think" +MIN_AUDIO_DURATION = 0.1 # Skip audio shorter than this (causes mel spectrogram errors) + +# (hf_dataset, hf_config, hf_split, streaming) +DATASET_CONFIGS = { + "librispeech_clean": ("librispeech_asr", "clean", "test", False), + "librispeech_other": ("librispeech_asr", "other", "test", False), + "voxpopuli": ("facebook/voxpopuli", "en", "test", False), + "tedlium": ("LIUM/tedlium", "release3", "test", False), + "gigaspeech": ("speechcolab/gigaspeech", "xs", "test", False), + "spgispeech": ("kensho/spgispeech", "test", "test", True), # streaming to avoid timeout due to large metadata + "earnings22": ("distil-whisper/earnings22", "chunked", "test", False), + "ami": ("edinburghcstr/ami", "ihm", "test", False), +} + + +def save_audio_and_format_entry(entry, dataset_name, audio_dir, sample_idx, with_audio=True): + """Format a dataset entry and optionally save audio file.""" + # Different datasets use different field names for transcription + text = ( + entry.get("text", "") # ami, LS, gigaspeech, tedlium + or entry.get("normalized_text", "") # voxpopuli + or entry.get("transcript", "") # spgispeech + or entry.get("transcription", "") # earnings22 + ) + text = text.strip() if text else "" + + system_message = {"role": "system", "content": SYSTEM_MESSAGE} + user_message = {"role": "user", "content": "Transcribe the following audio."} + + audio_info = entry.get("audio", {}) + if isinstance(audio_info, dict) and "array" in audio_info and "sampling_rate" in audio_info: + audio_array = audio_info["array"] + sampling_rate = audio_info["sampling_rate"] + duration = len(audio_array) / sampling_rate + + if duration < MIN_AUDIO_DURATION: + return None + + sample_id = entry.get("id", str(sample_idx)) + audio_filename = f"{sample_id}.flac" + + if with_audio: + sf.write(str(audio_dir / audio_filename), audio_array, sampling_rate) + + user_message["audio"] = { + "path": f"/dataset/asr-leaderboard/data/{dataset_name}/{audio_filename}", + "duration": float(duration), + } + + formatted_entry = { + "task_type": "ASR_LEADERBOARD", + "expected_answer": text, + "messages": [system_message, user_message], + "subset_for_metrics": dataset_name, + } + + if "id" in entry: + formatted_entry["id"] = entry["id"] + if "speaker_id" in entry: + formatted_entry["speaker_id"] = entry["speaker_id"] + + return formatted_entry + + +def prepare_dataset(dataset_name, output_dir, with_audio=True): + """Prepare a single ASR dataset.""" + if dataset_name not in DATASET_CONFIGS: + raise ValueError(f"Unknown dataset: {dataset_name}. Available: {list(DATASET_CONFIGS.keys())}") + + hf_dataset, hf_config, hf_split, streaming = DATASET_CONFIGS[dataset_name] + + print(f"Loading {dataset_name} from {hf_dataset} (streaming={streaming})...") + try: + if hf_config: + dataset = load_dataset(hf_dataset, hf_config, split=hf_split, trust_remote_code=True, streaming=streaming) + else: + dataset = load_dataset(hf_dataset, split=hf_split, trust_remote_code=True, streaming=streaming) + except Exception as e: + print(f"Warning: Failed to load {dataset_name}: {e}") + return 0 + + output_file = output_dir / f"{dataset_name}.jsonl" + audio_dir = output_dir / "data" / dataset_name + + if with_audio: + audio_dir.mkdir(parents=True, exist_ok=True) + print(f"Saving audio files to {audio_dir}") + + if streaming: + print(f"Processing {dataset_name} (streaming)...") + else: + print(f"Processing {len(dataset)} samples from {dataset_name}...") + + count = 0 + skipped = 0 + with open(output_file, "w", encoding="utf-8") as fout: + for idx, entry in enumerate(tqdm(dataset, desc=dataset_name)): + formatted = save_audio_and_format_entry(entry, dataset_name, audio_dir, idx, with_audio=with_audio) + if formatted is None: + skipped += 1 + continue + if formatted["expected_answer"]: + fout.write(json.dumps(formatted) + "\n") + count += 1 + + if skipped > 0: + print(f"Skipped {skipped} samples with audio < {MIN_AUDIO_DURATION}s") + + print(f"Saved {count} samples to {output_file}") + return count + + +def main(): + parser = argparse.ArgumentParser(description="Prepare ASR Leaderboard datasets for evaluation") + parser.add_argument( + "--datasets", + nargs="+", + default=["all"], + choices=list(DATASET_CONFIGS.keys()) + ["all"], + help="Datasets to prepare (default: all)", + ) + parser.add_argument( + "--no-audio", + action="store_true", + help="Skip saving audio files (JSONL still includes audio paths)", + ) + args = parser.parse_args() + + output_dir = Path(__file__).parent + output_dir.mkdir(parents=True, exist_ok=True) + + with_audio = not args.no_audio + + if args.no_audio: + print("Running without saving audio files.") + else: + print("Running with audio. Saving to data/{dataset}/") + + datasets_to_prepare = list(DATASET_CONFIGS.keys()) if "all" in args.datasets else args.datasets + + total_samples = 0 + for dataset_name in datasets_to_prepare: + total_samples += prepare_dataset(dataset_name, output_dir, with_audio=with_audio) + + # Combine all dataset JSONLs into test.jsonl + combined_file = output_dir / "test.jsonl" + print(f"\nCreating combined file: {combined_file}") + + all_jsonl_files = sorted(output_dir.glob("*.jsonl")) + dataset_files = [f for f in all_jsonl_files if f.name != "test.jsonl"] + + combined_count = 0 + with open(combined_file, "w", encoding="utf-8") as fout: + for dataset_file in dataset_files: + with open(dataset_file, encoding="utf-8") as fin: + for line in fin: + fout.write(line) + combined_count += 1 + print(f" Added {dataset_file.name}") + + print(f"Combined {combined_count} samples from {len(dataset_files)} datasets into {combined_file}") + print(f"\nTotal: {total_samples} samples prepared") + + +if __name__ == "__main__": + main() diff --git a/nemo_skills/pipeline/prepare_data.py b/nemo_skills/pipeline/prepare_data.py index 49a401978d..36820c7337 100644 --- a/nemo_skills/pipeline/prepare_data.py +++ b/nemo_skills/pipeline/prepare_data.py @@ -31,7 +31,7 @@ # TODO: read this from init.py -DATASETS_REQUIRE_DATA_DIR = ["ruler", "ioi24", "mmau-pro"] +DATASETS_REQUIRE_DATA_DIR = ["ruler", "ioi24", "mmau-pro", "asr-leaderboard"] @app.command(context_settings={"allow_extra_args": True, "ignore_unknown_options": True}) diff --git a/tests/gpu-tests/test_eval.py b/tests/gpu-tests/test_eval.py index c69c75be71..aa5df51035 100644 --- a/tests/gpu-tests/test_eval.py +++ b/tests/gpu-tests/test_eval.py @@ -43,6 +43,7 @@ "human-eval-infilling", "mbpp", "mmau-pro", + "asr-leaderboard", "aalcr", # Has tokenization mismatch issues }