Implement custom dataset class for ASR benchmarking#41576
Conversation
Added audio processing functionality and a custom dataset class for ASR benchmarking. The new features support various audio input formats and allow for sampling from a JSONL dataset. Signed-off-by: Yasmin Moslem <48152713+ymoslem@users.noreply.github.com>
There was a problem hiding this comment.
Code Review
This pull request introduces support for custom audio datasets in the benchmarking suite by adding a CustomAudioDataset class and a process_audio utility. The review feedback highlights a potential module-level failure due to the top-level import of soundfile without an ImportError check. Additionally, the feedback identifies several inconsistencies in the CustomAudioDataset.sample method, including missing support for the skip_chat_template flag, incorrect handling of null tokenizers, and the omission of logic for the output_tokens field.
ymoslem
left a comment
There was a problem hiding this comment.
Automatic suggestions reviewed
|
@ywang96 Would you please review. Thanks! |
The soundfile library is imported at the top level without an ImportError check. This will cause the entire datasets module to fail to load if soundfile is not installed, even for users running non-audio benchmarks. Please follow the existing pattern in this file (e.g., for pandas or datasets) by using a try...except block or placeholder module. Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: Yasmin Moslem <48152713+ymoslem@users.noreply.github.com>
The sample call for custom_audio is missing the skip_chat_template argument. This prevents the --skip-chat-template CLI flag from working correctly for this dataset type, which is inconsistent with the custom dataset implementation. Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: Yasmin Moslem <48152713+ymoslem@users.noreply.github.com>
The sample call for custom_audio is missing the skip_chat_template argument. This prevents the --skip-chat-template CLI flag from working correctly for this dataset type, which is inconsistent with the custom dataset implementation. Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: Yasmin Moslem <48152713+ymoslem@users.noreply.github.com>
ymoslem
left a comment
There was a problem hiding this comment.
Reviewing suggestions
Add try... except to the soundfile import, and add guards to the audio sample function Signed-off-by: Yasmin Moslem <48152713+ymoslem@users.noreply.github.com>
ymoslem
left a comment
There was a problem hiding this comment.
Fixed suggestions
|
Hi @ymoslem, the pre-commit checks have failed. Please run: uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
|
Hi @ymoslem, the pre-commit checks have failed. Please run: uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
|
Hi @ymoslem, the pre-commit checks have failed. Please run: uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
|
Hi @ymoslem, the pre-commit checks have failed. Please run: uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
|
Hi @ymoslem, the pre-commit checks have failed. Please run: uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
- Adding "custom_audio" to the `--dataset-name` choices, CustomAudioDataset class and process_audio function: - Support ASR models (Whisper tested) - Support Multimodal (text + audio) models requiring a chat template (Qwen2-Audio tested) - Change "custom_mm" to "custom_image" and CustomMMDataset to CustomImageDataset: - For now, both "custom_mm" and "custom_image" are accepted to keep backward compatibility. Signed-off-by: Yasmin Moslem <48152713+ymoslem@users.noreply.github.com>
|
Hi @ymoslem, the pre-commit checks have failed. Please run: uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
Match changes in datasets.py Signed-off-by: Yasmin Moslem <48152713+ymoslem@users.noreply.github.com>
Signed-off-by: Yasmin Moslem <48152713+ymoslem@users.noreply.github.com>
Added a deprecation warning for 'custom_mm' dataset. Use '--dataset-name custom_image' instead Signed-off-by: Yasmin Moslem <48152713+ymoslem@users.noreply.github.com>
ymoslem
left a comment
There was a problem hiding this comment.
Added a deprecation warning for custom_mm. Use --dataset-name custom_image instead.
Signed-off-by: Yasmin Moslem <48152713+ymoslem@users.noreply.github.com>
ymoslem
left a comment
There was a problem hiding this comment.
Updated deprecation warning for custom_mm
DarkLight1337
left a comment
There was a problem hiding this comment.
Can you also update the docs? https://docs.vllm.ai/en/latest/benchmarking/cli/?h=custom_mm#custom-multimodal-dataset
Updated the documentation to reflect changes in dataset naming and usage for image datasets. Signed-off-by: Yasmin Moslem <48152713+ymoslem@users.noreply.github.com>
|
Documentation preview: https://vllm--41576.org.readthedocs.build/en/41576/ |
Added instructions for benchmarking with CustomAudioDataset, including examples for Whisper and Qwen2-Audio models. Signed-off-by: Yasmin Moslem <48152713+ymoslem@users.noreply.github.com>
|
Hi @ymoslem, the pre-commit checks have failed. Please run: uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
Signed-off-by: Yasmin Moslem <48152713+ymoslem@users.noreply.github.com>
|
Hi @ymoslem, the pre-commit checks have failed. Please run: uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
Signed-off-by: Yasmin Moslem <48152713+ymoslem@users.noreply.github.com>
|
Hi @ymoslem, the pre-commit checks have failed. Please run: uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
Signed-off-by: Yasmin Moslem <48152713+ymoslem@users.noreply.github.com>
|
Thanks for your patience! |
Signed-off-by: Yasmin Moslem <48152713+ymoslem@users.noreply.github.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Signed-off-by: Yasmin Moslem <48152713+ymoslem@users.noreply.github.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Signed-off-by: Yasmin Moslem <48152713+ymoslem@users.noreply.github.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Signed-off-by: Yasmin Moslem <48152713+ymoslem@users.noreply.github.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Added a new function to process a custom audio dataset for ASR benchmarking.
Purpose
This PR adds support for benchmarking ASR (Automatic Speech Recognition) models using a custom local dataset. It introduces:
process_audio(): a utility function that normalizes audio inputs from:soundfile);{"array": ..., "sampling_rate": ...}; or(array, sr)tuples.CustomAudioDataset: a new dataset class extendingCustomDatasetthat loads audio samples from a JSONL file (e.g., {"prompt": "", "audio": "/path/to/audio.wav"}), processes them viaprocess_audio(), and constructsSampleRequestobjects with the audio asmulti_modal_data.CLI support:
custom_audioadded as a valid--dataset-namechoice, with a correspondingelifbranch inget_samples().Latest updates:
custom_audioto the--dataset-namechoices,CustomAudioDatasetclass andprocess_audiofunction:custom_mmtocustom_imageandCustomMMDatasettoCustomImageDataset:custom_mmandcustom_imageare accepted to keep backward compatibility.Test Plan 1 (Whisper)
Create a sample JSONL dataset
Start a server
Run the benchmark
Note: You might need to start another Terminal window, unless you use
nohupwhen starting the server.Test Plan 2 (Qwen2-Audio)
Create a sample JSONL dataset.
It is better to have a "prompt" with the required instruction.
Start a server
Run the benchmark
Note: You might need to start another Terminal window, unless you use
nohupwhen starting the server.Test Result
By the end of the run:
--save-resultand--save-detailedoptions, thewhisper_bench.jsonand qwen_bench.json files should include the results and outputs.Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.