model: support OpenAI Whisper#8064
model: support OpenAI Whisper#8064MahmoudAshraf97 wants to merge 18 commits intosgl-project:mainfrom
Conversation
There was a problem hiding this comment.
Summary of Changes
Hello @MahmoudAshraf97, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!
This pull request introduces the core infrastructure to support OpenAI's Whisper model for Speech-to-Text functionality. It encompasses the model's definition, custom adaptations for its encoder and weight loading, and a specialized processor to handle audio input and prepare it for the framework's multimodal capabilities.
Highlights
- Whisper Model Integration: I've added the foundational support for integrating OpenAI's Whisper model, specifically
WhisperForConditionalGeneration, into the framework. This enables Speech-to-Text (STT) capabilities. - Custom Encoder and Weight Loading: The integration includes a custom
WhisperForConditionalGenerationclass that utilizes aMiniCPMWhisperEncoderfor its encoder component. I've also implemented specific weight loading logic to handle parameter mapping for efficient model initialization. - Multimodal Audio Processing: A dedicated
WhisperProcessorhas been introduced to handle audio input. This processor is responsible for loading raw audio data, converting it into the necessary input features, and preparing it as multimodal data items for the SGLang framework. - Model Registration: The new
WhisperForConditionalGenerationmodel has been registered within the system'smodel_config.pyto ensure it's recognized as a valid generation model.
Using Gemini Code Assist
The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.
Invoking Gemini
You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.
| Feature | Command | Description |
|---|---|---|
| Code Review | /gemini review |
Performs a code review for the current pull request in its current state. |
| Pull Request Summary | /gemini summary |
Provides a summary of the current pull request in its current state. |
| Comment | @gemini-code-assist | Responds in comments when explicitly tagged, both in issue comments and review comments. |
| Help | /gemini help |
Displays a list of available commands. |
Customization
To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.
Limitations & Feedback
Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.
You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.
Footnotes
-
Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩
There was a problem hiding this comment.
Code Review
This PR introduces support for the OpenAI Whisper model. The changes look like a good start, but there are a few areas that need attention.
In python/sglang/srt/models/whisper.py, there are unused parameters and variables in __init__ and load_weights, which should be cleaned up.
In python/sglang/srt/multimodal/processors/whisper.py, the process_mm_data_async method contains blocking calls for I/O and CPU-intensive work, which will impact server performance. These should be made asynchronous using the provided executors.
Additionally, the is_audio_model function in python/sglang/srt/configs/model_config.py needs to be updated to correctly identify Whisper models.
Addressing these points will improve the correctness and performance of the implementation.
|
Thanks for the contribution. Could you please add some tests? |
I will add them once I'm able to run the model, as I mentioned above, the script crashes with no error so I can't debug it, I'm not very experienced with SGLang so I'll need help from the team or another contributors, I can handle all the modeling and inference logic of whisper, but not SGL internals |
Got you, @byjiang1996 may you take a look? Thanks! |
python/sglang/srt/models/whisper.py
Outdated
There was a problem hiding this comment.
Could you please follow this class to add unit test for WhisperForConditionalGeneration to avoid future regression? We can use a small openai whisper model in the unit test so it won't take too long to run
|
I reimplemented the model using SGL modules to handle the kv caching, I'm getting the following error: it seems to be a problem with the encoder kv cache, we don't need to store the kv values for the encoder in the forward_extend stage, only the final encoder hidden state should be stored for the decoding phase |
|
What should be done now is:
|
|
I'm blocked on two things:
input_ids=tensor([[50258, 50363, 50259, ... , 50257]], device='cuda:0'), shape=torch.Size([1, 27])
seq_lens=tensor([1], device='cuda:0')
out_cache_loc=tensor([1], device='cuda:0')
seq_lens_sum=1
positions=tensor([0], device='cuda:0')EDIT: point 2 was solved by removing the batch dim from the multimodal processor outputs |
01e1845 to
a816ea1
Compare
|
Any update on this PR? We are very eager to try Whisper with SGLang. |
|
Hi @MahmoudAshraf97 , I try to run Whisper on 4090D using this PR branch. The example script in the description should be the client script. So I launch server first by Could you please support the scenario of mm_inputs is None as well? If there is anything mistake, please correct me. Thanks! |
Right now, sglang doesn’t have very good support for cross-attention, and I feel that this implementation has some issues. |
Does SGLang has future plan to support this? Or a workable patch to enable Whisper? |
I don’t think there are any official plans for that. Gracefully supporting cross-attention would require quite a bit of effort, and for now, it seems that not many models actually need it. |
|
@yanbing-j there is no case where the text input to whisper can be None, so I don't understand why should it be supported? @yhyang201 this implementation is correct and I've verified using multiple intermediate tensors, I have not committed the work arounds to make the model work to this branch because the problems need to be solved on SGL side, whisper uses very standard MHA so I don't understand why it needs these workarounds for it to work |
Your implementation should be fine. However, SGLang’s support for Cross Attention is not very complete (it seems somewhat coupled with mllama). Specifically, you can see this part: . Given the current state of Cross Attention support in SGLang, I suspect the model might not function properly. |
I agree, since mllama was the first model implemented that uses cross attention, anyway the workaround to make the model work is to override cross attention with manual computation of the attention |
@MahmoudAshraf97 Could you please share how to run Whisper using this PR branch? I list my cmd in the above comments, I try to launch server and do the client part using your example scripe in the description, but it fails when launching server. |
Just disable cuda graphs and the server will run as expected, also I've updated the script above to be more simple since the implementation is more mature now |
|
@MahmoudAshraf97 thanks for the work. I tried it with a 3:36 audio file but I am getting the following, so I guess this PR doesn't include longer audio files? |
No this error is unrelated to the length of the audio file, use this patch to fix it |
Motivation
This PR aims to implement whisper model for STT
Modifications
Audio modality is already supported with many models, but most of them use encoder projection to convert the audio to audio tokens that the language model can use, whisper uses encoder-decoder cross attention and I see that there is a dispatch in the backends for cross attention so I assume it's supported
I still need help to get this completed because the process gets killed without errors so it's hard to debug
Edit: the script was being killed at resampling stage, using a 16khz audio file avoids that for now
Checklist
To Do:
This function is hardcoded to support the current models but its not generalizable as each model has its own quirks, I suggest offloading it to each model's implementation such as the case with
pad_input_idssglang/python/sglang/srt/managers/schedule_batch.py
Line 1063 in 045ab92