[Model] Add LoRA support for Whisper models#29856
[Model] Add LoRA support for Whisper models#29856jeejeelee merged 6 commits intovllm-project:mainfrom
Conversation
|
Documentation preview: https://vllm--29856.org.readthedocs.build/en/29856/ |
There was a problem hiding this comment.
Code Review
This pull request introduces multi-LoRA support for Whisper models, which is a valuable addition. The implementation is robust and well-engineered. I appreciate that instead of a model-specific hack, the changes generalize the existing LoRA infrastructure to support Whisper's architecture, particularly the KV-only packed layers in cross-attention. The inclusion of comprehensive unit tests and a clear example script significantly enhances the quality and usability of this contribution. The code is clean, the logic is sound, and the changes are well-documented. Overall, this is an excellent pull request.
93182eb to
ba3826b
Compare
|
Will look at this PR ASAP, also cc @NickLucche |
jeejeelee
left a comment
There was a problem hiding this comment.
Thank you for your contribution. The main concern is that maybe we should use MergedColumnParallelLinear rather than QKVLinear in the base model
| # LoRA-specific attributes | ||
| embedding_modules = {} | ||
| embedding_padding_modules: list[str] = [] |
There was a problem hiding this comment.
If the model inherits from SupportsLoRA, these two attributes are empty by default
There was a problem hiding this comment.
Thank you, I'll remove these redundant.
| @@ -0,0 +1,136 @@ | |||
| # SPDX-License-Identifier: Apache-2.0 | |||
There was a problem hiding this comment.
It looks like this example is similar to multilora_inference.py, so do we need to add this example?
There was a problem hiding this comment.
You're right - it's similar to the existing multilora_inference.py.
I'll remove whisper_multilora_inference.py from this PR.
| @@ -398,7 +403,11 @@ def can_replace_layer( | |||
| packed_modules_list: list, | |||
| model_config: PretrainedConfig | None = None, | |||
| ) -> bool: | |||
| return type(source_layer) is QKVParallelLinear and len(packed_modules_list) == 3 | |||
There was a problem hiding this comment.
Can we use MergedColumnParallelLinear rather than QKVParallelLinear in base model?
There was a problem hiding this comment.
I will:
- Revert my changes to MergedQKVParallelLinearWithLoRA in column_parallel_linear.py
- Update whisper.py to use MergedColumnParallelLinear for the cross-attention's kv_proj layer
I'll update the PR with these changes shortly. Thanks again for the review!
|
This pull request has merge conflicts that must be resolved before it can be |
22c6415 to
1b48b46
Compare
1b48b46 to
e3250e7
Compare
|
@jeejeelee Would you please let me know if there's any additional work? |
tests/lora/test_whisper_lora.py
Outdated
| @@ -0,0 +1,144 @@ | |||
| # SPDX-License-Identifier: Apache-2.0 | |||
There was a problem hiding this comment.
Could you please delete this test script, I think this test is unnecessary.
There was a problem hiding this comment.
@daje0601 I think we should delete this test script
There was a problem hiding this comment.
I didn't see this post before this, but I do now, so I'll delete it and push it again.
jeejeelee
left a comment
There was a problem hiding this comment.
After removing the above test, LGTM, thank you for contribution
|
Fantastic work :-) What is the timeline for merging that? |
I've been waiting too~ If there's anything else I need to do on my end, could you please let me know? |
55b3c02 to
cdd5a70
Compare
I deleted it and reposted it, but it's still stuck pending in the same place. Please check it out. |
NickLucche
left a comment
There was a problem hiding this comment.
@daje0601 Thanks for you work!
Given popularity of the model, I think we should really be adding tests with whisper+some lora adapter.
vllm/lora/worker_manager.py
Outdated
| self.max_position_embeddings = getattr( | ||
| text_config, | ||
| "max_position_embeddings", | ||
| getattr(text_config, "max_target_positions", None), | ||
| ) |
There was a problem hiding this comment.
you should probably check if is_encoder_decoder with vllm_config.model_config.is_encoder_decoder
There was a problem hiding this comment.
and add a TODO to generalize for OOT enc-dec models
There was a problem hiding this comment.
Thanks, I'll check it out tonight!
There was a problem hiding this comment.
Could you please trigger the buildkite/ci/pr check when you have a chance? Thank you!
|
@NickLucche Thanks for the review! I've addressed your feedback: Changes Made
Could you please trigger the |
|
Hi @daje0601, the pre-commit checks have failed. Please run: uv pip install pre-commit
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
NickLucche
left a comment
There was a problem hiding this comment.
Thanks for your work @daje0601 !
|
@NickLucche I pushed a fix for the CI failure ( |
This PR enables Multi-LoRA support for Whisper speech-to-text models, allowing users to serve multiple fine-tuned Whisper adapters from a single base model. Changes: - Add SupportsLoRA interface to WhisperForConditionalGeneration - Add packed_modules_mapping for LoRA compatibility - Use MergedColumnParallelLinear for kv_proj in cross-attention - Add fallback to max_target_positions in WorkerLoRAManager - Add unit tests for Whisper LoRA support Signed-off-by: daje0601 <englishmt4118@gmail.com>
…kv_proj
Address maintainer feedback:
- Replace QKVParallelLinear with MergedColumnParallelLinear for kv_proj
in WhisperCrossAttention, enabling LoRA support via existing
MergedColumnParallelLinearWithLoRA infrastructure
- Update weight loading to use integer shard indices (0, 1) instead of
string identifiers ("k", "v") for MergedColumnParallelLinear
- Remove redundant embedding_modules and embedding_padding_modules
attributes from WhisperForConditionalGeneration
- Remove example file (similar to existing multilora_inference.py)
- Rollback LoRA layer changes as they are no longer needed
- Update tests to reflect new architecture
Signed-off-by: daje0601 <englishmt4118@gmail.com>
Signed-off-by: daje0601 <englishmt4118@gmail.com>
1. Use is_encoder_decoder check for max_position_embeddings handling - Check vllm_config.model_config.is_encoder_decoder explicitly - Use max_target_positions for encoder-decoder models (e.g., Whisper) - Use max_position_embeddings for other models 2. Add TODO comment for OOT encoder-decoder model generalization 3. Add Whisper + LoRA integration tests - test_whisper_lora_inference: Basic LoRA inference test - test_whisper_multi_lora: Multiple LoRA ID test - test_whisper_with_and_without_lora: LoRA comparison test - Uses chengyili2005/whisper-small-mandarin-lora adapter Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> Signed-off-by: daje0601 <englishmt4118@gmail.com>
Signed-off-by: daje0601 <englishmt4118@gmail.com>
Whisper has known issues with forked workers in vllm's v1 engine. Add autouse fixture to set VLLM_WORKER_MULTIPROC_METHOD=spawn, matching the pattern used in tests/models/multimodal/generation/test_whisper.py. Fixes CUDA re-initialization error in forked subprocess. Signed-off-by: daje0601 <englishmt4118@gmail.com>
|
Rebased on latest main and all CI checks are passing. Ready for merge! |
|
@NickLucche @jeejeelee Gentle ping — CI is all green and both approvals are in. Could this be merged when you get a chance? Thanks! |
Purpose
This PR enables Multi-LoRA support for Whisper speech-to-text models, allowing users to serve multiple fine-tuned Whisper adapters from a single base model.
Background
Currently, vLLM's
WhisperForConditionalGenerationdoes not implement theSupportsLoRAinterface, preventing users from using LoRA adapters with Whisper models. This limitation requires users to deployseparate model instances for each fine-tuned variant, which is inefficient in terms of GPU memory usage.
Changes
1.
vllm/model_executor/models/whisper.pySupportsLoRAinterface toWhisperForConditionalGenerationembedding_modulesandembedding_padding_modulesattributes required by LoRApacked_modules_mappingto use simplified keys (qkv_proj,kv_proj) for LoRA compatibility2.
vllm/lora/layers/column_parallel_linear.pyMergedQKVParallelLinearWithLoRAto support KV-only (2-slice) configurationsencoder_attn.kv_proj) only have K and V projections, not Qcan_replace_layer()to accept both 2-module and 3-module configurationsslice_lora_a()to dynamically handle variable number of slices3.
vllm/lora/worker_manager.pymax_target_positionswhenmax_position_embeddingsis not availablemax_target_positionsinstead ofmax_position_embeddings4.
examples/offline_inference/whisper_multilora_inference.py5.
tests/lora/test_whisper_lora.pyTest Plan
Test Result(Unit Tests)
Manual Testing
Tested with openai/whisper-large-v3-turbo base model and custom LoRA adapters:
Example Usage
or