Skip to content

[Model] Add LoRA support for Whisper models#29856

Merged
jeejeelee merged 6 commits intovllm-project:mainfrom
daje0601:whisper-multi-lora-support
Mar 5, 2026
Merged

[Model] Add LoRA support for Whisper models#29856
jeejeelee merged 6 commits intovllm-project:mainfrom
daje0601:whisper-multi-lora-support

Conversation

@daje0601
Copy link
Copy Markdown
Contributor

@daje0601 daje0601 commented Dec 2, 2025

Purpose

This PR enables Multi-LoRA support for Whisper speech-to-text models, allowing users to serve multiple fine-tuned Whisper adapters from a single base model.

Background

Currently, vLLM's WhisperForConditionalGeneration does not implement the SupportsLoRA interface, preventing users from using LoRA adapters with Whisper models. This limitation requires users to deploy
separate model instances for each fine-tuned variant, which is inefficient in terms of GPU memory usage.

Changes

1. vllm/model_executor/models/whisper.py

  • Add SupportsLoRA interface to WhisperForConditionalGeneration
  • Add embedding_modules and embedding_padding_modules attributes required by LoRA
  • Update packed_modules_mapping to use simplified keys (qkv_proj, kv_proj) for LoRA compatibility

2. vllm/lora/layers/column_parallel_linear.py

  • Extend MergedQKVParallelLinearWithLoRA to support KV-only (2-slice) configurations
  • This is necessary because Whisper's cross-attention layers (encoder_attn.kv_proj) only have K and V projections, not Q
  • Update can_replace_layer() to accept both 2-module and 3-module configurations
  • Refactor slice_lora_a() to dynamically handle variable number of slices

3. vllm/lora/worker_manager.py

  • Add fallback to max_target_positions when max_position_embeddings is not available
  • Whisper config uses max_target_positions instead of max_position_embeddings

4. examples/offline_inference/whisper_multilora_inference.py

  • Add example script demonstrating Whisper Multi-LoRA inference

5. tests/lora/test_whisper_lora.py

  • Add unit tests for Whisper LoRA interface compliance
  • Add tests for KV-only configuration support
  • Add tests for WorkerLoRAManager Whisper compatibility

Test Plan

# Run unit tests
pytest tests/lora/test_whisper_lora.py -v

# Run example (requires LoRA adapter)
python examples/offline_inference/whisper_multilora_inference.py

Test Result(Unit Tests)

tests/lora/test_whisper_lora.py::TestWhisperLoRAInterface::test_supports_lora_attribute PASSED
tests/lora/test_whisper_lora.py::TestWhisperLoRAInterface::test_embedding_modules_defined PASSED
tests/lora/test_whisper_lora.py::TestWhisperLoRAInterface::test_embedding_padding_modules_defined PASSED
tests/lora/test_whisper_lora.py::TestWhisperLoRAInterface::test_packed_modules_mapping_format PASSED
tests/lora/test_whisper_lora.py::TestMergedQKVParallelLinearWithLoRAKVOnly::test_can_replace_layer_accepts_2_modules PASSED
tests/lora/test_whisper_lora.py::TestWorkerLoRAManagerWhisperCompat::test_max_position_embeddings_fallback PASSED
tests/lora/test_whisper_lora.py::TestWorkerLoRAManagerWhisperCompat::test_max_position_embeddings_priority PASSED

Manual Testing
Tested with openai/whisper-large-v3-turbo base model and custom LoRA adapters:

  • Server startup with --enable-lora flag: ✅
  • Single LoRA inference: ✅
  • Multi-LoRA switching between requests: ✅
  • Concurrent requests with different LoRAs: ✅

Example Usage

from vllm import LLM
from vllm.lora.request import LoRARequest

# Initialize with LoRA support
llm = LLM(
    model="openai/whisper-large-v3-turbo",
    enable_lora=True,
    max_loras=4,
    max_lora_rank=64,
)

# Use different LoRA adapters per request
outputs = llm.generate(
    inputs,
    lora_request=LoRARequest("my_whisper_lora", 1, "/path/to/lora")
)

or

vllm serve yourname/yourmodel \
--host 0.0.0.0 \
--port 8181 \
--dtype bfloat16 \
--trust-remote-code \
--enable-lora \
--lora-modules  \
lora1=lora_module_path \
lora2=lora_module_path \
--max-lora-rank 32 \
--gpu-memory-utilization 0.7 \
--tensor-parallel-size 1

  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the https://docs.google.com/document/d/1YyVqrgX4gHTtrstbq8oWUImOyPCKSGnJ7xtTpmXzlRs/edit?tab=t.0.

@daje0601 daje0601 requested a review from jeejeelee as a code owner December 2, 2025 08:56
@mergify
Copy link
Copy Markdown

mergify bot commented Dec 2, 2025

Documentation preview: https://vllm--29856.org.readthedocs.build/en/29856/

@mergify mergify bot added the documentation Improvements or additions to documentation label Dec 2, 2025
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces multi-LoRA support for Whisper models, which is a valuable addition. The implementation is robust and well-engineered. I appreciate that instead of a model-specific hack, the changes generalize the existing LoRA infrastructure to support Whisper's architecture, particularly the KV-only packed layers in cross-attention. The inclusion of comprehensive unit tests and a clear example script significantly enhances the quality and usability of this contribution. The code is clean, the logic is sound, and the changes are well-documented. Overall, this is an excellent pull request.

@daje0601 daje0601 closed this Dec 2, 2025
@daje0601 daje0601 reopened this Dec 2, 2025
@daje0601 daje0601 force-pushed the whisper-multi-lora-support branch 2 times, most recently from 93182eb to ba3826b Compare December 2, 2025 11:36
@jeejeelee
Copy link
Copy Markdown
Collaborator

Will look at this PR ASAP, also cc @NickLucche

Copy link
Copy Markdown
Collaborator

@jeejeelee jeejeelee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your contribution. The main concern is that maybe we should use MergedColumnParallelLinear rather than QKVLinear in the base model

Comment on lines +783 to +785
# LoRA-specific attributes
embedding_modules = {}
embedding_padding_modules: list[str] = []
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the model inherits from SupportsLoRA, these two attributes are empty by default

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, I'll remove these redundant.

@@ -0,0 +1,136 @@
# SPDX-License-Identifier: Apache-2.0
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like this example is similar to multilora_inference.py, so do we need to add this example?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right - it's similar to the existing multilora_inference.py.
I'll remove whisper_multilora_inference.py from this PR.

@@ -398,7 +403,11 @@ def can_replace_layer(
packed_modules_list: list,
model_config: PretrainedConfig | None = None,
) -> bool:
return type(source_layer) is QKVParallelLinear and len(packed_modules_list) == 3
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use MergedColumnParallelLinear rather than QKVParallelLinear in base model?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will:

  • Revert my changes to MergedQKVParallelLinearWithLoRA in column_parallel_linear.py
  • Update whisper.py to use MergedColumnParallelLinear for the cross-attention's kv_proj layer

I'll update the PR with these changes shortly. Thanks again for the review!

@mergify
Copy link
Copy Markdown

mergify bot commented Dec 6, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @daje0601.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Dec 6, 2025
@daje0601 daje0601 force-pushed the whisper-multi-lora-support branch from 22c6415 to 1b48b46 Compare December 6, 2025 16:46
@mergify mergify bot removed the needs-rebase label Dec 6, 2025
@daje0601 daje0601 force-pushed the whisper-multi-lora-support branch from 1b48b46 to e3250e7 Compare December 7, 2025 14:03
@daje0601
Copy link
Copy Markdown
Contributor Author

@jeejeelee Would you please let me know if there's any additional work?
If not, I'm planning to work on implementing multi-LoRA support for TTS as Well.
Thank you!

@@ -0,0 +1,144 @@
# SPDX-License-Identifier: Apache-2.0
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please delete this test script, I think this test is unnecessary.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@daje0601 I think we should delete this test script

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't see this post before this, but I do now, so I'll delete it and push it again.

Copy link
Copy Markdown
Collaborator

@jeejeelee jeejeelee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After removing the above test, LGTM, thank you for contribution

@joennlae
Copy link
Copy Markdown
Contributor

Fantastic work :-) What is the timeline for merging that?

@daje0601
Copy link
Copy Markdown
Contributor Author

Fantastic work :-) What is the timeline for merging that?

I've been waiting too~ If there's anything else I need to do on my end, could you please let me know?

@mergify mergify bot added the rocm Related to AMD ROCm label Jan 22, 2026
@mergify mergify bot added the cpu Related to CPU backends label Jan 22, 2026
@github-project-automation github-project-automation bot moved this from To Triage to Ready in gpt-oss Issues & Enhancements Jan 22, 2026
@github-project-automation github-project-automation bot moved this to Ready in NVIDIA Jan 22, 2026
@mergify mergify bot added v1 tpu Related to Google TPUs tool-calling labels Jan 22, 2026
@daje0601
Copy link
Copy Markdown
Contributor Author

After removing the above test, LGTM, thank you for contribution

I deleted it and reposted it, but it's still stuck pending in the same place. Please check it out.

Copy link
Copy Markdown
Collaborator

@NickLucche NickLucche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@daje0601 Thanks for you work!

Given popularity of the model, I think we should really be adding tests with whisper+some lora adapter.

Comment on lines +53 to +57
self.max_position_embeddings = getattr(
text_config,
"max_position_embeddings",
getattr(text_config, "max_target_positions", None),
)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you should probably check if is_encoder_decoder with vllm_config.model_config.is_encoder_decoder

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and add a TODO to generalize for OOT enc-dec models

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I'll check it out tonight!

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please trigger the buildkite/ci/pr check when you have a chance? Thank you!

@daje0601
Copy link
Copy Markdown
Contributor Author

@NickLucche Thanks for the review! I've addressed your feedback:

Changes Made

  1. is_encoder_decoder check - Now explicitly checking vllm_config.model_config.is_encoder_decoder instead of relying on getattr fallback chain

  2. TODO comment added - Added TODO for generalizing max_position_embeddings handling for out-of-tree encoder-decoder models

  3. Whisper + LoRA integration tests added - Created tests/lora/test_whisper.py with:

    • test_whisper_lora_inference: Basic LoRA inference test
    • test_whisper_multi_lora: Multiple LoRA adapter IDs test
    • test_whisper_with_and_without_lora: Comparison test with/without LoRA
    • Uses public adapter: chengyili2005/whisper-small-mandarin-lora

Could you please trigger the buildkite/ci/pr check when you have a chance? Thank you!

@mergify
Copy link
Copy Markdown

mergify bot commented Jan 28, 2026

Hi @daje0601, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy or markdownlint failing?
mypy and markdownlint are run differently in CI. If the failure is related to either of these checks, please use the following commands to run them locally:
# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10
# For markdownlint
pre-commit run --hook-stage manual markdownlint

Copy link
Copy Markdown
Collaborator

@NickLucche NickLucche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your work @daje0601 !

@daje0601
Copy link
Copy Markdown
Contributor Author

daje0601 commented Feb 9, 2026

@NickLucche I pushed a fix for the CI failure (test_whisper_multi_lora).
Added VLLM_WORKER_MULTIPROC_METHOD=spawn autouse fixture — same pattern
as tests/models/multimodal/generation/test_whisper.py.
Could you trigger the Buildkite CI when you get a chance?

daje0601 and others added 6 commits March 2, 2026 23:39
This PR enables Multi-LoRA support for Whisper speech-to-text models,
allowing users to serve multiple fine-tuned Whisper adapters from a
single base model.

Changes:
- Add SupportsLoRA interface to WhisperForConditionalGeneration
- Add packed_modules_mapping for LoRA compatibility
- Use MergedColumnParallelLinear for kv_proj in cross-attention
- Add fallback to max_target_positions in WorkerLoRAManager
- Add unit tests for Whisper LoRA support

Signed-off-by: daje0601 <englishmt4118@gmail.com>
…kv_proj

Address maintainer feedback:
- Replace QKVParallelLinear with MergedColumnParallelLinear for kv_proj
  in WhisperCrossAttention, enabling LoRA support via existing
  MergedColumnParallelLinearWithLoRA infrastructure
- Update weight loading to use integer shard indices (0, 1) instead of
  string identifiers ("k", "v") for MergedColumnParallelLinear
- Remove redundant embedding_modules and embedding_padding_modules
  attributes from WhisperForConditionalGeneration
- Remove example file (similar to existing multilora_inference.py)
- Rollback LoRA layer changes as they are no longer needed
- Update tests to reflect new architecture

Signed-off-by: daje0601 <englishmt4118@gmail.com>
Signed-off-by: daje0601 <englishmt4118@gmail.com>
1. Use is_encoder_decoder check for max_position_embeddings handling
   - Check vllm_config.model_config.is_encoder_decoder explicitly
   - Use max_target_positions for encoder-decoder models (e.g., Whisper)
   - Use max_position_embeddings for other models

2. Add TODO comment for OOT encoder-decoder model generalization

3. Add Whisper + LoRA integration tests
   - test_whisper_lora_inference: Basic LoRA inference test
   - test_whisper_multi_lora: Multiple LoRA ID test
   - test_whisper_with_and_without_lora: LoRA comparison test
   - Uses chengyili2005/whisper-small-mandarin-lora adapter

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Signed-off-by: daje0601 <englishmt4118@gmail.com>
Signed-off-by: daje0601 <englishmt4118@gmail.com>
Whisper has known issues with forked workers in vllm's v1 engine.
Add autouse fixture to set VLLM_WORKER_MULTIPROC_METHOD=spawn,
matching the pattern used in tests/models/multimodal/generation/test_whisper.py.

Fixes CUDA re-initialization error in forked subprocess.

Signed-off-by: daje0601 <englishmt4118@gmail.com>
@daje0601
Copy link
Copy Markdown
Contributor Author

daje0601 commented Mar 2, 2026

Rebased on latest main and all CI checks are passing. Ready for merge!

@daje0601
Copy link
Copy Markdown
Contributor Author

daje0601 commented Mar 5, 2026

@NickLucche @jeejeelee Gentle ping — CI is all green and both approvals are in. Could this be merged when you get a chance? Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci/build cpu Related to CPU backends deepseek Related to DeepSeek models documentation Improvements or additions to documentation frontend gpt-oss Related to GPT-OSS models kv-connector llama Related to Llama models multi-modality Related to multi-modality (#4194) new-model Requests to new models nvidia performance Performance-related issues qwen Related to Qwen models ready ONLY add when PR is ready to merge/full CI is needed rocm Related to AMD ROCm speculative-decoding structured-output tool-calling v1

Projects

Status: Done
Status: Done
Status: Done
Status: Done
Status: Done

Development

Successfully merging this pull request may close these issues.

5 participants