[Model] Add LoRA support for Whisper models by daje0601 · Pull Request #29856 · vllm-project/vllm

daje0601 · 2025-12-02T08:56:32Z

Purpose

This PR enables Multi-LoRA support for Whisper speech-to-text models, allowing users to serve multiple fine-tuned Whisper adapters from a single base model.

Background

Currently, vLLM's WhisperForConditionalGeneration does not implement the SupportsLoRA interface, preventing users from using LoRA adapters with Whisper models. This limitation requires users to deploy
separate model instances for each fine-tuned variant, which is inefficient in terms of GPU memory usage.

Changes

1. vllm/model_executor/models/whisper.py

Add SupportsLoRA interface to WhisperForConditionalGeneration
Add embedding_modules and embedding_padding_modules attributes required by LoRA
Update packed_modules_mapping to use simplified keys (qkv_proj, kv_proj) for LoRA compatibility

2. vllm/lora/layers/column_parallel_linear.py

Extend MergedQKVParallelLinearWithLoRA to support KV-only (2-slice) configurations
This is necessary because Whisper's cross-attention layers (encoder_attn.kv_proj) only have K and V projections, not Q
Update can_replace_layer() to accept both 2-module and 3-module configurations
Refactor slice_lora_a() to dynamically handle variable number of slices

3. vllm/lora/worker_manager.py

Add fallback to max_target_positions when max_position_embeddings is not available
Whisper config uses max_target_positions instead of max_position_embeddings

4. examples/offline_inference/whisper_multilora_inference.py

Add example script demonstrating Whisper Multi-LoRA inference

5. tests/lora/test_whisper_lora.py

Add unit tests for Whisper LoRA interface compliance
Add tests for KV-only configuration support
Add tests for WorkerLoRAManager Whisper compatibility

Test Plan

# Run unit tests
pytest tests/lora/test_whisper_lora.py -v

# Run example (requires LoRA adapter)
python examples/offline_inference/whisper_multilora_inference.py

Test Result(Unit Tests)

tests/lora/test_whisper_lora.py::TestWhisperLoRAInterface::test_supports_lora_attribute PASSED
tests/lora/test_whisper_lora.py::TestWhisperLoRAInterface::test_embedding_modules_defined PASSED
tests/lora/test_whisper_lora.py::TestWhisperLoRAInterface::test_embedding_padding_modules_defined PASSED
tests/lora/test_whisper_lora.py::TestWhisperLoRAInterface::test_packed_modules_mapping_format PASSED
tests/lora/test_whisper_lora.py::TestMergedQKVParallelLinearWithLoRAKVOnly::test_can_replace_layer_accepts_2_modules PASSED
tests/lora/test_whisper_lora.py::TestWorkerLoRAManagerWhisperCompat::test_max_position_embeddings_fallback PASSED
tests/lora/test_whisper_lora.py::TestWorkerLoRAManagerWhisperCompat::test_max_position_embeddings_priority PASSED

Manual Testing
Tested with openai/whisper-large-v3-turbo base model and custom LoRA adapters:

Server startup with --enable-lora flag: ✅
Single LoRA inference: ✅
Multi-LoRA switching between requests: ✅
Concurrent requests with different LoRAs: ✅

Example Usage

from vllm import LLM
from vllm.lora.request import LoRARequest

# Initialize with LoRA support
llm = LLM(
    model="openai/whisper-large-v3-turbo",
    enable_lora=True,
    max_loras=4,
    max_lora_rank=64,
)

# Use different LoRA adapters per request
outputs = llm.generate(
    inputs,
    lora_request=LoRARequest("my_whisper_lora", 1, "/path/to/lora")
)

or

vllm serve yourname/yourmodel \
--host 0.0.0.0 \
--port 8181 \
--dtype bfloat16 \
--trust-remote-code \
--enable-lora \
--lora-modules  \
lora1=lora_module_path \
lora2=lora_module_path \
--max-lora-rank 32 \
--gpu-memory-utilization 0.7 \
--tensor-parallel-size 1

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the https://docs.google.com/document/d/1YyVqrgX4gHTtrstbq8oWUImOyPCKSGnJ7xtTpmXzlRs/edit?tab=t.0.

mergify · 2025-12-02T08:57:17Z

Documentation preview: https://vllm--29856.org.readthedocs.build/en/29856/

gemini-code-assist

Code Review

This pull request introduces multi-LoRA support for Whisper models, which is a valuable addition. The implementation is robust and well-engineered. I appreciate that instead of a model-specific hack, the changes generalize the existing LoRA infrastructure to support Whisper's architecture, particularly the KV-only packed layers in cross-attention. The inclusion of comprehensive unit tests and a clear example script significantly enhances the quality and usability of this contribution. The code is clean, the logic is sound, and the changes are well-documented. Overall, this is an excellent pull request.

jeejeelee · 2025-12-03T06:32:12Z

Will look at this PR ASAP, also cc @NickLucche

jeejeelee

Thank you for your contribution. The main concern is that maybe we should use MergedColumnParallelLinear rather than QKVLinear in the base model

jeejeelee · 2025-12-05T16:31:31Z

vllm/model_executor/models/whisper.py

+    # LoRA-specific attributes
+    embedding_modules = {}
+    embedding_padding_modules: list[str] = []


If the model inherits from SupportsLoRA, these two attributes are empty by default

Thank you, I'll remove these redundant.

jeejeelee · 2025-12-05T16:35:06Z

examples/offline_inference/whisper_multilora_inference.py

@@ -0,0 +1,136 @@
+# SPDX-License-Identifier: Apache-2.0


It looks like this example is similar to multilora_inference.py, so do we need to add this example?

You're right - it's similar to the existing multilora_inference.py.
I'll remove whisper_multilora_inference.py from this PR.

jeejeelee · 2025-12-05T16:38:34Z

vllm/lora/layers/column_parallel_linear.py

@@ -398,7 +403,11 @@ def can_replace_layer(
        packed_modules_list: list,
        model_config: PretrainedConfig | None = None,
    ) -> bool:
-        return type(source_layer) is QKVParallelLinear and len(packed_modules_list) == 3


Can we use MergedColumnParallelLinear rather than QKVParallelLinear in base model?

I will:

Revert my changes to MergedQKVParallelLinearWithLoRA in column_parallel_linear.py

Update whisper.py to use MergedColumnParallelLinear for the cross-attention's kv_proj layer

I'll update the PR with these changes shortly. Thanks again for the review!

mergify · 2025-12-06T16:11:32Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @daje0601.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

daje0601 · 2025-12-12T00:29:09Z

@jeejeelee Would you please let me know if there's any additional work?
If not, I'm planning to work on implementing multi-LoRA support for TTS as Well.
Thank you!

jeejeelee · 2025-12-19T06:02:35Z

tests/lora/test_whisper_lora.py

@@ -0,0 +1,144 @@
+# SPDX-License-Identifier: Apache-2.0


Could you please delete this test script, I think this test is unnecessary.

@daje0601 I think we should delete this test script

I didn't see this post before this, but I do now, so I'll delete it and push it again.

jeejeelee

After removing the above test, LGTM, thank you for contribution

joennlae · 2026-01-21T06:28:31Z

Fantastic work :-) What is the timeline for merging that?

daje0601 · 2026-01-21T06:39:27Z

Fantastic work :-) What is the timeline for merging that?

I've been waiting too~ If there's anything else I need to do on my end, could you please let me know?

daje0601 · 2026-01-22T10:39:36Z

After removing the above test, LGTM, thank you for contribution

I deleted it and reposted it, but it's still stuck pending in the same place. Please check it out.

NickLucche

@daje0601 Thanks for you work!

Given popularity of the model, I think we should really be adding tests with whisper+some lora adapter.

NickLucche · 2026-01-26T13:06:34Z

vllm/lora/worker_manager.py

+        self.max_position_embeddings = getattr(
+            text_config,
+            "max_position_embeddings",
+            getattr(text_config, "max_target_positions", None),
+        )


you should probably check if is_encoder_decoder with vllm_config.model_config.is_encoder_decoder

and add a TODO to generalize for OOT enc-dec models

Thanks, I'll check it out tonight!

Could you please trigger the buildkite/ci/pr check when you have a chance? Thank you!

daje0601 · 2026-01-28T14:23:13Z

@NickLucche Thanks for the review! I've addressed your feedback:

Changes Made

is_encoder_decoder check - Now explicitly checking vllm_config.model_config.is_encoder_decoder instead of relying on getattr fallback chain
TODO comment added - Added TODO for generalizing max_position_embeddings handling for out-of-tree encoder-decoder models
Whisper + LoRA integration tests added - Created tests/lora/test_whisper.py with:
- test_whisper_lora_inference: Basic LoRA inference test
- test_whisper_multi_lora: Multiple LoRA adapter IDs test
- test_whisper_with_and_without_lora: Comparison test with/without LoRA
- Uses public adapter: chengyili2005/whisper-small-mandarin-lora

Could you please trigger the buildkite/ci/pr check when you have a chance? Thank you!

mergify · 2026-01-28T14:27:29Z

Hi @daje0601, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy or markdownlint failing?

mypy and markdownlint are run differently in CI. If the failure is related to either of these checks, please use the following commands to run them locally:

# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10
# For markdownlint
pre-commit run --hook-stage manual markdownlint

NickLucche

Thanks for your work @daje0601 !

daje0601 · 2026-02-09T14:27:21Z

@NickLucche I pushed a fix for the CI failure (test_whisper_multi_lora).
Added VLLM_WORKER_MULTIPROC_METHOD=spawn autouse fixture — same pattern
as tests/models/multimodal/generation/test_whisper.py.
Could you trigger the Buildkite CI when you get a chance?

This PR enables Multi-LoRA support for Whisper speech-to-text models, allowing users to serve multiple fine-tuned Whisper adapters from a single base model. Changes: - Add SupportsLoRA interface to WhisperForConditionalGeneration - Add packed_modules_mapping for LoRA compatibility - Use MergedColumnParallelLinear for kv_proj in cross-attention - Add fallback to max_target_positions in WorkerLoRAManager - Add unit tests for Whisper LoRA support Signed-off-by: daje0601 <englishmt4118@gmail.com>

…kv_proj Address maintainer feedback: - Replace QKVParallelLinear with MergedColumnParallelLinear for kv_proj in WhisperCrossAttention, enabling LoRA support via existing MergedColumnParallelLinearWithLoRA infrastructure - Update weight loading to use integer shard indices (0, 1) instead of string identifiers ("k", "v") for MergedColumnParallelLinear - Remove redundant embedding_modules and embedding_padding_modules attributes from WhisperForConditionalGeneration - Remove example file (similar to existing multilora_inference.py) - Rollback LoRA layer changes as they are no longer needed - Update tests to reflect new architecture Signed-off-by: daje0601 <englishmt4118@gmail.com>

Signed-off-by: daje0601 <englishmt4118@gmail.com>

1. Use is_encoder_decoder check for max_position_embeddings handling - Check vllm_config.model_config.is_encoder_decoder explicitly - Use max_target_positions for encoder-decoder models (e.g., Whisper) - Use max_position_embeddings for other models 2. Add TODO comment for OOT encoder-decoder model generalization 3. Add Whisper + LoRA integration tests - test_whisper_lora_inference: Basic LoRA inference test - test_whisper_multi_lora: Multiple LoRA ID test - test_whisper_with_and_without_lora: LoRA comparison test - Uses chengyili2005/whisper-small-mandarin-lora adapter Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> Signed-off-by: daje0601 <englishmt4118@gmail.com>

Signed-off-by: daje0601 <englishmt4118@gmail.com>

Whisper has known issues with forked workers in vllm's v1 engine. Add autouse fixture to set VLLM_WORKER_MULTIPROC_METHOD=spawn, matching the pattern used in tests/models/multimodal/generation/test_whisper.py. Fixes CUDA re-initialization error in forked subprocess. Signed-off-by: daje0601 <englishmt4118@gmail.com>

daje0601 · 2026-03-02T23:32:10Z

Rebased on latest main and all CI checks are passing. Ready for merge!

daje0601 · 2026-03-05T02:30:25Z

@NickLucche @jeejeelee Gentle ping — CI is all green and both approvals are in. Could this be merged when you get a chance? Thanks!

daje0601 requested a review from jeejeelee as a code owner December 2, 2025 08:56

mergify bot added the documentation Improvements or additions to documentation label Dec 2, 2025

gemini-code-assist bot reviewed Dec 2, 2025

View reviewed changes

daje0601 closed this Dec 2, 2025

daje0601 reopened this Dec 2, 2025

daje0601 force-pushed the whisper-multi-lora-support branch 2 times, most recently from 93182eb to ba3826b Compare December 2, 2025 11:36

jeejeelee reviewed Dec 5, 2025

View reviewed changes

mergify bot added the needs-rebase label Dec 6, 2025

daje0601 force-pushed the whisper-multi-lora-support branch from 22c6415 to 1b48b46 Compare December 6, 2025 16:46

mergify bot removed the needs-rebase label Dec 6, 2025

daje0601 force-pushed the whisper-multi-lora-support branch from 1b48b46 to e3250e7 Compare December 7, 2025 14:03

jeejeelee reviewed Dec 19, 2025

View reviewed changes

jeejeelee approved these changes Dec 19, 2025

View reviewed changes

daje0601 force-pushed the whisper-multi-lora-support branch from 55b3c02 to cdd5a70 Compare January 22, 2026 07:56

daje0601 requested review from NickLucche, gshtras, luccafong, noooop, patrickvonplaten, russellb, sighingnow, tdoublep and tjtanaa as code owners January 22, 2026 07:56

mergify bot added the rocm Related to AMD ROCm label Jan 22, 2026

github-project-automation bot added this to gpt-oss Issues & Enhancements and NVIDIA Jan 22, 2026

mergify bot added the cpu Related to CPU backends label Jan 22, 2026

github-project-automation bot moved this to To Triage in gpt-oss Issues & Enhancements Jan 22, 2026

mergify bot added the structured-output label Jan 22, 2026

github-project-automation bot moved this from To Triage to Ready in gpt-oss Issues & Enhancements Jan 22, 2026

github-project-automation bot moved this to Ready in NVIDIA Jan 22, 2026

mergify bot added the speculative-decoding label Jan 22, 2026

github-project-automation bot added this to Structured Output Jan 22, 2026

mergify bot added v1 tpu Related to Google TPUs tool-calling labels Jan 22, 2026

mergify bot assigned sangstar Jan 22, 2026

NickLucche reviewed Jan 26, 2026

View reviewed changes

NickLucche approved these changes Feb 4, 2026

View reviewed changes

NickLucche mentioned this pull request Feb 9, 2026

[RFC]: Encoder/decoder models & feature compatibility #7366

Closed

daje0601 and others added 6 commits March 2, 2026 23:39

chore: Remove unnecessary test_whisper_lora.py

2c83f9b

Signed-off-by: daje0601 <englishmt4118@gmail.com>

style: Fix ruff format issues

ffe9fab

Signed-off-by: daje0601 <englishmt4118@gmail.com>

DarkLight1337 mentioned this pull request Mar 18, 2026

[Bugfix] Fix Nemotron Parse loading #37407

Merged

5 tasks

Uh oh!

Conversation

daje0601 commented Dec 2, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Background

Changes

Test Plan

Uh oh!

mergify bot commented Dec 2, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

jeejeelee commented Dec 3, 2025

Uh oh!

jeejeelee left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mergify bot commented Dec 6, 2025

Uh oh!

daje0601 commented Dec 12, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jeejeelee left a comment

Choose a reason for hiding this comment

Uh oh!

joennlae commented Jan 21, 2026

Uh oh!

daje0601 commented Jan 21, 2026

Uh oh!

daje0601 commented Jan 22, 2026

Uh oh!

NickLucche left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

daje0601 commented Jan 28, 2026

Changes Made

Uh oh!

mergify bot commented Jan 28, 2026

Uh oh!

NickLucche left a comment

Choose a reason for hiding this comment

Uh oh!

daje0601 commented Feb 9, 2026

Uh oh!

daje0601 commented Mar 2, 2026

Uh oh!

daje0601 commented Mar 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

daje0601 commented Dec 2, 2025 •

edited by github-actions bot

Loading