Skip to content

[Misc] Clean up Kimi-audio whisper encoder loading#36903

Merged
Isotr0py merged 7 commits intovllm-project:mainfrom
Isotr0py:subfolder-loader
Mar 14, 2026
Merged

[Misc] Clean up Kimi-audio whisper encoder loading#36903
Isotr0py merged 7 commits intovllm-project:mainfrom
Isotr0py:subfolder-loader

Conversation

@Isotr0py
Copy link
Member

@Isotr0py Isotr0py commented Mar 12, 2026

Purpose

Test Plan

python examples/offline_inference/audio_language.py -m kimi_audio

Test Result

Rendering prompts: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 34.34it/s]
Processed prompts: 100%|███████████████████████████████████████████████████████| 1/1 [00:01<00:00,  1.37s/it, est. speed input: 157.17 toks/s, output: 46.78 toks/s]
[EOS]
A man is speaking and a telephone is ringing.[EOS][EOS][EOS]A man speaking with a telephone ringing in the background[EOS][EOS][EOS][EOS][EOS][EOS]A man speaking with a telephone ringing in the background[EOS][EOS][EOS][EOS][EOS][EOS][EOS][EOS][EOS][EOS][EOS][EOS][EOS][EOS][EOS][EOS][EOS][EOS][EOS][EOS][EOS][EOS][EOS]

Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
@Isotr0py Isotr0py requested a review from 22quinn as a code owner March 12, 2026 17:03
@Isotr0py Isotr0py requested a review from DarkLight1337 March 12, 2026 17:04
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a subfolder option to the DefaultModelLoader and refactors the Kimi-audio model to use this new, cleaner mechanism for loading the Whisper encoder weights. This is a significant improvement in simplifying the weight loading logic. However, I've identified a critical issue with the new skip_prefixes configuration that would likely prevent any model weights from being loaded. Additionally, there's a high-severity maintainability concern due to a large block of duplicated code for weight loading.

Comment on lines +104 to +135
def load_weights(self, weights: Iterable[tuple[str, torch.Tensor]]) -> set[str]:
stacked_params_mapping = [
# (param_name, shard_name, shard_id)
("qkv_proj", "q_proj", "q"),
("qkv_proj", "k_proj", "k"),
("qkv_proj", "v_proj", "v"),
]
params_dict = dict(self.named_parameters())
loaded_params: set[str] = set()
for name, loaded_weight in weights:
for param_name, weight_name, shard_id in stacked_params_mapping:
if weight_name not in name:
continue
name = name.replace(weight_name, param_name)
# Skip loading extra bias for GPTQ models.
if name.endswith(".bias") and name not in params_dict:
continue

param = params_dict[name]
weight_loader = param.weight_loader
weight_loader(param, loaded_weight, shard_id)
break
else:
# Skip loading extra bias for GPTQ models.
if name.endswith(".bias") and name not in params_dict:
continue

param = params_dict[name]
weight_loader = getattr(param, "weight_loader", default_weight_loader)
weight_loader(param, loaded_weight)
loaded_params.add(name)
return loaded_params
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

This load_weights method is nearly identical to the implementation in WhisperModel.load_weights. This significant code duplication can create maintenance challenges, as updates to one implementation may not be reflected in the other. To improve maintainability and code reuse, consider refactoring this logic into a shared utility function or a mixin class that can be used by both KimiAudioWhisperEncoder and WhisperModel.

Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
@Isotr0py
Copy link
Member Author

Let me double check model loading from remote repo with weights downloading tomorrow.

Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
@Isotr0py
Copy link
Member Author

Model loading from remote repo should work well now:

(Worker_TP0 pid=18859) INFO 03-13 16:15:53 [gpu_model_runner.py:4501] Starting to load model moonshotai/Kimi-Audio-7B-Instruct...
config.json: 1.27kB [00:00, 4.87MB/s]
(Worker_TP1 pid=18860) ERROR 03-13 16:15:54 [fa_utils.py:145] Cannot use FA version 2 is not supported due to FA2 is only supported on devices with compute capability >= 8
(Worker_TP1 pid=18860) INFO 03-13 16:15:54 [mm_encoder_attention.py:230] Using AttentionBackendEnum.TORCH_SDPA for MMEncoderAttention.
(Worker_TP0 pid=18859) ERROR 03-13 16:15:54 [fa_utils.py:145] Cannot use FA version 2 is not supported due to FA2 is only supported on devices with compute capability >= 8
(Worker_TP0 pid=18859) INFO 03-13 16:15:54 [mm_encoder_attention.py:230] Using AttentionBackendEnum.TORCH_SDPA for MMEncoderAttention.
(Worker_TP0 pid=18859) INFO 03-13 16:15:54 [cuda.py:317] Using FLASHINFER attention backend out of potential backends: ['FLASHINFER', 'TRITON_ATTN', 'FLEX_ATTENTION'].
model.safetensors.index.json: 34.7kB [00:00, 65.0MB/s]
model-1-of-35.safetensors: 100%|███████████████████████████████████████████████████████████████████████████████████████████| 466M/466M [00:07<00:00, 61.2MB/s]
model-15-of-35.safetensors: 100%|██████████████████████████████████████████████████████████████████████████████████████████| 466M/466M [00:07<00:00, 59.5MB/s]
model-13-of-35.safetensors: 100%|██████████████████████████████████████████████████████████████████████████████████████████| 466M/466M [00:08<00:00, 58.1MB/s]
model-11-of-35.safetensors: 100%|██████████████████████████████████████████████████████████████████████████████████████████| 466M/466M [00:08<00:00, 55.3MB/s]
model-16-of-35.safetensors: 100%|██████████████████████████████████████████████████████████████████████████████████████████| 466M/466M [00:08<00:00, 55.3MB/s]
model-12-of-35.safetensors: 100%|██████████████████████████████████████████████████████████████████████████████████████████| 466M/466M [00:08<00:00, 55.3MB/s]
model-14-of-35.safetensors: 100%|██████████████████████████████████████████████████████████████████████████████████████████| 466M/466M [00:08<00:00, 52.7MB/s]
model-10-of-35.safetensors: 100%|██████████████████████████████████████████████████████████████████████████████████████████| 466M/466M [00:09<00:00, 50.6MB/s]
model-19-of-35.safetensors: 100%|██████████████████████████████████████████████████████████████████████████████████████████| 466M/466M [00:11<00:00, 42.1MB/s]
model-18-of-35.safetensors: 100%|██████████████████████████████████████████████████████████████████████████████████████████| 466M/466M [00:11<00:00, 40.8MB/s]
model-17-of-35.safetensors: 100%|██████████████████████████████████████████████████████████████████████████████████████████| 466M/466M [00:11<00:00, 40.0MB/s]
model-21-of-35.safetensors: 100%|██████████████████████████████████████████████████████████████████████████████████████████| 466M/466M [00:11<00:00, 40.8MB/s]
model-22-of-35.safetensors: 100%|██████████████████████████████████████████████████████████████████████████████████████████| 466M/466M [00:11<00:00, 41.4MB/s]
model-2-of-35.safetensors: 100%|███████████████████████████████████████████████████████████████████████████████████████████| 466M/466M [00:11<00:00, 39.4MB/s]
model-20-of-35.safetensors: 100%|██████████████████████████████████████████████████████████████████████████████████████████| 466M/466M [00:12<00:00, 38.5MB/s]
model-23-of-35.safetensors: 100%|██████████████████████████████████████████████████████████████████████████████████████████| 466M/466M [00:11<00:00, 41.5MB/s]
model-24-of-35.safetensors: 100%|██████████████████████████████████████████████████████████████████████████████████████████| 466M/466M [00:07<00:00, 61.1MB/s]
model-25-of-35.safetensors: 100%|██████████████████████████████████████████████████████████████████████████████████████████| 466M/466M [00:08<00:00, 56.7MB/s]
model-28-of-35.safetensors: 100%|██████████████████████████████████████████████████████████████████████████████████████████| 466M/466M [00:07<00:00, 61.1MB/s]
model-26-of-35.safetensors: 100%|██████████████████████████████████████████████████████████████████████████████████████████| 466M/466M [00:08<00:00, 54.1MB/s]
model-27-of-35.safetensors: 100%|██████████████████████████████████████████████████████████████████████████████████████████| 466M/466M [00:08<00:00, 52.8MB/s]
model-3-of-35.safetensors: 100%|███████████████████████████████████████████████████████████████████████████████████████████| 466M/466M [00:09<00:00, 47.4MB/s]
model-35-of-35.safetensors: 100%|████████████████████████████████████████████████████████████████████████████████████████| 62.4M/62.4M [00:04<00:00, 15.5MB/s]
model-30-of-35.safetensors: 100%|██████████████████████████████████████████████████████████████████████████████████████████| 466M/466M [00:19<00:00, 24.0MB/s]
model-29-of-35.safetensors: 100%|██████████████████████████████████████████████████████████████████████████████████████████| 466M/466M [00:23<00:00, 20.2MB/s]
model-31-of-35.safetensors: 100%|██████████████████████████████████████████████████████████████████████████████████████████| 466M/466M [00:20<00:00, 22.4MB/s]
model-33-of-35.safetensors: 100%|██████████████████████████████████████████████████████████████████████████████████████████| 466M/466M [00:21<00:00, 22.2MB/s]
model-34-of-35.safetensors: 100%|██████████████████████████████████████████████████████████████████████████████████████████| 466M/466M [00:21<00:00, 21.8MB/s]
model-32-of-35.safetensors: 100%|██████████████████████████████████████████████████████████████████████████████████████████| 466M/466M [00:21<00:00, 21.6MB/s]
model-4-of-35.safetensors: 100%|███████████████████████████████████████████████████████████████████████████████████████████| 466M/466M [00:18<00:00, 24.8MB/s]
model-6-of-35.safetensors: 100%|███████████████████████████████████████████████████████████████████████████████████████████| 466M/466M [00:13<00:00, 34.2MB/s]
model-5-of-35.safetensors: 100%|███████████████████████████████████████████████████████████████████████████████████████████| 466M/466M [00:17<00:00, 27.4MB/s]
model-7-of-35.safetensors: 100%|███████████████████████████████████████████████████████████████████████████████████████████| 466M/466M [00:10<00:00, 43.1MB/s]
model-9-of-35.safetensors: 100%|███████████████████████████████████████████████████████████████████████████████████████████| 466M/466M [00:10<00:00, 46.5MB/s]
model-8-of-35.safetensors: 100%|███████████████████████████████████████████████████████████████████████████████████████████| 466M/466M [00:10<00:00, 43.8MB/s]
model-36-of-36.safetensors: 100%|████████████████████████████████████████████████████████████████████████████████████████| 3.62G/3.62G [00:39<00:00, 92.8MB/s]
(Worker_TP0 pid=18859) INFO 03-13 16:17:05 [weight_utils.py:565] Time spent downloading weights for moonshotai/Kimi-Audio-7B-Instruct: 70.285836 seconds0MB/s]
Loading safetensors checkpoint shards:   0% Completed | 0/36 [00:00<?, ?it/s]██████████████████████████████████████████████| 466M/466M [00:16<00:00, 42.3MB/s]
Loading safetensors checkpoint shards:   3% Completed | 1/36 [00:02<01:21,  2.32s/it]█████████████████████████████████████| 3.62G/3.62G [00:39<00:00, 147MB/s]
Loading safetensors checkpoint shards:   6% Completed | 2/36 [00:04<01:10,  2.08s/it]
Loading safetensors checkpoint shards:   8% Completed | 3/36 [00:11<02:31,  4.58s/it]
Loading safetensors checkpoint shards:  11% Completed | 4/36 [00:12<01:35,  2.97s/it]
Loading safetensors checkpoint shards:  14% Completed | 5/36 [00:12<01:02,  2.02s/it]
Loading safetensors checkpoint shards:  17% Completed | 6/36 [00:12<00:43,  1.45s/it]
Loading safetensors checkpoint shards:  19% Completed | 7/36 [00:13<00:31,  1.09s/it]
Loading safetensors checkpoint shards:  22% Completed | 8/36 [00:13<00:23,  1.17it/s]
Loading safetensors checkpoint shards:  25% Completed | 9/36 [00:14<00:18,  1.45it/s]
Loading safetensors checkpoint shards:  28% Completed | 10/36 [00:15<00:27,  1.04s/it]
Loading safetensors checkpoint shards:  31% Completed | 11/36 [00:18<00:35,  1.40s/it]
Loading safetensors checkpoint shards:  33% Completed | 12/36 [00:20<00:39,  1.67s/it]
Loading safetensors checkpoint shards:  36% Completed | 13/36 [00:22<00:39,  1.70s/it]
Loading safetensors checkpoint shards:  39% Completed | 14/36 [00:23<00:38,  1.75s/it]
Loading safetensors checkpoint shards:  42% Completed | 15/36 [00:25<00:37,  1.78s/it]
Loading safetensors checkpoint shards:  44% Completed | 16/36 [00:27<00:35,  1.79s/it]
Loading safetensors checkpoint shards:  47% Completed | 17/36 [00:29<00:34,  1.81s/it]
Loading safetensors checkpoint shards:  50% Completed | 18/36 [00:31<00:32,  1.82s/it]
Loading safetensors checkpoint shards:  53% Completed | 19/36 [00:33<00:31,  1.83s/it]
Loading safetensors checkpoint shards:  56% Completed | 20/36 [00:35<00:29,  1.84s/it]
Loading safetensors checkpoint shards:  58% Completed | 21/36 [00:36<00:27,  1.84s/it]
Loading safetensors checkpoint shards:  61% Completed | 22/36 [00:38<00:25,  1.85s/it]
Loading safetensors checkpoint shards:  64% Completed | 23/36 [00:40<00:23,  1.84s/it]
Loading safetensors checkpoint shards:  67% Completed | 24/36 [00:42<00:22,  1.85s/it]
Loading safetensors checkpoint shards:  69% Completed | 25/36 [00:44<00:20,  1.85s/it]
Loading safetensors checkpoint shards:  72% Completed | 26/36 [00:46<00:18,  1.85s/it]
Loading safetensors checkpoint shards:  75% Completed | 27/36 [00:47<00:16,  1.85s/it]
Loading safetensors checkpoint shards:  78% Completed | 28/36 [00:49<00:14,  1.84s/it]
Loading safetensors checkpoint shards:  92% Completed | 33/36 [00:49<00:01,  1.66it/s]
Loading safetensors checkpoint shards:  97% Completed | 35/36 [00:50<00:00,  2.12it/s]
Loading safetensors checkpoint shards: 100% Completed | 36/36 [00:54<00:00,  1.50s/it]
(Worker_TP0 pid=18859) 
whisper-large-v3/model.safetensors: 100%|█████████████████████████████████████████████████████████████████████████████████| 3.09G/3.09G [00:22<00:00, 139MB/s]
(Worker_TP1 pid=18860) INFO 03-13 16:18:21 [weight_utils.py:565] Time spent downloading weights for moonshotai/Kimi-Audio-7B-Instruct: 22.538067 seconds
(Worker_TP1 pid=18860) INFO 03-13 16:18:22 [weight_utils.py:609] No model.safetensors.index.json found in remote.
(Worker_TP0 pid=18859) INFO 03-13 16:18:22 [weight_utils.py:609] No model.safetensors.index.json found in remote.
Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  1.37it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  1.37it/s]
(Worker_TP0 pid=18859) 
(Worker_TP0 pid=18859) INFO 03-13 16:18:23 [default_loader.py:304] Loading weights took 77.74 seconds
(Worker_TP0 pid=18859) INFO 03-13 16:18:23 [gpu_model_runner.py:4584] Model loading took 7.89 GiB memory and 149.276291 seconds
...
endering prompts: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 27.93it/s]
Processed prompts: 100%|██████████████████████████████████████████████████| 1/1 [00:02<00:00,  2.60s/it, est. speed input: 82.65 toks/s, output: 24.60 toks/s]
[EOS]
A man speaks and a telephone rings[EOS][EOS][EOS]A man speaks and a telephone rings[EOS][EOS][EOS][EOS][EOS][EOS][EOS][EOS][EOS][EOS][EOS][EOS][EOS][EOS][EOS][EOS][EOS][EOS][EOS][EOS][EOS][EOS][EOS][EOS][EOS][EOS][EOS][EOS][EOS][EOS][EOS][EOS][EOS][EOS][EOS][EOS][EOS][EOS][EOS][EOS][EOS][EOS][EOS][EOS][EOS]

@Isotr0py Isotr0py added the ready ONLY add when PR is ready to merge/full CI is needed label Mar 13, 2026
@Isotr0py Isotr0py merged commit a8e8d62 into vllm-project:main Mar 14, 2026
52 checks passed
@Isotr0py Isotr0py deleted the subfolder-loader branch March 14, 2026 15:37
athrael-soju pushed a commit to athrael-soju/vllm that referenced this pull request Mar 15, 2026
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
Signed-off-by: Athrael Soju <athrael.soju@gmail.com>
athrael-soju pushed a commit to athrael-soju/vllm that referenced this pull request Mar 16, 2026
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
Signed-off-by: Athrael Soju <athrael.soju@gmail.com>
Lucaskabela pushed a commit to Lucaskabela/vllm that referenced this pull request Mar 17, 2026
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
wendyliu235 pushed a commit to wendyliu235/vllm-public that referenced this pull request Mar 18, 2026
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
fxdawnn pushed a commit to fxdawnn/vllm that referenced this pull request Mar 19, 2026
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready ONLY add when PR is ready to merge/full CI is needed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants