[Model] use AutoWeightsLoader for DeepSeekV2 by SoluMilken · Pull Request #41706 · vllm-project/vllm

SoluMilken · 2026-05-05T07:41:13Z

Purpose

Part of #15697.
Use AutoWeightsLoader for DeepSeekV2.

Code Change

Move load_weights from DeepseekV2ForCausalLM into DeepseekV2Model (no logic changes)
Replace DeepseekV2ForCausalLM.load_weights with AutoWeightsLoader(self)
Add use_mha and num_redundant_experts attributes to DeepseekV2Model.__init__ (required by the moved load_weights)
Update get_spec_layer_idx_from_weight_name to also match weight names without model. prefix (since AutoWeightsLoader strips it before delegating)

Test Plan

from vllm import LLM, SamplingParams
llm = LLM(model='deepseek-ai/DeepSeek-V2-Lite',
           trust_remote_code=True,
           max_model_len=256,
           tensor_parallel_size=2,
           enforce_eager=True,
           gpu_memory_utilization=0.95)
output = llm.generate(
    ['The capital of France is'],
    SamplingParams(max_tokens=20, temperature=0))
print(output[0].outputs[0].text)

Test Result

 Paris.                                                                                       
The currency of France is the Euro.                                                           
The languages spoken in France are French and

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.

gemini-code-assist

Code Review

This pull request refactors the DeepSeek-V2 model implementation by integrating the AutoWeightsLoader and reorganizing the model classes. It also updates the logic for identifying speculative layer indices and calculates model-specific parameters like use_mha and num_redundant_experts during initialization. Feedback indicates that the current method for determining num_redundant_experts is fragile in pipeline parallel environments and should be replaced with a direct reference to the global configuration.

jeejeelee · 2026-05-05T08:03:19Z

IIRC, we added AutoWeightsLoader support for DSv2, but it had a bug, so we reverted it.

SoluMilken · 2026-05-05T14:37:58Z

Hi @jeejeelee ,

Thanks for the context! It looks like the previous bug you mentioned was tracked in #16450 (the KeyError during Pipeline Parallelism).

I will explicitly test for this issue (e.g., running TP=1, PP=2) and make sure it is fully resolved before marking this PR as "Ready for review".

Besides this PP issue, are there any other specific edge cases or areas you would recommend me to pay attention to?

Thanks.

wenyili · 2026-05-07T04:41:52Z

I was digging into issue #16450 and traced the root cause, so sharing the analysis here in case it helps reviewers.

The bug is in get_spec_layer_idx_from_weight_name:

if weight_name.startswith(f"model.layers.{layer_idx + i}."):
    return layer_idx + i

This check hardcodes the model. prefix. When load_weights lived in DeepseekV2ForCausalLM it worked fine, because the raw checkpoint weights still carry the full prefix (e.g. model.layers.61.mlp.experts.0.down_proj.weight).

After this refactor, AutoWeightsLoader strips the top-level model. prefix before delegating to DeepseekV2Model.load_weights, so the incoming name becomes layers.61.mlp.experts.0.down_proj.weight. The spec-layer check never matches, the MTP weight is not skipped, and the following happens:

The weight falls through to expert_params_mapping, which transforms it to layers.61.mlp.experts.w2_weight.
is_pp_missing_parameter returns False — layer 61 is the MTP layer and is genuinely absent from DeepseekV2Model.layers (the ModuleList only covers 0..num_hidden_layers-1). is_pp_missing_parameter can only detect PPMissingLayer instances, not layers that simply don't exist in the list.
params_dict["layers.61.mlp.experts.w2_weight"] raises KeyError.

The bug isn't PP-specific — it would hit any configuration loading a model with num_nextn_predict_layers > 0 (V3/R1). PP just happened to be required in the original report due to memory constraints.

The fix in this PR (adding or weight_name.startswith(f"layers.{layer_idx + i}.")) is the right call for exactly this reason.

claude

Claude Code Review

This pull request is from a fork — automated review is disabled. A repository maintainer can comment @claude review to run a one-time review.

SoluMilken · 2026-05-08T13:43:49Z

Test Plan

1. DeepSeek-V2-Lite, TP=1, PP=2 (no MTP)

from vllm import LLM, SamplingParams
llm = LLM(
    model="deepseek-ai/DeepSeek-V2-Lite",
    trust_remote_code=True,
    max_model_len=256,
    tensor_parallel_size=1,
    pipeline_parallel_size=2,
    distributed_executor_backend="mp",
    enforce_eager=True,
    gpu_memory_utilization=0.95,
)
output = llm.generate(
    ["The capital of France is"],
    SamplingParams(max_tokens=20, temperature=0),
)
print(output[0].outputs[0].text)

2. ZixiQi/DeepSeek-V3-4layers-MTP-FP8 (with MTP layers)

from vllm import LLM, SamplingParams
llm = LLM(model='ZixiQi/DeepSeek-V3-4layers-MTP-FP8',
           trust_remote_code=True,
           max_model_len=256,
           enforce_eager=True)
output = llm.generate(
    ['The capital of France is'],
    SamplingParams(max_tokens=20, temperature=0))
print(output[0].outputs[0].text)

Testing Result

1. DeepSeek-V2-Lite, TP=1, PP=2 (no MTP)

Paris.
The currency of France is the Euro.
The languages spoken in France are French and

✅ No crash, correct result.

2. ZixiQi/DeepSeek-V3-4layers-MTP-FP8 (with MTP layers)

random result

 Cisyo-Christianity collateral特别想知道 balancesheetottage plausibly Danaenkoissancecroftoksleta和社会保障 tact

✅ No KeyError during weight loading — confirms the fix for #16450.

SoluMilken · 2026-05-08T13:48:47Z

Thank you @wenyili for the thorough root cause analysis! Your explanation of how AutoWeightsLoader strips the "model." prefix before delegation, and how the mismatch cascades into a KeyError, really clarified the issue. I've updated the fix accordingly.

SoluMilken · 2026-05-08T13:58:07Z

Hi @DarkLight1337 @jeejeelee @aaron-ang @wenyili,

Tests are done — updating the results here before marking as ready for review.

The root cause was identified by @wenyili — get_spec_layer_idx_from_weight_name hardcodes the model. prefix, which AutoWeightsLoader strips before delegating. The fix adds or weight_name.startswith(f"layers.{layer_idx + i}.") to handle both cases.

Test results: #41706 (comment)

PTAL when you have time. Happy to address any feedback.

SoluMilken · 2026-05-08T16:14:15Z

Some CI checks failed. Let me investigate which failures are related to this PR.

SoluMilken · 2026-05-09T04:47:21Z

I've just rebased onto the latest main branch. Let's see whether all CI checks pass.

SoluMilken · 2026-05-09T07:53:47Z

Hi @DarkLight1337,

Thanks for adding the ready label. I checked the current CI failures and they look unrelated to this PR.

This PR only changes vllm/model_executor/models/deepseek_v2.py. The failures I checked are in unrelated areas: Qwen config loading in test_qk_norm_rope_fusion, Mamba prefix-cache CUDA capture, and Qwen/DFlash initialization tests. DeepSeekV2 initialization passed in the initialization shard.

It looks like the failing CI is already being addressed by other PRs, so I’ll leave this as a note for context. When you have time, could you also help take a look at this PR?

Thanks.

DarkLight1337 · 2026-05-09T08:12:47Z

Just to be sure, let's wait for the tests to pass on main branch before rebasing this PR again

Signed-off-by: SoluMilken <ypiheyn.imm02g@g2.nctu.edu.tw>

SoluMilken · 2026-05-09T17:53:23Z

@DarkLight1337 CI is all green now! Ready for your review and merge.

SoluMilken · 2026-05-09T18:02:47Z

Thanks for the review and merge @DarkLight1337. Also big thanks to @jeejeelee and @wenyili for the historical context on the PP issue, that was super helpful to know beforehand!

Signed-off-by: SoluMilken <ypiheyn.imm02g@g2.nctu.edu.tw>

### What this PR does / why we need it? 1. fix vllm-project/vllm#33322 overwrite `gpu_modelrunner.sync_and_gather_intermediate_tensors`, for the sceniro `pp+sp+tp`, skip scatter the residual for ascend 2. vllm-project/vllm#35520 Adapted to the modifications of `ModelRunner v2` for hybrid attn in interface level, . Todo: Added support for Mamba in ModelRunner in Ascend. any pull_request is welcome 3. vllm-project/vllm#40711 4. vllm-project/vllm#42121 5. vllm-project/vllm#41706 6. vllm-project/vllm#39917 Disable `async_schedule` when `enable_return_routed_experts=True` 7. vllm-project/vllm#41046 8. vllm-project/vllm#41055 9. vllm-project/vllm#41035 10. vllm-project/vllm#42434 ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.20.1 - vLLM main: vllm-project/vllm@c7aa186 --------- Signed-off-by: wangli <wangli858794774@gmail.com>

### What this PR does / why we need it? 1. fix vllm-project/vllm#33322 overwrite `gpu_modelrunner.sync_and_gather_intermediate_tensors`, for the sceniro `pp+sp+tp`, skip scatter the residual for ascend 2. vllm-project/vllm#35520 Adapted to the modifications of `ModelRunner v2` for hybrid attn in interface level, . Todo: Added support for Mamba in ModelRunner in Ascend. any pull_request is welcome 3. vllm-project/vllm#40711 4. vllm-project/vllm#42121 5. vllm-project/vllm#41706 6. vllm-project/vllm#39917 Disable `async_schedule` when `enable_return_routed_experts=True` 7. vllm-project/vllm#41046 8. vllm-project/vllm#41055 9. vllm-project/vllm#41035 10. vllm-project/vllm#42434 ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.20.1 - vLLM main: vllm-project/vllm@c7aa186 --------- Signed-off-by: wangli <wangli858794774@gmail.com> Signed-off-by: 李少鹏 <lishaopeng21@huawei.com>

Signed-off-by: SoluMilken <ypiheyn.imm02g@g2.nctu.edu.tw>

mergify Bot added the deepseek Related to DeepSeek models label May 5, 2026

gemini-code-assist Bot reviewed May 5, 2026

View reviewed changes

Comment thread vllm/model_executor/models/deepseek_v2.py Outdated

wenyili mentioned this pull request May 7, 2026

[Bug]: DeepSeek-R1 KeyError: 'layers.61.mlp.experts.w2_weight' #16450

Closed

1 task

SoluMilken force-pushed the feature/use-autoweightloader-for-deepseek-2 branch from f532e80 to 6106d15 Compare May 8, 2026 13:27

SoluMilken marked this pull request as ready for review May 8, 2026 13:30

claude Bot reviewed May 8, 2026

View reviewed changes

SoluMilken marked this pull request as draft May 8, 2026 13:40

SoluMilken marked this pull request as ready for review May 8, 2026 13:46

DarkLight1337 added the ready ONLY add when PR is ready to merge/full CI is needed label May 8, 2026

SoluMilken force-pushed the feature/use-autoweightloader-for-deepseek-2 branch from fe1b20e to 15acc09 Compare May 9, 2026 04:47

SoluMilken added 2 commits May 9, 2026 23:42

use AutoWeightsLoader for DeepSeekV2

ee411cd

Signed-off-by: SoluMilken <ypiheyn.imm02g@g2.nctu.edu.tw>

update due to gemini's comments

53cd5b3

Signed-off-by: SoluMilken <ypiheyn.imm02g@g2.nctu.edu.tw>

SoluMilken force-pushed the feature/use-autoweightloader-for-deepseek-2 branch from 15acc09 to 53cd5b3 Compare May 9, 2026 15:42

DarkLight1337 approved these changes May 9, 2026

View reviewed changes

DarkLight1337 merged commit cd74911 into vllm-project:main May 9, 2026
61 checks passed

yiliu30 pushed a commit to yiliu30/vllm-fork that referenced this pull request May 11, 2026

[Model] use AutoWeightsLoader for DeepSeekV2 (vllm-project#41706)

f2416d5

Signed-off-by: SoluMilken <ypiheyn.imm02g@g2.nctu.edu.tw>

Potabk mentioned this pull request May 13, 2026

[CI] Upgrade vllm commit to 0512 vllm-project/vllm-ascend#9054

Closed

weifang231 pushed a commit to weifang231/eb-vllm that referenced this pull request May 13, 2026

[Model] use AutoWeightsLoader for DeepSeekV2 (vllm-project#41706)

addff67

Signed-off-by: SoluMilken <ypiheyn.imm02g@g2.nctu.edu.tw>

This was referenced May 14, 2026

[CI] Main2main 0513 vllm-project/vllm-ascend#9137

Closed

[CI] Main2main 0514 vllm-project/vllm-ascend#9155

Merged

mfylcek pushed a commit to mfylcek/vllm that referenced this pull request May 19, 2026

[Model] use AutoWeightsLoader for DeepSeekV2 (vllm-project#41706)

6f6ea81

Signed-off-by: SoluMilken <ypiheyn.imm02g@g2.nctu.edu.tw>

jhu960213 pushed a commit to jhu960213/vllm that referenced this pull request May 20, 2026

[Model] use AutoWeightsLoader for DeepSeekV2 (vllm-project#41706)

a460d6a

Signed-off-by: SoluMilken <ypiheyn.imm02g@g2.nctu.edu.tw>

Uh oh!

Conversation

SoluMilken commented May 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Code Change

Test Plan

Test Result

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

jeejeelee commented May 5, 2026

Uh oh!

SoluMilken commented May 5, 2026

Uh oh!

wenyili commented May 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

claude Bot left a comment

Choose a reason for hiding this comment

Claude Code Review

Uh oh!

SoluMilken commented May 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Test Plan

1. DeepSeek-V2-Lite, TP=1, PP=2 (no MTP)

2. ZixiQi/DeepSeek-V3-4layers-MTP-FP8 (with MTP layers)

Testing Result

1. DeepSeek-V2-Lite, TP=1, PP=2 (no MTP)

2. ZixiQi/DeepSeek-V3-4layers-MTP-FP8 (with MTP layers)

Uh oh!

SoluMilken commented May 8, 2026

Uh oh!

SoluMilken commented May 8, 2026

Uh oh!

SoluMilken commented May 8, 2026

Uh oh!

SoluMilken commented May 9, 2026

Uh oh!

SoluMilken commented May 9, 2026

Uh oh!

DarkLight1337 commented May 9, 2026

Uh oh!

SoluMilken commented May 9, 2026

Uh oh!

Uh oh!

SoluMilken commented May 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

SoluMilken commented May 5, 2026 •

edited

Loading

wenyili commented May 7, 2026 •

edited

Loading

SoluMilken commented May 8, 2026 •

edited

Loading