Skip to content

[Model] use AutoWeightsLoader for DeepSeekV2#41706

Merged
DarkLight1337 merged 2 commits into
vllm-project:mainfrom
SoluMilken:feature/use-autoweightloader-for-deepseek-2
May 9, 2026
Merged

[Model] use AutoWeightsLoader for DeepSeekV2#41706
DarkLight1337 merged 2 commits into
vllm-project:mainfrom
SoluMilken:feature/use-autoweightloader-for-deepseek-2

Conversation

@SoluMilken
Copy link
Copy Markdown
Contributor

@SoluMilken SoluMilken commented May 5, 2026

Purpose

Part of #15697.
Use AutoWeightsLoader for DeepSeekV2.

Code Change

  • Move load_weights from DeepseekV2ForCausalLM into DeepseekV2Model (no logic changes)
  • Replace DeepseekV2ForCausalLM.load_weights with AutoWeightsLoader(self)
  • Add use_mha and num_redundant_experts attributes to DeepseekV2Model.__init__ (required by the moved load_weights)
  • Update get_spec_layer_idx_from_weight_name to also match weight names without model. prefix (since AutoWeightsLoader strips it before delegating)

Test Plan

from vllm import LLM, SamplingParams
llm = LLM(model='deepseek-ai/DeepSeek-V2-Lite',
           trust_remote_code=True,
           max_model_len=256,
           tensor_parallel_size=2,
           enforce_eager=True,
           gpu_memory_utilization=0.95)
output = llm.generate(
    ['The capital of France is'],
    SamplingParams(max_tokens=20, temperature=0))
print(output[0].outputs[0].text)

Test Result

 Paris.                                                                                       
The currency of France is the Euro.                                                           
The languages spoken in France are French and
image image
Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.

@mergify mergify Bot added the deepseek Related to DeepSeek models label May 5, 2026
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors the DeepSeek-V2 model implementation by integrating the AutoWeightsLoader and reorganizing the model classes. It also updates the logic for identifying speculative layer indices and calculates model-specific parameters like use_mha and num_redundant_experts during initialization. Feedback indicates that the current method for determining num_redundant_experts is fragile in pipeline parallel environments and should be replaced with a direct reference to the global configuration.

Comment thread vllm/model_executor/models/deepseek_v2.py Outdated
@jeejeelee
Copy link
Copy Markdown
Member

IIRC, we added AutoWeightsLoader support for DSv2, but it had a bug, so we reverted it.

@SoluMilken
Copy link
Copy Markdown
Contributor Author

Hi @jeejeelee ,

Thanks for the context! It looks like the previous bug you mentioned was tracked in #16450 (the KeyError during Pipeline Parallelism).

I will explicitly test for this issue (e.g., running TP=1, PP=2) and make sure it is fully resolved before marking this PR as "Ready for review".

Besides this PP issue, are there any other specific edge cases or areas you would recommend me to pay attention to?

Thanks.

@wenyili
Copy link
Copy Markdown
Contributor

wenyili commented May 7, 2026

I was digging into issue #16450 and traced the root cause, so sharing the analysis here in case it helps reviewers.

The bug is in get_spec_layer_idx_from_weight_name:

if weight_name.startswith(f"model.layers.{layer_idx + i}."):
    return layer_idx + i

This check hardcodes the model. prefix. When load_weights lived in DeepseekV2ForCausalLM it worked fine, because the raw checkpoint weights still carry the full prefix (e.g. model.layers.61.mlp.experts.0.down_proj.weight).

After this refactor, AutoWeightsLoader strips the top-level model. prefix before delegating to DeepseekV2Model.load_weights, so the incoming name becomes layers.61.mlp.experts.0.down_proj.weight. The spec-layer check never matches, the MTP weight is not skipped, and the following happens:

  1. The weight falls through to expert_params_mapping, which transforms it to layers.61.mlp.experts.w2_weight.
  2. is_pp_missing_parameter returns False — layer 61 is the MTP layer and is genuinely absent from DeepseekV2Model.layers (the ModuleList only covers 0..num_hidden_layers-1). is_pp_missing_parameter can only detect PPMissingLayer instances, not layers that simply don't exist in the list.
  3. params_dict["layers.61.mlp.experts.w2_weight"] raises KeyError.

The bug isn't PP-specific — it would hit any configuration loading a model with num_nextn_predict_layers > 0 (V3/R1). PP just happened to be required in the original report due to memory constraints.

The fix in this PR (adding or weight_name.startswith(f"layers.{layer_idx + i}.")) is the right call for exactly this reason.

@SoluMilken SoluMilken force-pushed the feature/use-autoweightloader-for-deepseek-2 branch from f532e80 to 6106d15 Compare May 8, 2026 13:27
@SoluMilken SoluMilken marked this pull request as ready for review May 8, 2026 13:30
Copy link
Copy Markdown

@claude claude Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude Code Review

This pull request is from a fork — automated review is disabled. A repository maintainer can comment @claude review to run a one-time review.

@SoluMilken SoluMilken marked this pull request as draft May 8, 2026 13:40
@SoluMilken
Copy link
Copy Markdown
Contributor Author

SoluMilken commented May 8, 2026

Test Plan

1. DeepSeek-V2-Lite, TP=1, PP=2 (no MTP)

from vllm import LLM, SamplingParams
llm = LLM(
    model="deepseek-ai/DeepSeek-V2-Lite",
    trust_remote_code=True,
    max_model_len=256,
    tensor_parallel_size=1,
    pipeline_parallel_size=2,
    distributed_executor_backend="mp",
    enforce_eager=True,
    gpu_memory_utilization=0.95,
)
output = llm.generate(
    ["The capital of France is"],
    SamplingParams(max_tokens=20, temperature=0),
)
print(output[0].outputs[0].text)

2. ZixiQi/DeepSeek-V3-4layers-MTP-FP8 (with MTP layers)

from vllm import LLM, SamplingParams
llm = LLM(model='ZixiQi/DeepSeek-V3-4layers-MTP-FP8',
           trust_remote_code=True,
           max_model_len=256,
           enforce_eager=True)
output = llm.generate(
    ['The capital of France is'],
    SamplingParams(max_tokens=20, temperature=0))
print(output[0].outputs[0].text)

Testing Result

1. DeepSeek-V2-Lite, TP=1, PP=2 (no MTP)

Paris.
The currency of France is the Euro.
The languages spoken in France are French and

✅ No crash, correct result.

image image

2. ZixiQi/DeepSeek-V3-4layers-MTP-FP8 (with MTP layers)

random result

 Cisyo-Christianity collateral特别想知道 balancesheetottage plausibly Danaenkoissancecroftoksleta和社会保障 tact

✅ No KeyError during weight loading — confirms the fix for #16450.

image

@SoluMilken SoluMilken marked this pull request as ready for review May 8, 2026 13:46
@SoluMilken
Copy link
Copy Markdown
Contributor Author

Thank you @wenyili for the thorough root cause analysis! Your explanation of how AutoWeightsLoader strips the "model." prefix before delegation, and how the mismatch cascades into a KeyError, really clarified the issue. I've updated the fix accordingly.

@SoluMilken
Copy link
Copy Markdown
Contributor Author

Hi @DarkLight1337 @jeejeelee @aaron-ang @wenyili,

Tests are done — updating the results here before marking as ready for review.

The root cause was identified by @wenyiliget_spec_layer_idx_from_weight_name hardcodes the model. prefix, which AutoWeightsLoader strips before delegating. The fix adds or weight_name.startswith(f"layers.{layer_idx + i}.") to handle both cases.

Test results: #41706 (comment)

PTAL when you have time. Happy to address any feedback.

@DarkLight1337 DarkLight1337 added the ready ONLY add when PR is ready to merge/full CI is needed label May 8, 2026
@SoluMilken
Copy link
Copy Markdown
Contributor Author

Some CI checks failed. Let me investigate which failures are related to this PR.

@SoluMilken SoluMilken force-pushed the feature/use-autoweightloader-for-deepseek-2 branch from fe1b20e to 15acc09 Compare May 9, 2026 04:47
@SoluMilken
Copy link
Copy Markdown
Contributor Author

I've just rebased onto the latest main branch. Let's see whether all CI checks pass.

@SoluMilken
Copy link
Copy Markdown
Contributor Author

Hi @DarkLight1337,

Thanks for adding the ready label. I checked the current CI failures and they look unrelated to this PR.

This PR only changes vllm/model_executor/models/deepseek_v2.py. The failures I checked are in unrelated areas: Qwen config loading in test_qk_norm_rope_fusion, Mamba prefix-cache CUDA capture, and Qwen/DFlash initialization tests. DeepSeekV2 initialization passed in the initialization shard.

It looks like the failing CI is already being addressed by other PRs, so I’ll leave this as a note for context. When you have time, could you also help take a look at this PR?

Thanks.

@DarkLight1337
Copy link
Copy Markdown
Member

Just to be sure, let's wait for the tests to pass on main branch before rebasing this PR again

SoluMilken added 2 commits May 9, 2026 23:42
Signed-off-by: SoluMilken <ypiheyn.imm02g@g2.nctu.edu.tw>
Signed-off-by: SoluMilken <ypiheyn.imm02g@g2.nctu.edu.tw>
@SoluMilken SoluMilken force-pushed the feature/use-autoweightloader-for-deepseek-2 branch from 15acc09 to 53cd5b3 Compare May 9, 2026 15:42
@SoluMilken
Copy link
Copy Markdown
Contributor Author

@DarkLight1337 CI is all green now! Ready for your review and merge.

@DarkLight1337 DarkLight1337 merged commit cd74911 into vllm-project:main May 9, 2026
61 checks passed
@SoluMilken
Copy link
Copy Markdown
Contributor Author

Thanks for the review and merge @DarkLight1337. Also big thanks to @jeejeelee and @wenyili for the historical context on the PP issue, that was super helpful to know beforehand!

yiliu30 pushed a commit to yiliu30/vllm-fork that referenced this pull request May 11, 2026
Signed-off-by: SoluMilken <ypiheyn.imm02g@g2.nctu.edu.tw>
weifang231 pushed a commit to weifang231/eb-vllm that referenced this pull request May 13, 2026
Signed-off-by: SoluMilken <ypiheyn.imm02g@g2.nctu.edu.tw>
wangxiyuan pushed a commit to vllm-project/vllm-ascend that referenced this pull request May 14, 2026
### What this PR does / why we need it?
1. fix vllm-project/vllm#33322
overwrite `gpu_modelrunner.sync_and_gather_intermediate_tensors`, for
the sceniro `pp+sp+tp`, skip scatter the residual for ascend

2. vllm-project/vllm#35520
Adapted to the modifications of `ModelRunner v2` for hybrid attn in
interface level, .
Todo: Added support for Mamba in ModelRunner in Ascend. any pull_request
is welcome

3. vllm-project/vllm#40711

4. vllm-project/vllm#42121

5. vllm-project/vllm#41706

6. vllm-project/vllm#39917
Disable `async_schedule` when `enable_return_routed_experts=True`
7. vllm-project/vllm#41046
8. vllm-project/vllm#41055
9. vllm-project/vllm#41035
10. vllm-project/vllm#42434
### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.20.1
- vLLM main:
vllm-project/vllm@c7aa186

---------

Signed-off-by: wangli <wangli858794774@gmail.com>
Tian-Fantasea pushed a commit to Tian-Fantasea/vllm-ascend that referenced this pull request May 19, 2026
### What this PR does / why we need it?
1. fix vllm-project/vllm#33322
overwrite `gpu_modelrunner.sync_and_gather_intermediate_tensors`, for
the sceniro `pp+sp+tp`, skip scatter the residual for ascend

2. vllm-project/vllm#35520
Adapted to the modifications of `ModelRunner v2` for hybrid attn in
interface level, .
Todo: Added support for Mamba in ModelRunner in Ascend. any pull_request
is welcome

3. vllm-project/vllm#40711

4. vllm-project/vllm#42121

5. vllm-project/vllm#41706

6. vllm-project/vllm#39917
Disable `async_schedule` when `enable_return_routed_experts=True`
7. vllm-project/vllm#41046
8. vllm-project/vllm#41055
9. vllm-project/vllm#41035
10. vllm-project/vllm#42434
### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.20.1
- vLLM main:
vllm-project/vllm@c7aa186

---------

Signed-off-by: wangli <wangli858794774@gmail.com>
shaopeng-666 pushed a commit to shaopeng-666/vllm-ascend that referenced this pull request May 19, 2026
### What this PR does / why we need it?
1. fix vllm-project/vllm#33322
overwrite `gpu_modelrunner.sync_and_gather_intermediate_tensors`, for
the sceniro `pp+sp+tp`, skip scatter the residual for ascend

2. vllm-project/vllm#35520
Adapted to the modifications of `ModelRunner v2` for hybrid attn in
interface level, .
Todo: Added support for Mamba in ModelRunner in Ascend. any pull_request
is welcome

3. vllm-project/vllm#40711

4. vllm-project/vllm#42121

5. vllm-project/vllm#41706

6. vllm-project/vllm#39917
Disable `async_schedule` when `enable_return_routed_experts=True`
7. vllm-project/vllm#41046
8. vllm-project/vllm#41055
9. vllm-project/vllm#41035
10. vllm-project/vllm#42434
### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.20.1
- vLLM main:
vllm-project/vllm@c7aa186

---------

Signed-off-by: wangli <wangli858794774@gmail.com>
Signed-off-by: 李少鹏 <lishaopeng21@huawei.com>
mfylcek pushed a commit to mfylcek/vllm that referenced this pull request May 19, 2026
Signed-off-by: SoluMilken <ypiheyn.imm02g@g2.nctu.edu.tw>
jhu960213 pushed a commit to jhu960213/vllm that referenced this pull request May 20, 2026
Signed-off-by: SoluMilken <ypiheyn.imm02g@g2.nctu.edu.tw>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

deepseek Related to DeepSeek models ready ONLY add when PR is ready to merge/full CI is needed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants