[Model] use AutoWeightsLoader for DeepSeekV2#41706
Conversation
There was a problem hiding this comment.
Code Review
This pull request refactors the DeepSeek-V2 model implementation by integrating the AutoWeightsLoader and reorganizing the model classes. It also updates the logic for identifying speculative layer indices and calculates model-specific parameters like use_mha and num_redundant_experts during initialization. Feedback indicates that the current method for determining num_redundant_experts is fragile in pipeline parallel environments and should be replaced with a direct reference to the global configuration.
|
IIRC, we added AutoWeightsLoader support for DSv2, but it had a bug, so we reverted it. |
|
Hi @jeejeelee , Thanks for the context! It looks like the previous bug you mentioned was tracked in #16450 (the I will explicitly test for this issue (e.g., running Besides this PP issue, are there any other specific edge cases or areas you would recommend me to pay attention to? Thanks. |
|
I was digging into issue #16450 and traced the root cause, so sharing the analysis here in case it helps reviewers. The bug is in if weight_name.startswith(f"model.layers.{layer_idx + i}."):
return layer_idx + iThis check hardcodes the After this refactor,
The bug isn't PP-specific — it would hit any configuration loading a model with The fix in this PR (adding |
f532e80 to
6106d15
Compare
Test Plan1. DeepSeek-V2-Lite, TP=1, PP=2 (no MTP)from vllm import LLM, SamplingParams
llm = LLM(
model="deepseek-ai/DeepSeek-V2-Lite",
trust_remote_code=True,
max_model_len=256,
tensor_parallel_size=1,
pipeline_parallel_size=2,
distributed_executor_backend="mp",
enforce_eager=True,
gpu_memory_utilization=0.95,
)
output = llm.generate(
["The capital of France is"],
SamplingParams(max_tokens=20, temperature=0),
)
print(output[0].outputs[0].text)2. ZixiQi/DeepSeek-V3-4layers-MTP-FP8 (with MTP layers)from vllm import LLM, SamplingParams
llm = LLM(model='ZixiQi/DeepSeek-V3-4layers-MTP-FP8',
trust_remote_code=True,
max_model_len=256,
enforce_eager=True)
output = llm.generate(
['The capital of France is'],
SamplingParams(max_tokens=20, temperature=0))
print(output[0].outputs[0].text)Testing Result1. DeepSeek-V2-Lite, TP=1, PP=2 (no MTP)Paris.
The currency of France is the Euro.
The languages spoken in France are French and✅ No crash, correct result.
2. ZixiQi/DeepSeek-V3-4layers-MTP-FP8 (with MTP layers)random result Cisyo-Christianity collateral特别想知道 balancesheetottage plausibly Danaenkoissancecroftoksleta和社会保障 tact✅ No KeyError during weight loading — confirms the fix for #16450.
|
|
Thank you @wenyili for the thorough root cause analysis! Your explanation of how AutoWeightsLoader strips the "model." prefix before delegation, and how the mismatch cascades into a KeyError, really clarified the issue. I've updated the fix accordingly. |
|
Hi @DarkLight1337 @jeejeelee @aaron-ang @wenyili, Tests are done — updating the results here before marking as ready for review. The root cause was identified by @wenyili — Test results: #41706 (comment) PTAL when you have time. Happy to address any feedback. |
|
Some CI checks failed. Let me investigate which failures are related to this PR. |
fe1b20e to
15acc09
Compare
|
I've just rebased onto the latest main branch. Let's see whether all CI checks pass. |
|
Hi @DarkLight1337, Thanks for adding the ready label. I checked the current CI failures and they look unrelated to this PR. This PR only changes It looks like the failing CI is already being addressed by other PRs, so I’ll leave this as a note for context. When you have time, could you also help take a look at this PR? Thanks. |
|
Just to be sure, let's wait for the tests to pass on main branch before rebasing this PR again |
Signed-off-by: SoluMilken <ypiheyn.imm02g@g2.nctu.edu.tw>
Signed-off-by: SoluMilken <ypiheyn.imm02g@g2.nctu.edu.tw>
15acc09 to
53cd5b3
Compare
|
@DarkLight1337 CI is all green now! Ready for your review and merge. |
|
Thanks for the review and merge @DarkLight1337. Also big thanks to @jeejeelee and @wenyili for the historical context on the PP issue, that was super helpful to know beforehand! |
Signed-off-by: SoluMilken <ypiheyn.imm02g@g2.nctu.edu.tw>
Signed-off-by: SoluMilken <ypiheyn.imm02g@g2.nctu.edu.tw>
### What this PR does / why we need it? 1. fix vllm-project/vllm#33322 overwrite `gpu_modelrunner.sync_and_gather_intermediate_tensors`, for the sceniro `pp+sp+tp`, skip scatter the residual for ascend 2. vllm-project/vllm#35520 Adapted to the modifications of `ModelRunner v2` for hybrid attn in interface level, . Todo: Added support for Mamba in ModelRunner in Ascend. any pull_request is welcome 3. vllm-project/vllm#40711 4. vllm-project/vllm#42121 5. vllm-project/vllm#41706 6. vllm-project/vllm#39917 Disable `async_schedule` when `enable_return_routed_experts=True` 7. vllm-project/vllm#41046 8. vllm-project/vllm#41055 9. vllm-project/vllm#41035 10. vllm-project/vllm#42434 ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.20.1 - vLLM main: vllm-project/vllm@c7aa186 --------- Signed-off-by: wangli <wangli858794774@gmail.com>
### What this PR does / why we need it? 1. fix vllm-project/vllm#33322 overwrite `gpu_modelrunner.sync_and_gather_intermediate_tensors`, for the sceniro `pp+sp+tp`, skip scatter the residual for ascend 2. vllm-project/vllm#35520 Adapted to the modifications of `ModelRunner v2` for hybrid attn in interface level, . Todo: Added support for Mamba in ModelRunner in Ascend. any pull_request is welcome 3. vllm-project/vllm#40711 4. vllm-project/vllm#42121 5. vllm-project/vllm#41706 6. vllm-project/vllm#39917 Disable `async_schedule` when `enable_return_routed_experts=True` 7. vllm-project/vllm#41046 8. vllm-project/vllm#41055 9. vllm-project/vllm#41035 10. vllm-project/vllm#42434 ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.20.1 - vLLM main: vllm-project/vllm@c7aa186 --------- Signed-off-by: wangli <wangli858794774@gmail.com>
### What this PR does / why we need it? 1. fix vllm-project/vllm#33322 overwrite `gpu_modelrunner.sync_and_gather_intermediate_tensors`, for the sceniro `pp+sp+tp`, skip scatter the residual for ascend 2. vllm-project/vllm#35520 Adapted to the modifications of `ModelRunner v2` for hybrid attn in interface level, . Todo: Added support for Mamba in ModelRunner in Ascend. any pull_request is welcome 3. vllm-project/vllm#40711 4. vllm-project/vllm#42121 5. vllm-project/vllm#41706 6. vllm-project/vllm#39917 Disable `async_schedule` when `enable_return_routed_experts=True` 7. vllm-project/vllm#41046 8. vllm-project/vllm#41055 9. vllm-project/vllm#41035 10. vllm-project/vllm#42434 ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.20.1 - vLLM main: vllm-project/vllm@c7aa186 --------- Signed-off-by: wangli <wangli858794774@gmail.com> Signed-off-by: 李少鹏 <lishaopeng21@huawei.com>
Signed-off-by: SoluMilken <ypiheyn.imm02g@g2.nctu.edu.tw>
Signed-off-by: SoluMilken <ypiheyn.imm02g@g2.nctu.edu.tw>



Purpose
Part of #15697.
Use AutoWeightsLoader for DeepSeekV2.
Code Change
DeepseekV2ForCausalLMintoDeepseekV2Model(no logic changes)DeepseekV2ForCausalLM.load_weightswithAutoWeightsLoader(self)use_mhaandnum_redundant_expertsattributes toDeepseekV2Model.__init__(required by the moved load_weights)get_spec_layer_idx_from_weight_nameto also match weight names without model. prefix (since AutoWeightsLoader strips it before delegating)Test Plan
Test Result
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.