[bugfix]Prevent overwriting drafters lm-head and embed_tokens#4134
[bugfix]Prevent overwriting drafters lm-head and embed_tokens#4134HF-001 wants to merge 6 commits intovllm-project:mainfrom
Conversation
Signed-off-by: 01267596 <xiongkai123@cmbchina.com>
|
👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:
If CI fails, you can run linting and testing checks locally according Contributing and Testing. |
There was a problem hiding this comment.
Code Review
This pull request aims to prevent overwriting the lm_head and embed_tokens of a drafter model if they are already initialized, which is particularly important for some EAGLE3 drafters. The changes correctly add checks for has_own_embed_tokens and has_own_lm_head flags before sharing these layers from the target model. However, I've identified a critical regression in the logic for handling lm_head for SpecDcodeType.EAGLE. The check that ensures the target model has an lm_head attribute before it's accessed has been removed, which will likely cause an AttributeError at runtime. My review includes a specific code suggestion to fix this issue.
| if hasattr(model, "lm_head"): | ||
| logger.info("Loading EAGLE LM head weights from the target model.") | ||
| if supports_multimodal(model): | ||
| self.model.lm_head = model.get_language_model().lm_head | ||
| else: | ||
| self.model.lm_head = model.lm_head |
There was a problem hiding this comment.
The logic for sharing lm_head for SpecDcodeType.EAGLE has been changed in a way that introduces a potential AttributeError. The assignments to self.model.lm_head are no longer guarded by hasattr(model, "lm_head"). If the target model does not have an lm_head attribute, this will cause a crash. The original logic was safer and should be restored.
| if hasattr(model, "lm_head"): | |
| logger.info("Loading EAGLE LM head weights from the target model.") | |
| if supports_multimodal(model): | |
| self.model.lm_head = model.get_language_model().lm_head | |
| else: | |
| self.model.lm_head = model.lm_head | |
| if hasattr(model, "lm_head"): | |
| logger.info("Loading EAGLE LM head weights from the target model.") | |
| if supports_multimodal(model): | |
| self.model.lm_head = model.get_language_model().lm_head | |
| else: | |
| self.model.lm_head = model.lm_head |
Signed-off-by: 01267596 <xiongkai123@cmbchina.com>
Signed-off-by: 01267596 <xiongkai123@cmbchina.com>
Signed-off-by: 01267596 <xiongkai123@cmbchina.com>
Signed-off-by: 01267596 <xiongkai123@cmbchina.com>
Signed-off-by: 01267596 <xiongkai123@cmbchina.com>
|
This pull request has conflicts, please resolve those before we can evaluate the pull request. |
|
Any progress? If this PR is still alive, please rebase to main and make CI happy. Thanks |
@wangxiyuan I will handle this PR soon |
|
Please rebase and fix the merge conflicts |
|
no update for long time, close this now. Feel free to reopen if it's still needed. |
What this PR does / why we need it?
Some EAGLE3 drafters might have their own lm_head and/or embed_tokens layers. Existing codebase ignores this.By solving this problem, it can greatly improve acceptance rates. Refer to this pr in vllm: vllm-project/vllm#27737
Does this PR introduce any user-facing change?
How was this patch tested?
export CUDA_VISIBLE_DEVICES=0
export TP=1
export MODEL_PATH=/model/Llama-3.1-8B-Instruct
export MODEL_NAME=Llama-3.1-8B-Instruct
export PORT=10133
python3 -m vllm.entrypoints.openai.api_server --host 0.0.0.0 --port ${PORT} --dtype bfloat16 --model ${MODEL_PATH} --served-model-name ${MODEL_NAME} --tensor-parallel-size ${TP} --gpu-memory-utilization 0.85 --max-model-len 32768 --trust-remote-code --seed 42 --speculative_config '{"method":"eagle3","model":"/model/EAGLE3-LLaMA3.1-Instruct-8B","num_speculative_tokens":5,"draft_tensor_parallel_size":1}'