[main][Bugfix] Fixed an problem related to embeddings sharing#5967
[main][Bugfix] Fixed an problem related to embeddings sharing#5967yiz-liu merged 1 commit intovllm-project:mainfrom
Conversation
|
👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:
If CI fails, you can run linting and testing checks locally according Contributing and Testing. |
There was a problem hiding this comment.
Code Review
This pull request addresses an issue with embedding sharing in speculative decoding. The change correctly prevents unconditional sharing for non-MTP methods. However, the new condition for sharing, self.vllm_config.model_config.is_deepseek_mla and torch.equal(...), appears overly restrictive. It may prevent a valid memory optimization by not sharing embeddings on non-DeepSeek models even when their weights are identical. I have suggested removing the is_deepseek_mla check to align with the principle of sharing embeddings whenever they are equal, which seems to be the core intent of this bugfix.
| if self.vllm_config.model_config.is_deepseek_mla and \ | ||
| torch.equal(self.model.model.embed_tokens.weight, | ||
| model.model.embed_tokens.weight): |
There was a problem hiding this comment.
The condition to share embeddings seems too restrictive. By including self.vllm_config.model_config.is_deepseek_mla, you are preventing embedding sharing for non-DeepSeek models even when their embedding weights are identical. This could be a missed memory optimization.
Based on the PR description, the goal is to share embeddings if and only if they are identical. To achieve that, you should probably remove the is_deepseek_mla check.
if torch.equal(self.model.model.embed_tokens.weight,
model.model.embed_tokens.weight):584e8db to
f3518c0
Compare
|
This pull request has conflicts, please resolve those before we can evaluate the pull request. |
fa7a46a to
1e59244
Compare
f0a1094 to
af0ebd5
Compare
Signed-off-by: drslark <slarksblood@qq.com>
…to FIA_rebase * 'main' of https://github.com/vllm-project/vllm-ascend: (24 commits) add dispath_ffn_combine_bf16 (vllm-project#5866) [BugFix] Fix input parameter bug of dispatch_gmm_combine_decode[RFC: issue 5476] (vllm-project#5932) [1/N][Feat] Xlite Qwen3 MoE Support (vllm-project#5951) [Bugfix] Fix setting of `speculative_config.enforce_eager` for dsv32 (vllm-project#5945) [bugfix][mm] change get_num_encoder_tokens to get_num_encoder_embeds in recompute_schedule.py (vllm-project#5132) [Bugfix] fix pcp qwen full graph FIA bug (vllm-project#6037) [Bugfix]Fixed precision issues caused by pooled request pooling (vllm-project#6049) 【main】【bugfix】Resolved memory deallocation failure in the pooling layer under re-computation workloads. (vllm-project#6045) [main][Bugfix] Fixed an problem related to embeddings sharing (vllm-project#5967) [Feature]refactor the npugraph_ex config, support online-infer with static kernel (vllm-project#5775) [CI][Lint] Show lint diff on failure (vllm-project#5956) [CI] Add wait logic for each individual case (vllm-project#6036) [CI] Add DeepSeek-V3.2-W8A8 nightly ci test (vllm-project#4633) model runner v2 support triton of penalty (vllm-project#5854) [Docs][Model] Support Qwen3-VL-Embedding & Qwen3-VL-Reranker (vllm-project#6034) [Tests] move qwen3 performance test from nightly to e2e (vllm-project#5980) [Bugfix] fix bug of pcp+mtp+async scheduler (vllm-project#5994) [Main2Main] Upgrade vllm commit to releases/v0.14.0 (vllm-project#5988) [Ops] Add layernorm for qwen3Next (vllm-project#5765) [Doc] Add layer_sharding additional config for DeepSeek-V3.2-W8A8 (vllm-project#5921) ...
…3B-EAGLE3 (#6138) ### What this PR does / why we need it? Due to the long-term lack of synchronization with the upstream code, a problem that led to a decrease in the acceptance rate of the Qwen3-30B-A3B-EAGLE3 draft model was introduced when fixing the bug(#5967). Now, synchronize with the upstream and fix this bug ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? ```python from vllm import LLM, SamplingParams def main(): prompts = [ "The future of AI is", ] # Create a sampling params object. sampling_params = SamplingParams(temperature=0.8, top_p=0.95) # Create an LLM. llm = LLM( model="Qwen/Qwen3-30B-A3B", tensor_parallel_size=4, gpu_memory_utilization=0.9, enforce_eager=True, speculative_config={ "method": "eagle3", "model": "AngelSlim/Qwen3-a3B_eagle3" "num_speculative_tokens": 3, }, ) # Generate texts from the prompts. outputs = llm.generate(prompts, sampling_params) print(f"Outputs: {outputs}") for output in outputs: prompt = output.prompt generated_text = output.outputs[0].text print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}") ``` - vLLM version: v0.13.0 - vLLM main: vllm-project/vllm@d682094 Signed-off-by: zhaomingyu <zhaomingyu13@h-partners.com> Co-authored-by: drslark <slarkblood@qq.com>
…3B-EAGLE3 (#6139) ### What this PR does / why we need it? Due to the long-term lack of synchronization with the upstream code, a problem that led to a decrease in the acceptance rate of the Qwen3-30B-A3B-EAGLE3 draft model was introduced when fixing the bug(#5967). Now, synchronize with the upstream and fix this bug ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? ```python from vllm import LLM, SamplingParams def main(): prompts = [ "The future of AI is", ] # Create a sampling params object. sampling_params = SamplingParams(temperature=0.8, top_p=0.95) # Create an LLM. llm = LLM( model="Qwen/Qwen3-30B-A3B", tensor_parallel_size=4, gpu_memory_utilization=0.9, enforce_eager=True, speculative_config={ "method": "eagle3", "model": "AngelSlim/Qwen3-a3B_eagle3" "num_speculative_tokens": 3, }, ) # Generate texts from the prompts. outputs = llm.generate(prompts, sampling_params) print(f"Outputs: {outputs}") for output in outputs: prompt = output.prompt generated_text = output.outputs[0].text print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}") ``` Signed-off-by: zhaomingyu <zhaomingyu13@h-partners.com> Signed-off-by: zhaomingyu13 <zhaomingyu13@h-partners.com> Co-authored-by: drslark <slarkblood@qq.com>
…roject#5967) ### What this PR does / why we need it? Cancel the embeddings sharing when the embeddings of main model and the embeddings of eagle model are different. ### Does this PR introduce _any_ user-facing change? N/A ### How was this patch tested? Cause i don't have `Meta-Llama-3.1-8B-Instruc`t locally, i commented it and run: ```shell pytest -s tests/e2e/singlecard/spec_decode/test_v1_spec_decode.py::test_llama_qwen_eagle_acceptance ``` The output is fine: ```text . ======================================================================================================================== warnings summary ========================================================================================================================= <frozen importlib._bootstrap>:241 <frozen importlib._bootstrap>:241: DeprecationWarning: builtin type SwigPyPacked has no __module__ attribute <frozen importlib._bootstrap>:241 <frozen importlib._bootstrap>:241: DeprecationWarning: builtin type SwigPyObject has no __module__ attribute -- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html ====================================================================================================== 3 passed, 1 skipped, 2 warnings in 196.19s (0:03:16) ======================================================================================================= ``` - vLLM version: v0.13.0 - vLLM main: vllm-project/vllm@2c24bc6 Signed-off-by: drslark <slarksblood@qq.com>
…3B-EAGLE3 (vllm-project#6138) ### What this PR does / why we need it? Due to the long-term lack of synchronization with the upstream code, a problem that led to a decrease in the acceptance rate of the Qwen3-30B-A3B-EAGLE3 draft model was introduced when fixing the bug(vllm-project#5967). Now, synchronize with the upstream and fix this bug ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? ```python from vllm import LLM, SamplingParams def main(): prompts = [ "The future of AI is", ] # Create a sampling params object. sampling_params = SamplingParams(temperature=0.8, top_p=0.95) # Create an LLM. llm = LLM( model="Qwen/Qwen3-30B-A3B", tensor_parallel_size=4, gpu_memory_utilization=0.9, enforce_eager=True, speculative_config={ "method": "eagle3", "model": "AngelSlim/Qwen3-a3B_eagle3" "num_speculative_tokens": 3, }, ) # Generate texts from the prompts. outputs = llm.generate(prompts, sampling_params) print(f"Outputs: {outputs}") for output in outputs: prompt = output.prompt generated_text = output.outputs[0].text print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}") ``` - vLLM version: v0.13.0 - vLLM main: vllm-project/vllm@d682094 Signed-off-by: zhaomingyu <zhaomingyu13@h-partners.com> Co-authored-by: drslark <slarkblood@qq.com>
…3B-EAGLE3 (vllm-project#6139) ### What this PR does / why we need it? Due to the long-term lack of synchronization with the upstream code, a problem that led to a decrease in the acceptance rate of the Qwen3-30B-A3B-EAGLE3 draft model was introduced when fixing the bug(vllm-project#5967). Now, synchronize with the upstream and fix this bug ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? ```python from vllm import LLM, SamplingParams def main(): prompts = [ "The future of AI is", ] # Create a sampling params object. sampling_params = SamplingParams(temperature=0.8, top_p=0.95) # Create an LLM. llm = LLM( model="Qwen/Qwen3-30B-A3B", tensor_parallel_size=4, gpu_memory_utilization=0.9, enforce_eager=True, speculative_config={ "method": "eagle3", "model": "AngelSlim/Qwen3-a3B_eagle3" "num_speculative_tokens": 3, }, ) # Generate texts from the prompts. outputs = llm.generate(prompts, sampling_params) print(f"Outputs: {outputs}") for output in outputs: prompt = output.prompt generated_text = output.outputs[0].text print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}") ``` Signed-off-by: zhaomingyu <zhaomingyu13@h-partners.com> Signed-off-by: zhaomingyu13 <zhaomingyu13@h-partners.com> Co-authored-by: drslark <slarkblood@qq.com>
…roject#5967) ### What this PR does / why we need it? Cancel the embeddings sharing when the embeddings of main model and the embeddings of eagle model are different. ### Does this PR introduce _any_ user-facing change? N/A ### How was this patch tested? Cause i don't have `Meta-Llama-3.1-8B-Instruc`t locally, i commented it and run: ```shell pytest -s tests/e2e/singlecard/spec_decode/test_v1_spec_decode.py::test_llama_qwen_eagle_acceptance ``` The output is fine: ```text . ======================================================================================================================== warnings summary ========================================================================================================================= <frozen importlib._bootstrap>:241 <frozen importlib._bootstrap>:241: DeprecationWarning: builtin type SwigPyPacked has no __module__ attribute <frozen importlib._bootstrap>:241 <frozen importlib._bootstrap>:241: DeprecationWarning: builtin type SwigPyObject has no __module__ attribute -- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html ====================================================================================================== 3 passed, 1 skipped, 2 warnings in 196.19s (0:03:16) ======================================================================================================= ``` - vLLM version: v0.13.0 - vLLM main: vllm-project/vllm@2c24bc6 Signed-off-by: drslark <slarksblood@qq.com>
…3B-EAGLE3 (vllm-project#6138) ### What this PR does / why we need it? Due to the long-term lack of synchronization with the upstream code, a problem that led to a decrease in the acceptance rate of the Qwen3-30B-A3B-EAGLE3 draft model was introduced when fixing the bug(vllm-project#5967). Now, synchronize with the upstream and fix this bug ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? ```python from vllm import LLM, SamplingParams def main(): prompts = [ "The future of AI is", ] # Create a sampling params object. sampling_params = SamplingParams(temperature=0.8, top_p=0.95) # Create an LLM. llm = LLM( model="Qwen/Qwen3-30B-A3B", tensor_parallel_size=4, gpu_memory_utilization=0.9, enforce_eager=True, speculative_config={ "method": "eagle3", "model": "AngelSlim/Qwen3-a3B_eagle3" "num_speculative_tokens": 3, }, ) # Generate texts from the prompts. outputs = llm.generate(prompts, sampling_params) print(f"Outputs: {outputs}") for output in outputs: prompt = output.prompt generated_text = output.outputs[0].text print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}") ``` - vLLM version: v0.13.0 - vLLM main: vllm-project/vllm@d682094 Signed-off-by: zhaomingyu <zhaomingyu13@h-partners.com> Co-authored-by: drslark <slarkblood@qq.com>
…roject#5967) ### What this PR does / why we need it? Cancel the embeddings sharing when the embeddings of main model and the embeddings of eagle model are different. ### Does this PR introduce _any_ user-facing change? N/A ### How was this patch tested? Cause i don't have `Meta-Llama-3.1-8B-Instruc`t locally, i commented it and run: ```shell pytest -s tests/e2e/singlecard/spec_decode/test_v1_spec_decode.py::test_llama_qwen_eagle_acceptance ``` The output is fine: ```text . ======================================================================================================================== warnings summary ========================================================================================================================= <frozen importlib._bootstrap>:241 <frozen importlib._bootstrap>:241: DeprecationWarning: builtin type SwigPyPacked has no __module__ attribute <frozen importlib._bootstrap>:241 <frozen importlib._bootstrap>:241: DeprecationWarning: builtin type SwigPyObject has no __module__ attribute -- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html ====================================================================================================== 3 passed, 1 skipped, 2 warnings in 196.19s (0:03:16) ======================================================================================================= ``` - vLLM version: v0.13.0 - vLLM main: vllm-project/vllm@2c24bc6 Signed-off-by: drslark <slarksblood@qq.com> Signed-off-by: zrj026 <zhangrunjiang026@gmail.com>
…3B-EAGLE3 (vllm-project#6138) ### What this PR does / why we need it? Due to the long-term lack of synchronization with the upstream code, a problem that led to a decrease in the acceptance rate of the Qwen3-30B-A3B-EAGLE3 draft model was introduced when fixing the bug(vllm-project#5967). Now, synchronize with the upstream and fix this bug ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? ```python from vllm import LLM, SamplingParams def main(): prompts = [ "The future of AI is", ] # Create a sampling params object. sampling_params = SamplingParams(temperature=0.8, top_p=0.95) # Create an LLM. llm = LLM( model="Qwen/Qwen3-30B-A3B", tensor_parallel_size=4, gpu_memory_utilization=0.9, enforce_eager=True, speculative_config={ "method": "eagle3", "model": "AngelSlim/Qwen3-a3B_eagle3" "num_speculative_tokens": 3, }, ) # Generate texts from the prompts. outputs = llm.generate(prompts, sampling_params) print(f"Outputs: {outputs}") for output in outputs: prompt = output.prompt generated_text = output.outputs[0].text print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}") ``` - vLLM version: v0.13.0 - vLLM main: vllm-project/vllm@d682094 Signed-off-by: zhaomingyu <zhaomingyu13@h-partners.com> Co-authored-by: drslark <slarkblood@qq.com> Signed-off-by: zrj026 <zhangrunjiang026@gmail.com>
…roject#5967) ### What this PR does / why we need it? Cancel the embeddings sharing when the embeddings of main model and the embeddings of eagle model are different. ### Does this PR introduce _any_ user-facing change? N/A ### How was this patch tested? Cause i don't have `Meta-Llama-3.1-8B-Instruc`t locally, i commented it and run: ```shell pytest -s tests/e2e/singlecard/spec_decode/test_v1_spec_decode.py::test_llama_qwen_eagle_acceptance ``` The output is fine: ```text . ======================================================================================================================== warnings summary ========================================================================================================================= <frozen importlib._bootstrap>:241 <frozen importlib._bootstrap>:241: DeprecationWarning: builtin type SwigPyPacked has no __module__ attribute <frozen importlib._bootstrap>:241 <frozen importlib._bootstrap>:241: DeprecationWarning: builtin type SwigPyObject has no __module__ attribute -- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html ====================================================================================================== 3 passed, 1 skipped, 2 warnings in 196.19s (0:03:16) ======================================================================================================= ``` - vLLM version: v0.13.0 - vLLM main: vllm-project/vllm@2c24bc6 Signed-off-by: drslark <slarksblood@qq.com>
…3B-EAGLE3 (vllm-project#6138) ### What this PR does / why we need it? Due to the long-term lack of synchronization with the upstream code, a problem that led to a decrease in the acceptance rate of the Qwen3-30B-A3B-EAGLE3 draft model was introduced when fixing the bug(vllm-project#5967). Now, synchronize with the upstream and fix this bug ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? ```python from vllm import LLM, SamplingParams def main(): prompts = [ "The future of AI is", ] # Create a sampling params object. sampling_params = SamplingParams(temperature=0.8, top_p=0.95) # Create an LLM. llm = LLM( model="Qwen/Qwen3-30B-A3B", tensor_parallel_size=4, gpu_memory_utilization=0.9, enforce_eager=True, speculative_config={ "method": "eagle3", "model": "AngelSlim/Qwen3-a3B_eagle3" "num_speculative_tokens": 3, }, ) # Generate texts from the prompts. outputs = llm.generate(prompts, sampling_params) print(f"Outputs: {outputs}") for output in outputs: prompt = output.prompt generated_text = output.outputs[0].text print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}") ``` - vLLM version: v0.13.0 - vLLM main: vllm-project/vllm@d682094 Signed-off-by: zhaomingyu <zhaomingyu13@h-partners.com> Co-authored-by: drslark <slarkblood@qq.com>
…roject#5967) ### What this PR does / why we need it? Cancel the embeddings sharing when the embeddings of main model and the embeddings of eagle model are different. ### Does this PR introduce _any_ user-facing change? N/A ### How was this patch tested? Cause i don't have `Meta-Llama-3.1-8B-Instruc`t locally, i commented it and run: ```shell pytest -s tests/e2e/singlecard/spec_decode/test_v1_spec_decode.py::test_llama_qwen_eagle_acceptance ``` The output is fine: ```text . ======================================================================================================================== warnings summary ========================================================================================================================= <frozen importlib._bootstrap>:241 <frozen importlib._bootstrap>:241: DeprecationWarning: builtin type SwigPyPacked has no __module__ attribute <frozen importlib._bootstrap>:241 <frozen importlib._bootstrap>:241: DeprecationWarning: builtin type SwigPyObject has no __module__ attribute -- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html ====================================================================================================== 3 passed, 1 skipped, 2 warnings in 196.19s (0:03:16) ======================================================================================================= ``` - vLLM version: v0.13.0 - vLLM main: vllm-project/vllm@2c24bc6 Signed-off-by: drslark <slarksblood@qq.com> Signed-off-by: zrj026 <zhangrunjiang026@gmail.com>
…3B-EAGLE3 (vllm-project#6138) ### What this PR does / why we need it? Due to the long-term lack of synchronization with the upstream code, a problem that led to a decrease in the acceptance rate of the Qwen3-30B-A3B-EAGLE3 draft model was introduced when fixing the bug(vllm-project#5967). Now, synchronize with the upstream and fix this bug ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? ```python from vllm import LLM, SamplingParams def main(): prompts = [ "The future of AI is", ] # Create a sampling params object. sampling_params = SamplingParams(temperature=0.8, top_p=0.95) # Create an LLM. llm = LLM( model="Qwen/Qwen3-30B-A3B", tensor_parallel_size=4, gpu_memory_utilization=0.9, enforce_eager=True, speculative_config={ "method": "eagle3", "model": "AngelSlim/Qwen3-a3B_eagle3" "num_speculative_tokens": 3, }, ) # Generate texts from the prompts. outputs = llm.generate(prompts, sampling_params) print(f"Outputs: {outputs}") for output in outputs: prompt = output.prompt generated_text = output.outputs[0].text print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}") ``` - vLLM version: v0.13.0 - vLLM main: vllm-project/vllm@d682094 Signed-off-by: zhaomingyu <zhaomingyu13@h-partners.com> Co-authored-by: drslark <slarkblood@qq.com> Signed-off-by: zrj026 <zhangrunjiang026@gmail.com>
…roject#5967) ### What this PR does / why we need it? Cancel the embeddings sharing when the embeddings of main model and the embeddings of eagle model are different. ### Does this PR introduce _any_ user-facing change? N/A ### How was this patch tested? Cause i don't have `Meta-Llama-3.1-8B-Instruc`t locally, i commented it and run: ```shell pytest -s tests/e2e/singlecard/spec_decode/test_v1_spec_decode.py::test_llama_qwen_eagle_acceptance ``` The output is fine: ```text . ======================================================================================================================== warnings summary ========================================================================================================================= <frozen importlib._bootstrap>:241 <frozen importlib._bootstrap>:241: DeprecationWarning: builtin type SwigPyPacked has no __module__ attribute <frozen importlib._bootstrap>:241 <frozen importlib._bootstrap>:241: DeprecationWarning: builtin type SwigPyObject has no __module__ attribute -- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html ====================================================================================================== 3 passed, 1 skipped, 2 warnings in 196.19s (0:03:16) ======================================================================================================= ``` - vLLM version: v0.13.0 - vLLM main: vllm-project/vllm@2c24bc6 Signed-off-by: drslark <slarksblood@qq.com>
…3B-EAGLE3 (vllm-project#6138) ### What this PR does / why we need it? Due to the long-term lack of synchronization with the upstream code, a problem that led to a decrease in the acceptance rate of the Qwen3-30B-A3B-EAGLE3 draft model was introduced when fixing the bug(vllm-project#5967). Now, synchronize with the upstream and fix this bug ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? ```python from vllm import LLM, SamplingParams def main(): prompts = [ "The future of AI is", ] # Create a sampling params object. sampling_params = SamplingParams(temperature=0.8, top_p=0.95) # Create an LLM. llm = LLM( model="Qwen/Qwen3-30B-A3B", tensor_parallel_size=4, gpu_memory_utilization=0.9, enforce_eager=True, speculative_config={ "method": "eagle3", "model": "AngelSlim/Qwen3-a3B_eagle3" "num_speculative_tokens": 3, }, ) # Generate texts from the prompts. outputs = llm.generate(prompts, sampling_params) print(f"Outputs: {outputs}") for output in outputs: prompt = output.prompt generated_text = output.outputs[0].text print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}") ``` - vLLM version: v0.13.0 - vLLM main: vllm-project/vllm@d682094 Signed-off-by: zhaomingyu <zhaomingyu13@h-partners.com> Co-authored-by: drslark <slarkblood@qq.com>
…3B-EAGLE3 (vllm-project#6138) ### What this PR does / why we need it? Due to the long-term lack of synchronization with the upstream code, a problem that led to a decrease in the acceptance rate of the Qwen3-30B-A3B-EAGLE3 draft model was introduced when fixing the bug(vllm-project#5967). Now, synchronize with the upstream and fix this bug ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? ```python from vllm import LLM, SamplingParams def main(): prompts = [ "The future of AI is", ] # Create a sampling params object. sampling_params = SamplingParams(temperature=0.8, top_p=0.95) # Create an LLM. llm = LLM( model="Qwen/Qwen3-30B-A3B", tensor_parallel_size=4, gpu_memory_utilization=0.9, enforce_eager=True, speculative_config={ "method": "eagle3", "model": "AngelSlim/Qwen3-a3B_eagle3" "num_speculative_tokens": 3, }, ) # Generate texts from the prompts. outputs = llm.generate(prompts, sampling_params) print(f"Outputs: {outputs}") for output in outputs: prompt = output.prompt generated_text = output.outputs[0].text print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}") ``` - vLLM version: v0.13.0 - vLLM main: vllm-project/vllm@d682094 Signed-off-by: zhaomingyu <zhaomingyu13@h-partners.com> Co-authored-by: drslark <slarkblood@qq.com>
What this PR does / why we need it?
Cancel the embeddings sharing when the embeddings of main model and the embeddings of eagle model are different.
Does this PR introduce any user-facing change?
N/A
How was this patch tested?
Cause i don't have
Meta-Llama-3.1-8B-Instruct locally, i commented it and run:The output is fine: