fix: revert cast to cpu in MsgpackEncoder._encode_tensor to avoid hidden performance regressions#25738
Merged
vllm-bot merged 1 commit intovllm-project:mainfrom Sep 26, 2025
Conversation
…den performance regressions Signed-off-by: Andrew Sansom <andrew@protopia.ai>
Contributor
There was a problem hiding this comment.
Code Review
This pull request aims to prevent hidden performance regressions by explicitly moving tensors to the CPU before serialization, rather than relying on an implicit cast within the MsgpackEncoder. The change correctly adds a .cpu() call for prompt_embeds in the input preprocessing stage. However, reverting the .cpu() call in MsgpackEncoder._encode_tensor creates a risk for other types of tensor inputs, such as those from multi-modal data, which may not be on the CPU. This could lead to runtime errors. I've added a critical comment to address this by making the CPU requirement explicit with a check.
DarkLight1337
approved these changes
Sep 26, 2025
5 tasks
5 tasks
pdasigi
pushed a commit
to pdasigi/vllm
that referenced
this pull request
Oct 2, 2025
…idden performance regressions (vllm-project#25738) Signed-off-by: Andrew Sansom <andrew@protopia.ai>
yewentao256
pushed a commit
that referenced
this pull request
Oct 3, 2025
…idden performance regressions (#25738) Signed-off-by: Andrew Sansom <andrew@protopia.ai> Signed-off-by: yewentao256 <zhyanwentao@126.com>
choprahetarth
pushed a commit
to Tandemn-Labs/vllm
that referenced
this pull request
Oct 11, 2025
…idden performance regressions (vllm-project#25738) Signed-off-by: Andrew Sansom <andrew@protopia.ai>
lywa1998
pushed a commit
to lywa1998/vllm
that referenced
this pull request
Oct 20, 2025
…idden performance regressions (vllm-project#25738) Signed-off-by: Andrew Sansom <andrew@protopia.ai>
alhridoy
pushed a commit
to alhridoy/vllm
that referenced
this pull request
Oct 24, 2025
…idden performance regressions (vllm-project#25738) Signed-off-by: Andrew Sansom <andrew@protopia.ai>
rtourgeman
pushed a commit
to rtourgeman/vllm
that referenced
this pull request
Nov 10, 2025
…idden performance regressions (vllm-project#25738) Signed-off-by: Andrew Sansom <andrew@protopia.ai>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Purpose
PR #24278 introduced casting tensors in
MsgpackEncoderto the cpu before serializing. Although this did not introduce any performance regression at the time, casting between devices can be very expensive, and doing so every time a tensor is sent between processes has the potential into introduce major performance regressions. Tensors that will be serialized with Msgpack should be explicitly cast to CPU before any encoding, as recommended by @njhill.This is a spiritual successor to #22962 which does a similar casting to CPU in the OpenAI-compatible API. This PR change ensures that ALL prompt embeds tensors are cast to CPU before being processed, even if they are submitted in offline mode.
Test Plan
No new tests are needed. I have a few local scripts that I use to hit vLLM with prompt embeds with thousands of requests from different devices. Those local scripts all pass.
Test Result
Local relevant tests are passing. Pending CI.
@DarkLight1337
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.