[CORE] Prompt Embeddings Support for v1 Engine#24278
[CORE] Prompt Embeddings Support for v1 Engine#24278DarkLight1337 merged 76 commits intovllm-project:mainfrom
Conversation
Signed-off-by: Andrew Sansom <andrew@protopia.ai>
…xecutors in v1 engine Signed-off-by: Andrew Sansom <andrew@protopia.ai>
Signed-off-by: Andrew Sansom <andrew@protopia.ai>
Signed-off-by: Andrew Sansom <andrew@protopia.ai>
…mpt embeds Signed-off-by: Andrew Sansom <andrew@protopia.ai>
Signed-off-by: Andrew Sansom <andrew@protopia.ai>
Signed-off-by: Andrew Sansom <andrew@protopia.ai>
Signed-off-by: Andrew Sansom <andrew@protopia.ai>
Signed-off-by: Andrew Sansom <andrew@protopia.ai>
Signed-off-by: Andrew Sansom <andrew@protopia.ai>
Signed-off-by: Andrew Sansom <andrew@protopia.ai>
Signed-off-by: Andrew Sansom <andrew@protopia.ai>
Signed-off-by: Andrew Sansom <andrew@protopia.ai>
Signed-off-by: Andrew Sansom <andrew@protopia.ai>
Signed-off-by: Andrew Sansom <andrew@protopia.ai>
Signed-off-by: Andrew Sansom <andrew@protopia.ai>
Signed-off-by: Andrew Sansom <andrew@protopia.ai>
Signed-off-by: Andrew Sansom <andrew@protopia.ai>
Signed-off-by: Andrew Sansom <andrew@protopia.ai>
…tra parameters Signed-off-by: Andrew Sansom <andrew@protopia.ai>
Signed-off-by: Andrew Sansom <andrew@protopia.ai>
Signed-off-by: Andrew Sansom <andrew@protopia.ai>
Signed-off-by: Andrew Sansom <andrew@protopia.ai>
…mbeds tensors instead of zeros. Signed-off-by: Andrew Sansom <andrew@protopia.ai>
Signed-off-by: Andrew Sansom <andrew@protopia.ai>
|
Can we merge after #25025? It seems the tests are related |
|
@DarkLight1337 @WoosukKwon Looks like CI issue was resolved. Thanks! |
|
Thanks for your patience! |
Signed-off-by: Andrew Sansom <andrew@protopia.ai> Signed-off-by: Andrew Sansom <qthequartermasterman@gmail.com> Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
Fix bug due to vllm-project/vllm#24278 Signed-off-by: Kyuyeun Kim <kyuyeunk@google.com>
Fix bug due to vllm-project/vllm#24278 Signed-off-by: Kyuyeun Kim <kyuyeunk@google.com>
Signed-off-by: Andrew Sansom <andrew@protopia.ai> Signed-off-by: Andrew Sansom <qthequartermasterman@gmail.com> Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
…m-project#24278 Signed-off-by: Andrew Sansom <andrew@protopia.ai>
| self.inputs_embeds.copy_to_gpu(total_num_scheduled_tokens) | ||
| self.is_token_ids.copy_to_gpu(total_num_scheduled_tokens) |
There was a problem hiding this comment.
Shouldn't we do these conditional on self.enable_prompt_embeds?
There was a problem hiding this comment.
Probably! I'll investigate and open a follow-up PR. Thanks!
There was a problem hiding this comment.
Thanks for the suggestion.
| assert self.aux_buffers is not None | ||
| # view the tensor as a contiguous 1D array of bytes | ||
| arr = obj.flatten().contiguous().view(torch.uint8).numpy() | ||
| arr = obj.flatten().contiguous().cpu().view(torch.uint8).numpy() |
There was a problem hiding this comment.
What is the reason for adding this?
I think this has the potential to introduce unnoticed performance regressions. It's probably better to require the tensors to already be on the CPU.
There was a problem hiding this comment.
This change was originally added before I wrote #22962 which puts all tensors from users onto the CPU. I think it may be vestigial. I'll investigate and open up a follow-up PR reverting this change if it is now unneeded. 100% agree that it may have unnoticed performance regressions in the future.
Signed-off-by: Andrew Sansom <andrew@protopia.ai> Signed-off-by: Andrew Sansom <qthequartermasterman@gmail.com> Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
Signed-off-by: Andrew Sansom <andrew@protopia.ai> Signed-off-by: Andrew Sansom <qthequartermasterman@gmail.com> Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com> Signed-off-by: charlifu <charlifu@amd.com>
Signed-off-by: Andrew Sansom <andrew@protopia.ai> Signed-off-by: Andrew Sansom <qthequartermasterman@gmail.com> Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
Signed-off-by: Andrew Sansom <andrew@protopia.ai> Signed-off-by: Andrew Sansom <qthequartermasterman@gmail.com> Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
Purpose
Fixes #22124. Fixes #19746.
Prompt Embedding inputs are a niche, but frequently asked for feature in vLLM. #15428 introduced them in the v0 engine, but they have not yet been ported to the v1 engine. Prompt embedding users will be stuck on older versions of vLLM unless the feature is also introduced into the v1 engine.
The original RFC is #22124. The design differs from that RFC in three ways:
--enable-prompt-embedsis on (it's off by default). The double compilation proposal would require significant work, and was large enough while I was prototyping, I figured it would be better to do just the v1 + prompt embeds pieces first, because this PR is already large enough.--enable-prompt-embedsis on.Test Plan
There are several unit tests already extant that test prompt embeds, but they were previously disabled on the v1 engine. I enabled those. I also added some more scenarios to the basic correctness tests to catch regressions related to tensor_parallel + prompt_embeds.
I'm also locally running a local script file based on https://github.com/vllm-project/vllm/blob/main/examples/offline_inference/prompt_embed_inference.py against a large variety of combinations of (prompts, prompt_embeds, prompt+prompt_embeds) on many different seq_lens (ranging from very short to very long) within the same batch on a variety of settings (including eager mode on/off, chunked_prefill on/off, and various tensor parallel sizes).
Test Result
All the new tests are passing. My local script suite is also passing, and the generations look as expected on every configuration I've checked on my linux machine with two nvidia GPUs.
Pending CI test results. With any luck I didn't break anything else. 🤞
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.