[CORE] Prompt Embeddings Support for v1 Engine by qthequartermasterman · Pull Request #24278 · vllm-project/vllm

qthequartermasterman · 2025-09-04T21:14:01Z

Purpose

Fixes #22124. Fixes #19746.

Prompt Embedding inputs are a niche, but frequently asked for feature in vLLM. #15428 introduced them in the v0 engine, but they have not yet been ported to the v1 engine. Prompt embedding users will be stuck on older versions of vLLM unless the feature is also introduced into the v1 engine.

The original RFC is #22124. The design differs from that RFC in three ways:

Mixed batches (of both prompt_embeds and prompt_token_ids) are handled within the GPUModelRunner.execute_model itself, where tokens that are passed in id and not prompt_embed are first transformed to embeddings, then they are sent through the model. This is in some ways similar to how multi-modal embeddings are mixed with input_ids for multi-modal models. Since the model outputs token ids anyway, it was significantly cleaner to just handle the mixing here instead of in the scheduler, like in the RFC and the v0 engine.
The "double compilation" of the CUDA graph, once with input_ids and once with inputs_embeds, like in the RFC and v0 engine is eschewed, and instead, when prompt_embeds is enabled, all token_ids are transformed into embeddings first outside of the compiled graph, and then only inputs_embeds are passed in. While this has a performance hit, it is similar to how multimodal models are treated today, and it only happens when --enable-prompt-embeds is on (it's off by default). The double compilation proposal would require significant work, and was large enough while I was prototyping, I figured it would be better to do just the v1 + prompt embeds pieces first, because this PR is already large enough.
This goes further than the RFC in disabling prefix_caching when enabled alongside prompt_embeds. This didn't work in v0, and still doesn't yet work in v1. Future work can enable this support. Since it's now on by default, we need to disable prefix_caching whenever --enable-prompt-embeds is on.

Test Plan

There are several unit tests already extant that test prompt embeds, but they were previously disabled on the v1 engine. I enabled those. I also added some more scenarios to the basic correctness tests to catch regressions related to tensor_parallel + prompt_embeds.

I'm also locally running a local script file based on https://github.com/vllm-project/vllm/blob/main/examples/offline_inference/prompt_embed_inference.py against a large variety of combinations of (prompts, prompt_embeds, prompt+prompt_embeds) on many different seq_lens (ranging from very short to very long) within the same batch on a variety of settings (including eager mode on/off, chunked_prefill on/off, and various tensor parallel sizes).

Test Result

All the new tests are passing. My local script suite is also passing, and the generations look as expected on every configuration I've checked on my linux machine with two nvidia GPUs.

Pending CI test results. With any luck I didn't break anything else. 🤞

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.