[Perf] Remove redundant device copies for CPU-only pooling token IDs, 48.9% E2E throughput improvement by yewentao256 · Pull Request #38139 · vllm-project/vllm

yewentao256 · 2026-03-25T19:57:26Z

Purpose

This PR remove redundant device copies for CPU-only pooling token IDs

Originally, we have a "CPU -> GPU -> CPU" copy twice, now we just remain as it is.

Test

Acc

Covered in unit tests

pytest tests/v1/worker/test_gpu_input_batch.py -q -k pooling_metadata_token_id_buffers
tests/models/language/pooling/test_bge_m3.py::test_bge_m3_api_server_embedding
tests/models/language/pooling/test_bge_m3.py::test_bge_m3_api_server_sparse_embedding
tests/models/language/pooling/test_bge_m3.py::test_bge_m3_api_server_sparse_embedding_corner_case
tests/models/language/pooling/test_bge_m3.py::test_bge_m3_api_server_multi_vector

Perf

generate data

python - <<'PY'
import json
instruction = "<|user|>\nGiven a scientific paper title, retrieve the paper's abstract\n<|embed|>\n"
titles = [
    "Bitcoin: A Peer-to-Peer Electronic Cash System",
    "Generative Representational Instruction Tuning",
    "Attention Is All You Need",
    "BERT: Pre-training of Deep Bidirectional Transformers",
    "Language Models are Few-Shot Learners",
    "Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks",
    "LoRA: Low-Rank Adaptation of Large Language Models",
    "FlashAttention: Fast and Memory-Efficient Exact Attention",
]
with open("gritlm_bench.jsonl", "w") as f:
    for i in range(2000):
        title = titles[i % len(titles)]
        prompt = instruction + f"{title} ({i})"
        f.write(json.dumps({"prompt": prompt}) + "\n")
PY

Launch server

python -m vllm.entrypoints.cli.main serve parasail-ai/GritLM-7B-vllm \
  --runner pooling \
  --max-model-len 4000

benchmark

python -m vllm.entrypoints.cli.main bench serve \
  --backend openai-embeddings \
  --endpoint /v1/embeddings \
  --base-url http://127.0.0.1:8000 \
  --model parasail-ai/GritLM-7B-vllm \
  --dataset-name custom \
  --dataset-path /home/yewentao256/vllm_source/gritlm_bench.jsonl \
  --skip-chat-template \
  --custom-output-len 1 \
  --num-prompts 2000 \
  --num-warmups 100 \
  --request-rate inf

# Now
============ Serving Benchmark Result ============
Successful requests:                     2000      
Failed requests:                         0         
Benchmark duration (s):                  6.94      
Total input tokens:                      90390     
Request throughput (req/s):              287.99    
Total token throughput (tok/s):          13015.51  
----------------End-to-end Latency----------------
Mean E2EL (ms):                          4132.05   
Median E2EL (ms):                        4110.98   
P99 E2EL (ms):                           6821.56   
==================================================
# Main
============ Serving Benchmark Result ============
Successful requests:                     2000      
Failed requests:                         0         
Benchmark duration (s):                  10.35     
Total input tokens:                      90390     
Request throughput (req/s):              193.31    
Total token throughput (tok/s):          8736.74   
----------------End-to-end Latency----------------
Mean E2EL (ms):                          6720.99   
Median E2EL (ms):                        6655.87   
P99 E2EL (ms):                           9988.44   
==================================================

Signed-off-by: yewentao256 <zhyanwentao@126.com>

gemini-code-assist

Code Review

This pull request introduces a new mechanism to explicitly request and handle prompt token IDs on the CPU side for pooling operations. Previously, requires_token_ids implied device-side tokens. Now, requires_token_ids_cpu is added to allow poolers to specifically request CPU-side token IDs, which can be more efficient for certain operations (e.g., token-based trimming or instruction length calculation) that are performed on the CPU. The changes involve updating PoolingParams, PoolingParamsUpdate, and PoolingMetadata to include the new CPU-side token ID field, modifying gpu_input_batch.py to conditionally create these CPU tensors, and updating various pooler implementations (special, BERT, GRITLM) to utilize this new CPU-side token ID buffer. A new test case is added to validate this functionality. I have no feedback to provide.

noooop · 2026-03-26T01:54:05Z

vllm/pooling_params.py

            f"returned_token_ids={self.returned_token_ids}, "
            f"requires_token_ids={self.requires_token_ids}, "
+            f"requires_token_ids_cpu={self.requires_token_ids_cpu}, "
            f"skip_reading_prefix_cache={self.skip_reading_prefix_cache}, "


Do we really need to create a separate flag for requires_token_ids_cpu? Using returned_token_ids to control both CPU and GPU is already sufficient and adds almost no overhead.

Removed. Tested it and doesn't affect perf too much, nice catch!

============ Serving Benchmark Result ============ Successful requests: 2000 Failed requests: 0 Benchmark duration (s): 7.01 Total input tokens: 90390 Request throughput (req/s): 285.47 Total token throughput (tok/s): 12901.97 ----------------End-to-end Latency---------------- Mean E2EL (ms): 4147.34 Median E2EL (ms): 4211.50 P99 E2EL (ms): 6830.78 ==================================================

Signed-off-by: yewentao256 <zhyanwentao@126.com>

…ken IDs, 48.9% E2E throughput improvement (vllm-project#38139)" This reverts commit 995dea1.

Replace `types.SimpleNamespace` mock with real `PoolingMetadata` dataclass in `test_splade_pooler_matches_reference_formula`. The test broke after PR vllm-project#38139 added `get_prompt_token_ids_cpu()` to PoolingMetadata and updated SPLADESparsePooler to call it — the SimpleNamespace mock lacked this method. Using the real dataclass makes the test resilient to future interface changes and matches the pattern used in production warmup code. Signed-off-by: vllm-contributor <contributor@vllm.ai> Signed-off-by: haosdent <haosdent@gmail.com>

Signed-off-by: haosdent <haosdent@gmail.com>

… 48.9% E2E throughput improvement (vllm-project#38139) Signed-off-by: yewentao256 <zhyanwentao@126.com> Signed-off-by: Elham Harirpoush <elham.harirpoush@arm.com>

…t#38495) Signed-off-by: haosdent <haosdent@gmail.com> Signed-off-by: Elham Harirpoush <elham.harirpoush@arm.com>

… 48.9% E2E throughput improvement (vllm-project#38139) Signed-off-by: yewentao256 <zhyanwentao@126.com> Signed-off-by: Vinay Damodaran <vrdn@hey.com>

… 48.9% E2E throughput improvement (vllm-project#38139) Signed-off-by: yewentao256 <zhyanwentao@126.com> Signed-off-by: zhutaoyu <zhutaoyu97@gmail.com>

…t#38495) Signed-off-by: haosdent <haosdent@gmail.com> Signed-off-by: zhutaoyu <zhutaoyu97@gmail.com>

Signed-off-by: haosdent <haosdent@gmail.com> (cherry picked from commit a08b773)

… 48.9% E2E throughput improvement (vllm-project#38139) Signed-off-by: yewentao256 <zhyanwentao@126.com> Signed-off-by: neweyes <328719365@qq.com>

…t#38495) Signed-off-by: haosdent <haosdent@gmail.com> Signed-off-by: neweyes <328719365@qq.com>

… 48.9% E2E throughput improvement (vllm-project#38139) Signed-off-by: yewentao256 <zhyanwentao@126.com> Signed-off-by: EricccYang <yangyang4991@gmail.com>

…t#38495) Signed-off-by: haosdent <haosdent@gmail.com> Signed-off-by: EricccYang <yangyang4991@gmail.com>

… 48.9% E2E throughput improvement (vllm-project#38139) Signed-off-by: yewentao256 <zhyanwentao@126.com> Signed-off-by: bhargav-patel-29 <bhargav.patel@tihiitb.org>

…t#38495) Signed-off-by: haosdent <haosdent@gmail.com> Signed-off-by: bhargav-patel-29 <bhargav.patel@tihiitb.org>

… 48.9% E2E throughput improvement (vllm-project#38139) Signed-off-by: yewentao256 <zhyanwentao@126.com>