Skip to content

[Perf] Remove redundant device copies for CPU-only pooling token IDs, 48.9% E2E throughput improvement#38139

Merged
noooop merged 5 commits intomainfrom
wentao-remove-redundant-prompt-copy
Mar 29, 2026
Merged

[Perf] Remove redundant device copies for CPU-only pooling token IDs, 48.9% E2E throughput improvement#38139
noooop merged 5 commits intomainfrom
wentao-remove-redundant-prompt-copy

Conversation

@yewentao256
Copy link
Copy Markdown
Member

@yewentao256 yewentao256 commented Mar 25, 2026

Purpose

This PR remove redundant device copies for CPU-only pooling token IDs

Originally, we have a "CPU -> GPU -> CPU" copy twice, now we just remain as it is.

Test

Acc

Covered in unit tests

  • pytest tests/v1/worker/test_gpu_input_batch.py -q -k pooling_metadata_token_id_buffers
  • tests/models/language/pooling/test_bge_m3.py::test_bge_m3_api_server_embedding
  • tests/models/language/pooling/test_bge_m3.py::test_bge_m3_api_server_sparse_embedding
  • tests/models/language/pooling/test_bge_m3.py::test_bge_m3_api_server_sparse_embedding_corner_case
  • tests/models/language/pooling/test_bge_m3.py::test_bge_m3_api_server_multi_vector

Perf

  1. generate data
python - <<'PY'
import json
instruction = "<|user|>\nGiven a scientific paper title, retrieve the paper's abstract\n<|embed|>\n"
titles = [
    "Bitcoin: A Peer-to-Peer Electronic Cash System",
    "Generative Representational Instruction Tuning",
    "Attention Is All You Need",
    "BERT: Pre-training of Deep Bidirectional Transformers",
    "Language Models are Few-Shot Learners",
    "Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks",
    "LoRA: Low-Rank Adaptation of Large Language Models",
    "FlashAttention: Fast and Memory-Efficient Exact Attention",
]
with open("gritlm_bench.jsonl", "w") as f:
    for i in range(2000):
        title = titles[i % len(titles)]
        prompt = instruction + f"{title} ({i})"
        f.write(json.dumps({"prompt": prompt}) + "\n")
PY
  1. Launch server
python -m vllm.entrypoints.cli.main serve parasail-ai/GritLM-7B-vllm \
  --runner pooling \
  --max-model-len 4000
  1. benchmark
python -m vllm.entrypoints.cli.main bench serve \
  --backend openai-embeddings \
  --endpoint /v1/embeddings \
  --base-url http://127.0.0.1:8000 \
  --model parasail-ai/GritLM-7B-vllm \
  --dataset-name custom \
  --dataset-path /home/yewentao256/vllm_source/gritlm_bench.jsonl \
  --skip-chat-template \
  --custom-output-len 1 \
  --num-prompts 2000 \
  --num-warmups 100 \
  --request-rate inf
# Now
============ Serving Benchmark Result ============
Successful requests:                     2000      
Failed requests:                         0         
Benchmark duration (s):                  6.94      
Total input tokens:                      90390     
Request throughput (req/s):              287.99    
Total token throughput (tok/s):          13015.51  
----------------End-to-end Latency----------------
Mean E2EL (ms):                          4132.05   
Median E2EL (ms):                        4110.98   
P99 E2EL (ms):                           6821.56   
==================================================
# Main
============ Serving Benchmark Result ============
Successful requests:                     2000      
Failed requests:                         0         
Benchmark duration (s):                  10.35     
Total input tokens:                      90390     
Request throughput (req/s):              193.31    
Total token throughput (tok/s):          8736.74   
----------------End-to-end Latency----------------
Mean E2EL (ms):                          6720.99   
Median E2EL (ms):                        6655.87   
P99 E2EL (ms):                           9988.44   
==================================================

Signed-off-by: yewentao256 <zhyanwentao@126.com>
@yewentao256 yewentao256 added the ready ONLY add when PR is ready to merge/full CI is needed label Mar 25, 2026
@mergify mergify bot added the v1 label Mar 25, 2026
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a new mechanism to explicitly request and handle prompt token IDs on the CPU side for pooling operations. Previously, requires_token_ids implied device-side tokens. Now, requires_token_ids_cpu is added to allow poolers to specifically request CPU-side token IDs, which can be more efficient for certain operations (e.g., token-based trimming or instruction length calculation) that are performed on the CPU. The changes involve updating PoolingParams, PoolingParamsUpdate, and PoolingMetadata to include the new CPU-side token ID field, modifying gpu_input_batch.py to conditionally create these CPU tensors, and updating various pooler implementations (special, BERT, GRITLM) to utilize this new CPU-side token ID buffer. A new test case is added to validate this functionality. I have no feedback to provide.

@noooop noooop enabled auto-merge (squash) March 26, 2026 01:49
f"returned_token_ids={self.returned_token_ids}, "
f"requires_token_ids={self.requires_token_ids}, "
f"requires_token_ids_cpu={self.requires_token_ids_cpu}, "
f"skip_reading_prefix_cache={self.skip_reading_prefix_cache}, "
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we really need to create a separate flag for requires_token_ids_cpu? Using returned_token_ids to control both CPU and GPU is already sufficient and adds almost no overhead.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed. Tested it and doesn't affect perf too much, nice catch!

============ Serving Benchmark Result ============
Successful requests:                     2000      
Failed requests:                         0         
Benchmark duration (s):                  7.01      
Total input tokens:                      90390     
Request throughput (req/s):              285.47    
Total token throughput (tok/s):          12901.97  
----------------End-to-end Latency----------------
Mean E2EL (ms):                          4147.34   
Median E2EL (ms):                        4211.50   
P99 E2EL (ms):                           6830.78   
==================================================

Signed-off-by: yewentao256 <zhyanwentao@126.com>
@noooop noooop merged commit 995dea1 into main Mar 29, 2026
71 checks passed
@noooop noooop deleted the wentao-remove-redundant-prompt-copy branch March 29, 2026 18:12
zhewenl pushed a commit to zhewenl/vllm that referenced this pull request Mar 30, 2026
…ken IDs, 48.9% E2E throughput improvement (vllm-project#38139)"

This reverts commit 995dea1.
haosdent added a commit to haosdent/vllm that referenced this pull request Mar 30, 2026
Replace `types.SimpleNamespace` mock with real `PoolingMetadata` dataclass
in `test_splade_pooler_matches_reference_formula`. The test broke after
PR vllm-project#38139 added `get_prompt_token_ids_cpu()` to PoolingMetadata and
updated SPLADESparsePooler to call it — the SimpleNamespace mock lacked
this method.

Using the real dataclass makes the test resilient to future interface
changes and matches the pattern used in production warmup code.

Signed-off-by: vllm-contributor <contributor@vllm.ai>

Signed-off-by: haosdent <haosdent@gmail.com>
noooop pushed a commit that referenced this pull request Mar 30, 2026
Signed-off-by: haosdent <haosdent@gmail.com>
Elm8116 pushed a commit to Elm8116/vllm that referenced this pull request Mar 30, 2026
… 48.9% E2E throughput improvement (vllm-project#38139)

Signed-off-by: yewentao256 <zhyanwentao@126.com>
Signed-off-by: Elham Harirpoush <elham.harirpoush@arm.com>
Elm8116 pushed a commit to Elm8116/vllm that referenced this pull request Mar 30, 2026
…t#38495)

Signed-off-by: haosdent <haosdent@gmail.com>
Signed-off-by: Elham Harirpoush <elham.harirpoush@arm.com>
vrdn-23 pushed a commit to vrdn-23/vllm that referenced this pull request Mar 30, 2026
… 48.9% E2E throughput improvement (vllm-project#38139)

Signed-off-by: yewentao256 <zhyanwentao@126.com>
Signed-off-by: Vinay Damodaran <vrdn@hey.com>
benenzhu pushed a commit to benenzhu/vllm that referenced this pull request Mar 31, 2026
… 48.9% E2E throughput improvement (vllm-project#38139)

Signed-off-by: yewentao256 <zhyanwentao@126.com>
Signed-off-by: zhutaoyu <zhutaoyu97@gmail.com>
benenzhu pushed a commit to benenzhu/vllm that referenced this pull request Mar 31, 2026
…t#38495)

Signed-off-by: haosdent <haosdent@gmail.com>
Signed-off-by: zhutaoyu <zhutaoyu97@gmail.com>
khluu pushed a commit that referenced this pull request Mar 31, 2026
Signed-off-by: haosdent <haosdent@gmail.com>
(cherry picked from commit a08b773)
neweyes pushed a commit to neweyes/vllm that referenced this pull request Mar 31, 2026
… 48.9% E2E throughput improvement (vllm-project#38139)

Signed-off-by: yewentao256 <zhyanwentao@126.com>
Signed-off-by: neweyes <328719365@qq.com>
neweyes pushed a commit to neweyes/vllm that referenced this pull request Mar 31, 2026
…t#38495)

Signed-off-by: haosdent <haosdent@gmail.com>
Signed-off-by: neweyes <328719365@qq.com>
EricccYang pushed a commit to EricccYang/vllm that referenced this pull request Apr 1, 2026
… 48.9% E2E throughput improvement (vllm-project#38139)

Signed-off-by: yewentao256 <zhyanwentao@126.com>
Signed-off-by: EricccYang <yangyang4991@gmail.com>
EricccYang pushed a commit to EricccYang/vllm that referenced this pull request Apr 1, 2026
…t#38495)

Signed-off-by: haosdent <haosdent@gmail.com>
Signed-off-by: EricccYang <yangyang4991@gmail.com>
bhargav-patel-29 pushed a commit to Bharatgen-Tech/vllm that referenced this pull request Apr 1, 2026
… 48.9% E2E throughput improvement (vllm-project#38139)

Signed-off-by: yewentao256 <zhyanwentao@126.com>
Signed-off-by: bhargav-patel-29 <bhargav.patel@tihiitb.org>
bhargav-patel-29 pushed a commit to Bharatgen-Tech/vllm that referenced this pull request Apr 1, 2026
…t#38495)

Signed-off-by: haosdent <haosdent@gmail.com>
Signed-off-by: bhargav-patel-29 <bhargav.patel@tihiitb.org>
yzong-rh pushed a commit to yzong-rh/vllm that referenced this pull request Apr 3, 2026
… 48.9% E2E throughput improvement (vllm-project#38139)

Signed-off-by: yewentao256 <zhyanwentao@126.com>
yzong-rh pushed a commit to yzong-rh/vllm that referenced this pull request Apr 3, 2026
liuchenbing2026 pushed a commit to liuchenbing2026/vllm that referenced this pull request Apr 4, 2026
… 48.9% E2E throughput improvement (vllm-project#38139)

Signed-off-by: yewentao256 <zhyanwentao@126.com>
liuchenbing2026 pushed a commit to liuchenbing2026/vllm that referenced this pull request Apr 4, 2026
rishitdholakia13 pushed a commit to rishitdholakia13/vllm that referenced this pull request Apr 7, 2026
… 48.9% E2E throughput improvement (vllm-project#38139)

Signed-off-by: yewentao256 <zhyanwentao@126.com>
Signed-off-by: rishitdholakia13 <rishit+github@cohere.com>
rishitdholakia13 pushed a commit to rishitdholakia13/vllm that referenced this pull request Apr 7, 2026
…t#38495)

Signed-off-by: haosdent <haosdent@gmail.com>
Signed-off-by: rishitdholakia13 <rishit+github@cohere.com>
puririshi98 pushed a commit to puririshi98/vllm that referenced this pull request Apr 7, 2026
… 48.9% E2E throughput improvement (vllm-project#38139)

Signed-off-by: yewentao256 <zhyanwentao@126.com>
Signed-off-by: Rishi Puri <riship@nvidia.com>
puririshi98 pushed a commit to puririshi98/vllm that referenced this pull request Apr 7, 2026
…t#38495)

Signed-off-by: haosdent <haosdent@gmail.com>
Signed-off-by: Rishi Puri <riship@nvidia.com>
big-yellow-duck pushed a commit to EmbeddedLLM/vllm that referenced this pull request Apr 8, 2026
… 48.9% E2E throughput improvement (vllm-project#38139)

Signed-off-by: yewentao256 <zhyanwentao@126.com>
big-yellow-duck pushed a commit to EmbeddedLLM/vllm that referenced this pull request Apr 8, 2026
tjohnson31415 pushed a commit to vllm-project/vllm-spyre that referenced this pull request Apr 8, 2026
<!-- markdownlint-disable -->

## Description

Add support for vLLM v0.19.0

- bump vllm versions
- Inputs reorganization
([#35182](vllm-project/vllm#35182))
- `get_cross_encoder_act_fn` merged into `get_act_fn`
([#37537](vllm-project/vllm#37537))
- `RequestStatus.WAITING_FOR_FSM` renamed to
`WAITING_FOR_STRUCTURED_OUTPUT_GRAMMAR`
([#38048](vllm-project/vllm#38048))
- `prompt_token_ids_cpu` arg in PoolingMetadata
([#38139](vllm-project/vllm#38139))

## Related Issues

<!-- Link related issues, e.g., `Fixes #` or `Relates to #456` -->


## Checklist

- [x] I have read the [contributing
guidelines](https://docs.vllm.ai/projects/spyre/en/latest/contributing)
- [x] My code follows the project's code style (run `bash format.sh`)
- [x] I have added tests for my changes (if applicable)
- [ ] I have updated the documentation (if applicable)
- [x] My commits include a `Signed-off-by:` line (DCO compliance)

---------

Signed-off-by: Rafael Vasquez <rafvasq21@gmail.com>
mtparet pushed a commit to blackfuel-ai/vllm that referenced this pull request Apr 9, 2026
… 48.9% E2E throughput improvement (vllm-project#38139)

Signed-off-by: yewentao256 <zhyanwentao@126.com>
mtparet pushed a commit to blackfuel-ai/vllm that referenced this pull request Apr 9, 2026
EricccYang pushed a commit to EricccYang/vllm that referenced this pull request Apr 10, 2026
… 48.9% E2E throughput improvement (vllm-project#38139)

Signed-off-by: yewentao256 <zhyanwentao@126.com>
EricccYang pushed a commit to EricccYang/vllm that referenced this pull request Apr 10, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready ONLY add when PR is ready to merge/full CI is needed v1

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants