Conversation
Signed-off-by: yewentao256 <zhyanwentao@126.com>
There was a problem hiding this comment.
Code Review
This pull request introduces a new mechanism to explicitly request and handle prompt token IDs on the CPU side for pooling operations. Previously, requires_token_ids implied device-side tokens. Now, requires_token_ids_cpu is added to allow poolers to specifically request CPU-side token IDs, which can be more efficient for certain operations (e.g., token-based trimming or instruction length calculation) that are performed on the CPU. The changes involve updating PoolingParams, PoolingParamsUpdate, and PoolingMetadata to include the new CPU-side token ID field, modifying gpu_input_batch.py to conditionally create these CPU tensors, and updating various pooler implementations (special, BERT, GRITLM) to utilize this new CPU-side token ID buffer. A new test case is added to validate this functionality. I have no feedback to provide.
| f"returned_token_ids={self.returned_token_ids}, " | ||
| f"requires_token_ids={self.requires_token_ids}, " | ||
| f"requires_token_ids_cpu={self.requires_token_ids_cpu}, " | ||
| f"skip_reading_prefix_cache={self.skip_reading_prefix_cache}, " |
There was a problem hiding this comment.
Do we really need to create a separate flag for requires_token_ids_cpu? Using returned_token_ids to control both CPU and GPU is already sufficient and adds almost no overhead.
There was a problem hiding this comment.
Removed. Tested it and doesn't affect perf too much, nice catch!
============ Serving Benchmark Result ============
Successful requests: 2000
Failed requests: 0
Benchmark duration (s): 7.01
Total input tokens: 90390
Request throughput (req/s): 285.47
Total token throughput (tok/s): 12901.97
----------------End-to-end Latency----------------
Mean E2EL (ms): 4147.34
Median E2EL (ms): 4211.50
P99 E2EL (ms): 6830.78
==================================================Signed-off-by: yewentao256 <zhyanwentao@126.com>
…ken IDs, 48.9% E2E throughput improvement (vllm-project#38139)" This reverts commit 995dea1.
Replace `types.SimpleNamespace` mock with real `PoolingMetadata` dataclass in `test_splade_pooler_matches_reference_formula`. The test broke after PR vllm-project#38139 added `get_prompt_token_ids_cpu()` to PoolingMetadata and updated SPLADESparsePooler to call it — the SimpleNamespace mock lacked this method. Using the real dataclass makes the test resilient to future interface changes and matches the pattern used in production warmup code. Signed-off-by: vllm-contributor <contributor@vllm.ai> Signed-off-by: haosdent <haosdent@gmail.com>
… 48.9% E2E throughput improvement (vllm-project#38139) Signed-off-by: yewentao256 <zhyanwentao@126.com> Signed-off-by: Elham Harirpoush <elham.harirpoush@arm.com>
…t#38495) Signed-off-by: haosdent <haosdent@gmail.com> Signed-off-by: Elham Harirpoush <elham.harirpoush@arm.com>
… 48.9% E2E throughput improvement (vllm-project#38139) Signed-off-by: yewentao256 <zhyanwentao@126.com> Signed-off-by: Vinay Damodaran <vrdn@hey.com>
… 48.9% E2E throughput improvement (vllm-project#38139) Signed-off-by: yewentao256 <zhyanwentao@126.com> Signed-off-by: zhutaoyu <zhutaoyu97@gmail.com>
…t#38495) Signed-off-by: haosdent <haosdent@gmail.com> Signed-off-by: zhutaoyu <zhutaoyu97@gmail.com>
… 48.9% E2E throughput improvement (vllm-project#38139) Signed-off-by: yewentao256 <zhyanwentao@126.com> Signed-off-by: neweyes <328719365@qq.com>
…t#38495) Signed-off-by: haosdent <haosdent@gmail.com> Signed-off-by: neweyes <328719365@qq.com>
… 48.9% E2E throughput improvement (vllm-project#38139) Signed-off-by: yewentao256 <zhyanwentao@126.com> Signed-off-by: EricccYang <yangyang4991@gmail.com>
…t#38495) Signed-off-by: haosdent <haosdent@gmail.com> Signed-off-by: EricccYang <yangyang4991@gmail.com>
… 48.9% E2E throughput improvement (vllm-project#38139) Signed-off-by: yewentao256 <zhyanwentao@126.com> Signed-off-by: bhargav-patel-29 <bhargav.patel@tihiitb.org>
…t#38495) Signed-off-by: haosdent <haosdent@gmail.com> Signed-off-by: bhargav-patel-29 <bhargav.patel@tihiitb.org>
… 48.9% E2E throughput improvement (vllm-project#38139) Signed-off-by: yewentao256 <zhyanwentao@126.com>
…t#38495) Signed-off-by: haosdent <haosdent@gmail.com>
… 48.9% E2E throughput improvement (vllm-project#38139) Signed-off-by: yewentao256 <zhyanwentao@126.com>
…t#38495) Signed-off-by: haosdent <haosdent@gmail.com>
… 48.9% E2E throughput improvement (vllm-project#38139) Signed-off-by: yewentao256 <zhyanwentao@126.com> Signed-off-by: rishitdholakia13 <rishit+github@cohere.com>
…t#38495) Signed-off-by: haosdent <haosdent@gmail.com> Signed-off-by: rishitdholakia13 <rishit+github@cohere.com>
… 48.9% E2E throughput improvement (vllm-project#38139) Signed-off-by: yewentao256 <zhyanwentao@126.com> Signed-off-by: Rishi Puri <riship@nvidia.com>
…t#38495) Signed-off-by: haosdent <haosdent@gmail.com> Signed-off-by: Rishi Puri <riship@nvidia.com>
… 48.9% E2E throughput improvement (vllm-project#38139) Signed-off-by: yewentao256 <zhyanwentao@126.com>
…t#38495) Signed-off-by: haosdent <haosdent@gmail.com>
<!-- markdownlint-disable --> ## Description Add support for vLLM v0.19.0 - bump vllm versions - Inputs reorganization ([#35182](vllm-project/vllm#35182)) - `get_cross_encoder_act_fn` merged into `get_act_fn` ([#37537](vllm-project/vllm#37537)) - `RequestStatus.WAITING_FOR_FSM` renamed to `WAITING_FOR_STRUCTURED_OUTPUT_GRAMMAR` ([#38048](vllm-project/vllm#38048)) - `prompt_token_ids_cpu` arg in PoolingMetadata ([#38139](vllm-project/vllm#38139)) ## Related Issues <!-- Link related issues, e.g., `Fixes #` or `Relates to #456` --> ## Checklist - [x] I have read the [contributing guidelines](https://docs.vllm.ai/projects/spyre/en/latest/contributing) - [x] My code follows the project's code style (run `bash format.sh`) - [x] I have added tests for my changes (if applicable) - [ ] I have updated the documentation (if applicable) - [x] My commits include a `Signed-off-by:` line (DCO compliance) --------- Signed-off-by: Rafael Vasquez <rafvasq21@gmail.com>
… 48.9% E2E throughput improvement (vllm-project#38139) Signed-off-by: yewentao256 <zhyanwentao@126.com>
…t#38495) Signed-off-by: haosdent <haosdent@gmail.com>
… 48.9% E2E throughput improvement (vllm-project#38139) Signed-off-by: yewentao256 <zhyanwentao@126.com>
…t#38495) Signed-off-by: haosdent <haosdent@gmail.com>
Purpose
This PR remove redundant device copies for CPU-only pooling token IDs
Originally, we have a "CPU -> GPU -> CPU" copy twice, now we just remain as it is.
Test
Acc
Covered in unit tests
Perf