Conversation
Signed-off-by: yewentao256 <zhyanwentao@126.com>
There was a problem hiding this comment.
Code Review
This pull request optimizes the maxsim score computation by moving it to the GPU and processing in batches, which is a significant performance improvement. The implementation of the new batched function compute_maxsim_scores is mostly solid. However, I've identified a performance issue within the batching logic itself. The method for determining the batch size to avoid oversized memory allocations is inefficient and can be improved. I've provided a suggestion to refactor this part for better performance.
Signed-off-by: yewentao256 <zhyanwentao@126.com>
|
(I’m not sure if it will work when the API server count is greater than 1. I have reservations about using the GPU in the API server (or during pre-processing and post-processing stages). There might be a risk of OOM or other weird CUDA errors. However, computing maxsim scores on the GPU is indeed better.) |
|
@yewentao256 can you test this with api-server-count>1 without DP? I have concerns about the API server using GPU resources |
No worries, we can fix accordingly if the issue raised. With the current config |
| if q_emb.shape[1] != d_emb.shape[1]: | ||
| raise ValueError("Query and document embeddings must have same dim") | ||
|
|
||
| compute_device = torch.device("cuda" if torch.cuda.is_available() else "cpu") |
There was a problem hiding this comment.
I don't think you should be using torch.cuda.is_available() directly. You should use current_platform.is_cuda() at least
Waiting for issues to be reported is not a good testing strategy. We should have raised this PR with other contributors before merging |
…2E throughput improvement (vllm-project#35330) Signed-off-by: yewentao256 <zhyanwentao@126.com> Signed-off-by: stakeswky <stakeswky@gmail.com>
…2E throughput improvement (vllm-project#35330) Signed-off-by: yewentao256 <zhyanwentao@126.com>
…ring Replaces the vanilla padded-bmm MaxSim (PR vllm-project#35330, vllm-project#38620) with vendored flash-maxsim Triton kernels for ColBERT/ColPali document scoring. Three scoring paths, with automatic fallback: 1. Zero-copy (default): project hidden_states once, rerank kernel reads doc slices directly from projected_batch. No torch.cat, no score-matrix materialization. 2. Flash-packed (when zerocopy disabled or params not compatible): torch.cat + single fused kernel call, no padding. 3. Vanilla (CPU, d<16, or no Triton): original padded bmm. Key results on A100 80GB with ColBERT: - Kernel speedup on varlen docs: 100-3000x vs vanilla padded bmm - E2E throughput: +15-23% at 500+ docs/req (reranking workloads) - P95 latency: 13-24% lower - Score parity: max_abs_diff < 0.001 on 5K real docs, top-3 rankings identical Kernel correctness: max_err=4e-6 vs fp32 reference. Falls back to vanilla for CPU tensors, embedding dim < 16, chunked-prefill, or when pooling params request matryoshka truncation / activation off. Addresses: vllm-project#38282 Signed-off-by: roi.pony <roi.pony@ibm.com>
Purpose
Optimize maxsim scores computation for pooling models
Originally it is calculated in CPU, now we calculate it in GPU and using the batched version, so we get a lot of performance improvement
Test
Acc
Covered in unit tests
Perf
vllm serve --model jinaai/jina-colbert-v2 --runner pooling --port 9256 --enforce-eager --max-model-len 4096 --max-num-batched-tokens 4096 --disable-log-stats --hf-overrides '{"architectures": ["ColBERTJinaRobertaModel"]}' --trust-remote-codevllm bench serve --model jinaai/jina-colbert-v2 --backend vllm-rerank --endpoint /v1/rerank --host 127.0.0.1 --port 9256 --dataset-name random-rerank --num-prompts 2000 --request-rate inf --max-concurrency 64 --seed 0 --random-input-len 2048 --random-range-ratio 0.5 --percentile-metrics e2el --metric-percentiles 50,95,99