[Frontend] Re-enable running MaxSim on GPU #38620
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces an option to perform late-interaction scoring (MaxSim) on the GPU within the API server process to enhance performance. Key changes include adding a use_gpu_for_late_interaction_scoring CLI argument and implementing a two-stage execution flow in the ServingScores class. The review identified several critical issues in the new implementation: a missing return statement in the call method when GPU scoring is disabled, incorrect indexing and naming for document keys, appending pooling parameters to the wrong list, and failing to propagate the final result batch to the response context.
Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io>
Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io>
Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io>
|
cc @roipony ready to accept flash-maxsim for late-interaction scoring (This PR has not yet integrated flash-maxsim; it only cleans up the API to make integration easier.) |
yewentao256
left a comment
There was a problem hiding this comment.
LGTM, thanks for the work!
Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io>
Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io> Signed-off-by: Rishi Puri <riship@nvidia.com>
Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io>
…ring Replaces the vanilla padded-bmm MaxSim (PR vllm-project#35330, vllm-project#38620) with vendored flash-maxsim Triton kernels for ColBERT/ColPali document scoring. Three scoring paths, with automatic fallback: 1. Zero-copy (default): project hidden_states once, rerank kernel reads doc slices directly from projected_batch. No torch.cat, no score-matrix materialization. 2. Flash-packed (when zerocopy disabled or params not compatible): torch.cat + single fused kernel call, no padding. 3. Vanilla (CPU, d<16, or no Triton): original padded bmm. Key results on A100 80GB with ColBERT: - Kernel speedup on varlen docs: 100-3000x vs vanilla padded bmm - E2E throughput: +15-23% at 500+ docs/req (reranking workloads) - P95 latency: 13-24% lower - Score parity: max_abs_diff < 0.001 on 5K real docs, top-3 rankings identical Kernel correctness: max_err=4e-6 vs fp32 reference. Falls back to vanilla for CPU tensors, embedding dim < 16, chunked-prefill, or when pooling params request matryoshka truncation / activation off. Addresses: vllm-project#38282 Signed-off-by: roi.pony <roi.pony@ibm.com>
Purpose
Because the logic for MaxSim on GPU is too complex, this path is temporarily disabled in the refactoring score pooling entrypoints PR #28631. (for unblock subsequent PRs ASAP.)
let‘s re-enable running MaxSim on GPU.
After some attempts, enabling flash_late_interaction for the offline API leads to unnecessary complexity. Let's only enable flash_late_interaction for the API server.
Test Plan
tests/entrypoints/pooling/scoring/test_late_interaction_online.py
Test Result
pass
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.