Skip to content

[Frontend] Re-enable running MaxSim on GPU #38620

Merged
noooop merged 16 commits intovllm-project:mainfrom
noooop:maxsim_in_worker_side
Apr 2, 2026
Merged

[Frontend] Re-enable running MaxSim on GPU #38620
noooop merged 16 commits intovllm-project:mainfrom
noooop:maxsim_in_worker_side

Conversation

@noooop
Copy link
Copy Markdown
Collaborator

@noooop noooop commented Mar 31, 2026

Purpose

Because the logic for MaxSim on GPU is too complex, this path is temporarily disabled in the refactoring score pooling entrypoints PR #28631. (for unblock subsequent PRs ASAP.)

let‘s re-enable running MaxSim on GPU.

After some attempts, enabling flash_late_interaction for the offline API leads to unnecessary complexity. Let's only enable flash_late_interaction for the API server.

Test Plan

tests/entrypoints/pooling/scoring/test_late_interaction_online.py

Test Result

pass


Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io>
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces an option to perform late-interaction scoring (MaxSim) on the GPU within the API server process to enhance performance. Key changes include adding a use_gpu_for_late_interaction_scoring CLI argument and implementing a two-stage execution flow in the ServingScores class. The review identified several critical issues in the new implementation: a missing return statement in the call method when GPU scoring is disabled, incorrect indexing and naming for document keys, appending pooling parameters to the wrong list, and failing to propagate the final result batch to the response context.

Comment thread vllm/entrypoints/pooling/scoring/serving.py Outdated
Comment thread vllm/entrypoints/pooling/scoring/serving.py Outdated
Comment thread vllm/entrypoints/pooling/scoring/serving.py Outdated
Comment thread vllm/entrypoints/pooling/scoring/serving.py Outdated
noooop added 13 commits March 31, 2026 17:24
Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io>
Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io>
Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io>
Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io>
Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io>
Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io>
Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io>
Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io>
Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io>
Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io>
Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io>
Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io>
@noooop noooop requested a review from yewentao256 April 1, 2026 09:48
@noooop noooop marked this pull request as ready for review April 1, 2026 09:48
@noooop
Copy link
Copy Markdown
Collaborator Author

noooop commented Apr 1, 2026

cc @roipony

ready to accept flash-maxsim for late-interaction scoring

(This PR has not yet integrated flash-maxsim; it only cleans up the API to make integration easier.)

Comment thread vllm/entrypoints/openai/cli_args.py
@noooop noooop added the ready ONLY add when PR is ready to merge/full CI is needed label Apr 1, 2026
noooop added 2 commits April 2, 2026 11:18
Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io>
Copy link
Copy Markdown
Member

@yewentao256 yewentao256 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks for the work!

@noooop noooop merged commit a9b4f07 into vllm-project:main Apr 2, 2026
61 checks passed
@noooop noooop deleted the maxsim_in_worker_side branch April 2, 2026 16:03
HenryTangDev pushed a commit to HenryTangMain/vllm that referenced this pull request Apr 6, 2026
Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io>
puririshi98 pushed a commit to puririshi98/vllm that referenced this pull request Apr 7, 2026
Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io>
Signed-off-by: Rishi Puri <riship@nvidia.com>
mtparet pushed a commit to blackfuel-ai/vllm that referenced this pull request Apr 9, 2026
Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io>
roipony pushed a commit to roipony/vllm that referenced this pull request Apr 20, 2026
…ring

Replaces the vanilla padded-bmm MaxSim (PR vllm-project#35330, vllm-project#38620) with vendored
flash-maxsim Triton kernels for ColBERT/ColPali document scoring.

Three scoring paths, with automatic fallback:
  1. Zero-copy (default): project hidden_states once, rerank kernel
     reads doc slices directly from projected_batch. No torch.cat,
     no score-matrix materialization.
  2. Flash-packed (when zerocopy disabled or params not compatible):
     torch.cat + single fused kernel call, no padding.
  3. Vanilla (CPU, d<16, or no Triton): original padded bmm.

Key results on A100 80GB with ColBERT:
  - Kernel speedup on varlen docs: 100-3000x vs vanilla padded bmm
  - E2E throughput: +15-23% at 500+ docs/req (reranking workloads)
  - P95 latency: 13-24% lower
  - Score parity: max_abs_diff < 0.001 on 5K real docs,
    top-3 rankings identical

Kernel correctness: max_err=4e-6 vs fp32 reference. Falls back to
vanilla for CPU tensors, embedding dim < 16, chunked-prefill, or
when pooling params request matryoshka truncation / activation off.

Addresses: vllm-project#38282
Signed-off-by: roi.pony <roi.pony@ibm.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

frontend ready ONLY add when PR is ready to merge/full CI is needed v1

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants