[Frontend] Re-enable running MaxSim on GPU #38620

Merged

noooop merged 16 commits intovllm-project:mainfrom

noooop:maxsim_in_worker_side

Apr 2, 2026

Collaborator

noooop commented Mar 31, 2026 •

edited

Loading

Purpose

Because the logic for MaxSim on GPU is too complex, this path is temporarily disabled in the refactoring score pooling entrypoints PR #28631. （for unblock subsequent PRs ASAP.）

let‘s re-enable running MaxSim on GPU.

After some attempts, enabling flash_late_interaction for the offline API leads to unnecessary complexity. Let's only enable flash_late_interaction for the API server.

Test Plan

tests/entrypoints/pooling/scoring/test_late_interaction_online.py

Test Result

pass

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.


          init

d9d777c

Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io>

mergify Bot added frontend v1 labels

gemini-code-assist Bot reviewed

View reviewed changes

Contributor

gemini-code-assist Bot left a comment

Code Review

This pull request introduces an option to perform late-interaction scoring (MaxSim) on the GPU within the API server process to enhance performance. Key changes include adding a use_gpu_for_late_interaction_scoring CLI argument and implementing a two-stage execution flow in the ServingScores class. The review identified several critical issues in the new implementation: a missing return statement in the call method when GPU scoring is disabled, incorrect indexing and naming for document keys, appending pooling parameters to the wrong list, and failing to propagate the final result batch to the response context.

vllm/entrypoints/pooling/scoring/serving.py Outdated

vllm/entrypoints/pooling/scoring/serving.py Outdated

vllm/entrypoints/pooling/scoring/serving.py Outdated

vllm/entrypoints/pooling/scoring/serving.py Outdated

noooop added 13 commits

March 31, 2026 17:24


          refine

812eeb2

Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io>


          refine

00bd12c

Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io>


          refine

7089b3b

Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io>


          refine

47d9e75

Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io>


          + online part

f2c02ff

Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io>


          refine

a1140c1

Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io>


          + _init_ctx

258ec46

Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io>


          + _init_ctx

be146bb

Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io>


          refine

31f1482

Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io>


          mypy

3ea129e

Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io>


          Merge branch 'main' into maxsim_in_worker_side

b68de02


          mypy

aaca35a

Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io>

fix

db8f6ec

Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io>

noooop requested a review from yewentao256

April 1, 2026 09:48

noooop marked this pull request as ready for review

April 1, 2026 09:48

noooop requested review from DarkLight1337, WoosukKwon, aarnphm, chaunceyjiang, njhill and russellb as code owners

April 1, 2026 09:48

Collaborator Author

noooop commented Apr 1, 2026 •

edited

Loading

ready to accept flash-maxsim for late-interaction scoring

(This PR has not yet integrated flash-maxsim; it only cleans up the API to make integration easier.)

noooop commented

View reviewed changes

vllm/entrypoints/openai/cli_args.py

noooop added the ready label

noooop added 2 commits

April 2, 2026 11:18

fix

8bbac4f

Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io>


          Merge branch 'main' into maxsim_in_worker_side

e4e9a00

yewentao256 approved these changes

View reviewed changes

Member

yewentao256 left a comment

LGTM, thanks for the work!

noooop merged commit a9b4f07 into vllm-project:main

61 checks passed

noooop deleted the maxsim_in_worker_side branch

April 2, 2026 16:03

roipony mentioned this pull request

[Feature]: Replace vanilla MaxSim with flash-maxsim for late-interaction scoring #38282

Open

1 task

HenryTangDev pushed a commit to HenryTangMain/vllm that referenced this pull request


          [Frontend] Re-enable running MaxSim on GPU (vllm-project#38620)

19bba1a

Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io>

puririshi98 pushed a commit to puririshi98/vllm that referenced this pull request


          [Frontend] Re-enable running MaxSim on GPU (vllm-project#38620)

73d35cd

Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io>
Signed-off-by: Rishi Puri <riship@nvidia.com>

mtparet pushed a commit to blackfuel-ai/vllm that referenced this pull request


          [Frontend] Re-enable running MaxSim on GPU (vllm-project#38620)

2d1d46d

Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io>

roipony pushed a commit to roipony/vllm that referenced this pull request


          [Perf] Integrate flash-maxsim Triton kernels for late-interaction sco…

2cb17ae

…ring

Replaces the vanilla padded-bmm MaxSim (PR vllm-project#35330, vllm-project#38620) with vendored
flash-maxsim Triton kernels for ColBERT/ColPali document scoring.

Three scoring paths, with automatic fallback:
  1. Zero-copy (default): project hidden_states once, rerank kernel
     reads doc slices directly from projected_batch. No torch.cat,
     no score-matrix materialization.
  2. Flash-packed (when zerocopy disabled or params not compatible):
     torch.cat + single fused kernel call, no padding.
  3. Vanilla (CPU, d<16, or no Triton): original padded bmm.

Key results on A100 80GB with ColBERT:
  - Kernel speedup on varlen docs: 100-3000x vs vanilla padded bmm
  - E2E throughput: +15-23% at 500+ docs/req (reranking workloads)
  - P95 latency: 13-24% lower
  - Score parity: max_abs_diff < 0.001 on 5K real docs,
    top-3 rankings identical

Kernel correctness: max_err=4e-6 vs fp32 reference. Falls back to
vanilla for CPU tensors, embedding dim < 16, chunked-prefill, or
when pooling params request matryoshka truncation / activation off.

Addresses: vllm-project#38282
Signed-off-by: roi.pony <roi.pony@ibm.com>

roipony mentioned this pull request

[Perf] Integrate flash-maxsim Triton kernels for late-interaction scoring #40337

Open

6 tasks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Reviewers

yewentao256 yewentao256 approved these changes

WoosukKwon Awaiting requested review from WoosukKwon WoosukKwon is a code owner

njhill Awaiting requested review from njhill njhill is a code owner

DarkLight1337 Awaiting requested review from DarkLight1337 DarkLight1337 is a code owner

aarnphm Awaiting requested review from aarnphm aarnphm is a code owner

chaunceyjiang Awaiting requested review from chaunceyjiang chaunceyjiang is a code owner

russellb Awaiting requested review from russellb russellb is a code owner

+1 more reviewer

gemini-code-assist[bot] gemini-code-assist[bot] left review comments

Labels

frontend ready v1