Skip to content

[Refactor] Fix maxsim cuda platform and add cli to control it#35427

Merged
noooop merged 6 commits intomainfrom
wentao-fix-maxsim-scores-cuda
Mar 3, 2026
Merged

[Refactor] Fix maxsim cuda platform and add cli to control it#35427
noooop merged 6 commits intomainfrom
wentao-fix-maxsim-scores-cuda

Conversation

@yewentao256
Copy link
Copy Markdown
Member

@yewentao256 yewentao256 commented Feb 26, 2026

Purpose

Fixes #35330 (comment)

And add env to control it as discussed with @mgoin and @NickLucche, @noooop , @DarkLight1337

Signed-off-by: yewentao256 <zhyanwentao@126.com>
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request aims to refactor the device selection in compute_maxsim_scores to use the vLLM platform abstraction. While this is a good direction, the current change from torch.cuda.is_available() to current_platform.is_cuda() introduces a regression for ROCm-based systems, as it would incorrectly default to using the CPU. My review provides a critical fix to use current_platform.is_cuda_alike() instead, which correctly handles both CUDA and ROCm platforms, thus preserving the original behavior in a platform-agnostic way.

Comment thread vllm/entrypoints/pooling/score/utils.py Outdated
Signed-off-by: yewentao256 <zhyanwentao@126.com>
@yewentao256 yewentao256 changed the title [Refactor] Fix maxsim cuda platform [Refactor] Fix maxsim cuda platform and add env to control it Feb 26, 2026
@yewentao256 yewentao256 added the ready ONLY add when PR is ready to merge/full CI is needed label Feb 26, 2026
Comment thread vllm/entrypoints/pooling/score/utils.py Outdated
Comment thread vllm/entrypoints/pooling/score/utils.py Outdated
Comment thread vllm/envs.py Outdated
VLLM_ENGINE_READY_TIMEOUT_S: int = 600
VLLM_API_KEY: str | None = None
VLLM_DEBUG_LOG_API_SERVER_RESPONSE: bool = False
VLLM_USE_GPU_FOR_POOLING_SCORE: bool = False
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make this a config variable in FrontendArgs, not env variable

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Solved, thanks!

Copy link
Copy Markdown
Collaborator

@NickLucche NickLucche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't we disable it for num_api_servers > 1 ?

@noooop
Copy link
Copy Markdown
Collaborator

noooop commented Feb 27, 2026

Shouldn't we disable it for num_api_servers > 1 ?

I am concerned that all GPU workload corresponding to api_servers will use GPU:0. It will greatly increase the risk of OOM if num_api_servers > 1. As I mentioned in #35330 (comment):

I have reservations about using the GPU in the API server (or during pre-processing and post-processing stages). There might be a risk of OOM or other weird CUDA errors.

Using GPU outside the engine core may require further discussion.

However, I think it's okay to experiment with the pooling model to see what the pros and cons are. Especially the maxsim_scores.

yewentao256 and others added 3 commits February 27, 2026 10:46
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk>
Signed-off-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com>
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk>
Signed-off-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com>
Signed-off-by: yewentao256 <zhyanwentao@126.com>
@yewentao256 yewentao256 changed the title [Refactor] Fix maxsim cuda platform and add env to control it [Refactor] Fix maxsim cuda platform and add cli to control it Feb 27, 2026
Signed-off-by: yewentao256 <zhyanwentao@126.com>
@yewentao256
Copy link
Copy Markdown
Member Author

Thanks @NickLucche @noooop, I have added the assertion to make sure api-server == 1, let's be conservative first then remove the assertion

@yewentao256
Copy link
Copy Markdown
Member Author

Also tested with the cpu bmm perf, similar to scalar version so let's keep as it is

@noooop
Copy link
Copy Markdown
Collaborator

noooop commented Mar 1, 2026

cc @mgoin

@github-project-automation github-project-automation Bot moved this to Ready in NVIDIA Mar 3, 2026
@noooop noooop merged commit c21d003 into main Mar 3, 2026
54 checks passed
@noooop noooop deleted the wentao-fix-maxsim-scores-cuda branch March 3, 2026 04:48
@github-project-automation github-project-automation Bot moved this from Ready to Done in NVIDIA Mar 3, 2026
Copilot AI pushed a commit to machov/vllm that referenced this pull request Mar 10, 2026
…roject#35427)

Signed-off-by: yewentao256 <zhyanwentao@126.com>
Signed-off-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com>
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk>
avinashsingh77 pushed a commit to avinashsingh77/vllm that referenced this pull request Mar 12, 2026
…roject#35427)

Signed-off-by: yewentao256 <zhyanwentao@126.com>
Signed-off-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com>
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk>
wendyliu235 pushed a commit to wendyliu235/vllm-public that referenced this pull request Mar 18, 2026
…roject#35427)

Signed-off-by: yewentao256 <zhyanwentao@126.com>
Signed-off-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com>
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

frontend nvidia ready ONLY add when PR is ready to merge/full CI is needed

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

4 participants