[Refactor] Fix maxsim cuda platform and add cli to control it#35427
[Refactor] Fix maxsim cuda platform and add cli to control it#35427
Conversation
Signed-off-by: yewentao256 <zhyanwentao@126.com>
There was a problem hiding this comment.
Code Review
This pull request aims to refactor the device selection in compute_maxsim_scores to use the vLLM platform abstraction. While this is a good direction, the current change from torch.cuda.is_available() to current_platform.is_cuda() introduces a regression for ROCm-based systems, as it would incorrectly default to using the CPU. My review provides a critical fix to use current_platform.is_cuda_alike() instead, which correctly handles both CUDA and ROCm platforms, thus preserving the original behavior in a platform-agnostic way.
| VLLM_ENGINE_READY_TIMEOUT_S: int = 600 | ||
| VLLM_API_KEY: str | None = None | ||
| VLLM_DEBUG_LOG_API_SERVER_RESPONSE: bool = False | ||
| VLLM_USE_GPU_FOR_POOLING_SCORE: bool = False |
There was a problem hiding this comment.
Make this a config variable in FrontendArgs, not env variable
NickLucche
left a comment
There was a problem hiding this comment.
Shouldn't we disable it for num_api_servers > 1 ?
I am concerned that all GPU workload corresponding to api_servers will use GPU:0. It will greatly increase the risk of OOM if num_api_servers > 1. As I mentioned in #35330 (comment):
Using GPU outside the engine core may require further discussion. However, I think it's okay to experiment with the pooling model to see what the pros and cons are. Especially the maxsim_scores. |
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk> Signed-off-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com>
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk> Signed-off-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com>
Signed-off-by: yewentao256 <zhyanwentao@126.com>
|
Thanks @NickLucche @noooop, I have added the assertion to make sure api-server == 1, let's be conservative first then remove the assertion |
|
Also tested with the cpu bmm perf, similar to scalar version so let's keep as it is |
|
cc @mgoin |
…roject#35427) Signed-off-by: yewentao256 <zhyanwentao@126.com> Signed-off-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com> Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk>
…roject#35427) Signed-off-by: yewentao256 <zhyanwentao@126.com> Signed-off-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com> Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk>
…roject#35427) Signed-off-by: yewentao256 <zhyanwentao@126.com> Signed-off-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com> Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk>
Purpose
Fixes #35330 (comment)
And add env to control it as discussed with @mgoin and @NickLucche, @noooop , @DarkLight1337