[Misc] Enable V1 FP16 inference on pre-Ampere GPUs#24022
[Misc] Enable V1 FP16 inference on pre-Ampere GPUs#24022DarkLight1337 merged 1 commit intovllm-project:mainfrom
Conversation
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
There was a problem hiding this comment.
Code Review
This pull request removes a workaround that disabled V1 engine support for pre-Ampere GPUs (Volta, Turing) using FP16 precision. The change is justified by an upgrade to Triton v3.4.0, which fixes the underlying bug. By deleting this conditional check in vllm/engine/arg_utils.py, the PR correctly re-enables V1 FP16 inference on these GPU architectures, broadening hardware compatibility. The change is clear, concise, and supported by test results on a T4 machine. I find the change to be correct and have no further suggestions.
|
Hi Team, I'm facing this error when I try to perform "vllm serve deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B --tensor-parallel-size 2 --max-num-batched-tokens 16384" I have built wheels from source to support vllm on Nvidia T4 gpu with arm64 arch type. I have used the source code from main branch to build vllm wheels. When I try basic inference test it works, but when I try to run the server I get the following error:
Output: vllm serve command: |
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
Purpose
Test Plan
Test Result
Have confirmed on T4 machine:
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.