[Model Bash] DeepSeek R1 BF16 Min Latency QKV A GEMM (0.5% E2E Speedup) by robertgshaw2-redhat · Pull Request #34758 · vllm-project/vllm

robertgshaw2-redhat · 2026-02-17T23:16:34Z

Purpose

add min latency bf16 qkv_a_gemm
adapted from sgl: * https://github.com/sgl-project/sglang/blob/main/sgl-kernel/csrc/gemm/dsv3_fused_a_gemm.cu, which adapted from trtllm
disappointing E2E win, but sets up using PDL to overlap with AR in future PR

Test Plan

lm eval

local-completions ({'model': 'nvidia/DeepSeek-R1-NVFP4', 'base_url': 'http://localhost:7000/v1/completions', 'num_concurrent': 1000, 'tokenized_requests': False}), gen_kwargs: ({}), limit: None, num_fewshot: None, batch_size: 1
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.9568|±  |0.0056|
|     |       |strict-match    |     5|exact_match|↑  |0.9553|±  |0.0057|

benchmark

sweep:
	just benchmark 1 10 && \
	just benchmark 4 40 && \
	just benchmark 7 70 && \
	just benchmark 8 80 && \
	just benchmark 12 120 && \
	just benchmark 16 160

Test Result

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Signed-off-by: Robert Shaw <robshaw@redhat.com>

gemini-code-assist

Code Review

This pull request introduces a highly optimized fused GEMM kernel for DeepSeek V2/V3 models on Hopper architecture GPUs. The changes include the CUDA kernel itself, build system modifications in CMake, and integration into the Python codebase. The integration logic correctly checks for all conditions required to use this specialized kernel. My main feedback is to add error handling for CUDA API calls in the new kernel file to make it more robust.

gemini-code-assist · 2026-02-17T23:18:58Z

csrc/gemm/dsv3_fused_a_gemm.cu

+  cudaGetDevice(&device);
+  int sm_major = 0;
+  int sm_minor = 0;
+  cudaDeviceGetAttribute(&sm_major, cudaDevAttrComputeCapabilityMajor, device);
+  cudaDeviceGetAttribute(&sm_minor, cudaDevAttrComputeCapabilityMinor, device);


The CUDA runtime API calls cudaGetDevice and cudaDeviceGetAttribute are not checked for errors. If these calls fail, the function might return an incorrect SM version (e.g., 0), which could lead to silent failures or incorrect behavior downstream (e.g., dsv3_fused_a_gemm failing with a generic "required CUDA ARCH >= SM_90" message, or optimizations being silently disabled). It is recommended to add error checking for these CUDA calls, for example by using a macro that checks the cudaError_t return value and throws an exception on failure.

Signed-off-by: Robert Shaw <robshaw@redhat.com>

vllm/model_executor/models/deepseek_v2.py

CMakeLists.txt

Signed-off-by: Robert Shaw <robshaw@redhat.com>

mgoin

LGTM just need to fix sm120 restriction

Signed-off-by: Robert Shaw <robshaw@redhat.com>

eugr · 2026-02-18T17:42:36Z

@robertgshaw2-redhat - this commit causes vLLM to fail on startup on DGX Spark (sm121), vLLM built with TORCH_CUDA_ARCH_LIST=12.1a:

Traceback (most recent call last):
  File "/usr/local/bin/vllm", line 4, in <module>
    from vllm.entrypoints.cli.main import main
  File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/cli/__init__.py", line 3, in <module>
    from vllm.entrypoints.cli.benchmark.latency import BenchmarkLatencySubcommand
  File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/cli/benchmark/latency.py", line 5, in <module>
    from vllm.benchmarks.latency import add_cli_args, main
  File "/usr/local/lib/python3.12/dist-packages/vllm/benchmarks/latency.py", line 16, in <module>
    from vllm.engine.arg_utils import EngineArgs
  File "/usr/local/lib/python3.12/dist-packages/vllm/engine/arg_utils.py", line 35, in <module>
    from vllm.config import (
  File "/usr/local/lib/python3.12/dist-packages/vllm/config/__init__.py", line 5, in <module>
    from vllm.config.cache import CacheConfig
  File "/usr/local/lib/python3.12/dist-packages/vllm/config/cache.py", line 13, in <module>
    from vllm.utils.mem_utils import format_gib, get_cpu_memory
  File "/usr/local/lib/python3.12/dist-packages/vllm/utils/mem_utils.py", line 14, in <module>
    from vllm.platforms import current_platform
  File "/usr/local/lib/python3.12/dist-packages/vllm/platforms/__init__.py", line 252, in __getattr__
    _current_platform = resolve_obj_by_qualname(platform_cls_qualname)()
                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/utils/import_utils.py", line 111, in resolve_obj_by_qualname
    module = importlib.import_module(module_name)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/importlib/__init__.py", line 90, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/platforms/cuda.py", line 16, in <module>
    import vllm._C  # noqa
    ^^^^^^^^^^^^^^
ImportError: /usr/local/lib/python3.12/dist-packages/vllm/_C.abi3.so: undefined symbol: _Z17dsv3_fused_a_gemmRN2at6TensorERKS0_S3_

@mgoin, @johnnynunez - FYI.

…p) (vllm-project#34758) Signed-off-by: Robert Shaw <robshaw@redhat.com> Co-authored-by: Robert Shaw <robshaw@redhat.com> Signed-off-by: Jason Ozuzu <jasonozuzu@cohere.com>

SurealCereal · 2026-02-19T22:29:59Z

@robertgshaw2-redhat - this commit causes vLLM to fail on startup on DGX Spark (sm121), vLLM built with TORCH_CUDA_ARCH_LIST=12.1a:

It fails for me on startup with TORCH_CUDA_ARCH_LIST=12.0a running on an RTX PRO 6000 also. Here is the stack trace:

vllm  | Traceback (most recent call last):
vllm  |   File "/usr/local/bin/vllm", line 4, in <module>
vllm  |     from vllm.entrypoints.cli.main import main
vllm  |   File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/cli/__init__.py", line 3, in <module>
vllm  |     from vllm.entrypoints.cli.benchmark.latency import BenchmarkLatencySubcommand
vllm  |   File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/cli/benchmark/latency.py", line 5, in <module>
vllm  |     from vllm.benchmarks.latency import add_cli_args, main
vllm  |   File "/usr/local/lib/python3.12/dist-packages/vllm/benchmarks/latency.py", line 16, in <module>
vllm  |     from vllm.engine.arg_utils import EngineArgs
vllm  |   File "/usr/local/lib/python3.12/dist-packages/vllm/engine/arg_utils.py", line 35, in <module>
vllm  |     from vllm.config import (
vllm  |   File "/usr/local/lib/python3.12/dist-packages/vllm/config/__init__.py", line 5, in <module>
vllm  |     from vllm.config.cache import CacheConfig
vllm  |   File "/usr/local/lib/python3.12/dist-packages/vllm/config/cache.py", line 13, in <module>
vllm  |     from vllm.utils.mem_utils import format_gib, get_cpu_memory
vllm  |   File "/usr/local/lib/python3.12/dist-packages/vllm/utils/mem_utils.py", line 14, in <module>
vllm  |     from vllm.platforms import current_platform
vllm  |   File "/usr/local/lib/python3.12/dist-packages/vllm/platforms/__init__.py", line 252, in __getattr__
vllm  |     _current_platform = resolve_obj_by_qualname(platform_cls_qualname)()
vllm  |                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
vllm  |   File "/usr/local/lib/python3.12/dist-packages/vllm/utils/import_utils.py", line 111, in resolve_obj_by_qualname
vllm  |     module = importlib.import_module(module_name)
vllm  |              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
vllm  |   File "/usr/lib/python3.12/importlib/__init__.py", line 90, in import_module
vllm  |     return _bootstrap._gcd_import(name[level:], package, level)
vllm  |            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
vllm  |   File "/usr/local/lib/python3.12/dist-packages/vllm/platforms/cuda.py", line 16, in <module>
vllm  |     import vllm._C  # noqa
vllm  |     ^^^^^^^^^^^^^^
vllm  | ImportError: /usr/local/lib/python3.12/dist-packages/vllm/_C.abi3.so: undefined symbol: _Z17dsv3_fused_a_gemmRN2at6TensorERKS0_S3_
vllm exited with code 1 (restarting)

…p) (vllm-project#34758) Signed-off-by: Robert Shaw <robshaw@redhat.com> Co-authored-by: Robert Shaw <robshaw@redhat.com>

…p) (vllm-project#34758) Signed-off-by: Robert Shaw <robshaw@redhat.com> Co-authored-by: Robert Shaw <robshaw@redhat.com> Signed-off-by: zjy0516 <riverclouds.zhu@qq.com>

mgoin · 2026-02-23T18:02:31Z

Thank you for reporting @eugr @SurealCereal and sorry for the disruption. I should have a fix here #35123

…p) (vllm-project#34758) Signed-off-by: Robert Shaw <robshaw@redhat.com> Co-authored-by: Robert Shaw <robshaw@redhat.com>

…p) (vllm-project#34758) Signed-off-by: Robert Shaw <robshaw@redhat.com> Co-authored-by: Robert Shaw <robshaw@redhat.com> Signed-off-by: Andrii Skliar <askliar@nvidia.com>

…p) (vllm-project#34758) Signed-off-by: Robert Shaw <robshaw@redhat.com> Co-authored-by: Robert Shaw <robshaw@redhat.com>

…p) (vllm-project#34758) Signed-off-by: Robert Shaw <robshaw@redhat.com> Co-authored-by: Robert Shaw <robshaw@redhat.com> Signed-off-by: EricccYang <yangyang4991@gmail.com>

initial commit

9557cfc

Signed-off-by: Robert Shaw <robshaw@redhat.com>

robertgshaw2-redhat requested review from LucasWilkinson and tlrmchlsmth as code owners February 17, 2026 23:16

mergify bot added the ci/build label Feb 17, 2026

robertgshaw2-redhat changed the title ~~initial commit~~ [Model Bash] DeepSeek R1 KV A GEMM Feb 17, 2026

github-project-automation bot added this to DeepSeek V3/R1 Feb 17, 2026

github-project-automation bot moved this to Backlog in DeepSeek V3/R1 Feb 17, 2026

mergify bot added the deepseek Related to DeepSeek models label Feb 17, 2026

gemini-code-assist bot reviewed Feb 17, 2026

View reviewed changes

Robert Shaw added 2 commits February 17, 2026 23:56

update to make the changes deepseek specific

c3d3de5

Signed-off-by: Robert Shaw <robshaw@redhat.com>

update build

9788498

Signed-off-by: Robert Shaw <robshaw@redhat.com>

mgoin reviewed Feb 18, 2026

View reviewed changes

vllm/model_executor/models/deepseek_v2.py Outdated Show resolved Hide resolved

vllm/model_executor/models/deepseek_v2.py Outdated Show resolved Hide resolved

CMakeLists.txt Outdated Show resolved Hide resolved

Swich which layer

3b525e4

Signed-off-by: Robert Shaw <robshaw@redhat.com>

robertgshaw2-redhat changed the title ~~[Model Bash] DeepSeek R1 KV A GEMM~~ [Model Bash] DeepSeek R1 BF16 KV A GEMM Feb 18, 2026

Robert Shaw added 7 commits February 17, 2026 20:36

update cmaklists

29127d4

Signed-off-by: Robert Shaw <robshaw@redhat.com>

fix build

6b7048b

Signed-off-by: Robert Shaw <robshaw@redhat.com>

fix missing symbol

5b24998

Signed-off-by: Robert Shaw <robshaw@redhat.com>

thanks claude!

72fc063

Signed-off-by: Robert Shaw <robshaw@redhat.com>

is new sonnet the best model?

41f9f7d

Signed-off-by: Robert Shaw <robshaw@redhat.com>

remove duplicate

c497839

Signed-off-by: Robert Shaw <robshaw@redhat.com>

remove debug cruft

61edbde

Signed-off-by: Robert Shaw <robshaw@redhat.com>

robertgshaw2-redhat changed the title ~~[Model Bash] DeepSeek R1 BF16 KV A GEMM~~ [Model Bash] DeepSeek R1 BF16 Min Latency KV A GEMM (0.5% E2E Speedup) Feb 18, 2026

mgoin approved these changes Feb 18, 2026

View reviewed changes

github-project-automation bot moved this from Backlog to In progress in DeepSeek V3/R1 Feb 18, 2026

address mgoin comments

c354d75

Signed-off-by: Robert Shaw <robshaw@redhat.com>

robertgshaw2-redhat enabled auto-merge (squash) February 18, 2026 03:09

github-actions bot added the ready ONLY add when PR is ready to merge/full CI is needed label Feb 18, 2026

mgoin changed the title ~~[Model Bash] DeepSeek R1 BF16 Min Latency KV A GEMM (0.5% E2E Speedup)~~ [Model Bash] DeepSeek R1 BF16 Min Latency QKV A GEMM (0.5% E2E Speedup) Feb 18, 2026

vllm-bot merged commit 6874638 into main Feb 18, 2026
111 of 119 checks passed

vllm-bot deleted the add-sgl-a-gemm branch February 18, 2026 15:42

github-project-automation bot moved this from In progress to Done in DeepSeek V3/R1 Feb 18, 2026

eugr mentioned this pull request Feb 18, 2026

[Tracking upstream] Qwen3-Coder-Next-FP8 is broken since 2/11/2026 eugr/spark-vllm-docker#41

Closed

wzhao18 mentioned this pull request Feb 19, 2026

[Bug]: Deepseek V3.1 NVFP4 Weight Loading Fails #34869

Closed

1 task

jmamou pushed a commit to jmamou/vllm that referenced this pull request Feb 23, 2026

[Model Bash] DeepSeek R1 BF16 Min Latency QKV A GEMM (0.5% E2E Speedu…

abce8e3

…p) (vllm-project#34758) Signed-off-by: Robert Shaw <robshaw@redhat.com> Co-authored-by: Robert Shaw <robshaw@redhat.com>

eugr mentioned this pull request Feb 23, 2026

[ModelBash][DSV3] Add TRTLLM DSV3 Router GEMM kernel (6% B1 Speedup) #34302

Merged

5 tasks

mgoin mentioned this pull request Feb 23, 2026

[Bugfix] Fix DSV3 kernels breaking _C and _moe_C on unsupported arches #35123

Merged

5 tasks

llsj14 pushed a commit to llsj14/vllm that referenced this pull request Mar 1, 2026

[Model Bash] DeepSeek R1 BF16 Min Latency QKV A GEMM (0.5% E2E Speedu…

bec40a6

…p) (vllm-project#34758) Signed-off-by: Robert Shaw <robshaw@redhat.com> Co-authored-by: Robert Shaw <robshaw@redhat.com>

V2arK pushed a commit to V2arK/vllm that referenced this pull request Mar 9, 2026

[Model Bash] DeepSeek R1 BF16 Min Latency QKV A GEMM (0.5% E2E Speedu…

1af48bc

…p) (vllm-project#34758) Signed-off-by: Robert Shaw <robshaw@redhat.com> Co-authored-by: Robert Shaw <robshaw@redhat.com>

Copilot AI pushed a commit to machov/vllm that referenced this pull request Mar 10, 2026

[Model Bash] DeepSeek R1 BF16 Min Latency QKV A GEMM (0.5% E2E Speedu…

798bc3b

…p) (vllm-project#34758) Signed-off-by: Robert Shaw <robshaw@redhat.com> Co-authored-by: Robert Shaw <robshaw@redhat.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Model Bash] DeepSeek R1 BF16 Min Latency QKV A GEMM (0.5% E2E Speedup)#34758

[Model Bash] DeepSeek R1 BF16 Min Latency QKV A GEMM (0.5% E2E Speedup)#34758
vllm-bot merged 12 commits intomainfrom
add-sgl-a-gemm

robertgshaw2-redhat commented Feb 17, 2026 •

edited by github-actions bot

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Feb 17, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mgoin left a comment

Uh oh!

Uh oh!

eugr commented Feb 18, 2026

Uh oh!

SurealCereal commented Feb 19, 2026

Uh oh!

mgoin commented Feb 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Uh oh!

Conversation

robertgshaw2-redhat commented Feb 17, 2026 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Feb 17, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mgoin left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

eugr commented Feb 18, 2026

Uh oh!

SurealCereal commented Feb 19, 2026

Uh oh!

mgoin commented Feb 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

robertgshaw2-redhat commented Feb 17, 2026 •

edited by github-actions bot

Loading