Skip to content

[Model Bash] DeepSeek R1 BF16 Min Latency QKV A GEMM (0.5% E2E Speedup)#34758

Merged
vllm-bot merged 12 commits intomainfrom
add-sgl-a-gemm
Feb 18, 2026
Merged

[Model Bash] DeepSeek R1 BF16 Min Latency QKV A GEMM (0.5% E2E Speedup)#34758
vllm-bot merged 12 commits intomainfrom
add-sgl-a-gemm

Conversation

@robertgshaw2-redhat
Copy link
Copy Markdown
Collaborator

@robertgshaw2-redhat robertgshaw2-redhat commented Feb 17, 2026

Purpose

Test Plan

  • lm eval
local-completions ({'model': 'nvidia/DeepSeek-R1-NVFP4', 'base_url': 'http://localhost:7000/v1/completions', 'num_concurrent': 1000, 'tokenized_requests': False}), gen_kwargs: ({}), limit: None, num_fewshot: None, batch_size: 1
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match||0.9568|±  |0.0056|
|     |       |strict-match    |     5|exact_match||0.9553|±  |0.0057|
  • benchmark
sweep:
	just benchmark 1 10 && \
	just benchmark 4 40 && \
	just benchmark 7 70 && \
	just benchmark 8 80 && \
	just benchmark 12 120 && \
	just benchmark 16 160
image

Test Result


Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Signed-off-by: Robert Shaw <robshaw@redhat.com>
@mergify mergify bot added the ci/build label Feb 17, 2026
@robertgshaw2-redhat robertgshaw2-redhat changed the title initial commit [Model Bash] DeepSeek R1 KV A GEMM Feb 17, 2026
@mergify mergify bot added the deepseek Related to DeepSeek models label Feb 17, 2026
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a highly optimized fused GEMM kernel for DeepSeek V2/V3 models on Hopper architecture GPUs. The changes include the CUDA kernel itself, build system modifications in CMake, and integration into the Python codebase. The integration logic correctly checks for all conditions required to use this specialized kernel. My main feedback is to add error handling for CUDA API calls in the new kernel file to make it more robust.

Comment on lines +38 to +42
cudaGetDevice(&device);
int sm_major = 0;
int sm_minor = 0;
cudaDeviceGetAttribute(&sm_major, cudaDevAttrComputeCapabilityMajor, device);
cudaDeviceGetAttribute(&sm_minor, cudaDevAttrComputeCapabilityMinor, device);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The CUDA runtime API calls cudaGetDevice and cudaDeviceGetAttribute are not checked for errors. If these calls fail, the function might return an incorrect SM version (e.g., 0), which could lead to silent failures or incorrect behavior downstream (e.g., dsv3_fused_a_gemm failing with a generic "required CUDA ARCH >= SM_90" message, or optimizations being silently disabled). It is recommended to add error checking for these CUDA calls, for example by using a macro that checks the cudaError_t return value and throws an exception on failure.

Robert Shaw added 2 commits February 17, 2026 23:56
Signed-off-by: Robert Shaw <robshaw@redhat.com>
Signed-off-by: Robert Shaw <robshaw@redhat.com>
Signed-off-by: Robert Shaw <robshaw@redhat.com>
@robertgshaw2-redhat robertgshaw2-redhat changed the title [Model Bash] DeepSeek R1 KV A GEMM [Model Bash] DeepSeek R1 BF16 KV A GEMM Feb 18, 2026
Robert Shaw added 7 commits February 17, 2026 20:36
Signed-off-by: Robert Shaw <robshaw@redhat.com>
Signed-off-by: Robert Shaw <robshaw@redhat.com>
Signed-off-by: Robert Shaw <robshaw@redhat.com>
Signed-off-by: Robert Shaw <robshaw@redhat.com>
Signed-off-by: Robert Shaw <robshaw@redhat.com>
Signed-off-by: Robert Shaw <robshaw@redhat.com>
Signed-off-by: Robert Shaw <robshaw@redhat.com>
@robertgshaw2-redhat robertgshaw2-redhat changed the title [Model Bash] DeepSeek R1 BF16 KV A GEMM [Model Bash] DeepSeek R1 BF16 Min Latency KV A GEMM (0.5% E2E Speedup) Feb 18, 2026
Copy link
Copy Markdown
Member

@mgoin mgoin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM just need to fix sm120 restriction

@github-project-automation github-project-automation bot moved this from Backlog to In progress in DeepSeek V3/R1 Feb 18, 2026
Signed-off-by: Robert Shaw <robshaw@redhat.com>
@robertgshaw2-redhat robertgshaw2-redhat enabled auto-merge (squash) February 18, 2026 03:09
@github-actions github-actions bot added the ready ONLY add when PR is ready to merge/full CI is needed label Feb 18, 2026
@mgoin mgoin changed the title [Model Bash] DeepSeek R1 BF16 Min Latency KV A GEMM (0.5% E2E Speedup) [Model Bash] DeepSeek R1 BF16 Min Latency QKV A GEMM (0.5% E2E Speedup) Feb 18, 2026
@vllm-bot vllm-bot merged commit 6874638 into main Feb 18, 2026
111 of 119 checks passed
@vllm-bot vllm-bot deleted the add-sgl-a-gemm branch February 18, 2026 15:42
@github-project-automation github-project-automation bot moved this from In progress to Done in DeepSeek V3/R1 Feb 18, 2026
@eugr
Copy link
Copy Markdown

eugr commented Feb 18, 2026

@robertgshaw2-redhat - this commit causes vLLM to fail on startup on DGX Spark (sm121), vLLM built with TORCH_CUDA_ARCH_LIST=12.1a:

Traceback (most recent call last):
  File "/usr/local/bin/vllm", line 4, in <module>
    from vllm.entrypoints.cli.main import main
  File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/cli/__init__.py", line 3, in <module>
    from vllm.entrypoints.cli.benchmark.latency import BenchmarkLatencySubcommand
  File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/cli/benchmark/latency.py", line 5, in <module>
    from vllm.benchmarks.latency import add_cli_args, main
  File "/usr/local/lib/python3.12/dist-packages/vllm/benchmarks/latency.py", line 16, in <module>
    from vllm.engine.arg_utils import EngineArgs
  File "/usr/local/lib/python3.12/dist-packages/vllm/engine/arg_utils.py", line 35, in <module>
    from vllm.config import (
  File "/usr/local/lib/python3.12/dist-packages/vllm/config/__init__.py", line 5, in <module>
    from vllm.config.cache import CacheConfig
  File "/usr/local/lib/python3.12/dist-packages/vllm/config/cache.py", line 13, in <module>
    from vllm.utils.mem_utils import format_gib, get_cpu_memory
  File "/usr/local/lib/python3.12/dist-packages/vllm/utils/mem_utils.py", line 14, in <module>
    from vllm.platforms import current_platform
  File "/usr/local/lib/python3.12/dist-packages/vllm/platforms/__init__.py", line 252, in __getattr__
    _current_platform = resolve_obj_by_qualname(platform_cls_qualname)()
                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/utils/import_utils.py", line 111, in resolve_obj_by_qualname
    module = importlib.import_module(module_name)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/importlib/__init__.py", line 90, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/platforms/cuda.py", line 16, in <module>
    import vllm._C  # noqa
    ^^^^^^^^^^^^^^
ImportError: /usr/local/lib/python3.12/dist-packages/vllm/_C.abi3.so: undefined symbol: _Z17dsv3_fused_a_gemmRN2at6TensorERKS0_S3_

@mgoin, @johnnynunez - FYI.

jasonozuzu-cohere pushed a commit to jasonozuzu-cohere/vllm that referenced this pull request Feb 18, 2026
…p) (vllm-project#34758)

Signed-off-by: Robert Shaw <robshaw@redhat.com>
Co-authored-by: Robert Shaw <robshaw@redhat.com>
Signed-off-by: Jason Ozuzu <jasonozuzu@cohere.com>
@SurealCereal
Copy link
Copy Markdown

@robertgshaw2-redhat - this commit causes vLLM to fail on startup on DGX Spark (sm121), vLLM built with TORCH_CUDA_ARCH_LIST=12.1a:

It fails for me on startup with TORCH_CUDA_ARCH_LIST=12.0a running on an RTX PRO 6000 also. Here is the stack trace:

vllm  | Traceback (most recent call last):
vllm  |   File "/usr/local/bin/vllm", line 4, in <module>
vllm  |     from vllm.entrypoints.cli.main import main
vllm  |   File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/cli/__init__.py", line 3, in <module>
vllm  |     from vllm.entrypoints.cli.benchmark.latency import BenchmarkLatencySubcommand
vllm  |   File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/cli/benchmark/latency.py", line 5, in <module>
vllm  |     from vllm.benchmarks.latency import add_cli_args, main
vllm  |   File "/usr/local/lib/python3.12/dist-packages/vllm/benchmarks/latency.py", line 16, in <module>
vllm  |     from vllm.engine.arg_utils import EngineArgs
vllm  |   File "/usr/local/lib/python3.12/dist-packages/vllm/engine/arg_utils.py", line 35, in <module>
vllm  |     from vllm.config import (
vllm  |   File "/usr/local/lib/python3.12/dist-packages/vllm/config/__init__.py", line 5, in <module>
vllm  |     from vllm.config.cache import CacheConfig
vllm  |   File "/usr/local/lib/python3.12/dist-packages/vllm/config/cache.py", line 13, in <module>
vllm  |     from vllm.utils.mem_utils import format_gib, get_cpu_memory
vllm  |   File "/usr/local/lib/python3.12/dist-packages/vllm/utils/mem_utils.py", line 14, in <module>
vllm  |     from vllm.platforms import current_platform
vllm  |   File "/usr/local/lib/python3.12/dist-packages/vllm/platforms/__init__.py", line 252, in __getattr__
vllm  |     _current_platform = resolve_obj_by_qualname(platform_cls_qualname)()
vllm  |                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
vllm  |   File "/usr/local/lib/python3.12/dist-packages/vllm/utils/import_utils.py", line 111, in resolve_obj_by_qualname
vllm  |     module = importlib.import_module(module_name)
vllm  |              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
vllm  |   File "/usr/lib/python3.12/importlib/__init__.py", line 90, in import_module
vllm  |     return _bootstrap._gcd_import(name[level:], package, level)
vllm  |            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
vllm  |   File "/usr/local/lib/python3.12/dist-packages/vllm/platforms/cuda.py", line 16, in <module>
vllm  |     import vllm._C  # noqa
vllm  |     ^^^^^^^^^^^^^^
vllm  | ImportError: /usr/local/lib/python3.12/dist-packages/vllm/_C.abi3.so: undefined symbol: _Z17dsv3_fused_a_gemmRN2at6TensorERKS0_S3_
vllm exited with code 1 (restarting)

jmamou pushed a commit to jmamou/vllm that referenced this pull request Feb 23, 2026
…p) (vllm-project#34758)

Signed-off-by: Robert Shaw <robshaw@redhat.com>
Co-authored-by: Robert Shaw <robshaw@redhat.com>
ZJY0516 pushed a commit to ZJY0516/vllm that referenced this pull request Feb 23, 2026
…p) (vllm-project#34758)

Signed-off-by: Robert Shaw <robshaw@redhat.com>
Co-authored-by: Robert Shaw <robshaw@redhat.com>
Signed-off-by: zjy0516 <riverclouds.zhu@qq.com>
@mgoin
Copy link
Copy Markdown
Member

mgoin commented Feb 23, 2026

Thank you for reporting @eugr @SurealCereal and sorry for the disruption. I should have a fix here #35123

llsj14 pushed a commit to llsj14/vllm that referenced this pull request Mar 1, 2026
…p) (vllm-project#34758)

Signed-off-by: Robert Shaw <robshaw@redhat.com>
Co-authored-by: Robert Shaw <robshaw@redhat.com>
tunglinwood pushed a commit to tunglinwood/vllm that referenced this pull request Mar 4, 2026
…p) (vllm-project#34758)

Signed-off-by: Robert Shaw <robshaw@redhat.com>
Co-authored-by: Robert Shaw <robshaw@redhat.com>
askliar pushed a commit to askliar/vllm that referenced this pull request Mar 9, 2026
…p) (vllm-project#34758)

Signed-off-by: Robert Shaw <robshaw@redhat.com>
Co-authored-by: Robert Shaw <robshaw@redhat.com>
Signed-off-by: Andrii Skliar <askliar@nvidia.com>
V2arK pushed a commit to V2arK/vllm that referenced this pull request Mar 9, 2026
…p) (vllm-project#34758)

Signed-off-by: Robert Shaw <robshaw@redhat.com>
Co-authored-by: Robert Shaw <robshaw@redhat.com>
Copilot AI pushed a commit to machov/vllm that referenced this pull request Mar 10, 2026
…p) (vllm-project#34758)

Signed-off-by: Robert Shaw <robshaw@redhat.com>
Co-authored-by: Robert Shaw <robshaw@redhat.com>
EricccYang pushed a commit to EricccYang/vllm that referenced this pull request Apr 1, 2026
…p) (vllm-project#34758)

Signed-off-by: Robert Shaw <robshaw@redhat.com>
Co-authored-by: Robert Shaw <robshaw@redhat.com>
Signed-off-by: EricccYang <yangyang4991@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci/build deepseek Related to DeepSeek models ready ONLY add when PR is ready to merge/full CI is needed

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

5 participants