[Model Bash] DeepSeek R1 BF16 Min Latency QKV A GEMM (0.5% E2E Speedup)#34758
[Model Bash] DeepSeek R1 BF16 Min Latency QKV A GEMM (0.5% E2E Speedup)#34758
Conversation
Signed-off-by: Robert Shaw <robshaw@redhat.com>
There was a problem hiding this comment.
Code Review
This pull request introduces a highly optimized fused GEMM kernel for DeepSeek V2/V3 models on Hopper architecture GPUs. The changes include the CUDA kernel itself, build system modifications in CMake, and integration into the Python codebase. The integration logic correctly checks for all conditions required to use this specialized kernel. My main feedback is to add error handling for CUDA API calls in the new kernel file to make it more robust.
csrc/gemm/dsv3_fused_a_gemm.cu
Outdated
| cudaGetDevice(&device); | ||
| int sm_major = 0; | ||
| int sm_minor = 0; | ||
| cudaDeviceGetAttribute(&sm_major, cudaDevAttrComputeCapabilityMajor, device); | ||
| cudaDeviceGetAttribute(&sm_minor, cudaDevAttrComputeCapabilityMinor, device); |
There was a problem hiding this comment.
The CUDA runtime API calls cudaGetDevice and cudaDeviceGetAttribute are not checked for errors. If these calls fail, the function might return an incorrect SM version (e.g., 0), which could lead to silent failures or incorrect behavior downstream (e.g., dsv3_fused_a_gemm failing with a generic "required CUDA ARCH >= SM_90" message, or optimizations being silently disabled). It is recommended to add error checking for these CUDA calls, for example by using a macro that checks the cudaError_t return value and throws an exception on failure.
Signed-off-by: Robert Shaw <robshaw@redhat.com>
Signed-off-by: Robert Shaw <robshaw@redhat.com>
Signed-off-by: Robert Shaw <robshaw@redhat.com>
Signed-off-by: Robert Shaw <robshaw@redhat.com>
Signed-off-by: Robert Shaw <robshaw@redhat.com>
Signed-off-by: Robert Shaw <robshaw@redhat.com>
Signed-off-by: Robert Shaw <robshaw@redhat.com>
Signed-off-by: Robert Shaw <robshaw@redhat.com>
Signed-off-by: Robert Shaw <robshaw@redhat.com>
mgoin
left a comment
There was a problem hiding this comment.
LGTM just need to fix sm120 restriction
Signed-off-by: Robert Shaw <robshaw@redhat.com>
|
@robertgshaw2-redhat - this commit causes vLLM to fail on startup on DGX Spark (sm121), vLLM built with TORCH_CUDA_ARCH_LIST=12.1a: @mgoin, @johnnynunez - FYI. |
…p) (vllm-project#34758) Signed-off-by: Robert Shaw <robshaw@redhat.com> Co-authored-by: Robert Shaw <robshaw@redhat.com> Signed-off-by: Jason Ozuzu <jasonozuzu@cohere.com>
It fails for me on startup with TORCH_CUDA_ARCH_LIST=12.0a running on an RTX PRO 6000 also. Here is the stack trace: |
…p) (vllm-project#34758) Signed-off-by: Robert Shaw <robshaw@redhat.com> Co-authored-by: Robert Shaw <robshaw@redhat.com>
…p) (vllm-project#34758) Signed-off-by: Robert Shaw <robshaw@redhat.com> Co-authored-by: Robert Shaw <robshaw@redhat.com> Signed-off-by: zjy0516 <riverclouds.zhu@qq.com>
|
Thank you for reporting @eugr @SurealCereal and sorry for the disruption. I should have a fix here #35123 |
…p) (vllm-project#34758) Signed-off-by: Robert Shaw <robshaw@redhat.com> Co-authored-by: Robert Shaw <robshaw@redhat.com>
…p) (vllm-project#34758) Signed-off-by: Robert Shaw <robshaw@redhat.com> Co-authored-by: Robert Shaw <robshaw@redhat.com>
…p) (vllm-project#34758) Signed-off-by: Robert Shaw <robshaw@redhat.com> Co-authored-by: Robert Shaw <robshaw@redhat.com> Signed-off-by: Andrii Skliar <askliar@nvidia.com>
…p) (vllm-project#34758) Signed-off-by: Robert Shaw <robshaw@redhat.com> Co-authored-by: Robert Shaw <robshaw@redhat.com>
…p) (vllm-project#34758) Signed-off-by: Robert Shaw <robshaw@redhat.com> Co-authored-by: Robert Shaw <robshaw@redhat.com>
…p) (vllm-project#34758) Signed-off-by: Robert Shaw <robshaw@redhat.com> Co-authored-by: Robert Shaw <robshaw@redhat.com> Signed-off-by: EricccYang <yangyang4991@gmail.com>
Purpose
Test Plan
local-completions ({'model': 'nvidia/DeepSeek-R1-NVFP4', 'base_url': 'http://localhost:7000/v1/completions', 'num_concurrent': 1000, 'tokenized_requests': False}), gen_kwargs: ({}), limit: None, num_fewshot: None, batch_size: 1 |Tasks|Version| Filter |n-shot| Metric | |Value | |Stderr| |-----|------:|----------------|-----:|-----------|---|-----:|---|-----:| |gsm8k| 3|flexible-extract| 5|exact_match|↑ |0.9568|± |0.0056| | | |strict-match | 5|exact_match|↑ |0.9553|± |0.0057|Test Result
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.