[Revert] Remove CUDA torch fallbacks for fp8_mqa_logits/fp8_paged_mqa_logits_torch function by chaunceyjiang · Pull Request #37968 · vllm-project/vllm

chaunceyjiang · 2026-03-24T06:23:51Z

Purpose

Revet #35271
The original PR #35271 was intended to allow dsv3.2 to run even when deep_gemm is not installed or on lower-end GPUs such as A800.

@youkaichao believes that if the model vendor itself does not support the hardware, we should clearly state that it is not supported. Simply making it run doesn’t really add value.

Test Plan

Test Result

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

BEFORE SUBMITTING, PLEASE READ https://docs.vllm.ai/en/latest/contributing (anything written below this line will be removed by GitHub Actions)

…orch function Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com>

gemini-code-assist

Code Review

This pull request reverts the addition of PyTorch fallbacks for fp8_mqa_logits and fp8_paged_mqa_logits, making deep_gemm a hard requirement for this functionality. The changes correctly remove the fallback functions and conditional logic. However, the checks for deep_gemm availability have been changed from is_deep_gemm_supported to has_deep_gemm, which only verifies package installation and not hardware compatibility. To better align with the goal of failing clearly on unsupported hardware, I've suggested restoring the use of is_deep_gemm_supported and making the check a hard failure instead of a warning.

vllm/model_executor/layers/sparse_attn_indexer.py

vllm/v1/attention/backends/mla/indexer.py

…orch function Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com>

chaunceyjiang · 2026-03-24T06:52:06Z

Test

...
(Worker_TP1 pid=2571334) ERROR 03-24 14:48:25 [multiproc_executor.py:857]     self.start_layer, self.end_layer, self.layers = make_layers(
(Worker_TP1 pid=2571334) ERROR 03-24 14:48:25 [multiproc_executor.py:857]                                                     ^^^^^^^^^^^^
(Worker_TP1 pid=2571334) ERROR 03-24 14:48:25 [multiproc_executor.py:857]   File "/mnt/data4/jxy/vllm/vllm/model_executor/models/utils.py", line 645, in make_layers
(Worker_TP1 pid=2571334) ERROR 03-24 14:48:25 [multiproc_executor.py:857]     + get_offloader().wrap_modules(
(Worker_TP1 pid=2571334) ERROR 03-24 14:48:25 [multiproc_executor.py:857]       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP1 pid=2571334) ERROR 03-24 14:48:25 [multiproc_executor.py:857]   File "/mnt/data4/jxy/vllm/vllm/model_executor/offloader/base.py", line 90, in wrap_modules
(Worker_TP1 pid=2571334) ERROR 03-24 14:48:25 [multiproc_executor.py:857]     return list(modules_generator)
(Worker_TP1 pid=2571334) ERROR 03-24 14:48:25 [multiproc_executor.py:857]            ^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP1 pid=2571334) ERROR 03-24 14:48:25 [multiproc_executor.py:857]   File "/mnt/data4/jxy/vllm/vllm/model_executor/models/utils.py", line 646, in <genexpr>
(Worker_TP1 pid=2571334) ERROR 03-24 14:48:25 [multiproc_executor.py:857]     layer_fn(prefix=f"{prefix}.{idx}") for idx in range(start_layer, end_layer)
(Worker_TP1 pid=2571334) ERROR 03-24 14:48:25 [multiproc_executor.py:857]     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP1 pid=2571334) ERROR 03-24 14:48:25 [multiproc_executor.py:857]   File "/mnt/data4/jxy/vllm/vllm/model_executor/models/deepseek_v2.py", line 1178, in <lambda>
(Worker_TP1 pid=2571334) ERROR 03-24 14:48:25 [multiproc_executor.py:857]     lambda prefix: DeepseekV2DecoderLayer(
(Worker_TP1 pid=2571334) ERROR 03-24 14:48:25 [multiproc_executor.py:857]                    ^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP1 pid=2571334) ERROR 03-24 14:48:25 [multiproc_executor.py:857]   File "/mnt/data4/jxy/vllm/vllm/model_executor/models/deepseek_v2.py", line 1050, in __init__
(Worker_TP1 pid=2571334) ERROR 03-24 14:48:25 [multiproc_executor.py:857]     self.self_attn = attn_cls(
(Worker_TP1 pid=2571334) ERROR 03-24 14:48:25 [multiproc_executor.py:857]                      ^^^^^^^^^
(Worker_TP1 pid=2571334) ERROR 03-24 14:48:25 [multiproc_executor.py:857]   File "/mnt/data4/jxy/vllm/vllm/model_executor/models/deepseek_v2.py", line 950, in __init__
(Worker_TP1 pid=2571334) ERROR 03-24 14:48:25 [multiproc_executor.py:857]     self.indexer = Indexer(
(Worker_TP1 pid=2571334) ERROR 03-24 14:48:25 [multiproc_executor.py:857]                    ^^^^^^^^
(Worker_TP1 pid=2571334) ERROR 03-24 14:48:25 [multiproc_executor.py:857]   File "/mnt/data4/jxy/vllm/vllm/model_executor/models/deepseek_v2.py", line 677, in __init__
(Worker_TP1 pid=2571334) ERROR 03-24 14:48:25 [multiproc_executor.py:857]     self.indexer_op = SparseAttnIndexer(
(Worker_TP1 pid=2571334) ERROR 03-24 14:48:25 [multiproc_executor.py:857]                       ^^^^^^^^^^^^^^^^^^
(Worker_TP1 pid=2571334) ERROR 03-24 14:48:25 [multiproc_executor.py:857]   File "/mnt/data4/jxy/vllm/vllm/model_executor/layers/sparse_attn_indexer.py", line 309, in __init__
(Worker_TP1 pid=2571334) ERROR 03-24 14:48:25 [multiproc_executor.py:857]     raise RuntimeError(
(Worker_TP1 pid=2571334) ERROR 03-24 14:48:25 [multiproc_executor.py:857] RuntimeError: Sparse Attention Indexer CUDA op requires DeepGEMM to be installed.

youkaichao

thanks

ehfd · 2026-03-24T13:23:07Z

I believe in the opposite... We're supposed to implement TRITON_MLA_SPARSE instead... #38006.

One backwards compatibility implementation example was the Marlin FP8 E4M3 fallback for sm80, which allows FP8 models to run in Ampere (#17579, #19990). This is not supported in SGLang (sgl-project/sglang#12887, sgl-project/sglang#9754), where all Ampere users of FP8 W8A8 MoE are redirected to only vLLM.

Thus, TRITON_MLA_SPARSE should also be implemented, as this will redirect all Ampere users of Sparse MLA models to vLLM and nowhere else. If vLLM doesn't do this, nobody can do it.

LucasWilkinson

Thanks!

ZJY0516

Could you also revert this? #36519

…_logits_torch function (vllm-project#37968) Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com>

…_logits_torch function (vllm-project#37968) Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com> Signed-off-by: Michel Belleau <michel.belleau@malaiwah.com>

…_logits_torch function (vllm-project#37968) Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com> Signed-off-by: Monishver Chandrasekaran <monishverchandrasekaran@gmail.com>

…_logits_torch function (vllm-project#37968) Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com> Signed-off-by: Nithin Chalapathi <nithin.ch10@gmail.com>

…_logits_torch function (vllm-project#37968) Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com>

Remove CUDA torch fallbacks for fp8_mqa_logits/fp8_paged_mqa_logits_t…

093709b

…orch function Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com>

chaunceyjiang requested a review from pavanimajety as a code owner March 24, 2026 06:23

mergify bot added nvidia v1 labels Mar 24, 2026

github-project-automation bot added this to NVIDIA Mar 24, 2026

gemini-code-assist bot reviewed Mar 24, 2026

View reviewed changes

vllm/model_executor/layers/sparse_attn_indexer.py Show resolved Hide resolved

vllm/v1/attention/backends/mla/indexer.py Show resolved Hide resolved

Remove CUDA torch fallbacks for fp8_mqa_logits/fp8_paged_mqa_logits_t…

3c9c7bb

…orch function Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com>

youkaichao approved these changes Mar 24, 2026

View reviewed changes

github-project-automation bot moved this to Ready in NVIDIA Mar 24, 2026

chaunceyjiang added the ready ONLY add when PR is ready to merge/full CI is needed label Mar 24, 2026

LucasWilkinson approved these changes Mar 24, 2026

View reviewed changes

ZJY0516 reviewed Mar 24, 2026

View reviewed changes

ehfd mentioned this pull request Mar 24, 2026

[Feature]: Implement TRITON_MLA_SPARSE backend for sm80 support of Sparse MLA #38006

Open

1 task

Merge branch 'main' into remove_fp8_paged_mqa_logits_torch

413b1f0

chaunceyjiang enabled auto-merge (squash) March 25, 2026 03:15

chaunceyjiang merged commit 09c3dc9 into vllm-project:main Mar 25, 2026
63 checks passed

github-project-automation bot moved this from Ready to Done in NVIDIA Mar 25, 2026

chaunceyjiang mentioned this pull request Mar 25, 2026

[Revert] Remove DeepGEMM availability check in DeepseekV32IndexerMetadataBuilder #38076

Merged

5 tasks

RhizoNymph pushed a commit to RhizoNymph/vllm that referenced this pull request Mar 26, 2026

[Revert] Remove CUDA torch fallbacks for fp8_mqa_logits/fp8_paged_mqa…

eae9c89

…_logits_torch function (vllm-project#37968) Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com>

HenryTangDev pushed a commit to HenryTangMain/vllm that referenced this pull request Mar 27, 2026

[Revert] Remove CUDA torch fallbacks for fp8_mqa_logits/fp8_paged_mqa…

73ce004

…_logits_torch function (vllm-project#37968) Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com>

JiantaoXu pushed a commit to JiantaoXu/vllm that referenced this pull request Mar 28, 2026

[Revert] Remove CUDA torch fallbacks for fp8_mqa_logits/fp8_paged_mqa…

6c396f6

…_logits_torch function (vllm-project#37968) Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Revert] Remove CUDA torch fallbacks for fp8_mqa_logits/fp8_paged_mqa_logits_torch function#37968

[Revert] Remove CUDA torch fallbacks for fp8_mqa_logits/fp8_paged_mqa_logits_torch function#37968
chaunceyjiang merged 3 commits intovllm-project:mainfrom
chaunceyjiang:remove_fp8_paged_mqa_logits_torch

chaunceyjiang commented Mar 24, 2026 •

edited

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

Uh oh!

chaunceyjiang commented Mar 24, 2026

Uh oh!

youkaichao left a comment

Uh oh!

ehfd commented Mar 24, 2026 •

edited

Loading

Uh oh!

LucasWilkinson left a comment

Uh oh!

ZJY0516 left a comment •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Uh oh!

Conversation

chaunceyjiang commented Mar 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

chaunceyjiang commented Mar 24, 2026

Uh oh!

youkaichao left a comment

Choose a reason for hiding this comment

Uh oh!

ehfd commented Mar 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

LucasWilkinson left a comment

Choose a reason for hiding this comment

Uh oh!

ZJY0516 left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

chaunceyjiang commented Mar 24, 2026 •

edited

Loading

ehfd commented Mar 24, 2026 •

edited

Loading

ZJY0516 left a comment •

edited

Loading