Skip to content

[Revert] Remove CUDA torch fallbacks for fp8_mqa_logits/fp8_paged_mqa_logits_torch function#37968

Merged
chaunceyjiang merged 3 commits intovllm-project:mainfrom
chaunceyjiang:remove_fp8_paged_mqa_logits_torch
Mar 25, 2026
Merged

[Revert] Remove CUDA torch fallbacks for fp8_mqa_logits/fp8_paged_mqa_logits_torch function#37968
chaunceyjiang merged 3 commits intovllm-project:mainfrom
chaunceyjiang:remove_fp8_paged_mqa_logits_torch

Conversation

@chaunceyjiang
Copy link
Copy Markdown
Collaborator

@chaunceyjiang chaunceyjiang commented Mar 24, 2026

Purpose

Revet #35271
The original PR #35271 was intended to allow dsv3.2 to run even when deep_gemm is not installed or on lower-end GPUs such as A800.

@youkaichao believes that if the model vendor itself does not support the hardware, we should clearly state that it is not supported. Simply making it run doesn’t really add value.

Test Plan

Test Result


Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

BEFORE SUBMITTING, PLEASE READ https://docs.vllm.ai/en/latest/contributing (anything written below this line will be removed by GitHub Actions)

…orch function

Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com>
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request reverts the addition of PyTorch fallbacks for fp8_mqa_logits and fp8_paged_mqa_logits, making deep_gemm a hard requirement for this functionality. The changes correctly remove the fallback functions and conditional logic. However, the checks for deep_gemm availability have been changed from is_deep_gemm_supported to has_deep_gemm, which only verifies package installation and not hardware compatibility. To better align with the goal of failing clearly on unsupported hardware, I've suggested restoring the use of is_deep_gemm_supported and making the check a hard failure instead of a warning.

…orch function

Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com>
@chaunceyjiang
Copy link
Copy Markdown
Collaborator Author

Test

...
(Worker_TP1 pid=2571334) ERROR 03-24 14:48:25 [multiproc_executor.py:857]     self.start_layer, self.end_layer, self.layers = make_layers(
(Worker_TP1 pid=2571334) ERROR 03-24 14:48:25 [multiproc_executor.py:857]                                                     ^^^^^^^^^^^^
(Worker_TP1 pid=2571334) ERROR 03-24 14:48:25 [multiproc_executor.py:857]   File "/mnt/data4/jxy/vllm/vllm/model_executor/models/utils.py", line 645, in make_layers
(Worker_TP1 pid=2571334) ERROR 03-24 14:48:25 [multiproc_executor.py:857]     + get_offloader().wrap_modules(
(Worker_TP1 pid=2571334) ERROR 03-24 14:48:25 [multiproc_executor.py:857]       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP1 pid=2571334) ERROR 03-24 14:48:25 [multiproc_executor.py:857]   File "/mnt/data4/jxy/vllm/vllm/model_executor/offloader/base.py", line 90, in wrap_modules
(Worker_TP1 pid=2571334) ERROR 03-24 14:48:25 [multiproc_executor.py:857]     return list(modules_generator)
(Worker_TP1 pid=2571334) ERROR 03-24 14:48:25 [multiproc_executor.py:857]            ^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP1 pid=2571334) ERROR 03-24 14:48:25 [multiproc_executor.py:857]   File "/mnt/data4/jxy/vllm/vllm/model_executor/models/utils.py", line 646, in <genexpr>
(Worker_TP1 pid=2571334) ERROR 03-24 14:48:25 [multiproc_executor.py:857]     layer_fn(prefix=f"{prefix}.{idx}") for idx in range(start_layer, end_layer)
(Worker_TP1 pid=2571334) ERROR 03-24 14:48:25 [multiproc_executor.py:857]     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP1 pid=2571334) ERROR 03-24 14:48:25 [multiproc_executor.py:857]   File "/mnt/data4/jxy/vllm/vllm/model_executor/models/deepseek_v2.py", line 1178, in <lambda>
(Worker_TP1 pid=2571334) ERROR 03-24 14:48:25 [multiproc_executor.py:857]     lambda prefix: DeepseekV2DecoderLayer(
(Worker_TP1 pid=2571334) ERROR 03-24 14:48:25 [multiproc_executor.py:857]                    ^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP1 pid=2571334) ERROR 03-24 14:48:25 [multiproc_executor.py:857]   File "/mnt/data4/jxy/vllm/vllm/model_executor/models/deepseek_v2.py", line 1050, in __init__
(Worker_TP1 pid=2571334) ERROR 03-24 14:48:25 [multiproc_executor.py:857]     self.self_attn = attn_cls(
(Worker_TP1 pid=2571334) ERROR 03-24 14:48:25 [multiproc_executor.py:857]                      ^^^^^^^^^
(Worker_TP1 pid=2571334) ERROR 03-24 14:48:25 [multiproc_executor.py:857]   File "/mnt/data4/jxy/vllm/vllm/model_executor/models/deepseek_v2.py", line 950, in __init__
(Worker_TP1 pid=2571334) ERROR 03-24 14:48:25 [multiproc_executor.py:857]     self.indexer = Indexer(
(Worker_TP1 pid=2571334) ERROR 03-24 14:48:25 [multiproc_executor.py:857]                    ^^^^^^^^
(Worker_TP1 pid=2571334) ERROR 03-24 14:48:25 [multiproc_executor.py:857]   File "/mnt/data4/jxy/vllm/vllm/model_executor/models/deepseek_v2.py", line 677, in __init__
(Worker_TP1 pid=2571334) ERROR 03-24 14:48:25 [multiproc_executor.py:857]     self.indexer_op = SparseAttnIndexer(
(Worker_TP1 pid=2571334) ERROR 03-24 14:48:25 [multiproc_executor.py:857]                       ^^^^^^^^^^^^^^^^^^
(Worker_TP1 pid=2571334) ERROR 03-24 14:48:25 [multiproc_executor.py:857]   File "/mnt/data4/jxy/vllm/vllm/model_executor/layers/sparse_attn_indexer.py", line 309, in __init__
(Worker_TP1 pid=2571334) ERROR 03-24 14:48:25 [multiproc_executor.py:857]     raise RuntimeError(
(Worker_TP1 pid=2571334) ERROR 03-24 14:48:25 [multiproc_executor.py:857] RuntimeError: Sparse Attention Indexer CUDA op requires DeepGEMM to be installed.

Copy link
Copy Markdown
Member

@youkaichao youkaichao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks

@github-project-automation github-project-automation bot moved this to Ready in NVIDIA Mar 24, 2026
@chaunceyjiang chaunceyjiang added the ready ONLY add when PR is ready to merge/full CI is needed label Mar 24, 2026
@ehfd
Copy link
Copy Markdown
Contributor

ehfd commented Mar 24, 2026

I believe in the opposite... We're supposed to implement TRITON_MLA_SPARSE instead... #38006.

One backwards compatibility implementation example was the Marlin FP8 E4M3 fallback for sm80, which allows FP8 models to run in Ampere (#17579, #19990). This is not supported in SGLang (sgl-project/sglang#12887, sgl-project/sglang#9754), where all Ampere users of FP8 W8A8 MoE are redirected to only vLLM.

Thus, TRITON_MLA_SPARSE should also be implemented, as this will redirect all Ampere users of Sparse MLA models to vLLM and nowhere else. If vLLM doesn't do this, nobody can do it.

Copy link
Copy Markdown
Collaborator

@LucasWilkinson LucasWilkinson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

Copy link
Copy Markdown
Member

@ZJY0516 ZJY0516 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you also revert this? #36519

@chaunceyjiang chaunceyjiang enabled auto-merge (squash) March 25, 2026 03:15
@chaunceyjiang chaunceyjiang merged commit 09c3dc9 into vllm-project:main Mar 25, 2026
63 checks passed
@github-project-automation github-project-automation bot moved this from Ready to Done in NVIDIA Mar 25, 2026
RhizoNymph pushed a commit to RhizoNymph/vllm that referenced this pull request Mar 26, 2026
…_logits_torch function (vllm-project#37968)

Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com>
HenryTangDev pushed a commit to HenryTangMain/vllm that referenced this pull request Mar 27, 2026
…_logits_torch function (vllm-project#37968)

Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com>
malaiwah pushed a commit to malaiwah/vllm that referenced this pull request Mar 27, 2026
…_logits_torch function (vllm-project#37968)

Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com>
Signed-off-by: Michel Belleau <michel.belleau@malaiwah.com>
Monishver11 pushed a commit to Monishver11/vllm that referenced this pull request Mar 27, 2026
…_logits_torch function (vllm-project#37968)

Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com>
Signed-off-by: Monishver Chandrasekaran <monishverchandrasekaran@gmail.com>
nithinvc pushed a commit to nithinvc/vllm that referenced this pull request Mar 27, 2026
…_logits_torch function (vllm-project#37968)

Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com>

Signed-off-by: Nithin Chalapathi <nithin.ch10@gmail.com>
JiantaoXu pushed a commit to JiantaoXu/vllm that referenced this pull request Mar 28, 2026
…_logits_torch function (vllm-project#37968)

Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

nvidia ready ONLY add when PR is ready to merge/full CI is needed v1

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

5 participants