[Bugfix][TPU] Return a Default fp8 MoE Backend by vanbasten23 · Pull Request #32908 · vllm-project/vllm

vanbasten23 · 2026-01-23T04:35:47Z

Purpose

Commit caused TPU to fail at def select_fp8_moe_backend( (error). This PR intends to fix this error for TPU.

Test Plan

TPU CI:

USE_MOE_EP_KERNEL=1 MODEL_IMPL_TYPE=vllm vllm serve --seed=42 --model=BCCard/Qwen3-Coder-480B-A35B-Instruct-FP8-Dynamic --max-model-len=10240 --max-num-batched-tokens=8192 --max-num-seqs=512 --no-enable-prefix-caching --disable-log-requests --tensor-parallel-size=8 --kv-cache-dtype=fp8 --gpu-memory-utilization=0.95 --async-scheduling --enable-expert-parallel

Test Result

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Signed-off-by: Xiongfei Wei <isaacwxf23@gmail.com>

gemini-code-assist

Code Review

This pull request addresses a regression on TPUs by changing the behavior of select_fp8_moe_backend when no suitable FP8 MoE backend is found. Instead of raising a NotImplementedError, the function now returns (Fp8MoeBackend.NONE, None). This allows the system to gracefully handle cases where no specialized backend is available, which is the expected scenario on platforms like TPUs. The change is a targeted fix that restores the previous, correct behavior.

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

^{Bugbot Autofix is OFF. To automatically fix reported issues with Cloud Agents, enable Autofix in the Cursor dashboard.}

Comment @cursor review or bugbot run to trigger another review on this PR

vllm/model_executor/layers/fused_moe/oracle/fp8.py

robertgshaw2-redhat · 2026-01-23T16:31:36Z

Thanks for the fix!

However, this changes the behavior on CUDA + RoCM where we want to raise a clear error that no backend supports the model.

How does the TPU backend use this function

vanbasten23 · 2026-01-23T17:32:01Z

How does the TPU backend use this function

Thanks @robertgshaw2-redhat for the review. On TPU, it defines a VllmCompressedTensorsW8A8Fp8MoEMethod class that inherits vllm's CompressedTensorsW8A8Fp8MoEMethod. When TPU creates an instance of VllmCompressedTensorsW8A8Fp8MoEMethod, it invokes CompressedTensorsW8A8Fp8MoEMethod.__init__ here. Then during CompressedTensorsW8A8Fp8MoEMethod.__init__, it fails at the select_fp8_moe_backend.

I don't see TPU uses the self.fp8_backend, self.experts_cls returned from select_fp8_moe_backend. It seems that as long as select_fp8_moe_backend doesn't raise any exception, TPU should work.

If the current fix is not ideal, do you have other suggestions?

cc: @kyuyeunk

kyuyeunk · 2026-01-23T23:51:51Z

How does the TPU backend use this function

@robertgshaw2-redhat, previously, we just inherited the class Fp8MoEMethod but overload following functions

process_weights_after_loading: so we can apply tpu specific weight transformation during weight loading time
apply: so we can invoke our own forward function

before #32414, constructor of Fp8MoEMethod would work fine even if no backend was found, but it is erroring out. For now, we are getting around the issue by overriding constructor as well, but ideally, we want to reuse vLLM components as much as possible which will also allow us to align with vLLM's overall direction on how things are designed. Alternative is overriding functions that breaks plugin to the point it's not a vLLM anymore.

As we've discusses offline, proper way to resolve this issue is to allow plugins to register their own backends / kernels so that Fp8MoEMethod can correctly select that backend / kernel. Is there an eta on when that feature would land?

vanbasten23 · 2026-01-24T00:09:49Z

Thanks Kyuyeun for the input. I also tried to unblock myself (and tpu-inference) by doing something like vllm-project/tpu-inference#1512, another fix caused by the same issue. But the fix this time is not that straightforward. So I'm leaning towards getting it fixed in vllm.

robertgshaw2-redhat · 2026-01-24T00:20:20Z

Thanks Kyuyeun for the input. I also tried to unblock myself (and tpu-inference) by doing something like vllm-project/tpu-inference#1512, another fix caused by the same issue. But the fix this time is not that straightforward. So I'm leaning towards getting it fixed in vllm.

What I’m going to do is:

validate in ModelOpt, Fp8, and CT that the backend returned is not NOne (so move the validation from this function to the quant methods). this will unblock tpu
next week I’ll work on the register api we discussed
once tpu migrates to register api, we can move validation back to this function

I’ll port up the pr tomorrow morning

vanbasten23 · 2026-01-26T17:21:03Z

Hi @robertgshaw2-redhat , is there any update? This is currently blocking the TPU-inference.

Signed-off-by: Robert Shaw <robshaw@redhat.com>

robertgshaw2-redhat · 2026-01-26T17:32:53Z

Hi @robertgshaw2-redhat , is there any update? This is currently blocking the TPU-inference.

I thought a bit more about it. I think its better to have the check here rather than in the quant methods.

Does the change I pushed to the branch look okay to you?

vanbasten23 · 2026-01-26T19:44:26Z

Hi @robertgshaw2-redhat , is there any update? This is currently blocking the TPU-inference.

I thought a bit more about it. I think its better to have the check here rather than in the quant methods.

Does the change I pushed to the branch look okay to you?

Yes. It looks good to me. Thanks @robertgshaw2-redhat .

robertgshaw2-redhat · 2026-01-26T21:05:56Z

Hi @robertgshaw2-redhat , is there any update? This is currently blocking the TPU-inference.

I thought a bit more about it. I think its better to have the check here rather than in the quant methods.
Does the change I pushed to the branch look okay to you?

Yes. It looks good to me. Thanks @robertgshaw2-redhat .

great! merging.

vanbasten23 · 2026-01-26T23:45:20Z

@robertgshaw2-redhat , it looks the merging is blocked At least 1 approving review is required by reviewers with write access.

robertgshaw2-redhat · 2026-01-26T23:46:21Z

@robertgshaw2-redhat , it looks the merging is blocked At least 1 approving review is required by reviewers with write access.

My bad

Signed-off-by: Xiongfei Wei <isaacwxf23@gmail.com> Signed-off-by: Robert Shaw <robshaw@redhat.com> Co-authored-by: Robert Shaw <robshaw@redhat.com> Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com>

Return a default fp8 moe backend as before.

f69c6b1

Signed-off-by: Xiongfei Wei <isaacwxf23@gmail.com>

vanbasten23 requested review from mgoin and pavanimajety as code owners January 23, 2026 04:35

vanbasten23 requested a review from robertgshaw2-redhat January 23, 2026 04:35

vanbasten23 mentioned this pull request Jan 23, 2026

[MoE Refactor] Oracle Select FP8+NVFP4 Kernels In Priority #32414

Merged

5 tasks

gemini-code-assist bot reviewed Jan 23, 2026

View reviewed changes

cursor bot reviewed Jan 23, 2026

View reviewed changes

vllm/model_executor/layers/fused_moe/oracle/fp8.py Show resolved Hide resolved

robertgshaw2-redhat changed the title ~~Return a default fp8 moe backend as before.~~ [Bugfix][TPU] Return a Default fp8 MoE Backend Jan 23, 2026

mergify bot added the bug Something isn't working label Jan 23, 2026

add guard for CUDA / ROCM

8cfc8fd

Signed-off-by: Robert Shaw <robshaw@redhat.com>

robertgshaw2-redhat requested review from tlrmchlsmth and yewentao256 as code owners January 26, 2026 17:31

updated

e46ea2c

Signed-off-by: Robert Shaw <robshaw@redhat.com>

Merge branch 'main' into xiowei/fix_select_fp8_moe_backend

36c5583

robertgshaw2-redhat added the ready ONLY add when PR is ready to merge/full CI is needed label Jan 26, 2026

robertgshaw2-redhat enabled auto-merge (squash) January 26, 2026 21:05

robertgshaw2-redhat approved these changes Jan 26, 2026

View reviewed changes

robertgshaw2-redhat merged commit 510ed1e into vllm-project:main Jan 26, 2026
54 checks passed

This was referenced Mar 7, 2026

[MoE Refactor] Migrate Unquantized to Full Oracle Flow #36286

Open

[MoE Refactor] Migrate UnquantizedFusedMoEMethod and oracle to MK flow #36732

Closed

Uh oh!

Conversation

vanbasten23 commented Jan 23, 2026 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

cursor bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

robertgshaw2-redhat commented Jan 23, 2026

Uh oh!

vanbasten23 commented Jan 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kyuyeunk commented Jan 23, 2026

Uh oh!

vanbasten23 commented Jan 24, 2026

Uh oh!

robertgshaw2-redhat commented Jan 24, 2026

Uh oh!

vanbasten23 commented Jan 26, 2026

Uh oh!

robertgshaw2-redhat commented Jan 26, 2026

Uh oh!

vanbasten23 commented Jan 26, 2026

Uh oh!

robertgshaw2-redhat commented Jan 26, 2026

Uh oh!

vanbasten23 commented Jan 26, 2026

Uh oh!

Uh oh!

robertgshaw2-redhat commented Jan 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

vanbasten23 commented Jan 23, 2026 •

edited by github-actions bot

Loading

vanbasten23 commented Jan 23, 2026 •

edited

Loading