Skip to content

[MoE Refactor][15/N] Apply Refactor to Fp8#31415

Merged
mgoin merged 213 commits intomainfrom
apply-refactor-to-ct
Jan 8, 2026
Merged

[MoE Refactor][15/N] Apply Refactor to Fp8#31415
mgoin merged 213 commits intomainfrom
apply-refactor-to-ct

Conversation

@robertgshaw2-redhat
Copy link
Copy Markdown
Collaborator

@robertgshaw2-redhat robertgshaw2-redhat commented Dec 27, 2025

Purpose

SUMMARY:

  • continue refactoring fp8 moe
  • CUTLASS MoE --> make CutlassExpertsFp8 manage the strides data
  • CUTLASS MoE --> remove cutlass_moe_fp8 entrypoint
  • factor out MoE kernel selection into a new module
  • factor out process_weights_after_loading utilities to be shared by fp8 and ct
  • apply mk refactor to ct

Test Plan

  • unit tests
pytest -v -x tests/kernels/moe/test_cutlass_moe.py
  • benchmark scripts (note: these don't work on main)
python3 ./benchmark_grouped_gemm_cutlass.py --model "deepseek-ai/DeepSeek-V2-Lite" --tp-sizes 4 --batch-sizes 2 4
python benchmark_cutlass_moe_fp8.py  \
            --model "Llama-4-Maverick-17B-128E-Instruct-FP8"  \
            --tp-sizes 8 \
            --batch-size 2 4 8 \
            --per-act-token-opts false \
            --per-out-ch-opts false
  • moe refactor integration tests

Test Result


Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Robert Shaw and others added 30 commits December 22, 2025 18:03
Signed-off-by: Robert Shaw <robshaw@redhat.com>
Signed-off-by: Robert Shaw <robshaw@redhat.com>
Signed-off-by: Robert Shaw <robshaw@redhat.com>
Signed-off-by: Robert Shaw <robshaw@redhat.com>
Signed-off-by: Robert Shaw <robshaw@redhat.com>
Signed-off-by: Robert Shaw <robshaw@redhat.com>
Signed-off-by: Robert Shaw <robshaw@redhat.com>
Signed-off-by: Robert Shaw <robshaw@redhat.com>
Signed-off-by: Robert Shaw <robshaw@redhat.com>
Signed-off-by: Robert Shaw <robshaw@redhat.com>
Signed-off-by: Robert Shaw <robshaw@redhat.com>
Signed-off-by: Robert Shaw <robshaw@redhat.com>
Signed-off-by: Robert Shaw <robshaw@redhat.com>
Signed-off-by: Robert Shaw <robshaw@redhat.com>
Signed-off-by: Robert Shaw <robshaw@redhat.com>
Signed-off-by: Robert Shaw <robshaw@redhat.com>
Signed-off-by: Robert Shaw <robshaw@redhat.com>
Signed-off-by: Robert Shaw <robshaw@redhat.com>
Signed-off-by: Robert Shaw <robshaw@redhat.com>
Signed-off-by: Robert Shaw <robshaw@redhat.com>
Signed-off-by: Robert Shaw <robshaw@redhat.com>
Signed-off-by: Robert Shaw <robshaw@redhat.com>
Signed-off-by: Robert Shaw <robshaw@redhat.com>
Signed-off-by: Robert Shaw <robshaw@redhat.com>
Signed-off-by: Robert Shaw <robshaw@redhat.com>
Signed-off-by: Robert Shaw <robshaw@redhat.com>
Signed-off-by: Robert Shaw <robshaw@redhat.com>
Signed-off-by: Robert Shaw <robshaw@redhat.com>
Robert Shaw added 3 commits January 7, 2026 18:02
Signed-off-by: Robert Shaw <robshaw@redhat.com>
Signed-off-by: Robert Shaw <robshaw@redhat.com>
Signed-off-by: Robert Shaw <robshaw@redhat.com>
Robert Shaw and others added 2 commits January 7, 2026 19:35
Signed-off-by: Robert Shaw <robshaw@redhat.com>
num_questions: 1319
num_fewshot: 5
server_args: "--enforce-eager --max-model-len 8192 --tensor-parallel-size 2"
server_args: "--enforce-eager --max-model-len 8192 --tensor-parallel-size 2 --disable-uvicorn-access-log"
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is this? Do you just want to add it to the eval script here

# Add standard server arguments
server_args.extend(
[
"--trust-remote-code",
]
)

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, thats a good idea. It makes the logs much easier to read by not logging out /completions every request

Comment on lines +273 to +294
# Delayed import is required since the oracle is imported
# by CPU backends which cannot import all of these experts.
# TODO: update the experts to make this not happen.
from vllm.model_executor.layers.fused_moe import (
TritonExperts,
TritonOrDeepGemmExperts,
)
from vllm.model_executor.layers.fused_moe.cutlass_moe import (
CutlassExpertsFp8,
)
from vllm.model_executor.layers.fused_moe.flashinfer_cutlass_moe import (
FlashInferExperts,
)
from vllm.model_executor.layers.fused_moe.fused_marlin_moe import (
MarlinExperts,
)
from vllm.model_executor.layers.fused_moe.prepare_finalize import (
MoEPrepareAndFinalizeNoEP,
)
from vllm.model_executor.layers.fused_moe.rocm_aiter_fused_moe import (
AiterExperts,
)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should just put each import within each conditional, rather than importing all

Copy link
Copy Markdown
Collaborator

@bnellnm bnellnm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice refactor!

Robert Shaw added 4 commits January 7, 2026 20:49
Signed-off-by: Robert Shaw <robshaw@redhat.com>
Signed-off-by: Robert Shaw <robshaw@redhat.com>
Signed-off-by: Robert Shaw <robshaw@redhat.com>
Signed-off-by: Robert Shaw <robshaw@redhat.com>
Copy link
Copy Markdown
Member

@zyongye zyongye left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The overall structure LGTM. But we will need to revisit the fallback experts abstraction in the future as discussed.

assert w2_input_scale is not None

rotate_weights_for_fi_trtllm_fp8_per_tensor_moe(w13, w2)
register_scales_for_trtllm_fp8_per_tensor_moe(
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we register this inside the FusedMoEMethod? instead of in the utils?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there is some work to make TRTLLM a modular kernel. once that is done, we can revisit it

Signed-off-by: Robert Shaw <robshaw@redhat.com>
@github-project-automation github-project-automation bot moved this to Ready in NVIDIA Jan 8, 2026
@mgoin
Copy link
Copy Markdown
Member

mgoin commented Jan 8, 2026

Great work!

@mgoin mgoin merged commit 5dcd7ef into main Jan 8, 2026
65 checks passed
@mgoin mgoin deleted the apply-refactor-to-ct branch January 8, 2026 00:42
@github-project-automation github-project-automation bot moved this from Ready to Done in NVIDIA Jan 8, 2026
@github-project-automation github-project-automation bot moved this from In review to Done in MoE Refactor Jan 8, 2026
yugong333 pushed a commit to yugong333/vllm that referenced this pull request Jan 9, 2026
akh64bit pushed a commit to akh64bit/vllm that referenced this pull request Jan 16, 2026
dsuhinin pushed a commit to dsuhinin/vllm that referenced this pull request Jan 21, 2026
Signed-off-by: dsuhinin <suhinin.dmitriy@gmail.com>
ItzDEXX pushed a commit to ItzDEXX/vllm that referenced this pull request Feb 19, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci/build documentation Improvements or additions to documentation llama Related to Llama models nvidia performance Performance-related issues ready ONLY add when PR is ready to merge/full CI is needed

Projects

Status: Done
Status: Done

Development

Successfully merging this pull request may close these issues.

4 participants