[Kernel] Add MXFP8 to Marlin GEMM/MoE and refactor Mxfp8LinearOp by mgoin · Pull Request #34664 · vllm-project/vllm

mgoin · 2026-02-17T01:29:07Z

Purpose

The Marlin kernel already supports FP8 (per-channel/group scales) and MXFP4 (per-32-element e8m0 scales). MXFP8 is a natural combination: FP8 weights (like existing FP8 Marlin) with e8m0 microscaling block scales (like existing MXFP4 Marlin). We just have to wire the kernel building blocks together.

This PR also consolidates gemm kernel backend specific logic more into the Mxfp8LinearOp class for modelopt.py and mxfp8.py

Test Plan

Existing online quant test for mxfp8 will now run on L4 in CI tests/models/quantization/test_mxfp8.py

Test Result

vllm serve mgoin/Qwen3-0.6B-MXFP8
vllm serve Qwen/Qwen3-0.6B --quantization mxfp8
vllm serve Qwen/Qwen3-30B-A3B --quantization mxfp8

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

gemini-code-assist

Code Review

This pull request adds support for MXFP8 quantization in the Marlin kernel, providing a faster alternative to the existing emulation path. The changes span across kernel generation, C++ dispatch logic, and Python-level integration. The implementation introduces new utility functions for handling MXFP8-specific weight and scale preparation for Marlin. My review identifies a critical issue in the hardware capability check that could lead to runtime errors on unsupported GPUs.

vllm/model_executor/layers/quantization/modelopt.py

danisereb · 2026-02-22T12:32:36Z

Hey @mgoin,
please also see my PR for the Flashinfer cutlass MXFP8 GEMM:
#35053

The GEMM is available in flashinfer 0.6.4 (recently bumped in vLLM).

vllm/model_executor/layers/quantization/modelopt.py

mergify · 2026-02-24T16:40:01Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @mgoin.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Signed-off-by: mgoin <mgoin64@gmail.com>

Move backend selection, weight processing, and apply logic from ModelOptMxFp8LinearMethod into Mxfp8LinearOp so all MXFP8 linear backends (emulation, flashinfer CUTLASS, Marlin) are managed in one place. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: mgoin <mgoin64@gmail.com>

Use select_mxfp8_linear_backend() and delegate weight processing to Mxfp8LinearOp.process_weights(), enabling Marlin backend support for online MXFP8 quantization on SM80+ GPUs. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: mgoin <mgoin64@gmail.com>

Removed comment about backend-specific weight processing.

Use torch.get_default_dtype() and layer.output_size_per_partition / layer.input_size_per_partition directly instead of stashing copies as layer.orig_dtype, layer.marlin_size_n, layer.marlin_size_k. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: mgoin <mgoin64@gmail.com>

Signed-off-by: mgoin <mgoin64@gmail.com>

mergify · 2026-03-31T17:09:28Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @mgoin.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: mgoin <mgoin64@gmail.com>

Signed-off-by: mgoin <mgoin64@gmail.com>

mgoin requested review from pavanimajety, robertgshaw2-redhat, tlrmchlsmth and yewentao256 as code owners February 17, 2026 01:29

gemini-code-assist bot reviewed Feb 17, 2026

View reviewed changes

vllm/model_executor/layers/quantization/modelopt.py Outdated Show resolved Hide resolved

mgoin added performance Performance-related issues quantization ready ONLY add when PR is ready to merge/full CI is needed labels Feb 20, 2026

danisereb reviewed Feb 22, 2026

View reviewed changes

vllm/model_executor/layers/quantization/modelopt.py Outdated Show resolved Hide resolved

danisereb mentioned this pull request Feb 23, 2026

Integrate flashinfer mm_mxfp8 in ModelOpt MXFP8 #35053

Merged

5 tasks

mergify bot added the needs-rebase label Feb 24, 2026

mgoin added 4 commits March 19, 2026 17:49

Add MXFP8 to Marlin dense kernel

382b416

Signed-off-by: mgoin <mgoin64@gmail.com>

Fix scale dequant

00a5d9b

Signed-off-by: mgoin <mgoin64@gmail.com>

mgoin force-pushed the mxfp8-marlin branch from 66b612b to 4126c85 Compare March 19, 2026 22:04

mergify bot removed the needs-rebase label Mar 19, 2026

mgoin added 5 commits March 19, 2026 18:07

Remove comment on weight processing delegation

72ca1b2

Removed comment about backend-specific weight processing.

Cleanup

b8b4b09

Signed-off-by: mgoin <mgoin64@gmail.com>

Fix scale logic

ef27d9b

Signed-off-by: mgoin <mgoin64@gmail.com>

Fix

bf8ade8

Signed-off-by: mgoin <mgoin64@gmail.com>

mgoin changed the title ~~Add MXFP8 to Marlin dense kernel~~ [Kernel] Add MXFP8 to Marlin dense kernel Mar 19, 2026

mgoin added the nvidia label Mar 19, 2026

github-project-automation bot added this to NVIDIA Mar 19, 2026

mgoin changed the title ~~[Kernel] Add MXFP8 to Marlin dense kernel~~ [Kernel] Add MXFP8 to Marlin dense kernel and refactor Mxfp8LinearOp Mar 19, 2026

Merge branch 'main' into mxfp8-marlin

e9ecbb7

mgoin mentioned this pull request Mar 24, 2026

[NVFP4] Support NVFP4 dense models from modelopt and compressed-tensors on AMD Instinct GPUs, on Hopper, Ampere, etc. through emulation #35733

Open

mgoin added 2 commits March 30, 2026 16:13

Merge branch 'main' into mxfp8-marlin

cb48907

Add MXFP8 Marlin MoE

f83d778

Signed-off-by: mgoin <mgoin64@gmail.com>

mgoin requested a review from WoosukKwon as a code owner March 30, 2026 21:15

mgoin changed the title ~~[Kernel] Add MXFP8 to Marlin dense kernel and refactor Mxfp8LinearOp~~ [Kernel] Add MXFP8 to Marlin GEMM/MoE and refactor Mxfp8LinearOp Mar 30, 2026

Fix moe kernel

c57f65e

Signed-off-by: mgoin <mgoin64@gmail.com>

mgoin requested review from DarkLight1337 and ywang96 as code owners March 31, 2026 16:59

mergify bot added the needs-rebase label Mar 31, 2026

Merge branch 'main' into mxfp8-marlin

d736dd6

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: mgoin <mgoin64@gmail.com>

mergify bot removed the needs-rebase label Mar 31, 2026

Fix backend selection

f939b89

Signed-off-by: mgoin <mgoin64@gmail.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Kernel] Add MXFP8 to Marlin GEMM/MoE and refactor Mxfp8LinearOp#34664

[Kernel] Add MXFP8 to Marlin GEMM/MoE and refactor Mxfp8LinearOp#34664
mgoin wants to merge 15 commits intovllm-project:mainfrom
neuralmagic:mxfp8-marlin

mgoin commented Feb 17, 2026 •

edited

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

danisereb commented Feb 22, 2026 •

edited

Loading

Uh oh!

Uh oh!

mergify bot commented Feb 24, 2026

Uh oh!

mergify bot commented Mar 31, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

mgoin commented Feb 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

danisereb commented Feb 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

mergify bot commented Feb 24, 2026

Uh oh!

mergify bot commented Mar 31, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

mgoin commented Feb 17, 2026 •

edited

Loading

danisereb commented Feb 22, 2026 •

edited

Loading