Add support for ModelOpt MXFP8 dense models by danisereb · Pull Request #33786 · vllm-project/vllm

danisereb · 2026-02-04T10:36:11Z

Purpose

Add support for ModelOpt MXFP8 dense models.

No support for MoE yet.

Related PRs

NVIDIA/Model-Optimizer#736

Test Plan

Use this LLM model (BF16):
https://huggingface.co/nvidia/OpenMath2-Llama3.1-8B

Convert the model to MXFP8 using ModelOpt:

export MODEL_PATH=/my_home/hf_models/nvidia/OpenMath2-Llama3.1-8B
export OUTPUT_PATH=/my_home/hf_models/nvidia/OpenMath2-Llama3.1-8B-MXFP8

rm -rv $OUTPUT_PATH
mkdir -p $OUTPUT_PATH

python examples/llm_ptq/hf_ptq.py \
--export_fmt hf \
--dataset cnn_dailymail \
--pyt_ckpt_path $MODEL_PATH \
--export_path $OUTPUT_PATH \
--qformat mxfp8

The command above will generate a checkpoint nvidia/OpenMath2-Llama3.1-8B-MXFP8.

Compare performance (tokens/sec) and accuracy (gsm8k) of the BF16 and MXFP8 models.

Test Result

Performance (tokens/sec):

Measured on B200:

vllm bench throughput --model $MODEL_PATH \
--tensor-parallel-size 1 \
--trust-remote-code \
--async-scheduling \
--backend vllm \
--dataset-name random \
--random-prefix-len 0 \
--random-input-len 1024 \
--random-output-len 1024 \
--max-num-seqs 128 \
--num-prompts 512

BF16

Throughput: 12.91 requests/s, 26447.02 total tokens/s, 13223.51 output tokens/s
Total num prompt tokens:  524288
Total num output tokens:  524288

MXFP8

Throughput: 9.95 requests/s, 20374.04 total tokens/s, 10187.02 output tokens/s
Total num prompt tokens:  524288
Total num output tokens:  524288

Accuracy (GSM8K):

lm_eval \
  --model vllm \
  --model_args pretrained=$MODEL_PATH,max_model_len=4096,enforce_eager=True,attention_backend=TRITON_ATTN \
  --tasks gsm8k \
  --batch_size auto --limit 300

BF16

|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.8600|±  |0.0201|
|     |       |strict-match    |     5|exact_match|↑  |0.2333|±  |0.0245|

MXFP8

|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.8400|±  |0.0212|
|     |       |strict-match    |     5|exact_match|↑  |0.2467|±  |0.0249|

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

mergify · 2026-02-04T10:36:54Z

Documentation preview: https://vllm--33786.org.readthedocs.build/en/33786/

gemini-code-assist

Code Review

This pull request adds support for ModelOpt MXFP8 models by introducing a new quantization configuration and associated linear method. The changes are well-structured and add valuable new functionality. My review includes a few points of feedback regarding documentation accuracy, consistency in MoE support, and an opportunity to refactor for code clarity and reuse.

docs/features/quantization/modelopt.md

vllm/model_executor/layers/quantization/utils/mxfp8_utils.py

danisereb · 2026-02-05T13:28:10Z

vllm/model_executor/layers/quantization/utils/mxfp8_utils.py

-            "`pip install flashinfer`"
-        ) from err
+class Mxfp8Backend(Enum):
+    TORCH = "torch"


The "torch" backend is temporary (can be used for debug in the future).

Backend FLASHINFER_CUTLASS will be added in the future:
flashinfer-ai/flashinfer#2464

Can be added after the flashinfer PR is merged, and flashinfer version is bumped in vLLM.

Signed-off-by: Daniel Serebrenik <daserebrenik@nvidia.com>

ProExpertProg

cc @tjtanaa @mgoin

vllm/model_executor/layers/quantization/utils/mxfp8_utils.py

vllm/model_executor/layers/quantization/modelopt.py

Signed-off-by: Daniel Serebrenik <daserebrenik@nvidia.com>

mgoin

Looks good to me, thanks!

jikunshang · 2026-02-09T07:41:56Z

do we have plan to support mxfp8 gemm kernel?

danisereb · 2026-02-09T08:31:57Z

@jikunshang

do we have plan to support mxfp8 gemm kernel?

Yes, I mentioned that in a previous comment, see this Flashinfer PR:
flashinfer-ai/flashinfer#2464

Signed-off-by: Daniel Serebrenik <daserebrenik@nvidia.com>

mergify bot added documentation Improvements or additions to documentation nvidia labels Feb 4, 2026

github-project-automation bot added this to NVIDIA Feb 4, 2026

gemini-code-assist bot reviewed Feb 4, 2026

View reviewed changes

docs/features/quantization/modelopt.md Outdated Show resolved Hide resolved

vllm/model_executor/layers/quantization/utils/mxfp8_utils.py Outdated Show resolved Hide resolved

danisereb force-pushed the support_modelopt_mxfp8 branch 3 times, most recently from 6d66823 to 0e1bb9f Compare February 5, 2026 13:25

danisereb commented Feb 5, 2026

View reviewed changes

danisereb force-pushed the support_modelopt_mxfp8 branch from 0e1bb9f to 40fac14 Compare February 5, 2026 13:44

danisereb marked this pull request as ready for review February 5, 2026 14:39

danisereb requested review from ProExpertProg, WoosukKwon, hmellor, houseroad, mgoin, pavanimajety, robertgshaw2-redhat, tlrmchlsmth, yewentao256 and youkaichao as code owners February 5, 2026 14:39

danisereb added 7 commits February 5, 2026 17:02

Add support for ModelOpt MXFP8 models

7131a9e

Signed-off-by: Daniel Serebrenik <daserebrenik@nvidia.com>

Small cleanup in modelopt code

742dfe6

Signed-off-by: Daniel Serebrenik <daserebrenik@nvidia.com>

Fix readme

5179123

Signed-off-by: Daniel Serebrenik <daserebrenik@nvidia.com>

Cleanup mxfp8_utils

443e4fb

Signed-off-by: Daniel Serebrenik <daserebrenik@nvidia.com>

Remove MXFP8 FLASHINFER_CUTLASS

94469c0

Signed-off-by: Daniel Serebrenik <daserebrenik@nvidia.com>

Cleanup code

98b6fe7

Signed-off-by: Daniel Serebrenik <daserebrenik@nvidia.com>

Replace asserts with ValueError

99c68e2

Signed-off-by: Daniel Serebrenik <daserebrenik@nvidia.com>

danisereb force-pushed the support_modelopt_mxfp8 branch from a42165d to 99c68e2 Compare February 5, 2026 15:02

ProExpertProg reviewed Feb 5, 2026

View reviewed changes

mgoin changed the title ~~Add support for ModelOpt MXFP8 models~~ Add support for ModelOpt MXFP8 dense models Feb 8, 2026

mgoin reviewed Feb 8, 2026

View reviewed changes

vllm/model_executor/layers/quantization/utils/mxfp8_utils.py Outdated Show resolved Hide resolved

vllm/model_executor/layers/quantization/modelopt.py Outdated Show resolved Hide resolved

Rename TORCH to EMULATION (Mxfp8Backend)

8267a99

Signed-off-by: Daniel Serebrenik <daserebrenik@nvidia.com>

danisereb force-pushed the support_modelopt_mxfp8 branch from 459ff5a to 8267a99 Compare February 8, 2026 16:33

mgoin added the ready ONLY add when PR is ready to merge/full CI is needed label Feb 8, 2026

Cleanup override_quantization_method (ModelOptFp8Config)

6af0bdb

Signed-off-by: Daniel Serebrenik <daserebrenik@nvidia.com>

mgoin approved these changes Feb 8, 2026

View reviewed changes

github-project-automation bot moved this to Ready in NVIDIA Feb 8, 2026

mgoin enabled auto-merge (squash) February 8, 2026 16:48

auto-merge was automatically disabled February 8, 2026 16:51
Head branch was pushed to by a user without write access

mgoin enabled auto-merge (squash) February 8, 2026 16:59

vllm-bot merged commit 084aa19 into vllm-project:main Feb 8, 2026
58 of 61 checks passed

github-project-automation bot moved this from Ready to Done in NVIDIA Feb 8, 2026

ItzDEXX pushed a commit to ItzDEXX/vllm that referenced this pull request Feb 19, 2026

Add support for ModelOpt MXFP8 dense models (vllm-project#33786)

9df3f5d

Signed-off-by: Daniel Serebrenik <daserebrenik@nvidia.com>

danisereb mentioned this pull request Feb 22, 2026

Integrate flashinfer mm_mxfp8 in ModelOpt MXFP8 #35053

Merged

5 tasks

zeryx mentioned this pull request Feb 27, 2026

[Feature]: Support serving ModelOpt W4A8 MXFP4+FP8 checkpoints #35528

Open

1 task

llsj14 pushed a commit to llsj14/vllm that referenced this pull request Mar 1, 2026

Add support for ModelOpt MXFP8 dense models (vllm-project#33786)

3611975

Signed-off-by: Daniel Serebrenik <daserebrenik@nvidia.com>

danisereb mentioned this pull request Mar 4, 2026

Add support for ModelOpt MXFP8 MoE models #35986

Merged

5 tasks

tunglinwood pushed a commit to tunglinwood/vllm that referenced this pull request Mar 4, 2026

Add support for ModelOpt MXFP8 dense models (vllm-project#33786)

49cb498

Signed-off-by: Daniel Serebrenik <daserebrenik@nvidia.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add support for ModelOpt MXFP8 dense models#33786

Add support for ModelOpt MXFP8 dense models#33786
vllm-bot merged 9 commits intovllm-project:mainfrom
de-inf:support_modelopt_mxfp8

danisereb commented Feb 4, 2026 •

edited by github-actions bot

Loading

Uh oh!

mergify bot commented Feb 4, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

Uh oh!

danisereb Feb 5, 2026 •

edited

Loading

Uh oh!

ProExpertProg left a comment

Uh oh!

Uh oh!

Uh oh!

mgoin left a comment

Uh oh!

Uh oh!

jikunshang commented Feb 9, 2026

Uh oh!

danisereb commented Feb 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Uh oh!

Conversation

danisereb commented Feb 4, 2026 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Related PRs

Test Plan

Test Result

Performance (tokens/sec):

BF16

MXFP8

Accuracy (GSM8K):

BF16

MXFP8

Uh oh!

mergify bot commented Feb 4, 2026

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

danisereb Feb 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ProExpertProg left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

mgoin left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

jikunshang commented Feb 9, 2026

Uh oh!

danisereb commented Feb 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

danisereb commented Feb 4, 2026 •

edited by github-actions bot

Loading

danisereb Feb 5, 2026 •

edited

Loading