Skip to content

Add support for ModelOpt MXFP8 dense models#33786

Merged
vllm-bot merged 9 commits intovllm-project:mainfrom
de-inf:support_modelopt_mxfp8
Feb 8, 2026
Merged

Add support for ModelOpt MXFP8 dense models#33786
vllm-bot merged 9 commits intovllm-project:mainfrom
de-inf:support_modelopt_mxfp8

Conversation

@danisereb
Copy link
Copy Markdown
Contributor

@danisereb danisereb commented Feb 4, 2026

Purpose

Add support for ModelOpt MXFP8 dense models.

No support for MoE yet.

Related PRs

NVIDIA/Model-Optimizer#736

Test Plan

Use this LLM model (BF16):
https://huggingface.co/nvidia/OpenMath2-Llama3.1-8B

Convert the model to MXFP8 using ModelOpt:

export MODEL_PATH=/my_home/hf_models/nvidia/OpenMath2-Llama3.1-8B
export OUTPUT_PATH=/my_home/hf_models/nvidia/OpenMath2-Llama3.1-8B-MXFP8

rm -rv $OUTPUT_PATH
mkdir -p $OUTPUT_PATH

python examples/llm_ptq/hf_ptq.py \
--export_fmt hf \
--dataset cnn_dailymail \
--pyt_ckpt_path $MODEL_PATH \
--export_path $OUTPUT_PATH \
--qformat mxfp8

The command above will generate a checkpoint nvidia/OpenMath2-Llama3.1-8B-MXFP8.

Compare performance (tokens/sec) and accuracy (gsm8k) of the BF16 and MXFP8 models.

Test Result

Performance (tokens/sec):

Measured on B200:

vllm bench throughput --model $MODEL_PATH \
--tensor-parallel-size 1 \
--trust-remote-code \
--async-scheduling \
--backend vllm \
--dataset-name random \
--random-prefix-len 0 \
--random-input-len 1024 \
--random-output-len 1024 \
--max-num-seqs 128 \
--num-prompts 512

BF16

Throughput: 12.91 requests/s, 26447.02 total tokens/s, 13223.51 output tokens/s
Total num prompt tokens:  524288
Total num output tokens:  524288

MXFP8

Throughput: 9.95 requests/s, 20374.04 total tokens/s, 10187.02 output tokens/s
Total num prompt tokens:  524288
Total num output tokens:  524288

Accuracy (GSM8K):

lm_eval \
  --model vllm \
  --model_args pretrained=$MODEL_PATH,max_model_len=4096,enforce_eager=True,attention_backend=TRITON_ATTN \
  --tasks gsm8k \
  --batch_size auto --limit 300

BF16

|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.8600|±  |0.0201|
|     |       |strict-match    |     5|exact_match|↑  |0.2333|±  |0.0245|

MXFP8

|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.8400|±  |0.0212|
|     |       |strict-match    |     5|exact_match|↑  |0.2467|±  |0.0249|

Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

@mergify
Copy link
Copy Markdown

mergify bot commented Feb 4, 2026

Documentation preview: https://vllm--33786.org.readthedocs.build/en/33786/

@mergify mergify bot added documentation Improvements or additions to documentation nvidia labels Feb 4, 2026
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds support for ModelOpt MXFP8 models by introducing a new quantization configuration and associated linear method. The changes are well-structured and add valuable new functionality. My review includes a few points of feedback regarding documentation accuracy, consistency in MoE support, and an opportunity to refactor for code clarity and reuse.

@danisereb danisereb force-pushed the support_modelopt_mxfp8 branch 3 times, most recently from 6d66823 to 0e1bb9f Compare February 5, 2026 13:25
"`pip install flashinfer`"
) from err
class Mxfp8Backend(Enum):
TORCH = "torch"
Copy link
Copy Markdown
Contributor Author

@danisereb danisereb Feb 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The "torch" backend is temporary (can be used for debug in the future).

Backend FLASHINFER_CUTLASS will be added in the future:
flashinfer-ai/flashinfer#2464

Can be added after the flashinfer PR is merged, and flashinfer version is bumped in vLLM.

Signed-off-by: Daniel Serebrenik <daserebrenik@nvidia.com>
Signed-off-by: Daniel Serebrenik <daserebrenik@nvidia.com>
Signed-off-by: Daniel Serebrenik <daserebrenik@nvidia.com>
Signed-off-by: Daniel Serebrenik <daserebrenik@nvidia.com>
Signed-off-by: Daniel Serebrenik <daserebrenik@nvidia.com>
Signed-off-by: Daniel Serebrenik <daserebrenik@nvidia.com>
Signed-off-by: Daniel Serebrenik <daserebrenik@nvidia.com>
@danisereb danisereb force-pushed the support_modelopt_mxfp8 branch from a42165d to 99c68e2 Compare February 5, 2026 15:02
Copy link
Copy Markdown
Collaborator

@ProExpertProg ProExpertProg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mgoin mgoin changed the title Add support for ModelOpt MXFP8 models Add support for ModelOpt MXFP8 dense models Feb 8, 2026
Signed-off-by: Daniel Serebrenik <daserebrenik@nvidia.com>
@danisereb danisereb force-pushed the support_modelopt_mxfp8 branch from 459ff5a to 8267a99 Compare February 8, 2026 16:33
@mgoin mgoin added the ready ONLY add when PR is ready to merge/full CI is needed label Feb 8, 2026
Signed-off-by: Daniel Serebrenik <daserebrenik@nvidia.com>
Copy link
Copy Markdown
Member

@mgoin mgoin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, thanks!

@github-project-automation github-project-automation bot moved this to Ready in NVIDIA Feb 8, 2026
@mgoin mgoin enabled auto-merge (squash) February 8, 2026 16:48
auto-merge was automatically disabled February 8, 2026 16:51

Head branch was pushed to by a user without write access

@mgoin mgoin enabled auto-merge (squash) February 8, 2026 16:59
@vllm-bot vllm-bot merged commit 084aa19 into vllm-project:main Feb 8, 2026
58 of 61 checks passed
@github-project-automation github-project-automation bot moved this from Ready to Done in NVIDIA Feb 8, 2026
@jikunshang
Copy link
Copy Markdown
Collaborator

do we have plan to support mxfp8 gemm kernel?

@danisereb
Copy link
Copy Markdown
Contributor Author

@jikunshang

do we have plan to support mxfp8 gemm kernel?

Yes, I mentioned that in a previous comment, see this Flashinfer PR:
flashinfer-ai/flashinfer#2464

ItzDEXX pushed a commit to ItzDEXX/vllm that referenced this pull request Feb 19, 2026
Signed-off-by: Daniel Serebrenik <daserebrenik@nvidia.com>
llsj14 pushed a commit to llsj14/vllm that referenced this pull request Mar 1, 2026
Signed-off-by: Daniel Serebrenik <daserebrenik@nvidia.com>
tunglinwood pushed a commit to tunglinwood/vllm that referenced this pull request Mar 4, 2026
Signed-off-by: Daniel Serebrenik <daserebrenik@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation nvidia ready ONLY add when PR is ready to merge/full CI is needed

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

5 participants