Skip to content

Add support for ModelOpt MXFP8 models#31603

Closed
danisereb wants to merge 2 commits intovllm-project:mainfrom
danisereb:support_mxfp8_basic
Closed

Add support for ModelOpt MXFP8 models#31603
danisereb wants to merge 2 commits intovllm-project:mainfrom
danisereb:support_mxfp8_basic

Conversation

@danisereb
Copy link
Copy Markdown
Contributor

@danisereb danisereb commented Jan 1, 2026

Purpose

Add support for ModelOpt MXFP8 models.

Test Plan

Test a model that was converted to MXFP8 using ModelOpt.
https://huggingface.co/nvidia/OpenMath2-Llama3.1-8B

Test Result

Eval command:

export MODEL_PATH=/my_home/hf_models/nvidia/OpenMath2-Llama3.1-8B-MXFP8

lm_eval \
  --model vllm \
  --model_args pretrained=$MODEL_PATH,max_model_len=4096,enforce_eager=True \
  --tasks gsm8k \
  --batch_size auto

Benchmark command:

export MODEL_PATH=/my_home/hf_models/nvidia/OpenMath2-Llama3.1-8B-MXFP8

vllm bench throughput --model $MODEL_PATH \
--tensor-parallel-size 1 \
--load-format dummy \
--enforce-eager \
--trust-remote-code \
--async-scheduling \
--backend vllm \
--dataset-name random \
--num-prompts 16 \
--input-len 1000 \
--output-len 1000

Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds basic support for MXFP8 quantized models. The changes include adding mxfp8 to quantization configurations, implementing Mxfp8Config for linear layers and MoE layers, and adding utility functions for MXFP8 operations.

The implementation for linear layers uses torch._scaled_mm for performance. The MoE implementation currently falls back to dequantizing weights to BF16, as noted in the PR description.

I've found two critical issues:

  1. In the MoE implementation, there's incorrect slicing logic for weight scales when expert parallelism is used, which would lead to errors.
  2. The MXFP8 linear layer implementation is missing the bias addition.

Please address these issues. Otherwise, the changes look good and are a good step towards full MXFP8 support.

@robertgshaw2-redhat
Copy link
Copy Markdown
Collaborator

the PR generally looks good

However, we are actively trying to deprecate the long tail of quantization integrations to focus on our core integrations

We support MXFP8 in llm-compressor/compressed-tensors. Would you be open to adding this as a compressed-tensors backend rather than as a new discrete quantization integration?

@danisereb danisereb force-pushed the support_mxfp8_basic branch from 3ff19c5 to fa4ac0c Compare January 15, 2026 11:52
@mergify
Copy link
Copy Markdown

mergify bot commented Jan 15, 2026

Documentation preview: https://vllm--31603.org.readthedocs.build/en/31603/

@mergify mergify bot added the documentation Improvements or additions to documentation label Jan 15, 2026
@danisereb danisereb changed the title Add basic support for mxfp8 quantized models Add support for ModelOpt MXFP8 models Jan 15, 2026
@danisereb danisereb force-pushed the support_mxfp8_basic branch 3 times, most recently from caeae4d to 4bf4d13 Compare January 19, 2026 15:35
@danisereb danisereb force-pushed the support_mxfp8_basic branch from 4bf4d13 to 054c113 Compare January 28, 2026 17:44
@mergify
Copy link
Copy Markdown

mergify bot commented Jan 28, 2026

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @danisereb.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Jan 28, 2026
Signed-off-by: Daniel Serebrenik <daserebrenik@nvidia.com>
@danisereb danisereb force-pushed the support_mxfp8_basic branch from 054c113 to d2f5a05 Compare January 29, 2026 14:54
Signed-off-by: Daniel Serebrenik <daserebrenik@nvidia.com>
@mergify mergify bot removed the needs-rebase label Jan 29, 2026
@mergify
Copy link
Copy Markdown

mergify bot commented Feb 3, 2026

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @danisereb.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Feb 3, 2026
@danisereb
Copy link
Copy Markdown
Contributor Author

Not relevant, a new PR will be opened if required.

@danisereb danisereb closed this Feb 3, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation needs-rebase

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants