Add support for ModelOpt MXFP8 dense models#33786
Add support for ModelOpt MXFP8 dense models#33786vllm-bot merged 9 commits intovllm-project:mainfrom
Conversation
|
Documentation preview: https://vllm--33786.org.readthedocs.build/en/33786/ |
There was a problem hiding this comment.
Code Review
This pull request adds support for ModelOpt MXFP8 models by introducing a new quantization configuration and associated linear method. The changes are well-structured and add valuable new functionality. My review includes a few points of feedback regarding documentation accuracy, consistency in MoE support, and an opportunity to refactor for code clarity and reuse.
6d66823 to
0e1bb9f
Compare
| "`pip install flashinfer`" | ||
| ) from err | ||
| class Mxfp8Backend(Enum): | ||
| TORCH = "torch" |
There was a problem hiding this comment.
The "torch" backend is temporary (can be used for debug in the future).
Backend FLASHINFER_CUTLASS will be added in the future:
flashinfer-ai/flashinfer#2464
Can be added after the flashinfer PR is merged, and flashinfer version is bumped in vLLM.
0e1bb9f to
40fac14
Compare
Signed-off-by: Daniel Serebrenik <daserebrenik@nvidia.com>
Signed-off-by: Daniel Serebrenik <daserebrenik@nvidia.com>
Signed-off-by: Daniel Serebrenik <daserebrenik@nvidia.com>
Signed-off-by: Daniel Serebrenik <daserebrenik@nvidia.com>
Signed-off-by: Daniel Serebrenik <daserebrenik@nvidia.com>
Signed-off-by: Daniel Serebrenik <daserebrenik@nvidia.com>
Signed-off-by: Daniel Serebrenik <daserebrenik@nvidia.com>
a42165d to
99c68e2
Compare
Signed-off-by: Daniel Serebrenik <daserebrenik@nvidia.com>
459ff5a to
8267a99
Compare
Signed-off-by: Daniel Serebrenik <daserebrenik@nvidia.com>
Head branch was pushed to by a user without write access
|
do we have plan to support mxfp8 gemm kernel? |
Yes, I mentioned that in a previous comment, see this Flashinfer PR: |
Signed-off-by: Daniel Serebrenik <daserebrenik@nvidia.com>
Signed-off-by: Daniel Serebrenik <daserebrenik@nvidia.com>
Signed-off-by: Daniel Serebrenik <daserebrenik@nvidia.com>
Purpose
Add support for ModelOpt MXFP8 dense models.
No support for MoE yet.
Related PRs
NVIDIA/Model-Optimizer#736
Test Plan
Use this LLM model (BF16):
https://huggingface.co/nvidia/OpenMath2-Llama3.1-8B
Convert the model to MXFP8 using ModelOpt:
The command above will generate a checkpoint
nvidia/OpenMath2-Llama3.1-8B-MXFP8.Compare performance (tokens/sec) and accuracy (gsm8k) of the BF16 and MXFP8 models.
Test Result
Performance (tokens/sec):
Measured on B200:
vllm bench throughput --model $MODEL_PATH \ --tensor-parallel-size 1 \ --trust-remote-code \ --async-scheduling \ --backend vllm \ --dataset-name random \ --random-prefix-len 0 \ --random-input-len 1024 \ --random-output-len 1024 \ --max-num-seqs 128 \ --num-prompts 512BF16
MXFP8
Accuracy (GSM8K):
lm_eval \ --model vllm \ --model_args pretrained=$MODEL_PATH,max_model_len=4096,enforce_eager=True,attention_backend=TRITON_ATTN \ --tasks gsm8k \ --batch_size auto --limit 300BF16
MXFP8
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.