[Quant][Feature] Support online MXFP8 quantization for MoE and dense models#35448
[Quant][Feature] Support online MXFP8 quantization for MoE and dense models#35448mgoin merged 4 commits intovllm-project:mainfrom
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces support for online MXFP8 MoE quantization. The changes are comprehensive, adding a new Mxfp8Config and Mxfp8OnlineMoEMethod, updating backend selection logic, and modifying the flashinfer_fused_moe_blockscale_fp8 custom op. The implementation correctly handles the specifics of MXFP8, such as block shapes and scale types. I've identified a critical bug in the Mxfp8OnlineMoEMethod.create_weights method related to parameter naming that would cause a runtime error. My review includes a suggested fix for this issue. Overall, the changes are well-structured to integrate the new quantization mode.
|
This pull request has merge conflicts that must be resolved before it can be |
a630758 to
1e9ce28
Compare
|
This pull request has merge conflicts that must be resolved before it can be |
| from flashinfer.fused_moe import Fp8QuantizationType | ||
|
|
||
| assert not apply_router_weight_on_input | ||
| assert activation == MoEActivation.SILU |
There was a problem hiding this comment.
It seems like you assert SILU but don't restrict selection in _supports_activation, is this necessary?
There was a problem hiding this comment.
Yes, it's necessary — the monolithic class serves both block-scale (SILU only, hardcoded in flashinfer) and per-tensor (SILU + RELU2) paths, and _supports_activation can't distinguish them. Adding quant context to _supports_activation would require changing the abstract interface and every implementation across the codebase.
| # For Blackwell block-FP8 (used by online MXFP8), prefer FlashInfer TRTLLM | ||
| # so execution goes through the monolithic blockscale kernel path. | ||
| if ( | ||
| current_platform.is_cuda() | ||
| and current_platform.is_device_capability_family(100) | ||
| and weight_key == kMxfp8Static | ||
| and activation_key == kMxfp8Dynamic | ||
| and Fp8MoeBackend.FLASHINFER_TRTLLM in AVAILABLE_BACKENDS | ||
| ): | ||
| AVAILABLE_BACKENDS.remove(Fp8MoeBackend.FLASHINFER_TRTLLM) | ||
| AVAILABLE_BACKENDS.insert(0, Fp8MoeBackend.FLASHINFER_TRTLLM) | ||
|
|
There was a problem hiding this comment.
We already have an mxfp8 oracle at https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/layers/fused_moe/oracle/mxfp8.py, could we use that rather than overloading fp8?
There was a problem hiding this comment.
Sure, I changed the code to use https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/layers/fused_moe/oracle/mxfp8.py. I had to make a few other minor changes to correctly use it.
Signed-off-by: EdalatiAli <aliedalati@cohere.com>
b635809 to
5e1a1b2
Compare
mgoin
left a comment
There was a problem hiding this comment.
Okay seems reasonable to me to accept. I'm not sure how much we should truly reuse with the fp8 methods but it is fair enough to follow that pattern for now
| _SUPPORTED_BACKENDS: frozenset[Fp8MoeBackend] = frozenset( | ||
| { | ||
| Fp8MoeBackend.FLASHINFER_TRTLLM, | ||
| } | ||
| ) | ||
|
|
||
| class MxFp8MoeBackend(Enum): | ||
| FLASHINFER_TRTLLM = "FLASHINFER_TRTLLM" | ||
| _BACKEND_NAME_MAP: dict[str, Fp8MoeBackend] = { | ||
| "flashinfer_trtllm": Fp8MoeBackend.FLASHINFER_TRTLLM, | ||
| } |
There was a problem hiding this comment.
It is a bit confusing to use Fp8MoeBackend here and elsewhere for mxfp8, but I guess it is needed to reuse the moe utils
There was a problem hiding this comment.
Yes, using Fp8MoeBackend was needed to reuse the rest of the FP8 moe utils.
We can follow a better approach when more MXFP8 backend are available in the future.
Thank you for the feedback!
…models (vllm-project#35448) Signed-off-by: EdalatiAli <aliedalati@cohere.com>
…models (vllm-project#35448) Signed-off-by: EdalatiAli <aliedalati@cohere.com>
…models (vllm-project#35448) Signed-off-by: EdalatiAli <aliedalati@cohere.com>
…models (vllm-project#35448) Signed-off-by: EdalatiAli <aliedalati@cohere.com>
…models (vllm-project#35448) Signed-off-by: EdalatiAli <aliedalati@cohere.com>
Purpose
Add support for online MXFP8 quantization (
--quantization mxfp8), enabling BF16/FP16 models to be dynamically quantized to MXFP8 (microscaling FP8 with block-32 scales) at load time — for both linear layers and MoE expert layers.This is powered by the FlashInfer kernels
trtllm_fp8_block_scale_moefor MoE layers (see Add support for ModelOpt MXFP8 MoE models #35986)mm_mxfp8for linear layers (see Integrate flashinfer mm_mxfp8 in ModelOpt MXFP8 #35053 )This PR implements part of the online quantization support proposed in #32029 and #32412
Usage
Requires SM 100+ (Blackwell) GPU.
Test Plan
E2E tests for both a dense model (
Qwen/Qwen3-0.6B) and a MoE model (Qwen/Qwen3-30B-A3B): logprobs comparison against BF16 baseline + generation smoke test.In addition, we report the the accuracy on MMLU_pro and GM8K using lm_eval_harness as well as performance benchmarks for
Qwen/Qwen3-30B-A3B.Test Result
Accuracy
Performance
vllm bench throughput --model Qwen/Qwen3-30B-A3B \ --tensor-parallel-size 1 \ --trust-remote-code \ --async-scheduling \ --backend vllm \ --dataset-name random \ --random-prefix-len 0 \ --random-input-len 1024 \ --random-output-len 1024 \ --max-num-seqs 128 \ --num-prompts 512 \ --quantization mxfp8 ## Remove for the bf16 modelBF16 performance
Throughput: 7.73 requests/s, 15833.85 total tokens/s, 7916.92 output tokens/sMXFP8 performance
Throughput: 10.34 requests/s, 21179.37 total tokens/s, 10589.68 output tokens/sEssential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.