[Kernel] Add MXFP8 to Marlin GEMM/MoE and refactor Mxfp8LinearOp#34664
Open
mgoin wants to merge 15 commits intovllm-project:mainfrom
Open
[Kernel] Add MXFP8 to Marlin GEMM/MoE and refactor Mxfp8LinearOp#34664mgoin wants to merge 15 commits intovllm-project:mainfrom
mgoin wants to merge 15 commits intovllm-project:mainfrom
Conversation
Contributor
There was a problem hiding this comment.
Code Review
This pull request adds support for MXFP8 quantization in the Marlin kernel, providing a faster alternative to the existing emulation path. The changes span across kernel generation, C++ dispatch logic, and Python-level integration. The implementation introduces new utility functions for handling MXFP8-specific weight and scale preparation for Marlin. My review identifies a critical issue in the hardware capability check that could lead to runtime errors on unsupported GPUs.
Contributor
danisereb
reviewed
Feb 22, 2026
5 tasks
|
This pull request has merge conflicts that must be resolved before it can be |
Signed-off-by: mgoin <mgoin64@gmail.com>
Signed-off-by: mgoin <mgoin64@gmail.com>
Move backend selection, weight processing, and apply logic from ModelOptMxFp8LinearMethod into Mxfp8LinearOp so all MXFP8 linear backends (emulation, flashinfer CUTLASS, Marlin) are managed in one place. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: mgoin <mgoin64@gmail.com>
Use select_mxfp8_linear_backend() and delegate weight processing to Mxfp8LinearOp.process_weights(), enabling Marlin backend support for online MXFP8 quantization on SM80+ GPUs. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: mgoin <mgoin64@gmail.com>
Removed comment about backend-specific weight processing.
Use torch.get_default_dtype() and layer.output_size_per_partition / layer.input_size_per_partition directly instead of stashing copies as layer.orig_dtype, layer.marlin_size_n, layer.marlin_size_k. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: mgoin <mgoin64@gmail.com>
Signed-off-by: mgoin <mgoin64@gmail.com>
Signed-off-by: mgoin <mgoin64@gmail.com>
Signed-off-by: mgoin <mgoin64@gmail.com>
|
This pull request has merge conflicts that must be resolved before it can be |
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: mgoin <mgoin64@gmail.com>
Signed-off-by: mgoin <mgoin64@gmail.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Purpose
The Marlin kernel already supports FP8 (per-channel/group scales) and MXFP4 (per-32-element e8m0 scales). MXFP8 is a natural combination: FP8 weights (like existing FP8 Marlin) with e8m0 microscaling block scales (like existing MXFP4 Marlin). We just have to wire the kernel building blocks together.
This PR also consolidates gemm kernel backend specific logic more into the Mxfp8LinearOp class for modelopt.py and mxfp8.py
Test Plan
Existing online quant test for mxfp8 will now run on L4 in CI
tests/models/quantization/test_mxfp8.pyTest Result
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.