-
-
Notifications
You must be signed in to change notification settings - Fork 14.8k
Add support for ModelOpt MXFP8 MoE models #35986
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
402da2b
2329237
922d36e
e1f3b2d
69bf7b0
19bc0cb
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -1204,17 +1204,26 @@ def weight_loader( | |
| # Determine per-tensor weight scale patterns based on variant | ||
| # Use the dedicated method instead of brittle string matching | ||
| uses_weight_scale_2 = self.quant_method.uses_weight_scale_2_pattern() | ||
| quant_method = getattr(param, "quant_method", None) | ||
|
|
||
| # Call _load_per_tensor_weight_scale() to load per-tensor (scalar) | ||
| # weights scales. | ||
| # Input scales are always per-tensor. | ||
| # Weight scales: FP4 uses "weight_scale_2" and FP8 uses | ||
| # "weight_scale" for per-tensor scales. | ||
| # NOTE: ModelOpt MXFP8 MoE uses block scales in weight_scale | ||
| # tensors (quant_method=BLOCK), so those must not be treated | ||
| # as per-tensor scalars here. | ||
| is_block_weight_scale = ( | ||
| "weight_scale" in weight_name | ||
| and quant_method == FusedMoeWeightScaleSupported.BLOCK.value | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Technically, I think MXFP8 should be a GROUP scale rather than BLOCK since it is (1, 32)
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think I followed what ModelOpt NVFP4 |
||
| ) | ||
| is_per_tensor = ( | ||
| "weight_scale_2" in weight_name | ||
| if uses_weight_scale_2 | ||
| else "weight_scale" in weight_name | ||
| ) or "input_scale" in weight_name | ||
| is_per_tensor = is_per_tensor and not is_block_weight_scale | ||
| if is_per_tensor: | ||
| self._load_per_tensor_weight_scale( | ||
| shard_id=shard_id, | ||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,44 @@ | ||
| # SPDX-License-Identifier: Apache-2.0 | ||
| # SPDX-FileCopyrightText: Copyright contributors to the vLLM project | ||
| from enum import Enum | ||
|
|
||
| from vllm.logger import init_logger | ||
| from vllm.model_executor.layers.fused_moe.config import FusedMoEConfig | ||
|
|
||
| logger = init_logger(__name__) | ||
|
|
||
|
|
||
| class MxFp8MoeBackend(Enum): | ||
| FLASHINFER_TRTLLM = "FLASHINFER_TRTLLM" | ||
|
|
||
|
|
||
| def select_mxfp8_moe_backend( | ||
| config: FusedMoEConfig, | ||
| ) -> MxFp8MoeBackend: | ||
| if config.is_lora_enabled: | ||
| raise NotImplementedError("LoRA is not supported for MXFP8 MoE.") | ||
|
|
||
| AVAILABLE_BACKENDS = [ | ||
| MxFp8MoeBackend.FLASHINFER_TRTLLM, | ||
| ] | ||
|
|
||
| runner_backend = config.moe_backend | ||
| if runner_backend != "auto": | ||
| mapping = { | ||
| "flashinfer_trtllm": MxFp8MoeBackend.FLASHINFER_TRTLLM, | ||
| } | ||
| if backend := mapping.get(runner_backend): | ||
| logger.info_once( | ||
| "Using '%s' MxFp8 MoE backend (user-requested).", | ||
| backend.value, | ||
| ) | ||
| return backend | ||
| raise ValueError( | ||
| f"moe_backend='{runner_backend}' is not supported for MXFP8 MoE. " | ||
| f"Expected one of {list(mapping.keys())}." | ||
| ) | ||
|
|
||
| # Auto-select: only one backend available for now. | ||
| backend = AVAILABLE_BACKENDS[0] | ||
| logger.info_once("Using '%s' MxFp8 MoE backend.", backend.value) | ||
| return backend |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is non-gated not supported?
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
non-gated MoE is not supported yet (in flashinfer 0.6.4), working on it:
flashinfer-ai/flashinfer#2707