[ROCm][Quantization] GPT_OSS in amd-quark format model loading and emulations #29008
Conversation
Signed-off-by: xuebwang-amd <xuebwang@amd.com>
There was a problem hiding this comment.
Code Review
This pull request adds weight mappings for the gpt-oss model to support the quark quantization format. The changes are in vllm/model_executor/models/gpt_oss.py.
My review identifies a critical issue in the new mappings. They are missing the .experts submodule path, which will likely cause weight loading to fail. I've provided a suggestion to correct this. This issue might also be present in the existing MoE weight mappings in the same file, which you may want to investigate as well.
| ".gate_up_proj.weight": ".w13_weight", | ||
| ".gate_up_proj.weight_scale": ".w13_weight_scale", | ||
| ".gate_up_proj.bias": ".w13_bias", | ||
| ".gate_up_proj.input_scale": ".w13_input_scale", | ||
| ".down_proj.weight": ".w2_weight", | ||
| ".down_proj.weight_scale": ".w2_weight_scale", | ||
| ".down_proj.bias": ".w2_bias", | ||
| ".down_proj.input_scale": ".w2_input_scale" |
There was a problem hiding this comment.
The weight mappings for the MoE layers appear to be missing the .experts submodule in the target path. The MoE parameters are located within the experts submodule of the MLPBlock, so the vLLM parameter names will be of the form ...mlp.experts.w13_weight, etc. The current mappings would incorrectly resolve to ...mlp.w13_weight, which would cause weight loading to fail.
To ensure the weights are loaded correctly, the .experts part should be included in the mapping.
| ".gate_up_proj.weight": ".w13_weight", | |
| ".gate_up_proj.weight_scale": ".w13_weight_scale", | |
| ".gate_up_proj.bias": ".w13_bias", | |
| ".gate_up_proj.input_scale": ".w13_input_scale", | |
| ".down_proj.weight": ".w2_weight", | |
| ".down_proj.weight_scale": ".w2_weight_scale", | |
| ".down_proj.bias": ".w2_bias", | |
| ".down_proj.input_scale": ".w2_input_scale" | |
| ".gate_up_proj.weight": ".experts.w13_weight", | |
| ".gate_up_proj.weight_scale": ".experts.w13_weight_scale", | |
| ".gate_up_proj.bias": ".experts.w13_bias", | |
| ".gate_up_proj.input_scale": ".experts.w13_input_scale", | |
| ".down_proj.weight": ".experts.w2_weight", | |
| ".down_proj.weight_scale": ".experts.w2_weight_scale", | |
| ".down_proj.bias": ".experts.w2_bias", | |
| ".down_proj.input_scale": ".experts.w2_input_scale" |
|
@xuebwang-amd is there a model weights to test this feature? and can you share lm_eval score to know if it runs. I have been seeing many quark patches for the same GPTOSS model. |
Signed-off-by: xuebwang-amd <xuebwang@amd.com>
Signed-off-by: xuebwang-amd <xuebwang@amd.com>
Signed-off-by: xuebwang-amd <xuebwang@amd.com>
Signed-off-by: xuebwang-amd <xuebwang@amd.com>
Signed-off-by: xuebwang-amd <xuebwang@amd.com>
…tor later) Signed-off-by: xuebwang-amd <xuebwang@amd.com>
|
This pull request has merge conflicts that must be resolved before it can be |
Signed-off-by: xuebwang-amd <xuebwang@amd.com>
Signed-off-by: xuebwang-amd <xuebwang@amd.com>
There was a problem hiding this comment.
💡 Codex Review
vllm/vllm/model_executor/layers/fused_moe/layer.py
Lines 1162 to 1172 in 4a1c93a
The _load_per_tensor_weight_scale signature now requires combined_w13, but the ModelOpt per-tensor path still calls it without that argument. Hitting this branch will raise TypeError: _load_per_tensor_weight_scale() missing 1 required positional argument: 'combined_w13', preventing ModelOpt MoE checkpoints from loading.
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| # Round up hidden size if needed. | ||
| hidden_size, is_rounded_hidden_size = maybe_roundup_hidden_size( | ||
| hidden_size, | ||
| moe_in_dtype, | ||
| self.moe_parallel_config, | ||
| self.model_type, | ||
| self.is_mxfp4_quant, | ||
| self.emulate_quant, | ||
| is_lora_enabled=self.vllm_config.lora_config is not None, | ||
| ) | ||
| print(f"is_rounded_hidden_size is {is_rounded_hidden_size}") | ||
|
|
||
| if is_rounded_hidden_size: | ||
| self.hidden_size = hidden_size | ||
| self.moe_config: FusedMoEConfig = FusedMoEConfig( |
There was a problem hiding this comment.
Honor hidden_size padding for non-gpt_oss MoE
The padding result from maybe_roundup_hidden_size is only applied when is_rounded_hidden_size is true, yet that flag is set only for gpt_oss+mxfp4 in the helper (lines 250‑269). When other backends such as DeepEP round the hidden size inside maybe_roundup_layer_hidden_size, is_rounded_hidden_size stays false, so self.hidden_size/self.moe_config remain at the unpadded size while weights are built with the padded hidden_size later in __init__. DeepEP runs will then have layer metadata smaller than the actual weight shapes, leading to buffer shape mismatches at runtime.
Useful? React with 👍 / 👎.
Thanks @tjtanaa. Yes, there have been several works on GPTOSS model recently from amd-quark side. They are:
Please see more infos including lm_eval accuracy results on the top of PR descriptions. |
Signed-off-by: xuebwang-amd <xuebwang@amd.com>
|
Hi @robertgshaw2-redhat , following up on our discussion, the requested changes have now been implemented. Could you please have a double look if this is ready to go? Thank you! |
…at_mapping_in_gpt_oss
…at_mapping_in_gpt_oss
…at_mapping_in_gpt_oss
Signed-off-by: xuebwang-amd <xuebwang@amd.com>
…at_mapping_in_gpt_oss
Signed-off-by: xuebwang-amd <xuebwang@amd.com>
…ulations (vllm-project#29008) Signed-off-by: xuebwang-amd <xuebwang@amd.com> Signed-off-by: Robert Shaw <robshaw@redhat.com> Co-authored-by: Robert Shaw <robshaw@redhat.com> Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com>
…ulations (vllm-project#29008) Signed-off-by: xuebwang-amd <xuebwang@amd.com> Signed-off-by: Robert Shaw <robshaw@redhat.com> Co-authored-by: Robert Shaw <robshaw@redhat.com> Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com>
…ulations (vllm-project#29008) Signed-off-by: xuebwang-amd <xuebwang@amd.com> Signed-off-by: Robert Shaw <robshaw@redhat.com> Co-authored-by: Robert Shaw <robshaw@redhat.com> Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com>
…ulations (vllm-project#29008) Signed-off-by: xuebwang-amd <xuebwang@amd.com> Signed-off-by: Robert Shaw <robshaw@redhat.com> Co-authored-by: Robert Shaw <robshaw@redhat.com> Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com> Signed-off-by: Eldar Kurtic <research@neuralmagic.com>
…ulations (vllm-project#29008) Signed-off-by: xuebwang-amd <xuebwang@amd.com> Signed-off-by: Robert Shaw <robshaw@redhat.com> Co-authored-by: Robert Shaw <robshaw@redhat.com> Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com>
…ulations (vllm-project#29008) Signed-off-by: xuebwang-amd <xuebwang@amd.com> Signed-off-by: Robert Shaw <robshaw@redhat.com> Co-authored-by: Robert Shaw <robshaw@redhat.com> Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com>
…ulations (vllm-project#29008) Signed-off-by: xuebwang-amd <xuebwang@amd.com> Signed-off-by: Robert Shaw <robshaw@redhat.com> Co-authored-by: Robert Shaw <robshaw@redhat.com> Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com> Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com>
…ulations (vllm-project#29008) Signed-off-by: xuebwang-amd <xuebwang@amd.com> Signed-off-by: Robert Shaw <robshaw@redhat.com> Co-authored-by: Robert Shaw <robshaw@redhat.com> Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com>
| if ocp_mx_scheme in {"w_mxfp4", "w_mxfp4_a_mxfp4"}: | ||
| pass # No QDQ needed for these schemes |
There was a problem hiding this comment.
@xuebwang-amd this looks unnecessary. quant_dtype should already be properly set.
| self._emulate = ( | ||
| not current_platform.supports_mx() | ||
| or not self.ocp_mx_scheme.startswith("w_mxfp4") | ||
| ) and (self.mxfp4_backend is None or not self.use_rocm_aiter_moe) |
There was a problem hiding this comment.
@xuebwang-amd this is not correct. w_mxfp4_a_mxfp6 models can not run through aiter backend
…ulations (vllm-project#29008) Signed-off-by: xuebwang-amd <xuebwang@amd.com> Signed-off-by: Robert Shaw <robshaw@redhat.com> Co-authored-by: Robert Shaw <robshaw@redhat.com> Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com>
Purpose
This PR aims for:
mxfp4loading function for original openai/gpt-oss-20b & openai/gpt-oss-120bQuarkOCP_MX_MoEMethodTest Plan
See results below.
Test Result
(Sub)-tasks
mxfp4format model loading withquarkformat model loading, specificallyFIXMEherequark_moe.py, specially modifyQuarkOCP_MX_MoEMethodto be compatible with W4A16Quark_OCPMX_W4A16_MoEMethodandQuark_OCPMX_FP8_MoEMethodas @fxmarty-amd also pointed outW-MXFP4-A-MXFP8: not in the PR scope as @BowenBao had clarified [ROCm][Quantization] GPT_OSS in amd-quark format model loading and emulations #29008 (comment)W-FP8-AFP8quantization for gpt_ossTODO
Note
Adds end-to-end support for GPT-OSS models in amd-quark format and extends MoE quantization.
mxfp4_w4a16_moe_quant_config,mxfp4_fp8_moe_quant_config;fp8_w8a8_moe_quant_config/int8_w8a8_moe_quant_configaccept bias; expanded OCP-MX schemes (incl. weight-onlyw_mxfp4and*_a_fp8)*_a_fp8, MXFP4/MXFP6 dequant paths generalized; GPT-OSS Triton MoE path wired with PrecisionConfig handlingmodel_type==gpt_ossand MXFP4; weight loader extended for GPT-OSS fused/bias paths; RMS KV-scale loader helper addedQuarkOCP_MX_MoEMethod(W4A16/W4Afp8), bias + static FP8 input scale handling, backend gating, and routing to ROCm AIter/Triton/nativeWritten by Cursor Bugbot for commit e23834d. This will update automatically on new commits. Configure here.
Note
Cursor Bugbot is generating a summary for commit 4ae66d1. Configure here.
Note
Extends GPT-OSS quantization support across model formats and MoE runtimes, with new configs, loaders, and tests.
mxfp4andquarkformats: fused-expert/bias handling, EP/TP-aware slicing, expert mapping, and KV-cache scale loading; adds bias toqkv_proj/o_projmxfp4_w4a16_moe_quant_config,mxfp4_w4a8_moe_quant_config;fp8_w8a8_moe_quant_config,int8_w8a8_moe_quant_config,nvfp4_moe_quant_confignow accept bias; addsuse_mxfp4_w4a8w_mxfp4and*_a_fp8; generalized dequant/QDQ paths and input-quant dtype detection infused_moe.pyPrecisionConfig; errors for unimplemented FP8-activation pathQuarkOCP_MX_MoEMethodsupports W4A16/W4Afp8 (emulation/native gating), bias/static FP8 scales, GPT-OSS padding/scale mergemodel_type==gpt_ossand MXFP4; extended weight loader for GPT-OSS fused/bias casesWritten by Cursor Bugbot for commit 4ae66d1. This will update automatically on new commits. Configure here.
Note
Extends GPT‑OSS quantization and MoE execution across OpenAI and amd‑quark formats with new configs and loaders.
mxfp4andquark(fused experts, EP/TP slicing, bias, expert mapping, KV‑cache scale loader); adds bias toqkv_proj/o_projmxfp4_w4a16_moe_quant_configandmxfp4_w4a8_moe_quant_config; allows bias infp8_w8a8_moe_quant_config/int8_w8a8_moe_quant_config/nvfp4_moe_quant_config; introducesuse_mxfp4_w4a8w_mxfp4and*_a_fp8) and generalizes fused MoE runtime: input quant dtype detection, MXFP4/MXFP6 dequant, FP8 activation QDQ emulationPrecisionConfig; errors for unimplemented FP8‑act path; Precision handling detached from paramsmodel_type=gpt_osswith MXFP4; extends weight loader for GPT‑OSS fused/bias casesQuarkOCP_MX_MoEMethod(W4A16/W4Afp8), static FP8 input scale handling, backend gating (native/ROCm AIter/emulation)Written by Cursor Bugbot for commit 10c2323. This will update automatically on new commits. Configure here.
Note
Cursor Bugbot is generating a summary for commit 887c716. Configure here.
Note
Extends GPT‑OSS quantization and MoE execution across OpenAI and amd‑quark formats with unified loaders, new configs, and runtime paths.
mxfp4andquark(fused experts, bias, KV‑cache scales, EP/TP‑aware slicing); adds bias toqkv_proj/o_proj; introduceskv_cache_scale_loadermxfp4_w4a16_moe_quant_configandmxfp4_w4a8_moe_quant_config;fp8_w8a8_moe_quant_config,int8_w8a8_moe_quant_config,nvfp4_moe_quant_confignow accept bias; exposesuse_mxfp4_w4a8w_mxfp4and*_a_fp8) inocp_mx_utils.py; generalizes fused MoE to detect input quant dtype, dequantize MXFP4/MXFP6, and emulate FP8 activations via QDQPrecisionConfig; errors for unimplemented FP8‑act kernelmodel_type=gpt_oss+ MXFP4; enhances expert weight loader for GPT‑OSS fused/bias pathsQuarkOCP_MX_MoEMethod) to support W4A16/W4Afp8 (native/ROCm AIter/emulation), static FP8 input scales, and backend gatingWritten by Cursor Bugbot for commit 887c716. This will update automatically on new commits. Configure here.
Note
Cursor Bugbot is generating a summary for commit 1bece5d. Configure here.
Note
Cursor Bugbot is generating a summary for commit e1e52ea. Configure here.
Note
Extends GPT‑OSS quantization and MoE execution across OpenAI and amd‑quark formats with unified loaders, new configs, and runtime updates.
mxfp4andquark(fused experts, EP/TP slicing, expert mapping, KV‑cache scales, bias), and enables bias inqkv_proj/o_projmxfp4_w4a16_moe_quant_configandmxfp4_w4a8_moe_quant_config; allows bias infp8_w8a8_moe_quant_config,int8_w8a8_moe_quant_config,nvfp4_moe_quant_config; introducesuse_mxfp4_w4a8and expands OCP‑MX schemes (e.g.,w_mxfp4,*_a_fp8)*_a_fp8PrecisionConfig; errors for unimplemented FP8‑act kernelmodel_type=gpt_osswith MXFP4; enhances expert weight loader for GPT‑OSS fused/bias paths; addskv_cache_scale_loaderWritten by Cursor Bugbot for commit e1e52ea. This will update automatically on new commits. Configure here.
Note
Extends GPT‑OSS quantization and MoE execution across OpenAI and amd‑quark formats with unified loaders, new quant configs, and runtime updates.
mxfp4andquark): fused experts/bias handling, EP/TP‑aware slicing, expert mapping, and KV‑cache scale loading; enables bias inqkv_proj/o_projmxfp4_w4a16_moe_quant_configandmxfp4_w4a8_moe_quant_config; allows bias infp8_w8a8_moe_quant_config,int8_w8a8_moe_quant_config,nvfp4_moe_quant_config; introducesuse_mxfp4_w4a8; expandsOCP_MX_Scheme(e.g.,w_mxfp4,*_a_fp8)*_a_fp8; GPT‑OSS Triton fused MoE supports MXFP4 W4A16 withPrecisionConfigmodel_type==gpt_osswith MXFP4; enhances expert weight loader for GPT‑OSS fused/bias cases; addskv_cache_scale_loaderWritten by Cursor Bugbot for commit 04bec4c. This will update automatically on new commits. Configure here.