[Quantization] [Refactor] Create special "GptOssMxfp4MoeMethod" by zyongye · Pull Request #39604 · vllm-project/vllm

zyongye · 2026-04-12T03:33:17Z

Purpose

(Generated with Claude)

This PR scopes the MXFP4 quantization implementation to GPT-OSS checkpoints
while keeping generic MXFP4 support for other models (e.g. Quark).

Changes

quantization/mxfp4.py (renamed from gpt_oss_mxfp4.py):

Introduces Mxfp4Config as the canonical base class (get_name() =
"mxfp4")
with a functional get_quant_method() fallback so the "mxfp4" quant flag
works for non-GPT-OSS models
Introduces GptOssMxfp4Config(Mxfp4Config) (get_name() = "gpt_oss_mxfp4")
following the ModelOptQuantConfigBase/ModelOptFp8Config pattern
GptOssMxfp4Config.override_quantization_method() redirects GPT-OSS
checkpoints
(quant_method="mxfp4" + model_type="gpt_oss") to "gpt_oss_mxfp4";
requires explicit model_type="gpt_oss" to avoid claiming other MXFP4
checkpoints

quantization/__init__.py:

Registers "gpt_oss_mxfp4" in QuantizationMethods and method_to_config;
"mxfp4" → Mxfp4Config, "gpt_oss_mxfp4" → GptOssMxfp4Config

base_config.py:

Extends override_quantization_method with hf_config=None; updates all
subclasses — enables model-type-aware override decisions

models/config.py:

GptOssForCausalLMConfig.verify_and_update_model_config normalizes
hf_config.quantization_config["quant_method"] "mxfp4" → "gpt_oss_mxfp4"

models/gpt_oss.py:

Normalizes quant_method in load_weights for direct JSON reads
Replaces hardcoded == "gpt_oss_mxfp4" checks with _is_mxfp4() helper
("mxfp4" in weight_dtype) covering all variants (GPT-OSS: "gpt_oss_mxfp4",
Quark: "mxfp4")

oracle/mxfp4.py (renamed from gpt_oss_mxfp4.py):

Generic functions/enum kept with generic names (make_mxfp4_moe_kernel,
make_mxfp4_moe_quant_config, Mxfp4MoeBackend)
GPT-OSS-specific functions renamed: select_mxfp4_moe_backend →
select_gpt_oss_mxfp4_moe_backend, convert_to_mxfp4_moe_kernel_format →
convert_gpt_oss_weight_to_mxfp4_moe_kernel_format

Backward compatibility

Checkpoint string "quant_method": "mxfp4" in JSON is unchanged
"mxfp4" entry in QuantizationMethods preserved
No kernel names changed

Test Plan

gpt-oss-20b gpqa score with low reasoning effort

Test Result

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Rename the GPT-OSS-specific MXFP4 quantization files and classes to make ownership explicit: - oracle/mxfp4.py → oracle/gpt_oss_mxfp4.py; Mxfp4MoeBackend → GptOssMxfp4MoeBackend - quantization/mxfp4.py → quantization/gpt_oss_mxfp4.py; Mxfp4Config → GptOssMxfp4Config, Mxfp4MoEMethod → GptOssMxfp4MoEMethod - get_name() now returns "gpt_oss_mxfp4"; registry keeps "mxfp4" alias for backward compat with existing model JSON configs - Update all importers (compressed_tensors_moe_w4a4_mxfp4, quark_moe, __init__, layer.py, gpt_oss.py) Kernel class names are unchanged. Co-authored-by: Claude <noreply@anthropic.com> Signed-off-by: Yongye Zhu <yongyezhu@meta.com> Signed-off-by: Yongye Zhu <zyy1102000@gmail.com>

… method - Rename quantization/gpt_oss_mxfp4.py back to mxfp4.py; split into a canonical Mxfp4Config base (get_name()="mxfp4", get_quant_method raises NotImplementedError) and GptOssMxfp4Config subclass (get_name()= "gpt_oss_mxfp4") following the ModelOptQuantConfigBase pattern - Register "gpt_oss_mxfp4" in QuantizationMethods and method_to_config; "mxfp4" maps to the base class, "gpt_oss_mxfp4" to the GPT-OSS impl - Add GptOssMxfp4Config.override_quantization_method: redirects checkpoints with quant_method="mxfp4" + model_type="gpt_oss" to "gpt_oss_mxfp4" - Extend QuantizationConfig.override_quantization_method signature with hf_config=None; update all subclasses and the call-site in model.py to pass hf_config so model-type can inform override decisions - Add GptOssForCausalLMConfig.verify_and_update_model_config to normalize hf_config.quantization_config quant_method "mxfp4"->"gpt_oss_mxfp4" - Normalize quant_method in gpt_oss.py load_weights for direct reads from checkpoint JSON - Guard transformers/base.py NotImplementedError for both "mxfp4" and "gpt_oss_mxfp4" Co-Authored-By: Claude Signed-off-by: Yongye Zhu <zyy1102000@gmail.com>

… oracle Rename make_mxfp4_moe_quant_config and make_mxfp4_moe_kernel to make_gpt_oss_mxfp4_moe_quant_config and make_gpt_oss_mxfp4_moe_kernel in oracle/gpt_oss_mxfp4.py; update all callers in mxfp4.py, compressed_tensors_moe_w4a4_mxfp4.py, and quark_moe.py. Co-Authored-By: Claude Signed-off-by: Yongye Zhu <zyy1102000@gmail.com>

- convert_to_mxfp4_moe_kernel_format → convert_gpt_oss_weight_to_mxfp4_moe_kernel_format - select_mxfp4_moe_backend → select_gpt_oss_mxfp4_moe_backend Update all callers in mxfp4.py and quark_moe.py. Co-Authored-By: Claude Signed-off-by: Yongye Zhu <zyy1102000@gmail.com>

The backend enum is a generic MXFP4 concept, not GPT-OSS specific. Revert to the original Mxfp4MoeBackend name across oracle/gpt_oss_mxfp4.py and all callers (mxfp4.py, compressed_tensors_moe_w4a4_mxfp4.py, quark_moe.py). Co-Authored-By: Claude Signed-off-by: Yongye Zhu <zyy1102000@gmail.com>

The oracle file and its functions (make_mxfp4_moe_kernel, make_mxfp4_moe_quant_config, select_mxfp4_moe_backend, convert_to_mxfp4_moe_kernel_format) are generic MXFP4 utilities, not GPT-OSS specific. Revert to oracle/mxfp4.py and restore the original function names; update all imports accordingly. Co-Authored-By: Claude Signed-off-by: Yongye Zhu <zyy1102000@gmail.com>

Keep make_mxfp4_moe_kernel and make_mxfp4_moe_quant_config as generic names; restore gpt_oss prefix on the GPT-OSS-specific functions: - convert_to_mxfp4_moe_kernel_format → convert_gpt_oss_weight_to_mxfp4_moe_kernel_format - select_mxfp4_moe_backend → select_gpt_oss_mxfp4_moe_backend Update all callers in mxfp4.py, compressed_tensors_moe_w4a4_mxfp4.py, and quark_moe.py. Co-Authored-By: Claude Signed-off-by: Yongye Zhu <zyy1102000@gmail.com>

gemini-code-assist

Code Review

This pull request introduces a model-specific quantization configuration for GPT-OSS models using MXFP4, renaming internal methods and adding normalization logic to distinguish it from generic MXFP4. Feedback indicates that the changes to the base Mxfp4Config class cause a regression for non-GPT-OSS models by raising a NotImplementedError. Furthermore, the override logic for GPT-OSS is identified as being too aggressive and potentially incompatible with normalized configuration strings. Finally, hardcoded checks for the new gpt_oss_mxfp4 name in the model executor may break compatibility with other MXFP4 variants like Quark.

gemini-code-assist · 2026-04-12T03:35:55Z

+    def get_quant_method(
+        self, layer: torch.nn.Module, prefix: str
+    ) -> "QuantizeMethodBase | None":
+        raise NotImplementedError(
+            f"{type(self).__name__} does not implement get_quant_method. "
+            "Use a model-specific subclass (e.g. GptOssMxfp4Config)."
+        )


The Mxfp4Config.get_quant_method implementation now raises NotImplementedError, which is a regression for generic MXFP4 support. Previously, this method provided a functional implementation for FusedMoE layers. If the intention is to make Mxfp4Config a base class, it should still provide a default implementation or be marked as abstract in a way that doesn't break the user-facing "mxfp4" quantization flag for non-GPT-OSS models.

gemini-code-assist · 2026-04-12T03:35:55Z

+    def override_quantization_method(
+        cls, hf_quant_cfg, user_quant, hf_config=None
+    ) -> QuantizationMethods | None:
+        if not (
+            isinstance(hf_quant_cfg, dict)
+            and hf_quant_cfg.get("quant_method") == "mxfp4"
+        ):
+            return None
+        model_type = getattr(hf_config, "model_type", None)
+        if model_type is not None and model_type != "gpt_oss":
+            return None
+        return "gpt_oss_mxfp4"


The override_quantization_method logic in GptOssMxfp4Config has two critical issues:

It returns "gpt_oss_mxfp4" if hf_config is None (since model_type will be None), which is too aggressive for a model-specific config. It should only return the override if it can confirm the model is "gpt_oss".

It only checks for "quant_method": "mxfp4". However, GptOssForCausalLMConfig.verify_and_update_model_config (in models/config.py) normalizes this to "gpt_oss_mxfp4" before this check runs. If a user explicitly passes --quantization mxfp4, this method will return None, leading to a mismatch error in vllm/config/model.py because self.quantization ("mxfp4") won't match the normalized quant_method ("gpt_oss_mxfp4").

Suggested change

def override_quantization_method(

cls, hf_quant_cfg, user_quant, hf_config=None

) -> QuantizationMethods | None:

if not (

isinstance(hf_quant_cfg, dict)

and hf_quant_cfg.get("quant_method") == "mxfp4"

):

return None

model_type = getattr(hf_config, "model_type", None)

if model_type is not None and model_type != "gpt_oss":

return None

return "gpt_oss_mxfp4"

@classmethod

def override_quantization_method(

cls, hf_quant_cfg, user_quant, hf_config=None

) -> QuantizationMethods | None:

if not isinstance(hf_quant_cfg, dict):

return None

quant_method = hf_quant_cfg.get("quant_method")

if quant_method not in ("mxfp4", "gpt_oss_mxfp4"):

return None

if getattr(hf_config, "model_type", None) != "gpt_oss":

return None

return "gpt_oss_mxfp4"

gemini-code-assist · 2026-04-12T03:35:55Z

        moe_weight_dtype = _get_moe_weight_dtype(layer_id=0)

-        if moe_weight_dtype == "mxfp4":
+        if moe_weight_dtype == "gpt_oss_mxfp4":


Hardcoding the check for "gpt_oss_mxfp4" here (and at line 685) breaks compatibility with other quantization methods that use MXFP4, such as Quark. Since QuarkMoEMethod (in quark_moe.py) is being updated to use the same backend selection logic, it likely expects the same alignment and loading behavior. These checks should be more generic (e.g., checking if the method is an MXFP4 variant) or QuarkMoEMethod should also be updated to use the new canonical name when applicable.

- Mxfp4Config.get_quant_method: clarify that the NotImplementedError means no subclass claimed the checkpoint, and hint at the model_type requirement, so the failure is diagnosable at the call site - gpt_oss.py load_weights: document why three separate normalization paths for "mxfp4"->"gpt_oss_mxfp4" exist (each operates on a distinct copy of the quantization config dict) Co-Authored-By: Claude Signed-off-by: Yongye Zhu <zyy1102000@gmail.com>

- Move get_quant_method to Mxfp4Config as a functional fallback so the "mxfp4" quant flag works for non-GPT-OSS models (e.g. Quark) without hitting NotImplementedError - Fix GptOssMxfp4Config.override_quantization_method: - Require explicit model_type="gpt_oss" (never claim when hf_config is None or model_type is unknown) - Also match "gpt_oss_mxfp4" in hf_quant_cfg so --quantization mxfp4 from the user does not cause a mismatch after verify_and_update_model_config normalizes the dict value first - Replace hardcoded "gpt_oss_mxfp4" checks in gpt_oss.py with a generic _is_mxfp4() helper ("mxfp4" in weight_dtype) to cover all MXFP4 variants including Quark ("mxfp4") and GPT-OSS ("gpt_oss_mxfp4") Co-Authored-By: Claude Signed-off-by: Yongye Zhu <zyy1102000@gmail.com>

- base_config.py: add type annotations to override_quantization_method parameters (hf_quant_cfg, user_quant, hf_config) to fix griffe warnings in mkdocs build - docs/design/moe_kernel_features.md: update cross-reference from Mxfp4MoEMethod to GptOssMxfp4MoEMethod to match the renamed class Co-Authored-By: Claude Signed-off-by: Yongye Zhu <zyy1102000@gmail.com>

mergify · 2026-04-12T04:25:16Z

Documentation preview: https://vllm--39604.org.readthedocs.build/en/39604/

…-project#39604) Signed-off-by: Yongye Zhu <zyy1102000@gmail.com> Signed-off-by: Jonathan Chen <chenleejonathan@gmail.com>

…-project#39604) Signed-off-by: Yongye Zhu <zyy1102000@gmail.com>

…_config parameter to HPU quantization config overrides (#1349) ## Summary Fixes a regression introduced by upstream vLLM that breaks all quantization tests using HPU-specific GPTQ and AWQ backends (e.g. `run_qwen3_inc_dynamic_load_generate_test`). ## Changes 1. **Add `hf_config` parameter to `override_quantization_method()` in `GPTQHPUConfig` and `AWQHPUConfig`** — upstream changed the call site in `vllm/config/model.py` to pass `hf_config=self.hf_config`, but plugin implementations still used the old 2-parameter signature, causing `TypeError`. 2. **Re-enable `build_nixl_dockerfile` CI test** in pre-merge workflow. ## Upstream PR that introduced the regression - vllm-project/vllm#39604 — added `hf_config` keyword argument to `override_quantization_method()` call and updated all upstream implementations, but plugin implementations were not updated. --------- Signed-off-by: Paweł Olejniczak <pawelx.olejniczak@intel.com>

BowenBao · 2026-04-15T18:40:44Z



-def select_mxfp4_moe_backend(
+def select_gpt_oss_mxfp4_moe_backend(


@zyongye IMO these are not gpt-oss specific. quark_moe.py is under refactor to adopt mxfp4 oracle which supports any model with mxfp4 moe. (might see dup comments from me, I sent in a wrong place earlier)

We are temporarily route all mxfp4 to gpt_oss_mxfp4 for compatibility reason. We will later create another mxfp4moe that will decouple with this one. After we added that method we can route AMD related change to the new moe class.

### What this PR does / why we need it? Upgrade vllm commit to `6f786f2c506cb07f4566771fdc62e640e2c4a176` 1. fix vllm-project/vllm#32936 This pr leads to the issue in Capture Phase (_dummy_run), vllm-project/vllm#28207 (comment) we should re-read `compilation_config.cudagraph_capture_sizes` after the super() call in `_check_and_update_cudagraph_mode` to keep `self.cudagraph_batch_sizes` in sync with the (possibly rewritten) sizes in `model_runner_v1.NPUModelRunner_check_and_update_cudagraph_mode`. For example, when speculative decoding (e.g. eagle3) is enabled and `cudagraph_capture_sizes` is explicitly specify as [5, 12], vLLM's `_check_and_update_cudagraph_mode` calls `adjust_cudagraph_sizes_for_spec_decode`, which rounds `cudagraph_capture_sizes` up to a multiple of (num_speculative_tokens + 1). For example, with `num_speculative_tokens=2` , [5, 12] becomes [6, 12]. However, in vllm-ascend, `self.cudagraph_batch_sizes` was cached during __init__ with the original [5, 12]. When `set_graph_params(self.cudagraph_batch_sizes)` runs later, it creates `graph_params.events` keyed by {5, 12}. Meanwhile, the `CudagraphDispatcher` uses the updated [6, 12] from `compilation_config`, so it tries to capture at num_tokens=6 — causing KeyError: 6 in `graph_params.events[num_tokens]` inside `full_graph_fia`. you can also re-produce the issue with the script: ```python import os os.environ.setdefault("VLLM_WORKER_MULTIPROC_METHOD", "spawn") from vllm import LLM, SamplingParams from vllm.config import CompilationConfig EXAMPLE_PROMPTS = [ "Hello, my name is", "The president of the United States is", "The capital of France is", "The future of AI is", ] SAMPLING_PARAMS = SamplingParams( max_tokens=300, temperature=0.0, ignore_eos=False, ) def run_spec(): """Run with eagle3 speculative decoding.""" llm = LLM( model="Qwen/Qwen3-8B", tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, disable_log_stats=False, max_model_len=4096, seed=1024, async_scheduling=False, speculative_config={ "disable_padded_drafter_batch": False, "method": "eagle3", "model": "RedHatAI/Qwen3-8B-speculator.eagle3", "num_speculative_tokens": 2, "draft_tensor_parallel_size": 1, "max_model_len": 128, }, compilation_config=CompilationConfig( cudagraph_mode="FULL", cudagraph_capture_sizes=[5, 12], ), ) spec_outputs = llm.generate(EXAMPLE_PROMPTS, SAMPLING_PARAMS) del llm return spec_outputs def main(): spec_outputs = run_spec() for o in spec_outputs: print(f" PROMPT: {o.prompt!r}") print(f" OUTPUT: {o.outputs[0].text[:80]}...") if __name__ == "__main__": main() ``` 3. fix vllm-project/vllm#39604 ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? 1. For 310P, we are - vLLM version: - vLLM main: https://github.com/vllm-project/vllm/commit/v0.19.0 --------- Signed-off-by: wangli <wangli858794774@gmail.com>

…_config parameter to HPU quantization config overrides (vllm-project#1349) ## Summary Fixes a regression introduced by upstream vLLM that breaks all quantization tests using HPU-specific GPTQ and AWQ backends (e.g. `run_qwen3_inc_dynamic_load_generate_test`). ## Changes 1. **Add `hf_config` parameter to `override_quantization_method()` in `GPTQHPUConfig` and `AWQHPUConfig`** — upstream changed the call site in `vllm/config/model.py` to pass `hf_config=self.hf_config`, but plugin implementations still used the old 2-parameter signature, causing `TypeError`. 2. **Re-enable `build_nixl_dockerfile` CI test** in pre-merge workflow. ## Upstream PR that introduced the regression - vllm-project/vllm#39604 — added `hf_config` keyword argument to `override_quantization_method()` call and updated all upstream implementations, but plugin implementations were not updated. --------- Signed-off-by: Paweł Olejniczak <pawelx.olejniczak@intel.com> Signed-off-by: bmyrcha <bartosz.myrcha@intel.com>

### What this PR does / why we need it? Upgrade vllm commit to `6f786f2c506cb07f4566771fdc62e640e2c4a176` 1. fix vllm-project/vllm#32936 This pr leads to the issue in Capture Phase (_dummy_run), vllm-project/vllm#28207 (comment) we should re-read `compilation_config.cudagraph_capture_sizes` after the super() call in `_check_and_update_cudagraph_mode` to keep `self.cudagraph_batch_sizes` in sync with the (possibly rewritten) sizes in `model_runner_v1.NPUModelRunner_check_and_update_cudagraph_mode`. For example, when speculative decoding (e.g. eagle3) is enabled and `cudagraph_capture_sizes` is explicitly specify as [5, 12], vLLM's `_check_and_update_cudagraph_mode` calls `adjust_cudagraph_sizes_for_spec_decode`, which rounds `cudagraph_capture_sizes` up to a multiple of (num_speculative_tokens + 1). For example, with `num_speculative_tokens=2` , [5, 12] becomes [6, 12]. However, in vllm-ascend, `self.cudagraph_batch_sizes` was cached during __init__ with the original [5, 12]. When `set_graph_params(self.cudagraph_batch_sizes)` runs later, it creates `graph_params.events` keyed by {5, 12}. Meanwhile, the `CudagraphDispatcher` uses the updated [6, 12] from `compilation_config`, so it tries to capture at num_tokens=6 — causing KeyError: 6 in `graph_params.events[num_tokens]` inside `full_graph_fia`. you can also re-produce the issue with the script: ```python import os os.environ.setdefault("VLLM_WORKER_MULTIPROC_METHOD", "spawn") from vllm import LLM, SamplingParams from vllm.config import CompilationConfig EXAMPLE_PROMPTS = [ "Hello, my name is", "The president of the United States is", "The capital of France is", "The future of AI is", ] SAMPLING_PARAMS = SamplingParams( max_tokens=300, temperature=0.0, ignore_eos=False, ) def run_spec(): """Run with eagle3 speculative decoding.""" llm = LLM( model="Qwen/Qwen3-8B", tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, disable_log_stats=False, max_model_len=4096, seed=1024, async_scheduling=False, speculative_config={ "disable_padded_drafter_batch": False, "method": "eagle3", "model": "RedHatAI/Qwen3-8B-speculator.eagle3", "num_speculative_tokens": 2, "draft_tensor_parallel_size": 1, "max_model_len": 128, }, compilation_config=CompilationConfig( cudagraph_mode="FULL", cudagraph_capture_sizes=[5, 12], ), ) spec_outputs = llm.generate(EXAMPLE_PROMPTS, SAMPLING_PARAMS) del llm return spec_outputs def main(): spec_outputs = run_spec() for o in spec_outputs: print(f" PROMPT: {o.prompt!r}") print(f" OUTPUT: {o.outputs[0].text[:80]}...") if __name__ == "__main__": main() ``` 3. fix vllm-project/vllm#39604 ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? 1. For 310P, we are - vLLM version: - vLLM main: https://github.com/vllm-project/vllm/commit/v0.19.0 --------- Signed-off-by: wangli <wangli858794774@gmail.com>

### What this PR does / why we need it? Upgrade vllm commit to `6f786f2c506cb07f4566771fdc62e640e2c4a176` 1. fix vllm-project/vllm#32936 This pr leads to the issue in Capture Phase (_dummy_run), vllm-project/vllm#28207 (comment) we should re-read `compilation_config.cudagraph_capture_sizes` after the super() call in `_check_and_update_cudagraph_mode` to keep `self.cudagraph_batch_sizes` in sync with the (possibly rewritten) sizes in `model_runner_v1.NPUModelRunner_check_and_update_cudagraph_mode`. For example, when speculative decoding (e.g. eagle3) is enabled and `cudagraph_capture_sizes` is explicitly specify as [5, 12], vLLM's `_check_and_update_cudagraph_mode` calls `adjust_cudagraph_sizes_for_spec_decode`, which rounds `cudagraph_capture_sizes` up to a multiple of (num_speculative_tokens + 1). For example, with `num_speculative_tokens=2` , [5, 12] becomes [6, 12]. However, in vllm-ascend, `self.cudagraph_batch_sizes` was cached during __init__ with the original [5, 12]. When `set_graph_params(self.cudagraph_batch_sizes)` runs later, it creates `graph_params.events` keyed by {5, 12}. Meanwhile, the `CudagraphDispatcher` uses the updated [6, 12] from `compilation_config`, so it tries to capture at num_tokens=6 — causing KeyError: 6 in `graph_params.events[num_tokens]` inside `full_graph_fia`. you can also re-produce the issue with the script: ```python import os os.environ.setdefault("VLLM_WORKER_MULTIPROC_METHOD", "spawn") from vllm import LLM, SamplingParams from vllm.config import CompilationConfig EXAMPLE_PROMPTS = [ "Hello, my name is", "The president of the United States is", "The capital of France is", "The future of AI is", ] SAMPLING_PARAMS = SamplingParams( max_tokens=300, temperature=0.0, ignore_eos=False, ) def run_spec(): """Run with eagle3 speculative decoding.""" llm = LLM( model="Qwen/Qwen3-8B", tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, disable_log_stats=False, max_model_len=4096, seed=1024, async_scheduling=False, speculative_config={ "disable_padded_drafter_batch": False, "method": "eagle3", "model": "RedHatAI/Qwen3-8B-speculator.eagle3", "num_speculative_tokens": 2, "draft_tensor_parallel_size": 1, "max_model_len": 128, }, compilation_config=CompilationConfig( cudagraph_mode="FULL", cudagraph_capture_sizes=[5, 12], ), ) spec_outputs = llm.generate(EXAMPLE_PROMPTS, SAMPLING_PARAMS) del llm return spec_outputs def main(): spec_outputs = run_spec() for o in spec_outputs: print(f" PROMPT: {o.prompt!r}") print(f" OUTPUT: {o.outputs[0].text[:80]}...") if __name__ == "__main__": main() ``` 3. fix vllm-project/vllm#39604 ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? 1. For 310P, we are - vLLM version: - vLLM main: https://github.com/vllm-project/vllm/commit/v0.19.0 --------- Signed-off-by: wangli <wangli858794774@gmail.com> Signed-off-by: tfhddd <2272751277@qq.com>

### What this PR does / why we need it? Upgrade vllm commit to `6f786f2c506cb07f4566771fdc62e640e2c4a176` 1. fix vllm-project/vllm#32936 This pr leads to the issue in Capture Phase (_dummy_run), vllm-project/vllm#28207 (comment) we should re-read `compilation_config.cudagraph_capture_sizes` after the super() call in `_check_and_update_cudagraph_mode` to keep `self.cudagraph_batch_sizes` in sync with the (possibly rewritten) sizes in `model_runner_v1.NPUModelRunner_check_and_update_cudagraph_mode`. For example, when speculative decoding (e.g. eagle3) is enabled and `cudagraph_capture_sizes` is explicitly specify as [5, 12], vLLM's `_check_and_update_cudagraph_mode` calls `adjust_cudagraph_sizes_for_spec_decode`, which rounds `cudagraph_capture_sizes` up to a multiple of (num_speculative_tokens + 1). For example, with `num_speculative_tokens=2` , [5, 12] becomes [6, 12]. However, in vllm-ascend, `self.cudagraph_batch_sizes` was cached during __init__ with the original [5, 12]. When `set_graph_params(self.cudagraph_batch_sizes)` runs later, it creates `graph_params.events` keyed by {5, 12}. Meanwhile, the `CudagraphDispatcher` uses the updated [6, 12] from `compilation_config`, so it tries to capture at num_tokens=6 — causing KeyError: 6 in `graph_params.events[num_tokens]` inside `full_graph_fia`. you can also re-produce the issue with the script: ```python import os os.environ.setdefault("VLLM_WORKER_MULTIPROC_METHOD", "spawn") from vllm import LLM, SamplingParams from vllm.config import CompilationConfig EXAMPLE_PROMPTS = [ "Hello, my name is", "The president of the United States is", "The capital of France is", "The future of AI is", ] SAMPLING_PARAMS = SamplingParams( max_tokens=300, temperature=0.0, ignore_eos=False, ) def run_spec(): """Run with eagle3 speculative decoding.""" llm = LLM( model="Qwen/Qwen3-8B", tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, disable_log_stats=False, max_model_len=4096, seed=1024, async_scheduling=False, speculative_config={ "disable_padded_drafter_batch": False, "method": "eagle3", "model": "RedHatAI/Qwen3-8B-speculator.eagle3", "num_speculative_tokens": 2, "draft_tensor_parallel_size": 1, "max_model_len": 128, }, compilation_config=CompilationConfig( cudagraph_mode="FULL", cudagraph_capture_sizes=[5, 12], ), ) spec_outputs = llm.generate(EXAMPLE_PROMPTS, SAMPLING_PARAMS) del llm return spec_outputs def main(): spec_outputs = run_spec() for o in spec_outputs: print(f" PROMPT: {o.prompt!r}") print(f" OUTPUT: {o.outputs[0].text[:80]}...") if __name__ == "__main__": main() ``` 3. fix vllm-project/vllm#39604 ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? 1. For 310P, we are - vLLM version: - vLLM main: https://github.com/vllm-project/vllm/commit/v0.19.0 --------- Signed-off-by: wangli <wangli858794774@gmail.com>

…_config parameter to HPU quantization config overrides (vllm-project#1349) ## Summary Fixes a regression introduced by upstream vLLM that breaks all quantization tests using HPU-specific GPTQ and AWQ backends (e.g. `run_qwen3_inc_dynamic_load_generate_test`). ## Changes 1. **Add `hf_config` parameter to `override_quantization_method()` in `GPTQHPUConfig` and `AWQHPUConfig`** — upstream changed the call site in `vllm/config/model.py` to pass `hf_config=self.hf_config`, but plugin implementations still used the old 2-parameter signature, causing `TypeError`. 2. **Re-enable `build_nixl_dockerfile` CI test** in pre-merge workflow. ## Upstream PR that introduced the regression - vllm-project/vllm#39604 — added `hf_config` keyword argument to `override_quantization_method()` call and updated all upstream implementations, but plugin implementations were not updated. --------- Signed-off-by: Paweł Olejniczak <pawelx.olejniczak@intel.com> Signed-off-by: Yeonsil Yoon <yeon.sil.yoon@intel.com>

…_config parameter to HPU quantization config overrides (vllm-project#1349) ## Summary Fixes a regression introduced by upstream vLLM that breaks all quantization tests using HPU-specific GPTQ and AWQ backends (e.g. `run_qwen3_inc_dynamic_load_generate_test`). ## Changes 1. **Add `hf_config` parameter to `override_quantization_method()` in `GPTQHPUConfig` and `AWQHPUConfig`** — upstream changed the call site in `vllm/config/model.py` to pass `hf_config=self.hf_config`, but plugin implementations still used the old 2-parameter signature, causing `TypeError`. 2. **Re-enable `build_nixl_dockerfile` CI test** in pre-merge workflow. ## Upstream PR that introduced the regression - vllm-project/vllm#39604 — added `hf_config` keyword argument to `override_quantization_method()` call and updated all upstream implementations, but plugin implementations were not updated. --------- Signed-off-by: Paweł Olejniczak <pawelx.olejniczak@intel.com> Signed-off-by: bmyrcha <bartosz.myrcha@intel.com>

…-project#39604) Signed-off-by: Yongye Zhu <zyy1102000@gmail.com>

### What this PR does / why we need it? Upgrade vllm commit to `6f786f2c506cb07f4566771fdc62e640e2c4a176` 1. fix vllm-project/vllm#32936 This pr leads to the issue in Capture Phase (_dummy_run), vllm-project/vllm#28207 (comment) we should re-read `compilation_config.cudagraph_capture_sizes` after the super() call in `_check_and_update_cudagraph_mode` to keep `self.cudagraph_batch_sizes` in sync with the (possibly rewritten) sizes in `model_runner_v1.NPUModelRunner_check_and_update_cudagraph_mode`. For example, when speculative decoding (e.g. eagle3) is enabled and `cudagraph_capture_sizes` is explicitly specify as [5, 12], vLLM's `_check_and_update_cudagraph_mode` calls `adjust_cudagraph_sizes_for_spec_decode`, which rounds `cudagraph_capture_sizes` up to a multiple of (num_speculative_tokens + 1). For example, with `num_speculative_tokens=2` , [5, 12] becomes [6, 12]. However, in vllm-ascend, `self.cudagraph_batch_sizes` was cached during __init__ with the original [5, 12]. When `set_graph_params(self.cudagraph_batch_sizes)` runs later, it creates `graph_params.events` keyed by {5, 12}. Meanwhile, the `CudagraphDispatcher` uses the updated [6, 12] from `compilation_config`, so it tries to capture at num_tokens=6 — causing KeyError: 6 in `graph_params.events[num_tokens]` inside `full_graph_fia`. you can also re-produce the issue with the script: ```python import os os.environ.setdefault("VLLM_WORKER_MULTIPROC_METHOD", "spawn") from vllm import LLM, SamplingParams from vllm.config import CompilationConfig EXAMPLE_PROMPTS = [ "Hello, my name is", "The president of the United States is", "The capital of France is", "The future of AI is", ] SAMPLING_PARAMS = SamplingParams( max_tokens=300, temperature=0.0, ignore_eos=False, ) def run_spec(): """Run with eagle3 speculative decoding.""" llm = LLM( model="Qwen/Qwen3-8B", tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, disable_log_stats=False, max_model_len=4096, seed=1024, async_scheduling=False, speculative_config={ "disable_padded_drafter_batch": False, "method": "eagle3", "model": "RedHatAI/Qwen3-8B-speculator.eagle3", "num_speculative_tokens": 2, "draft_tensor_parallel_size": 1, "max_model_len": 128, }, compilation_config=CompilationConfig( cudagraph_mode="FULL", cudagraph_capture_sizes=[5, 12], ), ) spec_outputs = llm.generate(EXAMPLE_PROMPTS, SAMPLING_PARAMS) del llm return spec_outputs def main(): spec_outputs = run_spec() for o in spec_outputs: print(f" PROMPT: {o.prompt!r}") print(f" OUTPUT: {o.outputs[0].text[:80]}...") if __name__ == "__main__": main() ``` 3. fix vllm-project/vllm#39604 ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? 1. For 310P, we are - vLLM version: - vLLM main: https://github.com/vllm-project/vllm/commit/v0.19.0 --------- Signed-off-by: wangli <wangli858794774@gmail.com> Signed-off-by: guxin108 <1252896542@qq.com>

…-project#39604) Signed-off-by: Yongye Zhu <zyy1102000@gmail.com> Signed-off-by: Avinash Singh <avinashsingh.rcoem@gmail.com>

### What this PR does / why we need it? Upgrade vllm commit to `6f786f2c506cb07f4566771fdc62e640e2c4a176` 1. fix vllm-project/vllm#32936 This pr leads to the issue in Capture Phase (_dummy_run), vllm-project/vllm#28207 (comment) we should re-read `compilation_config.cudagraph_capture_sizes` after the super() call in `_check_and_update_cudagraph_mode` to keep `self.cudagraph_batch_sizes` in sync with the (possibly rewritten) sizes in `model_runner_v1.NPUModelRunner_check_and_update_cudagraph_mode`. For example, when speculative decoding (e.g. eagle3) is enabled and `cudagraph_capture_sizes` is explicitly specify as [5, 12], vLLM's `_check_and_update_cudagraph_mode` calls `adjust_cudagraph_sizes_for_spec_decode`, which rounds `cudagraph_capture_sizes` up to a multiple of (num_speculative_tokens + 1). For example, with `num_speculative_tokens=2` , [5, 12] becomes [6, 12]. However, in vllm-ascend, `self.cudagraph_batch_sizes` was cached during __init__ with the original [5, 12]. When `set_graph_params(self.cudagraph_batch_sizes)` runs later, it creates `graph_params.events` keyed by {5, 12}. Meanwhile, the `CudagraphDispatcher` uses the updated [6, 12] from `compilation_config`, so it tries to capture at num_tokens=6 — causing KeyError: 6 in `graph_params.events[num_tokens]` inside `full_graph_fia`. you can also re-produce the issue with the script: ```python import os os.environ.setdefault("VLLM_WORKER_MULTIPROC_METHOD", "spawn") from vllm import LLM, SamplingParams from vllm.config import CompilationConfig EXAMPLE_PROMPTS = [ "Hello, my name is", "The president of the United States is", "The capital of France is", "The future of AI is", ] SAMPLING_PARAMS = SamplingParams( max_tokens=300, temperature=0.0, ignore_eos=False, ) def run_spec(): """Run with eagle3 speculative decoding.""" llm = LLM( model="Qwen/Qwen3-8B", tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, disable_log_stats=False, max_model_len=4096, seed=1024, async_scheduling=False, speculative_config={ "disable_padded_drafter_batch": False, "method": "eagle3", "model": "RedHatAI/Qwen3-8B-speculator.eagle3", "num_speculative_tokens": 2, "draft_tensor_parallel_size": 1, "max_model_len": 128, }, compilation_config=CompilationConfig( cudagraph_mode="FULL", cudagraph_capture_sizes=[5, 12], ), ) spec_outputs = llm.generate(EXAMPLE_PROMPTS, SAMPLING_PARAMS) del llm return spec_outputs def main(): spec_outputs = run_spec() for o in spec_outputs: print(f" PROMPT: {o.prompt!r}") print(f" OUTPUT: {o.outputs[0].text[:80]}...") if __name__ == "__main__": main() ``` 3. fix vllm-project/vllm#39604 ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? 1. For 310P, we are - vLLM version: - vLLM main: https://github.com/vllm-project/vllm/commit/v0.19.0 --------- Signed-off-by: wangli <wangli858794774@gmail.com> Signed-off-by: zouyida2052 <zouyida2002@gmail.com>

zyongye added 7 commits April 11, 2026 19:59

zyongye requested review from ProExpertProg, WoosukKwon, hmellor, houseroad, mgoin, pavanimajety, robertgshaw2-redhat, tjtanaa, tlrmchlsmth, yewentao256 and youkaichao as code owners April 12, 2026 03:33

zyongye changed the title ~~[Quantization] Create special "GptOssMxfp4MoeMethod"~~ [Quantization] [Refactor] Create special "GptOssMxfp4MoeMethod" Apr 12, 2026

zyongye mentioned this pull request Apr 12, 2026

[Quantization] Rename mxfp4 quant layer and oracle to gpt_oss_mxfp4 #38683

Closed

5 tasks

mergify Bot added gpt-oss Related to GPT-OSS models cpu Related to CPU backends labels Apr 12, 2026

github-project-automation Bot added this to gpt-oss Issues & Enhancements Apr 12, 2026

github-project-automation Bot moved this to To Triage in gpt-oss Issues & Enhancements Apr 12, 2026

gemini-code-assist Bot reviewed Apr 12, 2026

View reviewed changes

zyongye added 3 commits April 12, 2026 03:39

mergify Bot added the documentation Improvements or additions to documentation label Apr 12, 2026

mgoin added the nvidia label Apr 13, 2026

github-project-automation Bot added this to NVIDIA Apr 13, 2026

github-project-automation Bot moved this to Ready in NVIDIA Apr 13, 2026

mgoin merged commit 739e594 into vllm-project:main Apr 13, 2026
86 of 88 checks passed

github-project-automation Bot moved this from Ready to Done in gpt-oss Issues & Enhancements Apr 13, 2026

github-project-automation Bot moved this from Ready to Done in NVIDIA Apr 13, 2026

wojciech-wais pushed a commit to wojciech-wais/vllm that referenced this pull request Apr 13, 2026

[Quantization] [Refactor] Create special "GptOssMxfp4MoeMethod" (vllm…

f1e62c7

…-project#39604) Signed-off-by: Yongye Zhu <zyy1102000@gmail.com>

Rohan138 mentioned this pull request Apr 14, 2026

[Bugfix][ROCm]: Allow gpt_oss_mxfp4 quantization method on rocm #39754

Merged

5 tasks

pawel-olejniczak mentioned this pull request Apr 14, 2026

[FIX_FOR_VLLM_CUSTOM=d3af8c18317c0dc008d42e4367fbb9045cfb7bf6] Add hf_config parameter to HPU quantization config overrides vllm-project/vllm-gaudi#1349

Merged

BowenBao reviewed Apr 15, 2026

View reviewed changes

Potabk mentioned this pull request Apr 16, 2026

[Misc] Upgrade vllm commit to 0414 vllm-project/vllm-ascend#8172

Merged

whk-lab pushed a commit to whk-lab/vllm that referenced this pull request Apr 23, 2026

[Quantization] [Refactor] Create special "GptOssMxfp4MoeMethod" (vllm…

b3570da

…-project#39604) Signed-off-by: Yongye Zhu <zyy1102000@gmail.com>

demian-overflow mentioned this pull request Apr 30, 2026

[Refactor] Extract shared helpers from MXFP4 MoE backend selectors #41317

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Quantization] [Refactor] Create special "GptOssMxfp4MoeMethod"#39604

[Quantization] [Refactor] Create special "GptOssMxfp4MoeMethod"#39604
mgoin merged 10 commits intovllm-project:mainfrom
zyongye:rename_gpt_oss_mxfp4

zyongye commented Apr 12, 2026 •

edited

Loading

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Apr 12, 2026

Uh oh!

gemini-code-assist Bot Apr 12, 2026

Uh oh!

gemini-code-assist Bot Apr 12, 2026

Uh oh!

mergify Bot commented Apr 12, 2026

Uh oh!

Uh oh!

BowenBao Apr 15, 2026

Uh oh!

zyongye Apr 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants



		def select_mxfp4_moe_backend(
		def select_gpt_oss_mxfp4_moe_backend(

Uh oh!

Conversation

zyongye commented Apr 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Changes

Backward compatibility

Test Plan

Test Result

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Apr 12, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Apr 12, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Apr 12, 2026

Choose a reason for hiding this comment

Uh oh!

mergify Bot commented Apr 12, 2026

Uh oh!

Uh oh!

BowenBao Apr 15, 2026

Choose a reason for hiding this comment

Uh oh!

zyongye Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

zyongye commented Apr 12, 2026 •

edited

Loading