[Quantization] [Refactor] Create special "GptOssMxfp4MoeMethod"#39604
[Quantization] [Refactor] Create special "GptOssMxfp4MoeMethod"#39604mgoin merged 10 commits intovllm-project:mainfrom
Conversation
Rename the GPT-OSS-specific MXFP4 quantization files and classes to make ownership explicit: - oracle/mxfp4.py → oracle/gpt_oss_mxfp4.py; Mxfp4MoeBackend → GptOssMxfp4MoeBackend - quantization/mxfp4.py → quantization/gpt_oss_mxfp4.py; Mxfp4Config → GptOssMxfp4Config, Mxfp4MoEMethod → GptOssMxfp4MoEMethod - get_name() now returns "gpt_oss_mxfp4"; registry keeps "mxfp4" alias for backward compat with existing model JSON configs - Update all importers (compressed_tensors_moe_w4a4_mxfp4, quark_moe, __init__, layer.py, gpt_oss.py) Kernel class names are unchanged. Co-authored-by: Claude <noreply@anthropic.com> Signed-off-by: Yongye Zhu <yongyezhu@meta.com> Signed-off-by: Yongye Zhu <zyy1102000@gmail.com>
… method - Rename quantization/gpt_oss_mxfp4.py back to mxfp4.py; split into a canonical Mxfp4Config base (get_name()="mxfp4", get_quant_method raises NotImplementedError) and GptOssMxfp4Config subclass (get_name()= "gpt_oss_mxfp4") following the ModelOptQuantConfigBase pattern - Register "gpt_oss_mxfp4" in QuantizationMethods and method_to_config; "mxfp4" maps to the base class, "gpt_oss_mxfp4" to the GPT-OSS impl - Add GptOssMxfp4Config.override_quantization_method: redirects checkpoints with quant_method="mxfp4" + model_type="gpt_oss" to "gpt_oss_mxfp4" - Extend QuantizationConfig.override_quantization_method signature with hf_config=None; update all subclasses and the call-site in model.py to pass hf_config so model-type can inform override decisions - Add GptOssForCausalLMConfig.verify_and_update_model_config to normalize hf_config.quantization_config quant_method "mxfp4"->"gpt_oss_mxfp4" - Normalize quant_method in gpt_oss.py load_weights for direct reads from checkpoint JSON - Guard transformers/base.py NotImplementedError for both "mxfp4" and "gpt_oss_mxfp4" Co-Authored-By: Claude Signed-off-by: Yongye Zhu <zyy1102000@gmail.com>
… oracle Rename make_mxfp4_moe_quant_config and make_mxfp4_moe_kernel to make_gpt_oss_mxfp4_moe_quant_config and make_gpt_oss_mxfp4_moe_kernel in oracle/gpt_oss_mxfp4.py; update all callers in mxfp4.py, compressed_tensors_moe_w4a4_mxfp4.py, and quark_moe.py. Co-Authored-By: Claude Signed-off-by: Yongye Zhu <zyy1102000@gmail.com>
- convert_to_mxfp4_moe_kernel_format → convert_gpt_oss_weight_to_mxfp4_moe_kernel_format - select_mxfp4_moe_backend → select_gpt_oss_mxfp4_moe_backend Update all callers in mxfp4.py and quark_moe.py. Co-Authored-By: Claude Signed-off-by: Yongye Zhu <zyy1102000@gmail.com>
The backend enum is a generic MXFP4 concept, not GPT-OSS specific. Revert to the original Mxfp4MoeBackend name across oracle/gpt_oss_mxfp4.py and all callers (mxfp4.py, compressed_tensors_moe_w4a4_mxfp4.py, quark_moe.py). Co-Authored-By: Claude Signed-off-by: Yongye Zhu <zyy1102000@gmail.com>
The oracle file and its functions (make_mxfp4_moe_kernel, make_mxfp4_moe_quant_config, select_mxfp4_moe_backend, convert_to_mxfp4_moe_kernel_format) are generic MXFP4 utilities, not GPT-OSS specific. Revert to oracle/mxfp4.py and restore the original function names; update all imports accordingly. Co-Authored-By: Claude Signed-off-by: Yongye Zhu <zyy1102000@gmail.com>
Keep make_mxfp4_moe_kernel and make_mxfp4_moe_quant_config as generic names; restore gpt_oss prefix on the GPT-OSS-specific functions: - convert_to_mxfp4_moe_kernel_format → convert_gpt_oss_weight_to_mxfp4_moe_kernel_format - select_mxfp4_moe_backend → select_gpt_oss_mxfp4_moe_backend Update all callers in mxfp4.py, compressed_tensors_moe_w4a4_mxfp4.py, and quark_moe.py. Co-Authored-By: Claude Signed-off-by: Yongye Zhu <zyy1102000@gmail.com>
There was a problem hiding this comment.
Code Review
This pull request introduces a model-specific quantization configuration for GPT-OSS models using MXFP4, renaming internal methods and adding normalization logic to distinguish it from generic MXFP4. Feedback indicates that the changes to the base Mxfp4Config class cause a regression for non-GPT-OSS models by raising a NotImplementedError. Furthermore, the override logic for GPT-OSS is identified as being too aggressive and potentially incompatible with normalized configuration strings. Finally, hardcoded checks for the new gpt_oss_mxfp4 name in the model executor may break compatibility with other MXFP4 variants like Quark.
| def get_quant_method( | ||
| self, layer: torch.nn.Module, prefix: str | ||
| ) -> "QuantizeMethodBase | None": | ||
| raise NotImplementedError( | ||
| f"{type(self).__name__} does not implement get_quant_method. " | ||
| "Use a model-specific subclass (e.g. GptOssMxfp4Config)." | ||
| ) |
There was a problem hiding this comment.
The Mxfp4Config.get_quant_method implementation now raises NotImplementedError, which is a regression for generic MXFP4 support. Previously, this method provided a functional implementation for FusedMoE layers. If the intention is to make Mxfp4Config a base class, it should still provide a default implementation or be marked as abstract in a way that doesn't break the user-facing "mxfp4" quantization flag for non-GPT-OSS models.
| def override_quantization_method( | ||
| cls, hf_quant_cfg, user_quant, hf_config=None | ||
| ) -> QuantizationMethods | None: | ||
| if not ( | ||
| isinstance(hf_quant_cfg, dict) | ||
| and hf_quant_cfg.get("quant_method") == "mxfp4" | ||
| ): | ||
| return None | ||
| model_type = getattr(hf_config, "model_type", None) | ||
| if model_type is not None and model_type != "gpt_oss": | ||
| return None | ||
| return "gpt_oss_mxfp4" |
There was a problem hiding this comment.
The override_quantization_method logic in GptOssMxfp4Config has two critical issues:
- It returns
"gpt_oss_mxfp4"ifhf_configisNone(sincemodel_typewill beNone), which is too aggressive for a model-specific config. It should only return the override if it can confirm the model is"gpt_oss". - It only checks for
"quant_method": "mxfp4". However,GptOssForCausalLMConfig.verify_and_update_model_config(inmodels/config.py) normalizes this to"gpt_oss_mxfp4"before this check runs. If a user explicitly passes--quantization mxfp4, this method will returnNone, leading to a mismatch error invllm/config/model.pybecauseself.quantization("mxfp4") won't match the normalizedquant_method("gpt_oss_mxfp4").
| def override_quantization_method( | |
| cls, hf_quant_cfg, user_quant, hf_config=None | |
| ) -> QuantizationMethods | None: | |
| if not ( | |
| isinstance(hf_quant_cfg, dict) | |
| and hf_quant_cfg.get("quant_method") == "mxfp4" | |
| ): | |
| return None | |
| model_type = getattr(hf_config, "model_type", None) | |
| if model_type is not None and model_type != "gpt_oss": | |
| return None | |
| return "gpt_oss_mxfp4" | |
| @classmethod | |
| def override_quantization_method( | |
| cls, hf_quant_cfg, user_quant, hf_config=None | |
| ) -> QuantizationMethods | None: | |
| if not isinstance(hf_quant_cfg, dict): | |
| return None | |
| quant_method = hf_quant_cfg.get("quant_method") | |
| if quant_method not in ("mxfp4", "gpt_oss_mxfp4"): | |
| return None | |
| if getattr(hf_config, "model_type", None) != "gpt_oss": | |
| return None | |
| return "gpt_oss_mxfp4" |
| moe_weight_dtype = _get_moe_weight_dtype(layer_id=0) | ||
|
|
||
| if moe_weight_dtype == "mxfp4": | ||
| if moe_weight_dtype == "gpt_oss_mxfp4": |
There was a problem hiding this comment.
Hardcoding the check for "gpt_oss_mxfp4" here (and at line 685) breaks compatibility with other quantization methods that use MXFP4, such as Quark. Since QuarkMoEMethod (in quark_moe.py) is being updated to use the same backend selection logic, it likely expects the same alignment and loading behavior. These checks should be more generic (e.g., checking if the method is an MXFP4 variant) or QuarkMoEMethod should also be updated to use the new canonical name when applicable.
- Mxfp4Config.get_quant_method: clarify that the NotImplementedError means no subclass claimed the checkpoint, and hint at the model_type requirement, so the failure is diagnosable at the call site - gpt_oss.py load_weights: document why three separate normalization paths for "mxfp4"->"gpt_oss_mxfp4" exist (each operates on a distinct copy of the quantization config dict) Co-Authored-By: Claude Signed-off-by: Yongye Zhu <zyy1102000@gmail.com>
- Move get_quant_method to Mxfp4Config as a functional fallback so
the "mxfp4" quant flag works for non-GPT-OSS models (e.g. Quark)
without hitting NotImplementedError
- Fix GptOssMxfp4Config.override_quantization_method:
- Require explicit model_type="gpt_oss" (never claim when hf_config
is None or model_type is unknown)
- Also match "gpt_oss_mxfp4" in hf_quant_cfg so --quantization mxfp4
from the user does not cause a mismatch after verify_and_update_model_config
normalizes the dict value first
- Replace hardcoded "gpt_oss_mxfp4" checks in gpt_oss.py with a
generic _is_mxfp4() helper ("mxfp4" in weight_dtype) to cover all
MXFP4 variants including Quark ("mxfp4") and GPT-OSS ("gpt_oss_mxfp4")
Co-Authored-By: Claude
Signed-off-by: Yongye Zhu <zyy1102000@gmail.com>
- base_config.py: add type annotations to override_quantization_method parameters (hf_quant_cfg, user_quant, hf_config) to fix griffe warnings in mkdocs build - docs/design/moe_kernel_features.md: update cross-reference from Mxfp4MoEMethod to GptOssMxfp4MoEMethod to match the renamed class Co-Authored-By: Claude Signed-off-by: Yongye Zhu <zyy1102000@gmail.com>
|
Documentation preview: https://vllm--39604.org.readthedocs.build/en/39604/ |
…-project#39604) Signed-off-by: Yongye Zhu <zyy1102000@gmail.com> Signed-off-by: Jonathan Chen <chenleejonathan@gmail.com>
…-project#39604) Signed-off-by: Yongye Zhu <zyy1102000@gmail.com>
…_config parameter to HPU quantization config overrides (#1349) ## Summary Fixes a regression introduced by upstream vLLM that breaks all quantization tests using HPU-specific GPTQ and AWQ backends (e.g. `run_qwen3_inc_dynamic_load_generate_test`). ## Changes 1. **Add `hf_config` parameter to `override_quantization_method()` in `GPTQHPUConfig` and `AWQHPUConfig`** — upstream changed the call site in `vllm/config/model.py` to pass `hf_config=self.hf_config`, but plugin implementations still used the old 2-parameter signature, causing `TypeError`. 2. **Re-enable `build_nixl_dockerfile` CI test** in pre-merge workflow. ## Upstream PR that introduced the regression - vllm-project/vllm#39604 — added `hf_config` keyword argument to `override_quantization_method()` call and updated all upstream implementations, but plugin implementations were not updated. --------- Signed-off-by: Paweł Olejniczak <pawelx.olejniczak@intel.com>
|
|
||
|
|
||
| def select_mxfp4_moe_backend( | ||
| def select_gpt_oss_mxfp4_moe_backend( |
There was a problem hiding this comment.
@zyongye IMO these are not gpt-oss specific. quark_moe.py is under refactor to adopt mxfp4 oracle which supports any model with mxfp4 moe. (might see dup comments from me, I sent in a wrong place earlier)
There was a problem hiding this comment.
We are temporarily route all mxfp4 to gpt_oss_mxfp4 for compatibility reason. We will later create another mxfp4moe that will decouple with this one. After we added that method we can route AMD related change to the new moe class.
### What this PR does / why we need it? Upgrade vllm commit to `6f786f2c506cb07f4566771fdc62e640e2c4a176` 1. fix vllm-project/vllm#32936 This pr leads to the issue in Capture Phase (_dummy_run), vllm-project/vllm#28207 (comment) we should re-read `compilation_config.cudagraph_capture_sizes` after the super() call in `_check_and_update_cudagraph_mode` to keep `self.cudagraph_batch_sizes` in sync with the (possibly rewritten) sizes in `model_runner_v1.NPUModelRunner_check_and_update_cudagraph_mode`. For example, when speculative decoding (e.g. eagle3) is enabled and `cudagraph_capture_sizes` is explicitly specify as [5, 12], vLLM's `_check_and_update_cudagraph_mode` calls `adjust_cudagraph_sizes_for_spec_decode`, which rounds `cudagraph_capture_sizes` up to a multiple of (num_speculative_tokens + 1). For example, with `num_speculative_tokens=2` , [5, 12] becomes [6, 12]. However, in vllm-ascend, `self.cudagraph_batch_sizes` was cached during __init__ with the original [5, 12]. When `set_graph_params(self.cudagraph_batch_sizes)` runs later, it creates `graph_params.events` keyed by {5, 12}. Meanwhile, the `CudagraphDispatcher` uses the updated [6, 12] from `compilation_config`, so it tries to capture at num_tokens=6 — causing KeyError: 6 in `graph_params.events[num_tokens]` inside `full_graph_fia`. you can also re-produce the issue with the script: ```python import os os.environ.setdefault("VLLM_WORKER_MULTIPROC_METHOD", "spawn") from vllm import LLM, SamplingParams from vllm.config import CompilationConfig EXAMPLE_PROMPTS = [ "Hello, my name is", "The president of the United States is", "The capital of France is", "The future of AI is", ] SAMPLING_PARAMS = SamplingParams( max_tokens=300, temperature=0.0, ignore_eos=False, ) def run_spec(): """Run with eagle3 speculative decoding.""" llm = LLM( model="Qwen/Qwen3-8B", tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, disable_log_stats=False, max_model_len=4096, seed=1024, async_scheduling=False, speculative_config={ "disable_padded_drafter_batch": False, "method": "eagle3", "model": "RedHatAI/Qwen3-8B-speculator.eagle3", "num_speculative_tokens": 2, "draft_tensor_parallel_size": 1, "max_model_len": 128, }, compilation_config=CompilationConfig( cudagraph_mode="FULL", cudagraph_capture_sizes=[5, 12], ), ) spec_outputs = llm.generate(EXAMPLE_PROMPTS, SAMPLING_PARAMS) del llm return spec_outputs def main(): spec_outputs = run_spec() for o in spec_outputs: print(f" PROMPT: {o.prompt!r}") print(f" OUTPUT: {o.outputs[0].text[:80]}...") if __name__ == "__main__": main() ``` 3. fix vllm-project/vllm#39604 ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? 1. For 310P, we are - vLLM version: - vLLM main: https://github.com/vllm-project/vllm/commit/v0.19.0 --------- Signed-off-by: wangli <wangli858794774@gmail.com>
…_config parameter to HPU quantization config overrides (vllm-project#1349) ## Summary Fixes a regression introduced by upstream vLLM that breaks all quantization tests using HPU-specific GPTQ and AWQ backends (e.g. `run_qwen3_inc_dynamic_load_generate_test`). ## Changes 1. **Add `hf_config` parameter to `override_quantization_method()` in `GPTQHPUConfig` and `AWQHPUConfig`** — upstream changed the call site in `vllm/config/model.py` to pass `hf_config=self.hf_config`, but plugin implementations still used the old 2-parameter signature, causing `TypeError`. 2. **Re-enable `build_nixl_dockerfile` CI test** in pre-merge workflow. ## Upstream PR that introduced the regression - vllm-project/vllm#39604 — added `hf_config` keyword argument to `override_quantization_method()` call and updated all upstream implementations, but plugin implementations were not updated. --------- Signed-off-by: Paweł Olejniczak <pawelx.olejniczak@intel.com> Signed-off-by: bmyrcha <bartosz.myrcha@intel.com>
### What this PR does / why we need it? Upgrade vllm commit to `6f786f2c506cb07f4566771fdc62e640e2c4a176` 1. fix vllm-project/vllm#32936 This pr leads to the issue in Capture Phase (_dummy_run), vllm-project/vllm#28207 (comment) we should re-read `compilation_config.cudagraph_capture_sizes` after the super() call in `_check_and_update_cudagraph_mode` to keep `self.cudagraph_batch_sizes` in sync with the (possibly rewritten) sizes in `model_runner_v1.NPUModelRunner_check_and_update_cudagraph_mode`. For example, when speculative decoding (e.g. eagle3) is enabled and `cudagraph_capture_sizes` is explicitly specify as [5, 12], vLLM's `_check_and_update_cudagraph_mode` calls `adjust_cudagraph_sizes_for_spec_decode`, which rounds `cudagraph_capture_sizes` up to a multiple of (num_speculative_tokens + 1). For example, with `num_speculative_tokens=2` , [5, 12] becomes [6, 12]. However, in vllm-ascend, `self.cudagraph_batch_sizes` was cached during __init__ with the original [5, 12]. When `set_graph_params(self.cudagraph_batch_sizes)` runs later, it creates `graph_params.events` keyed by {5, 12}. Meanwhile, the `CudagraphDispatcher` uses the updated [6, 12] from `compilation_config`, so it tries to capture at num_tokens=6 — causing KeyError: 6 in `graph_params.events[num_tokens]` inside `full_graph_fia`. you can also re-produce the issue with the script: ```python import os os.environ.setdefault("VLLM_WORKER_MULTIPROC_METHOD", "spawn") from vllm import LLM, SamplingParams from vllm.config import CompilationConfig EXAMPLE_PROMPTS = [ "Hello, my name is", "The president of the United States is", "The capital of France is", "The future of AI is", ] SAMPLING_PARAMS = SamplingParams( max_tokens=300, temperature=0.0, ignore_eos=False, ) def run_spec(): """Run with eagle3 speculative decoding.""" llm = LLM( model="Qwen/Qwen3-8B", tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, disable_log_stats=False, max_model_len=4096, seed=1024, async_scheduling=False, speculative_config={ "disable_padded_drafter_batch": False, "method": "eagle3", "model": "RedHatAI/Qwen3-8B-speculator.eagle3", "num_speculative_tokens": 2, "draft_tensor_parallel_size": 1, "max_model_len": 128, }, compilation_config=CompilationConfig( cudagraph_mode="FULL", cudagraph_capture_sizes=[5, 12], ), ) spec_outputs = llm.generate(EXAMPLE_PROMPTS, SAMPLING_PARAMS) del llm return spec_outputs def main(): spec_outputs = run_spec() for o in spec_outputs: print(f" PROMPT: {o.prompt!r}") print(f" OUTPUT: {o.outputs[0].text[:80]}...") if __name__ == "__main__": main() ``` 3. fix vllm-project/vllm#39604 ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? 1. For 310P, we are - vLLM version: - vLLM main: https://github.com/vllm-project/vllm/commit/v0.19.0 --------- Signed-off-by: wangli <wangli858794774@gmail.com>
### What this PR does / why we need it? Upgrade vllm commit to `6f786f2c506cb07f4566771fdc62e640e2c4a176` 1. fix vllm-project/vllm#32936 This pr leads to the issue in Capture Phase (_dummy_run), vllm-project/vllm#28207 (comment) we should re-read `compilation_config.cudagraph_capture_sizes` after the super() call in `_check_and_update_cudagraph_mode` to keep `self.cudagraph_batch_sizes` in sync with the (possibly rewritten) sizes in `model_runner_v1.NPUModelRunner_check_and_update_cudagraph_mode`. For example, when speculative decoding (e.g. eagle3) is enabled and `cudagraph_capture_sizes` is explicitly specify as [5, 12], vLLM's `_check_and_update_cudagraph_mode` calls `adjust_cudagraph_sizes_for_spec_decode`, which rounds `cudagraph_capture_sizes` up to a multiple of (num_speculative_tokens + 1). For example, with `num_speculative_tokens=2` , [5, 12] becomes [6, 12]. However, in vllm-ascend, `self.cudagraph_batch_sizes` was cached during __init__ with the original [5, 12]. When `set_graph_params(self.cudagraph_batch_sizes)` runs later, it creates `graph_params.events` keyed by {5, 12}. Meanwhile, the `CudagraphDispatcher` uses the updated [6, 12] from `compilation_config`, so it tries to capture at num_tokens=6 — causing KeyError: 6 in `graph_params.events[num_tokens]` inside `full_graph_fia`. you can also re-produce the issue with the script: ```python import os os.environ.setdefault("VLLM_WORKER_MULTIPROC_METHOD", "spawn") from vllm import LLM, SamplingParams from vllm.config import CompilationConfig EXAMPLE_PROMPTS = [ "Hello, my name is", "The president of the United States is", "The capital of France is", "The future of AI is", ] SAMPLING_PARAMS = SamplingParams( max_tokens=300, temperature=0.0, ignore_eos=False, ) def run_spec(): """Run with eagle3 speculative decoding.""" llm = LLM( model="Qwen/Qwen3-8B", tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, disable_log_stats=False, max_model_len=4096, seed=1024, async_scheduling=False, speculative_config={ "disable_padded_drafter_batch": False, "method": "eagle3", "model": "RedHatAI/Qwen3-8B-speculator.eagle3", "num_speculative_tokens": 2, "draft_tensor_parallel_size": 1, "max_model_len": 128, }, compilation_config=CompilationConfig( cudagraph_mode="FULL", cudagraph_capture_sizes=[5, 12], ), ) spec_outputs = llm.generate(EXAMPLE_PROMPTS, SAMPLING_PARAMS) del llm return spec_outputs def main(): spec_outputs = run_spec() for o in spec_outputs: print(f" PROMPT: {o.prompt!r}") print(f" OUTPUT: {o.outputs[0].text[:80]}...") if __name__ == "__main__": main() ``` 3. fix vllm-project/vllm#39604 ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? 1. For 310P, we are - vLLM version: - vLLM main: https://github.com/vllm-project/vllm/commit/v0.19.0 --------- Signed-off-by: wangli <wangli858794774@gmail.com>
### What this PR does / why we need it? Upgrade vllm commit to `6f786f2c506cb07f4566771fdc62e640e2c4a176` 1. fix vllm-project/vllm#32936 This pr leads to the issue in Capture Phase (_dummy_run), vllm-project/vllm#28207 (comment) we should re-read `compilation_config.cudagraph_capture_sizes` after the super() call in `_check_and_update_cudagraph_mode` to keep `self.cudagraph_batch_sizes` in sync with the (possibly rewritten) sizes in `model_runner_v1.NPUModelRunner_check_and_update_cudagraph_mode`. For example, when speculative decoding (e.g. eagle3) is enabled and `cudagraph_capture_sizes` is explicitly specify as [5, 12], vLLM's `_check_and_update_cudagraph_mode` calls `adjust_cudagraph_sizes_for_spec_decode`, which rounds `cudagraph_capture_sizes` up to a multiple of (num_speculative_tokens + 1). For example, with `num_speculative_tokens=2` , [5, 12] becomes [6, 12]. However, in vllm-ascend, `self.cudagraph_batch_sizes` was cached during __init__ with the original [5, 12]. When `set_graph_params(self.cudagraph_batch_sizes)` runs later, it creates `graph_params.events` keyed by {5, 12}. Meanwhile, the `CudagraphDispatcher` uses the updated [6, 12] from `compilation_config`, so it tries to capture at num_tokens=6 — causing KeyError: 6 in `graph_params.events[num_tokens]` inside `full_graph_fia`. you can also re-produce the issue with the script: ```python import os os.environ.setdefault("VLLM_WORKER_MULTIPROC_METHOD", "spawn") from vllm import LLM, SamplingParams from vllm.config import CompilationConfig EXAMPLE_PROMPTS = [ "Hello, my name is", "The president of the United States is", "The capital of France is", "The future of AI is", ] SAMPLING_PARAMS = SamplingParams( max_tokens=300, temperature=0.0, ignore_eos=False, ) def run_spec(): """Run with eagle3 speculative decoding.""" llm = LLM( model="Qwen/Qwen3-8B", tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, disable_log_stats=False, max_model_len=4096, seed=1024, async_scheduling=False, speculative_config={ "disable_padded_drafter_batch": False, "method": "eagle3", "model": "RedHatAI/Qwen3-8B-speculator.eagle3", "num_speculative_tokens": 2, "draft_tensor_parallel_size": 1, "max_model_len": 128, }, compilation_config=CompilationConfig( cudagraph_mode="FULL", cudagraph_capture_sizes=[5, 12], ), ) spec_outputs = llm.generate(EXAMPLE_PROMPTS, SAMPLING_PARAMS) del llm return spec_outputs def main(): spec_outputs = run_spec() for o in spec_outputs: print(f" PROMPT: {o.prompt!r}") print(f" OUTPUT: {o.outputs[0].text[:80]}...") if __name__ == "__main__": main() ``` 3. fix vllm-project/vllm#39604 ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? 1. For 310P, we are - vLLM version: - vLLM main: https://github.com/vllm-project/vllm/commit/v0.19.0 --------- Signed-off-by: wangli <wangli858794774@gmail.com> Signed-off-by: tfhddd <2272751277@qq.com>
### What this PR does / why we need it? Upgrade vllm commit to `6f786f2c506cb07f4566771fdc62e640e2c4a176` 1. fix vllm-project/vllm#32936 This pr leads to the issue in Capture Phase (_dummy_run), vllm-project/vllm#28207 (comment) we should re-read `compilation_config.cudagraph_capture_sizes` after the super() call in `_check_and_update_cudagraph_mode` to keep `self.cudagraph_batch_sizes` in sync with the (possibly rewritten) sizes in `model_runner_v1.NPUModelRunner_check_and_update_cudagraph_mode`. For example, when speculative decoding (e.g. eagle3) is enabled and `cudagraph_capture_sizes` is explicitly specify as [5, 12], vLLM's `_check_and_update_cudagraph_mode` calls `adjust_cudagraph_sizes_for_spec_decode`, which rounds `cudagraph_capture_sizes` up to a multiple of (num_speculative_tokens + 1). For example, with `num_speculative_tokens=2` , [5, 12] becomes [6, 12]. However, in vllm-ascend, `self.cudagraph_batch_sizes` was cached during __init__ with the original [5, 12]. When `set_graph_params(self.cudagraph_batch_sizes)` runs later, it creates `graph_params.events` keyed by {5, 12}. Meanwhile, the `CudagraphDispatcher` uses the updated [6, 12] from `compilation_config`, so it tries to capture at num_tokens=6 — causing KeyError: 6 in `graph_params.events[num_tokens]` inside `full_graph_fia`. you can also re-produce the issue with the script: ```python import os os.environ.setdefault("VLLM_WORKER_MULTIPROC_METHOD", "spawn") from vllm import LLM, SamplingParams from vllm.config import CompilationConfig EXAMPLE_PROMPTS = [ "Hello, my name is", "The president of the United States is", "The capital of France is", "The future of AI is", ] SAMPLING_PARAMS = SamplingParams( max_tokens=300, temperature=0.0, ignore_eos=False, ) def run_spec(): """Run with eagle3 speculative decoding.""" llm = LLM( model="Qwen/Qwen3-8B", tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, disable_log_stats=False, max_model_len=4096, seed=1024, async_scheduling=False, speculative_config={ "disable_padded_drafter_batch": False, "method": "eagle3", "model": "RedHatAI/Qwen3-8B-speculator.eagle3", "num_speculative_tokens": 2, "draft_tensor_parallel_size": 1, "max_model_len": 128, }, compilation_config=CompilationConfig( cudagraph_mode="FULL", cudagraph_capture_sizes=[5, 12], ), ) spec_outputs = llm.generate(EXAMPLE_PROMPTS, SAMPLING_PARAMS) del llm return spec_outputs def main(): spec_outputs = run_spec() for o in spec_outputs: print(f" PROMPT: {o.prompt!r}") print(f" OUTPUT: {o.outputs[0].text[:80]}...") if __name__ == "__main__": main() ``` 3. fix vllm-project/vllm#39604 ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? 1. For 310P, we are - vLLM version: - vLLM main: https://github.com/vllm-project/vllm/commit/v0.19.0 --------- Signed-off-by: wangli <wangli858794774@gmail.com>
…_config parameter to HPU quantization config overrides (vllm-project#1349) ## Summary Fixes a regression introduced by upstream vLLM that breaks all quantization tests using HPU-specific GPTQ and AWQ backends (e.g. `run_qwen3_inc_dynamic_load_generate_test`). ## Changes 1. **Add `hf_config` parameter to `override_quantization_method()` in `GPTQHPUConfig` and `AWQHPUConfig`** — upstream changed the call site in `vllm/config/model.py` to pass `hf_config=self.hf_config`, but plugin implementations still used the old 2-parameter signature, causing `TypeError`. 2. **Re-enable `build_nixl_dockerfile` CI test** in pre-merge workflow. ## Upstream PR that introduced the regression - vllm-project/vllm#39604 — added `hf_config` keyword argument to `override_quantization_method()` call and updated all upstream implementations, but plugin implementations were not updated. --------- Signed-off-by: Paweł Olejniczak <pawelx.olejniczak@intel.com> Signed-off-by: Yeonsil Yoon <yeon.sil.yoon@intel.com>
…_config parameter to HPU quantization config overrides (vllm-project#1349) ## Summary Fixes a regression introduced by upstream vLLM that breaks all quantization tests using HPU-specific GPTQ and AWQ backends (e.g. `run_qwen3_inc_dynamic_load_generate_test`). ## Changes 1. **Add `hf_config` parameter to `override_quantization_method()` in `GPTQHPUConfig` and `AWQHPUConfig`** — upstream changed the call site in `vllm/config/model.py` to pass `hf_config=self.hf_config`, but plugin implementations still used the old 2-parameter signature, causing `TypeError`. 2. **Re-enable `build_nixl_dockerfile` CI test** in pre-merge workflow. ## Upstream PR that introduced the regression - vllm-project/vllm#39604 — added `hf_config` keyword argument to `override_quantization_method()` call and updated all upstream implementations, but plugin implementations were not updated. --------- Signed-off-by: Paweł Olejniczak <pawelx.olejniczak@intel.com> Signed-off-by: bmyrcha <bartosz.myrcha@intel.com>
…-project#39604) Signed-off-by: Yongye Zhu <zyy1102000@gmail.com>
### What this PR does / why we need it? Upgrade vllm commit to `6f786f2c506cb07f4566771fdc62e640e2c4a176` 1. fix vllm-project/vllm#32936 This pr leads to the issue in Capture Phase (_dummy_run), vllm-project/vllm#28207 (comment) we should re-read `compilation_config.cudagraph_capture_sizes` after the super() call in `_check_and_update_cudagraph_mode` to keep `self.cudagraph_batch_sizes` in sync with the (possibly rewritten) sizes in `model_runner_v1.NPUModelRunner_check_and_update_cudagraph_mode`. For example, when speculative decoding (e.g. eagle3) is enabled and `cudagraph_capture_sizes` is explicitly specify as [5, 12], vLLM's `_check_and_update_cudagraph_mode` calls `adjust_cudagraph_sizes_for_spec_decode`, which rounds `cudagraph_capture_sizes` up to a multiple of (num_speculative_tokens + 1). For example, with `num_speculative_tokens=2` , [5, 12] becomes [6, 12]. However, in vllm-ascend, `self.cudagraph_batch_sizes` was cached during __init__ with the original [5, 12]. When `set_graph_params(self.cudagraph_batch_sizes)` runs later, it creates `graph_params.events` keyed by {5, 12}. Meanwhile, the `CudagraphDispatcher` uses the updated [6, 12] from `compilation_config`, so it tries to capture at num_tokens=6 — causing KeyError: 6 in `graph_params.events[num_tokens]` inside `full_graph_fia`. you can also re-produce the issue with the script: ```python import os os.environ.setdefault("VLLM_WORKER_MULTIPROC_METHOD", "spawn") from vllm import LLM, SamplingParams from vllm.config import CompilationConfig EXAMPLE_PROMPTS = [ "Hello, my name is", "The president of the United States is", "The capital of France is", "The future of AI is", ] SAMPLING_PARAMS = SamplingParams( max_tokens=300, temperature=0.0, ignore_eos=False, ) def run_spec(): """Run with eagle3 speculative decoding.""" llm = LLM( model="Qwen/Qwen3-8B", tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, disable_log_stats=False, max_model_len=4096, seed=1024, async_scheduling=False, speculative_config={ "disable_padded_drafter_batch": False, "method": "eagle3", "model": "RedHatAI/Qwen3-8B-speculator.eagle3", "num_speculative_tokens": 2, "draft_tensor_parallel_size": 1, "max_model_len": 128, }, compilation_config=CompilationConfig( cudagraph_mode="FULL", cudagraph_capture_sizes=[5, 12], ), ) spec_outputs = llm.generate(EXAMPLE_PROMPTS, SAMPLING_PARAMS) del llm return spec_outputs def main(): spec_outputs = run_spec() for o in spec_outputs: print(f" PROMPT: {o.prompt!r}") print(f" OUTPUT: {o.outputs[0].text[:80]}...") if __name__ == "__main__": main() ``` 3. fix vllm-project/vllm#39604 ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? 1. For 310P, we are - vLLM version: - vLLM main: https://github.com/vllm-project/vllm/commit/v0.19.0 --------- Signed-off-by: wangli <wangli858794774@gmail.com> Signed-off-by: guxin108 <1252896542@qq.com>
…-project#39604) Signed-off-by: Yongye Zhu <zyy1102000@gmail.com> Signed-off-by: Avinash Singh <avinashsingh.rcoem@gmail.com>
### What this PR does / why we need it? Upgrade vllm commit to `6f786f2c506cb07f4566771fdc62e640e2c4a176` 1. fix vllm-project/vllm#32936 This pr leads to the issue in Capture Phase (_dummy_run), vllm-project/vllm#28207 (comment) we should re-read `compilation_config.cudagraph_capture_sizes` after the super() call in `_check_and_update_cudagraph_mode` to keep `self.cudagraph_batch_sizes` in sync with the (possibly rewritten) sizes in `model_runner_v1.NPUModelRunner_check_and_update_cudagraph_mode`. For example, when speculative decoding (e.g. eagle3) is enabled and `cudagraph_capture_sizes` is explicitly specify as [5, 12], vLLM's `_check_and_update_cudagraph_mode` calls `adjust_cudagraph_sizes_for_spec_decode`, which rounds `cudagraph_capture_sizes` up to a multiple of (num_speculative_tokens + 1). For example, with `num_speculative_tokens=2` , [5, 12] becomes [6, 12]. However, in vllm-ascend, `self.cudagraph_batch_sizes` was cached during __init__ with the original [5, 12]. When `set_graph_params(self.cudagraph_batch_sizes)` runs later, it creates `graph_params.events` keyed by {5, 12}. Meanwhile, the `CudagraphDispatcher` uses the updated [6, 12] from `compilation_config`, so it tries to capture at num_tokens=6 — causing KeyError: 6 in `graph_params.events[num_tokens]` inside `full_graph_fia`. you can also re-produce the issue with the script: ```python import os os.environ.setdefault("VLLM_WORKER_MULTIPROC_METHOD", "spawn") from vllm import LLM, SamplingParams from vllm.config import CompilationConfig EXAMPLE_PROMPTS = [ "Hello, my name is", "The president of the United States is", "The capital of France is", "The future of AI is", ] SAMPLING_PARAMS = SamplingParams( max_tokens=300, temperature=0.0, ignore_eos=False, ) def run_spec(): """Run with eagle3 speculative decoding.""" llm = LLM( model="Qwen/Qwen3-8B", tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, disable_log_stats=False, max_model_len=4096, seed=1024, async_scheduling=False, speculative_config={ "disable_padded_drafter_batch": False, "method": "eagle3", "model": "RedHatAI/Qwen3-8B-speculator.eagle3", "num_speculative_tokens": 2, "draft_tensor_parallel_size": 1, "max_model_len": 128, }, compilation_config=CompilationConfig( cudagraph_mode="FULL", cudagraph_capture_sizes=[5, 12], ), ) spec_outputs = llm.generate(EXAMPLE_PROMPTS, SAMPLING_PARAMS) del llm return spec_outputs def main(): spec_outputs = run_spec() for o in spec_outputs: print(f" PROMPT: {o.prompt!r}") print(f" OUTPUT: {o.outputs[0].text[:80]}...") if __name__ == "__main__": main() ``` 3. fix vllm-project/vllm#39604 ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? 1. For 310P, we are - vLLM version: - vLLM main: https://github.com/vllm-project/vllm/commit/v0.19.0 --------- Signed-off-by: wangli <wangli858794774@gmail.com> Signed-off-by: zouyida2052 <zouyida2002@gmail.com>
Purpose
(Generated with Claude)
This PR scopes the MXFP4 quantization implementation to GPT-OSS checkpoints
while keeping generic MXFP4 support for other models (e.g. Quark).
Changes
quantization/mxfp4.py(renamed fromgpt_oss_mxfp4.py):Mxfp4Configas the canonical base class (get_name()="mxfp4")with a functional
get_quant_method()fallback so the"mxfp4"quant flagworks for non-GPT-OSS models
GptOssMxfp4Config(Mxfp4Config)(get_name()="gpt_oss_mxfp4")following the
ModelOptQuantConfigBase/ModelOptFp8ConfigpatternGptOssMxfp4Config.override_quantization_method()redirects GPT-OSScheckpoints
(
quant_method="mxfp4"+model_type="gpt_oss") to"gpt_oss_mxfp4";requires explicit
model_type="gpt_oss"to avoid claiming other MXFP4checkpoints
quantization/__init__.py:"gpt_oss_mxfp4"inQuantizationMethodsandmethod_to_config;"mxfp4"→Mxfp4Config,"gpt_oss_mxfp4"→GptOssMxfp4Configbase_config.py:override_quantization_methodwithhf_config=None; updates allsubclasses — enables model-type-aware override decisions
models/config.py:GptOssForCausalLMConfig.verify_and_update_model_confignormalizeshf_config.quantization_config["quant_method"]"mxfp4"→"gpt_oss_mxfp4"models/gpt_oss.py:quant_methodinload_weightsfor direct JSON reads== "gpt_oss_mxfp4"checks with_is_mxfp4()helper(
"mxfp4" in weight_dtype) covering all variants (GPT-OSS:"gpt_oss_mxfp4",Quark:
"mxfp4")oracle/mxfp4.py(renamed fromgpt_oss_mxfp4.py):make_mxfp4_moe_kernel,make_mxfp4_moe_quant_config,Mxfp4MoeBackend)select_mxfp4_moe_backend→select_gpt_oss_mxfp4_moe_backend,convert_to_mxfp4_moe_kernel_format→convert_gpt_oss_weight_to_mxfp4_moe_kernel_formatBackward compatibility
"quant_method": "mxfp4"in JSON is unchanged"mxfp4"entry inQuantizationMethodspreservedTest Plan
gpt-oss-20b gpqa score with low reasoning effort
Test Result
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.