Skip to content

[Quantization] [Refactor] Create special "GptOssMxfp4MoeMethod"#39604

Merged
mgoin merged 10 commits intovllm-project:mainfrom
zyongye:rename_gpt_oss_mxfp4
Apr 13, 2026
Merged

[Quantization] [Refactor] Create special "GptOssMxfp4MoeMethod"#39604
mgoin merged 10 commits intovllm-project:mainfrom
zyongye:rename_gpt_oss_mxfp4

Conversation

@zyongye
Copy link
Copy Markdown
Member

@zyongye zyongye commented Apr 12, 2026

Purpose

(Generated with Claude)

This PR scopes the MXFP4 quantization implementation to GPT-OSS checkpoints
while keeping generic MXFP4 support for other models (e.g. Quark).

Changes

quantization/mxfp4.py (renamed from gpt_oss_mxfp4.py):

  • Introduces Mxfp4Config as the canonical base class (get_name() =
    "mxfp4")
    with a functional get_quant_method() fallback so the "mxfp4" quant flag
    works for non-GPT-OSS models
  • Introduces GptOssMxfp4Config(Mxfp4Config) (get_name() = "gpt_oss_mxfp4")
    following the ModelOptQuantConfigBase/ModelOptFp8Config pattern
  • GptOssMxfp4Config.override_quantization_method() redirects GPT-OSS
    checkpoints
    (quant_method="mxfp4" + model_type="gpt_oss") to "gpt_oss_mxfp4";
    requires explicit model_type="gpt_oss" to avoid claiming other MXFP4
    checkpoints

quantization/__init__.py:

  • Registers "gpt_oss_mxfp4" in QuantizationMethods and method_to_config;
    "mxfp4"Mxfp4Config, "gpt_oss_mxfp4"GptOssMxfp4Config

base_config.py:

  • Extends override_quantization_method with hf_config=None; updates all
    subclasses — enables model-type-aware override decisions

models/config.py:

  • GptOssForCausalLMConfig.verify_and_update_model_config normalizes
    hf_config.quantization_config["quant_method"] "mxfp4""gpt_oss_mxfp4"

models/gpt_oss.py:

  • Normalizes quant_method in load_weights for direct JSON reads
  • Replaces hardcoded == "gpt_oss_mxfp4" checks with _is_mxfp4() helper
    ("mxfp4" in weight_dtype) covering all variants (GPT-OSS: "gpt_oss_mxfp4",
    Quark: "mxfp4")

oracle/mxfp4.py (renamed from gpt_oss_mxfp4.py):

  • Generic functions/enum kept with generic names (make_mxfp4_moe_kernel,
    make_mxfp4_moe_quant_config, Mxfp4MoeBackend)
  • GPT-OSS-specific functions renamed: select_mxfp4_moe_backend
    select_gpt_oss_mxfp4_moe_backend, convert_to_mxfp4_moe_kernel_format
    convert_gpt_oss_weight_to_mxfp4_moe_kernel_format

Backward compatibility

  • Checkpoint string "quant_method": "mxfp4" in JSON is unchanged
  • "mxfp4" entry in QuantizationMethods preserved
  • No kernel names changed

Test Plan

gpt-oss-20b gpqa score with low reasoning effort

Test Result


Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

zyongye added 7 commits April 11, 2026 19:59
Rename the GPT-OSS-specific MXFP4 quantization files and classes to
make ownership explicit:

- oracle/mxfp4.py → oracle/gpt_oss_mxfp4.py; Mxfp4MoeBackend → GptOssMxfp4MoeBackend
- quantization/mxfp4.py → quantization/gpt_oss_mxfp4.py; Mxfp4Config → GptOssMxfp4Config, Mxfp4MoEMethod → GptOssMxfp4MoEMethod
- get_name() now returns "gpt_oss_mxfp4"; registry keeps "mxfp4" alias for backward compat with existing model JSON configs
- Update all importers (compressed_tensors_moe_w4a4_mxfp4, quark_moe, __init__, layer.py, gpt_oss.py)

Kernel class names are unchanged.

Co-authored-by: Claude <noreply@anthropic.com>
Signed-off-by: Yongye Zhu <yongyezhu@meta.com>

Signed-off-by: Yongye Zhu <zyy1102000@gmail.com>
… method

- Rename quantization/gpt_oss_mxfp4.py back to mxfp4.py; split into a
  canonical Mxfp4Config base (get_name()="mxfp4", get_quant_method raises
  NotImplementedError) and GptOssMxfp4Config subclass (get_name()=
  "gpt_oss_mxfp4") following the ModelOptQuantConfigBase pattern
- Register "gpt_oss_mxfp4" in QuantizationMethods and method_to_config;
  "mxfp4" maps to the base class, "gpt_oss_mxfp4" to the GPT-OSS impl
- Add GptOssMxfp4Config.override_quantization_method: redirects checkpoints
  with quant_method="mxfp4" + model_type="gpt_oss" to "gpt_oss_mxfp4"
- Extend QuantizationConfig.override_quantization_method signature with
  hf_config=None; update all subclasses and the call-site in model.py to
  pass hf_config so model-type can inform override decisions
- Add GptOssForCausalLMConfig.verify_and_update_model_config to normalize
  hf_config.quantization_config quant_method "mxfp4"->"gpt_oss_mxfp4"
- Normalize quant_method in gpt_oss.py load_weights for direct reads from
  checkpoint JSON
- Guard transformers/base.py NotImplementedError for both "mxfp4" and
  "gpt_oss_mxfp4"

Co-Authored-By: Claude

Signed-off-by: Yongye Zhu <zyy1102000@gmail.com>
… oracle

Rename make_mxfp4_moe_quant_config and make_mxfp4_moe_kernel to
make_gpt_oss_mxfp4_moe_quant_config and make_gpt_oss_mxfp4_moe_kernel
in oracle/gpt_oss_mxfp4.py; update all callers in mxfp4.py,
compressed_tensors_moe_w4a4_mxfp4.py, and quark_moe.py.

Co-Authored-By: Claude

Signed-off-by: Yongye Zhu <zyy1102000@gmail.com>
- convert_to_mxfp4_moe_kernel_format → convert_gpt_oss_weight_to_mxfp4_moe_kernel_format
- select_mxfp4_moe_backend → select_gpt_oss_mxfp4_moe_backend

Update all callers in mxfp4.py and quark_moe.py.

Co-Authored-By: Claude

Signed-off-by: Yongye Zhu <zyy1102000@gmail.com>
The backend enum is a generic MXFP4 concept, not GPT-OSS specific.
Revert to the original Mxfp4MoeBackend name across oracle/gpt_oss_mxfp4.py
and all callers (mxfp4.py, compressed_tensors_moe_w4a4_mxfp4.py,
quark_moe.py).

Co-Authored-By: Claude

Signed-off-by: Yongye Zhu <zyy1102000@gmail.com>
The oracle file and its functions (make_mxfp4_moe_kernel,
make_mxfp4_moe_quant_config, select_mxfp4_moe_backend,
convert_to_mxfp4_moe_kernel_format) are generic MXFP4 utilities,
not GPT-OSS specific. Revert to oracle/mxfp4.py and restore the
original function names; update all imports accordingly.

Co-Authored-By: Claude

Signed-off-by: Yongye Zhu <zyy1102000@gmail.com>
Keep make_mxfp4_moe_kernel and make_mxfp4_moe_quant_config as generic
names; restore gpt_oss prefix on the GPT-OSS-specific functions:
- convert_to_mxfp4_moe_kernel_format → convert_gpt_oss_weight_to_mxfp4_moe_kernel_format
- select_mxfp4_moe_backend → select_gpt_oss_mxfp4_moe_backend

Update all callers in mxfp4.py, compressed_tensors_moe_w4a4_mxfp4.py,
and quark_moe.py.

Co-Authored-By: Claude

Signed-off-by: Yongye Zhu <zyy1102000@gmail.com>
@zyongye zyongye changed the title [Quantization] Create special "GptOssMxfp4MoeMethod" [Quantization] [Refactor] Create special "GptOssMxfp4MoeMethod" Apr 12, 2026
@mergify mergify Bot added gpt-oss Related to GPT-OSS models cpu Related to CPU backends labels Apr 12, 2026
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a model-specific quantization configuration for GPT-OSS models using MXFP4, renaming internal methods and adding normalization logic to distinguish it from generic MXFP4. Feedback indicates that the changes to the base Mxfp4Config class cause a regression for non-GPT-OSS models by raising a NotImplementedError. Furthermore, the override logic for GPT-OSS is identified as being too aggressive and potentially incompatible with normalized configuration strings. Finally, hardcoded checks for the new gpt_oss_mxfp4 name in the model executor may break compatibility with other MXFP4 variants like Quark.

Comment on lines +71 to +77
def get_quant_method(
self, layer: torch.nn.Module, prefix: str
) -> "QuantizeMethodBase | None":
raise NotImplementedError(
f"{type(self).__name__} does not implement get_quant_method. "
"Use a model-specific subclass (e.g. GptOssMxfp4Config)."
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The Mxfp4Config.get_quant_method implementation now raises NotImplementedError, which is a regression for generic MXFP4 support. Previously, this method provided a functional implementation for FusedMoE layers. If the intention is to make Mxfp4Config a base class, it should still provide a default implementation or be marked as abstract in a way that doesn't break the user-facing "mxfp4" quantization flag for non-GPT-OSS models.

Comment on lines +97 to +108
def override_quantization_method(
cls, hf_quant_cfg, user_quant, hf_config=None
) -> QuantizationMethods | None:
if not (
isinstance(hf_quant_cfg, dict)
and hf_quant_cfg.get("quant_method") == "mxfp4"
):
return None
model_type = getattr(hf_config, "model_type", None)
if model_type is not None and model_type != "gpt_oss":
return None
return "gpt_oss_mxfp4"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The override_quantization_method logic in GptOssMxfp4Config has two critical issues:

  1. It returns "gpt_oss_mxfp4" if hf_config is None (since model_type will be None), which is too aggressive for a model-specific config. It should only return the override if it can confirm the model is "gpt_oss".
  2. It only checks for "quant_method": "mxfp4". However, GptOssForCausalLMConfig.verify_and_update_model_config (in models/config.py) normalizes this to "gpt_oss_mxfp4" before this check runs. If a user explicitly passes --quantization mxfp4, this method will return None, leading to a mismatch error in vllm/config/model.py because self.quantization ("mxfp4") won't match the normalized quant_method ("gpt_oss_mxfp4").
Suggested change
def override_quantization_method(
cls, hf_quant_cfg, user_quant, hf_config=None
) -> QuantizationMethods | None:
if not (
isinstance(hf_quant_cfg, dict)
and hf_quant_cfg.get("quant_method") == "mxfp4"
):
return None
model_type = getattr(hf_config, "model_type", None)
if model_type is not None and model_type != "gpt_oss":
return None
return "gpt_oss_mxfp4"
@classmethod
def override_quantization_method(
cls, hf_quant_cfg, user_quant, hf_config=None
) -> QuantizationMethods | None:
if not isinstance(hf_quant_cfg, dict):
return None
quant_method = hf_quant_cfg.get("quant_method")
if quant_method not in ("mxfp4", "gpt_oss_mxfp4"):
return None
if getattr(hf_config, "model_type", None) != "gpt_oss":
return None
return "gpt_oss_mxfp4"

Comment thread vllm/model_executor/models/gpt_oss.py Outdated
moe_weight_dtype = _get_moe_weight_dtype(layer_id=0)

if moe_weight_dtype == "mxfp4":
if moe_weight_dtype == "gpt_oss_mxfp4":
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Hardcoding the check for "gpt_oss_mxfp4" here (and at line 685) breaks compatibility with other quantization methods that use MXFP4, such as Quark. Since QuarkMoEMethod (in quark_moe.py) is being updated to use the same backend selection logic, it likely expects the same alignment and loading behavior. These checks should be more generic (e.g., checking if the method is an MXFP4 variant) or QuarkMoEMethod should also be updated to use the new canonical name when applicable.

zyongye added 3 commits April 12, 2026 03:39
- Mxfp4Config.get_quant_method: clarify that the NotImplementedError
  means no subclass claimed the checkpoint, and hint at the model_type
  requirement, so the failure is diagnosable at the call site
- gpt_oss.py load_weights: document why three separate normalization
  paths for "mxfp4"->"gpt_oss_mxfp4" exist (each operates on a
  distinct copy of the quantization config dict)

Co-Authored-By: Claude

Signed-off-by: Yongye Zhu <zyy1102000@gmail.com>
- Move get_quant_method to Mxfp4Config as a functional fallback so
  the "mxfp4" quant flag works for non-GPT-OSS models (e.g. Quark)
  without hitting NotImplementedError
- Fix GptOssMxfp4Config.override_quantization_method:
  - Require explicit model_type="gpt_oss" (never claim when hf_config
    is None or model_type is unknown)
  - Also match "gpt_oss_mxfp4" in hf_quant_cfg so --quantization mxfp4
    from the user does not cause a mismatch after verify_and_update_model_config
    normalizes the dict value first
- Replace hardcoded "gpt_oss_mxfp4" checks in gpt_oss.py with a
  generic _is_mxfp4() helper ("mxfp4" in weight_dtype) to cover all
  MXFP4 variants including Quark ("mxfp4") and GPT-OSS ("gpt_oss_mxfp4")

Co-Authored-By: Claude

Signed-off-by: Yongye Zhu <zyy1102000@gmail.com>
- base_config.py: add type annotations to override_quantization_method
  parameters (hf_quant_cfg, user_quant, hf_config) to fix griffe warnings
  in mkdocs build
- docs/design/moe_kernel_features.md: update cross-reference from
  Mxfp4MoEMethod to GptOssMxfp4MoEMethod to match the renamed class

Co-Authored-By: Claude

Signed-off-by: Yongye Zhu <zyy1102000@gmail.com>
@mergify
Copy link
Copy Markdown
Contributor

mergify Bot commented Apr 12, 2026

Documentation preview: https://vllm--39604.org.readthedocs.build/en/39604/

@mergify mergify Bot added the documentation Improvements or additions to documentation label Apr 12, 2026
@mgoin mgoin added the nvidia label Apr 13, 2026
@github-project-automation github-project-automation Bot moved this to Ready in NVIDIA Apr 13, 2026
@mgoin mgoin merged commit 739e594 into vllm-project:main Apr 13, 2026
86 of 88 checks passed
@github-project-automation github-project-automation Bot moved this from Ready to Done in NVIDIA Apr 13, 2026
jonathanc-n pushed a commit to jonathanc-n/vllm that referenced this pull request Apr 13, 2026
…-project#39604)

Signed-off-by: Yongye Zhu <zyy1102000@gmail.com>
Signed-off-by: Jonathan Chen <chenleejonathan@gmail.com>
wojciech-wais pushed a commit to wojciech-wais/vllm that referenced this pull request Apr 13, 2026
tzielinski-habana pushed a commit to vllm-project/vllm-gaudi that referenced this pull request Apr 15, 2026
…_config parameter to HPU quantization config overrides (#1349)

## Summary

Fixes a regression introduced by upstream vLLM that breaks all
quantization tests using HPU-specific GPTQ and AWQ backends (e.g.
`run_qwen3_inc_dynamic_load_generate_test`).

## Changes

1. **Add `hf_config` parameter to `override_quantization_method()` in
`GPTQHPUConfig` and `AWQHPUConfig`** — upstream changed the call site in
`vllm/config/model.py` to pass `hf_config=self.hf_config`, but plugin
implementations still used the old 2-parameter signature, causing
`TypeError`.
2. **Re-enable `build_nixl_dockerfile` CI test** in pre-merge workflow.

## Upstream PR that introduced the regression

- vllm-project/vllm#39604 — added `hf_config`
keyword argument to `override_quantization_method()` call and updated
all upstream implementations, but plugin implementations were not
updated.

---------

Signed-off-by: Paweł Olejniczak <pawelx.olejniczak@intel.com>


def select_mxfp4_moe_backend(
def select_gpt_oss_mxfp4_moe_backend(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zyongye IMO these are not gpt-oss specific. quark_moe.py is under refactor to adopt mxfp4 oracle which supports any model with mxfp4 moe. (might see dup comments from me, I sent in a wrong place earlier)

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are temporarily route all mxfp4 to gpt_oss_mxfp4 for compatibility reason. We will later create another mxfp4moe that will decouple with this one. After we added that method we can route AMD related change to the new moe class.

wangxiyuan pushed a commit to vllm-project/vllm-ascend that referenced this pull request Apr 17, 2026
### What this PR does / why we need it?
Upgrade vllm commit to `6f786f2c506cb07f4566771fdc62e640e2c4a176`
1. fix vllm-project/vllm#32936

This pr leads to the issue in Capture Phase (_dummy_run),
vllm-project/vllm#28207 (comment)
we should re-read `compilation_config.cudagraph_capture_sizes` after the
super() call in `_check_and_update_cudagraph_mode` to keep
`self.cudagraph_batch_sizes` in sync with the (possibly rewritten) sizes
in `model_runner_v1.NPUModelRunner_check_and_update_cudagraph_mode`.

For example, when speculative decoding (e.g. eagle3) is enabled and
`cudagraph_capture_sizes` is explicitly specify as [5, 12], vLLM's
`_check_and_update_cudagraph_mode` calls
`adjust_cudagraph_sizes_for_spec_decode`, which rounds
`cudagraph_capture_sizes` up to a multiple of
(num_speculative_tokens + 1). For example, with
`num_speculative_tokens=2` , [5, 12] becomes [6, 12].

However, in vllm-ascend, `self.cudagraph_batch_sizes` was cached during
__init__ with the original [5, 12]. When
`set_graph_params(self.cudagraph_batch_sizes)` runs later, it creates
`graph_params.events` keyed by {5, 12}. Meanwhile, the
`CudagraphDispatcher` uses the updated [6, 12] from
`compilation_config`, so it tries to capture at num_tokens=6 — causing
KeyError: 6 in `graph_params.events[num_tokens]` inside
`full_graph_fia`.

you can also re-produce the issue with the script:
```python
import os

os.environ.setdefault("VLLM_WORKER_MULTIPROC_METHOD", "spawn")

from vllm import LLM, SamplingParams
from vllm.config import CompilationConfig

EXAMPLE_PROMPTS = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

SAMPLING_PARAMS = SamplingParams(
    max_tokens=300,
    temperature=0.0,
    ignore_eos=False,
)


def run_spec():
    """Run with eagle3 speculative decoding."""
    llm = LLM(
        model="Qwen/Qwen3-8B",
        tensor_parallel_size=1,
        pipeline_parallel_size=1,
        data_parallel_size=1,
        disable_log_stats=False,
        max_model_len=4096,
        seed=1024,
        async_scheduling=False,
        speculative_config={
            "disable_padded_drafter_batch": False,
            "method": "eagle3",
            "model": "RedHatAI/Qwen3-8B-speculator.eagle3",
            "num_speculative_tokens": 2,
            "draft_tensor_parallel_size": 1,
            "max_model_len": 128,
        },
        compilation_config=CompilationConfig(
            cudagraph_mode="FULL",
            cudagraph_capture_sizes=[5, 12],
        ),
    )
    spec_outputs = llm.generate(EXAMPLE_PROMPTS, SAMPLING_PARAMS)
    del llm
    return spec_outputs


def main():
    spec_outputs = run_spec()
    for o in spec_outputs:
        print(f"  PROMPT: {o.prompt!r}")
        print(f"  OUTPUT: {o.outputs[0].text[:80]}...")


if __name__ == "__main__":
    main()

```


3. fix vllm-project/vllm#39604
### Does this PR introduce _any_ user-facing change?

### How was this patch tested?
1. For 310P, we are
- vLLM version: 
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.19.0

---------

Signed-off-by: wangli <wangli858794774@gmail.com>
bmyrcha pushed a commit to bmyrcha/vllm-gaudi that referenced this pull request Apr 17, 2026
…_config parameter to HPU quantization config overrides (vllm-project#1349)

## Summary

Fixes a regression introduced by upstream vLLM that breaks all
quantization tests using HPU-specific GPTQ and AWQ backends (e.g.
`run_qwen3_inc_dynamic_load_generate_test`).

## Changes

1. **Add `hf_config` parameter to `override_quantization_method()` in
`GPTQHPUConfig` and `AWQHPUConfig`** — upstream changed the call site in
`vllm/config/model.py` to pass `hf_config=self.hf_config`, but plugin
implementations still used the old 2-parameter signature, causing
`TypeError`.
2. **Re-enable `build_nixl_dockerfile` CI test** in pre-merge workflow.

## Upstream PR that introduced the regression

- vllm-project/vllm#39604 — added `hf_config`
keyword argument to `override_quantization_method()` call and updated
all upstream implementations, but plugin implementations were not
updated.

---------

Signed-off-by: Paweł Olejniczak <pawelx.olejniczak@intel.com>
Signed-off-by: bmyrcha <bartosz.myrcha@intel.com>
1kzk pushed a commit to 1kzk/vllm-ascend that referenced this pull request Apr 20, 2026
### What this PR does / why we need it?
Upgrade vllm commit to `6f786f2c506cb07f4566771fdc62e640e2c4a176`
1. fix vllm-project/vllm#32936

This pr leads to the issue in Capture Phase (_dummy_run),
vllm-project/vllm#28207 (comment)
we should re-read `compilation_config.cudagraph_capture_sizes` after the
super() call in `_check_and_update_cudagraph_mode` to keep
`self.cudagraph_batch_sizes` in sync with the (possibly rewritten) sizes
in `model_runner_v1.NPUModelRunner_check_and_update_cudagraph_mode`.

For example, when speculative decoding (e.g. eagle3) is enabled and
`cudagraph_capture_sizes` is explicitly specify as [5, 12], vLLM's
`_check_and_update_cudagraph_mode` calls
`adjust_cudagraph_sizes_for_spec_decode`, which rounds
`cudagraph_capture_sizes` up to a multiple of
(num_speculative_tokens + 1). For example, with
`num_speculative_tokens=2` , [5, 12] becomes [6, 12].

However, in vllm-ascend, `self.cudagraph_batch_sizes` was cached during
__init__ with the original [5, 12]. When
`set_graph_params(self.cudagraph_batch_sizes)` runs later, it creates
`graph_params.events` keyed by {5, 12}. Meanwhile, the
`CudagraphDispatcher` uses the updated [6, 12] from
`compilation_config`, so it tries to capture at num_tokens=6 — causing
KeyError: 6 in `graph_params.events[num_tokens]` inside
`full_graph_fia`.

you can also re-produce the issue with the script:
```python
import os

os.environ.setdefault("VLLM_WORKER_MULTIPROC_METHOD", "spawn")

from vllm import LLM, SamplingParams
from vllm.config import CompilationConfig

EXAMPLE_PROMPTS = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

SAMPLING_PARAMS = SamplingParams(
    max_tokens=300,
    temperature=0.0,
    ignore_eos=False,
)


def run_spec():
    """Run with eagle3 speculative decoding."""
    llm = LLM(
        model="Qwen/Qwen3-8B",
        tensor_parallel_size=1,
        pipeline_parallel_size=1,
        data_parallel_size=1,
        disable_log_stats=False,
        max_model_len=4096,
        seed=1024,
        async_scheduling=False,
        speculative_config={
            "disable_padded_drafter_batch": False,
            "method": "eagle3",
            "model": "RedHatAI/Qwen3-8B-speculator.eagle3",
            "num_speculative_tokens": 2,
            "draft_tensor_parallel_size": 1,
            "max_model_len": 128,
        },
        compilation_config=CompilationConfig(
            cudagraph_mode="FULL",
            cudagraph_capture_sizes=[5, 12],
        ),
    )
    spec_outputs = llm.generate(EXAMPLE_PROMPTS, SAMPLING_PARAMS)
    del llm
    return spec_outputs


def main():
    spec_outputs = run_spec()
    for o in spec_outputs:
        print(f"  PROMPT: {o.prompt!r}")
        print(f"  OUTPUT: {o.outputs[0].text[:80]}...")


if __name__ == "__main__":
    main()

```


3. fix vllm-project/vllm#39604
### Does this PR introduce _any_ user-facing change?

### How was this patch tested?
1. For 310P, we are
- vLLM version: 
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.19.0

---------

Signed-off-by: wangli <wangli858794774@gmail.com>
Pz1116 pushed a commit to Pz1116/vllm-ascend that referenced this pull request Apr 20, 2026
### What this PR does / why we need it?
Upgrade vllm commit to `6f786f2c506cb07f4566771fdc62e640e2c4a176`
1. fix vllm-project/vllm#32936

This pr leads to the issue in Capture Phase (_dummy_run),
vllm-project/vllm#28207 (comment)
we should re-read `compilation_config.cudagraph_capture_sizes` after the
super() call in `_check_and_update_cudagraph_mode` to keep
`self.cudagraph_batch_sizes` in sync with the (possibly rewritten) sizes
in `model_runner_v1.NPUModelRunner_check_and_update_cudagraph_mode`.

For example, when speculative decoding (e.g. eagle3) is enabled and
`cudagraph_capture_sizes` is explicitly specify as [5, 12], vLLM's
`_check_and_update_cudagraph_mode` calls
`adjust_cudagraph_sizes_for_spec_decode`, which rounds
`cudagraph_capture_sizes` up to a multiple of
(num_speculative_tokens + 1). For example, with
`num_speculative_tokens=2` , [5, 12] becomes [6, 12].

However, in vllm-ascend, `self.cudagraph_batch_sizes` was cached during
__init__ with the original [5, 12]. When
`set_graph_params(self.cudagraph_batch_sizes)` runs later, it creates
`graph_params.events` keyed by {5, 12}. Meanwhile, the
`CudagraphDispatcher` uses the updated [6, 12] from
`compilation_config`, so it tries to capture at num_tokens=6 — causing
KeyError: 6 in `graph_params.events[num_tokens]` inside
`full_graph_fia`.

you can also re-produce the issue with the script:
```python
import os

os.environ.setdefault("VLLM_WORKER_MULTIPROC_METHOD", "spawn")

from vllm import LLM, SamplingParams
from vllm.config import CompilationConfig

EXAMPLE_PROMPTS = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

SAMPLING_PARAMS = SamplingParams(
    max_tokens=300,
    temperature=0.0,
    ignore_eos=False,
)


def run_spec():
    """Run with eagle3 speculative decoding."""
    llm = LLM(
        model="Qwen/Qwen3-8B",
        tensor_parallel_size=1,
        pipeline_parallel_size=1,
        data_parallel_size=1,
        disable_log_stats=False,
        max_model_len=4096,
        seed=1024,
        async_scheduling=False,
        speculative_config={
            "disable_padded_drafter_batch": False,
            "method": "eagle3",
            "model": "RedHatAI/Qwen3-8B-speculator.eagle3",
            "num_speculative_tokens": 2,
            "draft_tensor_parallel_size": 1,
            "max_model_len": 128,
        },
        compilation_config=CompilationConfig(
            cudagraph_mode="FULL",
            cudagraph_capture_sizes=[5, 12],
        ),
    )
    spec_outputs = llm.generate(EXAMPLE_PROMPTS, SAMPLING_PARAMS)
    del llm
    return spec_outputs


def main():
    spec_outputs = run_spec()
    for o in spec_outputs:
        print(f"  PROMPT: {o.prompt!r}")
        print(f"  OUTPUT: {o.outputs[0].text[:80]}...")


if __name__ == "__main__":
    main()

```


3. fix vllm-project/vllm#39604
### Does this PR introduce _any_ user-facing change?

### How was this patch tested?
1. For 310P, we are
- vLLM version: 
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.19.0

---------

Signed-off-by: wangli <wangli858794774@gmail.com>
tfhddd pushed a commit to ascend-gha-runners/vllm-ascend that referenced this pull request Apr 21, 2026
### What this PR does / why we need it?
Upgrade vllm commit to `6f786f2c506cb07f4566771fdc62e640e2c4a176`
1. fix vllm-project/vllm#32936

This pr leads to the issue in Capture Phase (_dummy_run),
vllm-project/vllm#28207 (comment)
we should re-read `compilation_config.cudagraph_capture_sizes` after the
super() call in `_check_and_update_cudagraph_mode` to keep
`self.cudagraph_batch_sizes` in sync with the (possibly rewritten) sizes
in `model_runner_v1.NPUModelRunner_check_and_update_cudagraph_mode`.

For example, when speculative decoding (e.g. eagle3) is enabled and
`cudagraph_capture_sizes` is explicitly specify as [5, 12], vLLM's
`_check_and_update_cudagraph_mode` calls
`adjust_cudagraph_sizes_for_spec_decode`, which rounds
`cudagraph_capture_sizes` up to a multiple of
(num_speculative_tokens + 1). For example, with
`num_speculative_tokens=2` , [5, 12] becomes [6, 12].

However, in vllm-ascend, `self.cudagraph_batch_sizes` was cached during
__init__ with the original [5, 12]. When
`set_graph_params(self.cudagraph_batch_sizes)` runs later, it creates
`graph_params.events` keyed by {5, 12}. Meanwhile, the
`CudagraphDispatcher` uses the updated [6, 12] from
`compilation_config`, so it tries to capture at num_tokens=6 — causing
KeyError: 6 in `graph_params.events[num_tokens]` inside
`full_graph_fia`.

you can also re-produce the issue with the script:
```python
import os

os.environ.setdefault("VLLM_WORKER_MULTIPROC_METHOD", "spawn")

from vllm import LLM, SamplingParams
from vllm.config import CompilationConfig

EXAMPLE_PROMPTS = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

SAMPLING_PARAMS = SamplingParams(
    max_tokens=300,
    temperature=0.0,
    ignore_eos=False,
)

def run_spec():
    """Run with eagle3 speculative decoding."""
    llm = LLM(
        model="Qwen/Qwen3-8B",
        tensor_parallel_size=1,
        pipeline_parallel_size=1,
        data_parallel_size=1,
        disable_log_stats=False,
        max_model_len=4096,
        seed=1024,
        async_scheduling=False,
        speculative_config={
            "disable_padded_drafter_batch": False,
            "method": "eagle3",
            "model": "RedHatAI/Qwen3-8B-speculator.eagle3",
            "num_speculative_tokens": 2,
            "draft_tensor_parallel_size": 1,
            "max_model_len": 128,
        },
        compilation_config=CompilationConfig(
            cudagraph_mode="FULL",
            cudagraph_capture_sizes=[5, 12],
        ),
    )
    spec_outputs = llm.generate(EXAMPLE_PROMPTS, SAMPLING_PARAMS)
    del llm
    return spec_outputs

def main():
    spec_outputs = run_spec()
    for o in spec_outputs:
        print(f"  PROMPT: {o.prompt!r}")
        print(f"  OUTPUT: {o.outputs[0].text[:80]}...")

if __name__ == "__main__":
    main()

```

3. fix vllm-project/vllm#39604
### Does this PR introduce _any_ user-facing change?

### How was this patch tested?
1. For 310P, we are
- vLLM version:
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.19.0

---------

Signed-off-by: wangli <wangli858794774@gmail.com>
Signed-off-by: tfhddd <2272751277@qq.com>
anning-2026 pushed a commit to anning-2026/vllm-ascend that referenced this pull request Apr 21, 2026
### What this PR does / why we need it?
Upgrade vllm commit to `6f786f2c506cb07f4566771fdc62e640e2c4a176`
1. fix vllm-project/vllm#32936

This pr leads to the issue in Capture Phase (_dummy_run),
vllm-project/vllm#28207 (comment)
we should re-read `compilation_config.cudagraph_capture_sizes` after the
super() call in `_check_and_update_cudagraph_mode` to keep
`self.cudagraph_batch_sizes` in sync with the (possibly rewritten) sizes
in `model_runner_v1.NPUModelRunner_check_and_update_cudagraph_mode`.

For example, when speculative decoding (e.g. eagle3) is enabled and
`cudagraph_capture_sizes` is explicitly specify as [5, 12], vLLM's
`_check_and_update_cudagraph_mode` calls
`adjust_cudagraph_sizes_for_spec_decode`, which rounds
`cudagraph_capture_sizes` up to a multiple of
(num_speculative_tokens + 1). For example, with
`num_speculative_tokens=2` , [5, 12] becomes [6, 12].

However, in vllm-ascend, `self.cudagraph_batch_sizes` was cached during
__init__ with the original [5, 12]. When
`set_graph_params(self.cudagraph_batch_sizes)` runs later, it creates
`graph_params.events` keyed by {5, 12}. Meanwhile, the
`CudagraphDispatcher` uses the updated [6, 12] from
`compilation_config`, so it tries to capture at num_tokens=6 — causing
KeyError: 6 in `graph_params.events[num_tokens]` inside
`full_graph_fia`.

you can also re-produce the issue with the script:
```python
import os

os.environ.setdefault("VLLM_WORKER_MULTIPROC_METHOD", "spawn")

from vllm import LLM, SamplingParams
from vllm.config import CompilationConfig

EXAMPLE_PROMPTS = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

SAMPLING_PARAMS = SamplingParams(
    max_tokens=300,
    temperature=0.0,
    ignore_eos=False,
)


def run_spec():
    """Run with eagle3 speculative decoding."""
    llm = LLM(
        model="Qwen/Qwen3-8B",
        tensor_parallel_size=1,
        pipeline_parallel_size=1,
        data_parallel_size=1,
        disable_log_stats=False,
        max_model_len=4096,
        seed=1024,
        async_scheduling=False,
        speculative_config={
            "disable_padded_drafter_batch": False,
            "method": "eagle3",
            "model": "RedHatAI/Qwen3-8B-speculator.eagle3",
            "num_speculative_tokens": 2,
            "draft_tensor_parallel_size": 1,
            "max_model_len": 128,
        },
        compilation_config=CompilationConfig(
            cudagraph_mode="FULL",
            cudagraph_capture_sizes=[5, 12],
        ),
    )
    spec_outputs = llm.generate(EXAMPLE_PROMPTS, SAMPLING_PARAMS)
    del llm
    return spec_outputs


def main():
    spec_outputs = run_spec()
    for o in spec_outputs:
        print(f"  PROMPT: {o.prompt!r}")
        print(f"  OUTPUT: {o.outputs[0].text[:80]}...")


if __name__ == "__main__":
    main()

```


3. fix vllm-project/vllm#39604
### Does this PR introduce _any_ user-facing change?

### How was this patch tested?
1. For 310P, we are
- vLLM version: 
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.19.0

---------

Signed-off-by: wangli <wangli858794774@gmail.com>
yeonsily pushed a commit to yeonsily/vllm-gaudi that referenced this pull request Apr 21, 2026
…_config parameter to HPU quantization config overrides (vllm-project#1349)

## Summary

Fixes a regression introduced by upstream vLLM that breaks all
quantization tests using HPU-specific GPTQ and AWQ backends (e.g.
`run_qwen3_inc_dynamic_load_generate_test`).

## Changes

1. **Add `hf_config` parameter to `override_quantization_method()` in
`GPTQHPUConfig` and `AWQHPUConfig`** — upstream changed the call site in
`vllm/config/model.py` to pass `hf_config=self.hf_config`, but plugin
implementations still used the old 2-parameter signature, causing
`TypeError`.
2. **Re-enable `build_nixl_dockerfile` CI test** in pre-merge workflow.

## Upstream PR that introduced the regression

- vllm-project/vllm#39604 — added `hf_config`
keyword argument to `override_quantization_method()` call and updated
all upstream implementations, but plugin implementations were not
updated.

---------

Signed-off-by: Paweł Olejniczak <pawelx.olejniczak@intel.com>
Signed-off-by: Yeonsil Yoon <yeon.sil.yoon@intel.com>
bmyrcha pushed a commit to bmyrcha/vllm-gaudi that referenced this pull request Apr 22, 2026
…_config parameter to HPU quantization config overrides (vllm-project#1349)

## Summary

Fixes a regression introduced by upstream vLLM that breaks all
quantization tests using HPU-specific GPTQ and AWQ backends (e.g.
`run_qwen3_inc_dynamic_load_generate_test`).

## Changes

1. **Add `hf_config` parameter to `override_quantization_method()` in
`GPTQHPUConfig` and `AWQHPUConfig`** — upstream changed the call site in
`vllm/config/model.py` to pass `hf_config=self.hf_config`, but plugin
implementations still used the old 2-parameter signature, causing
`TypeError`.
2. **Re-enable `build_nixl_dockerfile` CI test** in pre-merge workflow.

## Upstream PR that introduced the regression

- vllm-project/vllm#39604 — added `hf_config`
keyword argument to `override_quantization_method()` call and updated
all upstream implementations, but plugin implementations were not
updated.

---------

Signed-off-by: Paweł Olejniczak <pawelx.olejniczak@intel.com>
Signed-off-by: bmyrcha <bartosz.myrcha@intel.com>
whk-lab pushed a commit to whk-lab/vllm that referenced this pull request Apr 23, 2026
guxin108 pushed a commit to guxin108/vllm-ascend that referenced this pull request Apr 24, 2026
### What this PR does / why we need it?
Upgrade vllm commit to `6f786f2c506cb07f4566771fdc62e640e2c4a176`
1. fix vllm-project/vllm#32936

This pr leads to the issue in Capture Phase (_dummy_run),
vllm-project/vllm#28207 (comment)
we should re-read `compilation_config.cudagraph_capture_sizes` after the
super() call in `_check_and_update_cudagraph_mode` to keep
`self.cudagraph_batch_sizes` in sync with the (possibly rewritten) sizes
in `model_runner_v1.NPUModelRunner_check_and_update_cudagraph_mode`.

For example, when speculative decoding (e.g. eagle3) is enabled and
`cudagraph_capture_sizes` is explicitly specify as [5, 12], vLLM's
`_check_and_update_cudagraph_mode` calls
`adjust_cudagraph_sizes_for_spec_decode`, which rounds
`cudagraph_capture_sizes` up to a multiple of
(num_speculative_tokens + 1). For example, with
`num_speculative_tokens=2` , [5, 12] becomes [6, 12].

However, in vllm-ascend, `self.cudagraph_batch_sizes` was cached during
__init__ with the original [5, 12]. When
`set_graph_params(self.cudagraph_batch_sizes)` runs later, it creates
`graph_params.events` keyed by {5, 12}. Meanwhile, the
`CudagraphDispatcher` uses the updated [6, 12] from
`compilation_config`, so it tries to capture at num_tokens=6 — causing
KeyError: 6 in `graph_params.events[num_tokens]` inside
`full_graph_fia`.

you can also re-produce the issue with the script:
```python
import os

os.environ.setdefault("VLLM_WORKER_MULTIPROC_METHOD", "spawn")

from vllm import LLM, SamplingParams
from vllm.config import CompilationConfig

EXAMPLE_PROMPTS = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

SAMPLING_PARAMS = SamplingParams(
    max_tokens=300,
    temperature=0.0,
    ignore_eos=False,
)

def run_spec():
    """Run with eagle3 speculative decoding."""
    llm = LLM(
        model="Qwen/Qwen3-8B",
        tensor_parallel_size=1,
        pipeline_parallel_size=1,
        data_parallel_size=1,
        disable_log_stats=False,
        max_model_len=4096,
        seed=1024,
        async_scheduling=False,
        speculative_config={
            "disable_padded_drafter_batch": False,
            "method": "eagle3",
            "model": "RedHatAI/Qwen3-8B-speculator.eagle3",
            "num_speculative_tokens": 2,
            "draft_tensor_parallel_size": 1,
            "max_model_len": 128,
        },
        compilation_config=CompilationConfig(
            cudagraph_mode="FULL",
            cudagraph_capture_sizes=[5, 12],
        ),
    )
    spec_outputs = llm.generate(EXAMPLE_PROMPTS, SAMPLING_PARAMS)
    del llm
    return spec_outputs

def main():
    spec_outputs = run_spec()
    for o in spec_outputs:
        print(f"  PROMPT: {o.prompt!r}")
        print(f"  OUTPUT: {o.outputs[0].text[:80]}...")

if __name__ == "__main__":
    main()

```

3. fix vllm-project/vllm#39604
### Does this PR introduce _any_ user-facing change?

### How was this patch tested?
1. For 310P, we are
- vLLM version:
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.19.0

---------

Signed-off-by: wangli <wangli858794774@gmail.com>
Signed-off-by: guxin108 <1252896542@qq.com>
avinashsingh77 pushed a commit to avinashsingh77/vllm that referenced this pull request Apr 27, 2026
…-project#39604)

Signed-off-by: Yongye Zhu <zyy1102000@gmail.com>
Signed-off-by: Avinash Singh <avinashsingh.rcoem@gmail.com>
zouyida2052 pushed a commit to zouyida2052/vllm-ascend that referenced this pull request Apr 28, 2026
### What this PR does / why we need it?
Upgrade vllm commit to `6f786f2c506cb07f4566771fdc62e640e2c4a176`
1. fix vllm-project/vllm#32936

This pr leads to the issue in Capture Phase (_dummy_run),
vllm-project/vllm#28207 (comment)
we should re-read `compilation_config.cudagraph_capture_sizes` after the
super() call in `_check_and_update_cudagraph_mode` to keep
`self.cudagraph_batch_sizes` in sync with the (possibly rewritten) sizes
in `model_runner_v1.NPUModelRunner_check_and_update_cudagraph_mode`.

For example, when speculative decoding (e.g. eagle3) is enabled and
`cudagraph_capture_sizes` is explicitly specify as [5, 12], vLLM's
`_check_and_update_cudagraph_mode` calls
`adjust_cudagraph_sizes_for_spec_decode`, which rounds
`cudagraph_capture_sizes` up to a multiple of
(num_speculative_tokens + 1). For example, with
`num_speculative_tokens=2` , [5, 12] becomes [6, 12].

However, in vllm-ascend, `self.cudagraph_batch_sizes` was cached during
__init__ with the original [5, 12]. When
`set_graph_params(self.cudagraph_batch_sizes)` runs later, it creates
`graph_params.events` keyed by {5, 12}. Meanwhile, the
`CudagraphDispatcher` uses the updated [6, 12] from
`compilation_config`, so it tries to capture at num_tokens=6 — causing
KeyError: 6 in `graph_params.events[num_tokens]` inside
`full_graph_fia`.

you can also re-produce the issue with the script:
```python
import os

os.environ.setdefault("VLLM_WORKER_MULTIPROC_METHOD", "spawn")

from vllm import LLM, SamplingParams
from vllm.config import CompilationConfig

EXAMPLE_PROMPTS = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

SAMPLING_PARAMS = SamplingParams(
    max_tokens=300,
    temperature=0.0,
    ignore_eos=False,
)

def run_spec():
    """Run with eagle3 speculative decoding."""
    llm = LLM(
        model="Qwen/Qwen3-8B",
        tensor_parallel_size=1,
        pipeline_parallel_size=1,
        data_parallel_size=1,
        disable_log_stats=False,
        max_model_len=4096,
        seed=1024,
        async_scheduling=False,
        speculative_config={
            "disable_padded_drafter_batch": False,
            "method": "eagle3",
            "model": "RedHatAI/Qwen3-8B-speculator.eagle3",
            "num_speculative_tokens": 2,
            "draft_tensor_parallel_size": 1,
            "max_model_len": 128,
        },
        compilation_config=CompilationConfig(
            cudagraph_mode="FULL",
            cudagraph_capture_sizes=[5, 12],
        ),
    )
    spec_outputs = llm.generate(EXAMPLE_PROMPTS, SAMPLING_PARAMS)
    del llm
    return spec_outputs

def main():
    spec_outputs = run_spec()
    for o in spec_outputs:
        print(f"  PROMPT: {o.prompt!r}")
        print(f"  OUTPUT: {o.outputs[0].text[:80]}...")

if __name__ == "__main__":
    main()

```

3. fix vllm-project/vllm#39604
### Does this PR introduce _any_ user-facing change?

### How was this patch tested?
1. For 310P, we are
- vLLM version:
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.19.0

---------

Signed-off-by: wangli <wangli858794774@gmail.com>
Signed-off-by: zouyida2052 <zouyida2002@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cpu Related to CPU backends documentation Improvements or additions to documentation gpt-oss Related to GPT-OSS models nvidia quantization ready ONLY add when PR is ready to merge/full CI is needed

Projects

Status: Done
Status: Done

Development

Successfully merging this pull request may close these issues.

3 participants