[XPU][4/N] add mxfp4 moe model support by jikunshang · Pull Request #33679 · vllm-project/vllm

jikunshang · 2026-02-03T09:43:07Z

Purpose

[4/N] of #33214
add mxfp4 moe support. we can also refactor xpu part once mxfp4 apply kernel abstraction.

Test Plan

python3 examples/offline_inference/basic/generate.py --model openai/gpt-oss-20b --temperature 0 --enforce-eager

Test Result

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Signed-off-by: Kunshang Ji <kunshang.ji@intel.com>

gemini-code-assist

Code Review

This pull request refactors the IPEX-specific MXFP4 MoE method to a more generic XPU implementation by renaming IpexMxfp4MoEMethod to XpuMxfp4MoEMethod and replacing the IPEX-dependent logic in apply_monolithic with a call to the xpu_fused_moe kernel.

My review identifies a critical issue in the new apply_monolithic implementation where input padding is missing, which will likely lead to a shape mismatch and runtime errors. I've provided a detailed comment with a suggested fix for this. Additionally, I've noted a minor performance concern regarding an unused tensor allocation.

vllm/model_executor/layers/quantization/mxfp4.py

jikunshang · 2026-02-04T03:30:58Z

@robertgshaw2-redhat @mgoin can you help take a review? thanks!

marvind · 2026-02-12T19:43:36Z

Hi! I am regularly getting gibberish output running gpt-oss-20b with reasoning set to high after this change. I run an Intel Arc Pro B60 using the v0.16.0 tag which introduced it. v0.15.1 worked fine. I use the docker/Dockerfile.xpu docker image and define the docker compose service as given below.
A good test prompt seems to be: Create a long and complex excel formula and explain it.

The model does not manage to leave reasoning anymore and starts to produce gibberish. Important to have reasoning set to high. Nothing strange in the log.
Could you have a look? Do you prefer a separate issue?

Details

  vllm:
    image: vllm-xpu-env
    ports:
      - "8000:8000"
    devices:
      - /dev/dri/renderD128:/dev/dri/renderD128
      - /dev/dri/card0:/dev/dri/card0
    volumes:
      - /root/models:/llm/models
      - type: bind
        source: /dev/dri/by-path/pci-0000:03:00.0-card
        target: /dev/dri/by-path/pci-0000:03:00.0-card
      - type: bind
        source: /dev/dri/by-path/pci-0000:03:00.0-render
        target: /dev/dri/by-path/pci-0000:03:00.0-render
    environment:
      - VLLM_ALLOW_LONG_MAX_MODEL_LEN=1
      - VLLM_WORKER_MULTIPROC_METHOD=spawn
      - VLLM_OFFLOAD_WEIGHTS_BEFORE_QUANT=1
    shm_size: '32gb'
    entrypoint:
      - /bin/bash
      - -c
      - |
        source /opt/intel/oneapi/setvars.sh --force && \
        python3 -m vllm.entrypoints.openai.api_server \
          --model /llm/models/gpt-oss-20b \
          --served-model-name gpt-oss-20b \
          --enforce-eager \
          --port 8000 \
          --host 0.0.0.0 \
          --gpu-memory-util=0.88 \
          --block-size 64 \
          -tp=1

jikunshang · 2026-02-13T00:38:00Z

@marvind thanks for reporting this.
I use 0.16.0rc3 build a docker container. try to run a subset of gsm8k, result looks reasonable.

can you share your client command? or try with python3 examples/online_serving/openai_chat_completion_client.py ?

marvind · 2026-02-13T07:29:19Z

Thank you for the swift feedback!
This is a minimal example which triggers this behavior only using curl and jq:

curl -N -s -X POST http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer EMPTY" \
  -d '{
    "model": "gpt-oss-20b",
    "messages": [
      {"role": "user", "content": "Create a long and complex excel formula and explain it."}
    ],
    "stream": true,
    "reasoning_effort": "high"
  }' \
| grep --line-buffered "^data: " \
| grep -v "\[DONE\]" \
| sed -u 's/^data: //' \
| jq -jr --unbuffered '
    .choices[0].delta |
    if .reasoning then
      "\u001b[93m" + .reasoning + "\u001b[0m"
    elif .content then
      .content
    else
      empty
    end
'

For me it consistently ends up outputting exclamation marks (!!!!) after a longer reasoning which gets more and more chaotic:
sample_gpt-oss-20b_vllm-v0.16.0.txt

I will also run lm_eval but have time for it first tomorrow.

marvind · 2026-02-14T08:18:03Z

@jikunshang

local-chat-completions ({'model': 'gpt-oss-20b', 'base_url': 'http://localhost:8000/v1/chat/completions', 'max_gen_toks': 4096, 'num_concurrent': 24}), gen_kwargs: ({}), limit: 100.0, num_fewshot: 5, batch_size: 1
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value|   |Stderr|
|-----|------:|----------------|-----:|-----------|---|----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  | 0.85|±  |0.0359|
|     |       |strict-match    |     5|exact_match|↑  | 0.08|±  |0.0273|

$ docker-compose exec vllm lm_eval --model local-chat-completions --tasks gsm8k --num_fewshot 5 --batch_size 1 --model_args "model=gpt-oss-20b,base_url=http://localhost:8000/v1/chat/completions,max_gen_toks=4096,num_concurrent=24" --apply_chat_template --output_path ./lm_eval_output --log_sample --limit 100

strict-match looks off compared to yours, doesn't it? I also get WARNING [models.api_models:822] API returned null content. Check reasoning_content field or generation limits. from time to time. This is on v0.16.0.

v0.15.1 actually looks similar but I do not get the warnings and the curl command from my previous message works fine:

local-chat-completions ({'model': 'gpt-oss-20b', 'base_url': 'http://localhost:8000/v1/chat/completions', 'max_gen_toks': 4096, 'num_concurrent': 24}), gen_kwargs: ({}), limit: 100.0, num_fewshot: 5, batch_size: 1
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value|   |Stderr|
|-----|------:|----------------|-----:|-----------|---|----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  | 0.87|±  |0.0338|
|     |       |strict-match    |     5|exact_match|↑  | 0.05|±  |0.0219|

$ docker-compose exec vllm lm_eval --model local-chat-completions --tasks gsm8k --num_fewshot 5 --batch_size 1 --model_args "model=gpt-oss-20b,base_url=http://localhost:8000/v1/chat/completions,max_gen_toks=4096,num_concurrent=24" --apply_chat_template --output_path ./lm_eval_output --log_sample --limit 100

Not sure how to interpret this. Please let me know if I can provide more information.

Signed-off-by: Kunshang Ji <kunshang.ji@intel.com>

jikunshang · 2026-02-24T07:53:15Z

@marvind
sorry for late reply. we did find some accuracy issue when switch to latest vllm-xpu-kernels impl, narrow to decode attention kernel. will try to provide some solid fix in next release.
some workaround for you:

rollback to v0.15.1 with ipex
use triton attention instead(performance will not be optimal)

marvind · 2026-02-24T18:16:30Z

Thank you, @jikunshang. v0.15.1 works fine for the time being.
#33214 looks great, looking forward to test it as you progress. 🙂

Signed-off-by: Kunshang Ji <kunshang.ji@intel.com>

jikunshang · 2026-03-07T01:05:12Z

@marvind #35984 this PR upgrades kernel deps, thisshould resolve accuracy issue. can you test latest main branch?

marvind · 2026-03-08T19:24:01Z

Thank you for the update, @jikunshang!
The outputs look much better with the fix in v0.17.0. Unfortunately, the issue does not seem fully resolved.

I will attach two sample outputs to show what I mean (v0.15.1 with IPEX vs. v0.17.0, both include reasoning and content):
sample_gpt-oss-20b_vllm-v0.15.1.txt
sample_gpt-oss-20b_vllm-v0.17.0.txt
Especially the final lines of the v0.17.0 output are scrambled.

I ran this command to obtain the outputs:

curl -N -s -X POST http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer EMPTY" \
  -d '{
    "model": "gpt-oss-20b",
    "messages": [
      {"role": "user", "content": "Create a long and complex excel formula and explain it."}
    ],
    "stream": true,
    "reasoning_effort": "high"
  }' \
| grep --line-buffered "^data: " \
| grep -v "\[DONE\]" \
| sed -u 's/^data: //' \
| jq -jr --unbuffered '
    .choices[0].delta |
    if .reasoning then
      "\u001b[93m" + .reasoning + "\u001b[0m"
    elif .content then
      .content
    else
      empty
    end
'

jikunshang · 2026-03-09T00:15:19Z

@marvind we will continue investigating. thanks!

jikunshang · 2026-03-09T05:01:45Z

@marvind would you mind try with latest per-commit-wheel. you can find it here https://github.com/vllm-project/vllm-xpu-kernels/actions/runs/22834791643
I take a quick try with your commands. didn't get any garbage output.

marvind · 2026-03-09T09:27:24Z

@jikunshang, tested vllm v0.17.0 plus the updated vllm-xpu-kernel per-commit-wheel you mentioned. This fixes the issue! Great work, thanks. 😊

kernel migration, mxfp4 moe support

a1a69eb

Signed-off-by: Kunshang Ji <kunshang.ji@intel.com>

jikunshang requested review from mgoin, pavanimajety, robertgshaw2-redhat, tlrmchlsmth and yewentao256 as code owners February 3, 2026 09:43

jikunshang changed the title ~~[XPU][3/N] add mxfp4 moe model support~~ [XPU][4/N] add mxfp4 moe model support Feb 3, 2026

jikunshang mentioned this pull request Feb 3, 2026

[RFC]: XPU kernel migration to vllm-xpu-kernels #33214

Open

14 tasks

gemini-code-assist bot reviewed Feb 3, 2026

View reviewed changes

vllm/model_executor/layers/quantization/mxfp4.py Show resolved Hide resolved

jikunshang added the ready ONLY add when PR is ready to merge/full CI is needed label Feb 4, 2026

bigPYJ1151 approved these changes Feb 6, 2026

View reviewed changes

Merge branch 'main' into kunshang/mxfp4

b833388

jikunshang merged commit 7439e4f into vllm-project:main Feb 6, 2026
58 checks passed

ItzDEXX pushed a commit to ItzDEXX/vllm that referenced this pull request Feb 19, 2026

[XPU][4/N] add mxfp4 moe model support (vllm-project#33679)

898d4e7

Signed-off-by: Kunshang Ji <kunshang.ji@intel.com>

tunglinwood pushed a commit to tunglinwood/vllm that referenced this pull request Mar 4, 2026

[XPU][4/N] add mxfp4 moe model support (vllm-project#33679)

f63a6fd

Signed-off-by: Kunshang Ji <kunshang.ji@intel.com>

Uh oh!

Conversation

jikunshang commented Feb 3, 2026 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

jikunshang commented Feb 4, 2026

Uh oh!

Uh oh!

marvind commented Feb 12, 2026

Uh oh!

jikunshang commented Feb 13, 2026

Uh oh!

marvind commented Feb 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

marvind commented Feb 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jikunshang commented Feb 24, 2026

Uh oh!

marvind commented Feb 24, 2026

Uh oh!

jikunshang commented Mar 7, 2026

Uh oh!

marvind commented Mar 8, 2026

Uh oh!

jikunshang commented Mar 9, 2026

Uh oh!

jikunshang commented Mar 9, 2026

Uh oh!

marvind commented Mar 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

jikunshang commented Feb 3, 2026 •

edited by github-actions bot

Loading

marvind commented Feb 13, 2026 •

edited

Loading

marvind commented Feb 14, 2026 •

edited

Loading