Skip to content

[XPU][4/N] add mxfp4 moe model support#33679

Merged
jikunshang merged 2 commits intovllm-project:mainfrom
jikunshang:kunshang/mxfp4
Feb 6, 2026
Merged

[XPU][4/N] add mxfp4 moe model support#33679
jikunshang merged 2 commits intovllm-project:mainfrom
jikunshang:kunshang/mxfp4

Conversation

@jikunshang
Copy link
Collaborator

@jikunshang jikunshang commented Feb 3, 2026

Purpose

[4/N] of #33214
add mxfp4 moe support. we can also refactor xpu part once mxfp4 apply kernel abstraction.

Test Plan

python3 examples/offline_inference/basic/generate.py --model openai/gpt-oss-20b --temperature 0 --enforce-eager

Test Result


Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Signed-off-by: Kunshang Ji <kunshang.ji@intel.com>
@jikunshang jikunshang changed the title [XPU][3/N] add mxfp4 moe model support [XPU][4/N] add mxfp4 moe model support Feb 3, 2026
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors the IPEX-specific MXFP4 MoE method to a more generic XPU implementation by renaming IpexMxfp4MoEMethod to XpuMxfp4MoEMethod and replacing the IPEX-dependent logic in apply_monolithic with a call to the xpu_fused_moe kernel.

My review identifies a critical issue in the new apply_monolithic implementation where input padding is missing, which will likely lead to a shape mismatch and runtime errors. I've provided a detailed comment with a suggested fix for this. Additionally, I've noted a minor performance concern regarding an unused tensor allocation.

@jikunshang
Copy link
Collaborator Author

@robertgshaw2-redhat @mgoin can you help take a review? thanks!

@jikunshang jikunshang added the ready ONLY add when PR is ready to merge/full CI is needed label Feb 4, 2026
@jikunshang jikunshang merged commit 7439e4f into vllm-project:main Feb 6, 2026
58 checks passed
@marvind
Copy link

marvind commented Feb 12, 2026

Hi! I am regularly getting gibberish output running gpt-oss-20b with reasoning set to high after this change. I run an Intel Arc Pro B60 using the v0.16.0 tag which introduced it. v0.15.1 worked fine. I use the docker/Dockerfile.xpu docker image and define the docker compose service as given below.
A good test prompt seems to be: Create a long and complex excel formula and explain it.

The model does not manage to leave reasoning anymore and starts to produce gibberish. Important to have reasoning set to high. Nothing strange in the log.
Could you have a look? Do you prefer a separate issue?

image
Details
  vllm:
    image: vllm-xpu-env
    ports:
      - "8000:8000"
    devices:
      - /dev/dri/renderD128:/dev/dri/renderD128
      - /dev/dri/card0:/dev/dri/card0
    volumes:
      - /root/models:/llm/models
      - type: bind
        source: /dev/dri/by-path/pci-0000:03:00.0-card
        target: /dev/dri/by-path/pci-0000:03:00.0-card
      - type: bind
        source: /dev/dri/by-path/pci-0000:03:00.0-render
        target: /dev/dri/by-path/pci-0000:03:00.0-render
    environment:
      - VLLM_ALLOW_LONG_MAX_MODEL_LEN=1
      - VLLM_WORKER_MULTIPROC_METHOD=spawn
      - VLLM_OFFLOAD_WEIGHTS_BEFORE_QUANT=1
    shm_size: '32gb'
    entrypoint:
      - /bin/bash
      - -c
      - |
        source /opt/intel/oneapi/setvars.sh --force && \
        python3 -m vllm.entrypoints.openai.api_server \
          --model /llm/models/gpt-oss-20b \
          --served-model-name gpt-oss-20b \
          --enforce-eager \
          --port 8000 \
          --host 0.0.0.0 \
          --gpu-memory-util=0.88 \
          --block-size 64 \
          -tp=1

@jikunshang
Copy link
Collaborator Author

@marvind thanks for reporting this.
I use 0.16.0rc3 build a docker container. try to run a subset of gsm8k, result looks reasonable.
image

can you share your client command? or try with python3 examples/online_serving/openai_chat_completion_client.py ?

@marvind
Copy link

marvind commented Feb 13, 2026

Thank you for the swift feedback!
This is a minimal example which triggers this behavior only using curl and jq:

curl -N -s -X POST http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer EMPTY" \
  -d '{
    "model": "gpt-oss-20b",
    "messages": [
      {"role": "user", "content": "Create a long and complex excel formula and explain it."}
    ],
    "stream": true,
    "reasoning_effort": "high"
  }' \
| grep --line-buffered "^data: " \
| grep -v "\[DONE\]" \
| sed -u 's/^data: //' \
| jq -jr --unbuffered '
    .choices[0].delta |
    if .reasoning then
      "\u001b[93m" + .reasoning + "\u001b[0m"
    elif .content then
      .content
    else
      empty
    end
'

For me it consistently ends up outputting exclamation marks (!!!!) after a longer reasoning which gets more and more chaotic:
sample_gpt-oss-20b_vllm-v0.16.0.txt

I will also run lm_eval but have time for it first tomorrow.

@marvind
Copy link

marvind commented Feb 14, 2026

@jikunshang

local-chat-completions ({'model': 'gpt-oss-20b', 'base_url': 'http://localhost:8000/v1/chat/completions', 'max_gen_toks': 4096, 'num_concurrent': 24}), gen_kwargs: ({}), limit: 100.0, num_fewshot: 5, batch_size: 1
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value|   |Stderr|
|-----|------:|----------------|-----:|-----------|---|----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  | 0.85|±  |0.0359|
|     |       |strict-match    |     5|exact_match|↑  | 0.08|±  |0.0273|

$ docker-compose exec vllm lm_eval --model local-chat-completions --tasks gsm8k --num_fewshot 5 --batch_size 1 --model_args "model=gpt-oss-20b,base_url=http://localhost:8000/v1/chat/completions,max_gen_toks=4096,num_concurrent=24" --apply_chat_template --output_path ./lm_eval_output --log_sample --limit 100

strict-match looks off compared to yours, doesn't it? I also get WARNING [models.api_models:822] API returned null content. Check reasoning_content field or generation limits. from time to time. This is on v0.16.0.

v0.15.1 actually looks similar but I do not get the warnings and the curl command from my previous message works fine:

local-chat-completions ({'model': 'gpt-oss-20b', 'base_url': 'http://localhost:8000/v1/chat/completions', 'max_gen_toks': 4096, 'num_concurrent': 24}), gen_kwargs: ({}), limit: 100.0, num_fewshot: 5, batch_size: 1
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value|   |Stderr|
|-----|------:|----------------|-----:|-----------|---|----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  | 0.87|±  |0.0338|
|     |       |strict-match    |     5|exact_match|↑  | 0.05|±  |0.0219|

$ docker-compose exec vllm lm_eval --model local-chat-completions --tasks gsm8k --num_fewshot 5 --batch_size 1 --model_args "model=gpt-oss-20b,base_url=http://localhost:8000/v1/chat/completions,max_gen_toks=4096,num_concurrent=24" --apply_chat_template --output_path ./lm_eval_output --log_sample --limit 100

Not sure how to interpret this. Please let me know if I can provide more information.

ItzDEXX pushed a commit to ItzDEXX/vllm that referenced this pull request Feb 19, 2026
Signed-off-by: Kunshang Ji <kunshang.ji@intel.com>
@jikunshang
Copy link
Collaborator Author

@marvind
sorry for late reply. we did find some accuracy issue when switch to latest vllm-xpu-kernels impl, narrow to decode attention kernel. will try to provide some solid fix in next release.
some workaround for you:

  1. rollback to v0.15.1 with ipex
  2. use triton attention instead(performance will not be optimal)

@marvind
Copy link

marvind commented Feb 24, 2026

Thank you, @jikunshang. v0.15.1 works fine for the time being.
#33214 looks great, looking forward to test it as you progress. 🙂

tunglinwood pushed a commit to tunglinwood/vllm that referenced this pull request Mar 4, 2026
Signed-off-by: Kunshang Ji <kunshang.ji@intel.com>
@jikunshang
Copy link
Collaborator Author

@marvind #35984 this PR upgrades kernel deps, thisshould resolve accuracy issue. can you test latest main branch?

@marvind
Copy link

marvind commented Mar 8, 2026

Thank you for the update, @jikunshang!
The outputs look much better with the fix in v0.17.0. Unfortunately, the issue does not seem fully resolved.

I will attach two sample outputs to show what I mean (v0.15.1 with IPEX vs. v0.17.0, both include reasoning and content):
sample_gpt-oss-20b_vllm-v0.15.1.txt
sample_gpt-oss-20b_vllm-v0.17.0.txt
Especially the final lines of the v0.17.0 output are scrambled.

I ran this command to obtain the outputs:

curl -N -s -X POST http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer EMPTY" \
  -d '{
    "model": "gpt-oss-20b",
    "messages": [
      {"role": "user", "content": "Create a long and complex excel formula and explain it."}
    ],
    "stream": true,
    "reasoning_effort": "high"
  }' \
| grep --line-buffered "^data: " \
| grep -v "\[DONE\]" \
| sed -u 's/^data: //' \
| jq -jr --unbuffered '
    .choices[0].delta |
    if .reasoning then
      "\u001b[93m" + .reasoning + "\u001b[0m"
    elif .content then
      .content
    else
      empty
    end
'

@jikunshang
Copy link
Collaborator Author

@marvind we will continue investigating. thanks!

@jikunshang
Copy link
Collaborator Author

@marvind would you mind try with latest per-commit-wheel. you can find it here https://github.com/vllm-project/vllm-xpu-kernels/actions/runs/22834791643
I take a quick try with your commands. didn't get any garbage output.

@marvind
Copy link

marvind commented Mar 9, 2026

@jikunshang, tested vllm v0.17.0 plus the updated vllm-xpu-kernel per-commit-wheel you mentioned. This fixes the issue! Great work, thanks. 😊

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready ONLY add when PR is ready to merge/full CI is needed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants