Skip to content

[v1] Add PrefixLM support to FlexAttention backend#27938

Merged
Isotr0py merged 25 commits intovllm-project:mainfrom
Isotr0py:flex-prefixlm
Dec 7, 2025
Merged

[v1] Add PrefixLM support to FlexAttention backend#27938
Isotr0py merged 25 commits intovllm-project:mainfrom
Isotr0py:flex-prefixlm

Conversation

@Isotr0py
Copy link
Copy Markdown
Member

@Isotr0py Isotr0py commented Nov 2, 2025

Purpose

  • Currently, there is no attention backend supports image-bidirectional attention in vLLM, so Gemma3 and paligemma can't generate correct outputs. And models like moondream are blocked due to missing attention backend support.
  • This PR adds image-bidirectional support to FlexAttention backend to fill the void.

Test Plan

pytest -s -v tests/models/multimodal/generation/test_common.py -k gemma3

Test Result

vllm (both native and transformers backend) results should have converged results with HF now.

PR results

tests/models/multimodal/generation/test_common.py::test_single_image_models[gemma3-test_case0]
  /home/mozf/develop-projects/vllm/tests/models/multimodal/generation/vlm_utils/core.py:157: UserWarning: Test0:
  hf:   "Here's what's in the center of the image:\n\nIt's a traditional Chinese gate or archway. It's red and gold, with Chinese characters written on it. It's a prominent feature of the street scene.<end_of_turn>"
  vllm: "Here's what's in the center of the image:\n\nIt's a traditional Chinese gate or archway. It's red and gold, with Chinese characters written on it. It's a prominent feature of the street scene."
    comparator(

tests/models/multimodal/generation/test_common.py::test_single_image_models[gemma3-test_case0]
  /home/mozf/develop-projects/vllm/tests/models/multimodal/generation/vlm_utils/core.py:157: UserWarning: Test1:
  hf:   'The center of the image features a vibrant Chinese-themed archway or gate. It\'s decorated with red and gold colors, traditional Chinese characters (likely meaning "Chinese Town"), and red lanterns.  There are also two white stone lion statues flanking the entrance.<end_of_turn>'
  vllm: 'The center of the image features a vibrant Chinese-themed archway or gate. It\'s decorated with red and gold colors, traditional Chinese characters (likely meaning "Chinese Town"), and red lanterns.  There are also two white stone lion statues flanking the entrance.'
    comparator(

Main branch

tests/models/multimodal/generation/test_common.py::test_single_image_models[gemma3-test_case0]
  /home/mozf/develop-projects/vllm/tests/models/multimodal/generation/vlm_utils/core.py:157: UserWarning: Test0:
  Matched tokens:       [8291, 236789, 236751, 1144, 236789, 236751, 528, 506, 3988, 529, 506, 2471, 236787, 108]
  hf:   "Here's what's in the center of the image:\n\nIt's a traditional Chinese gate or archway. It's red and gold, with Chinese characters written on it. It's a prominent feature of the street scene.<end_of_turn>"      {1509: -0.03537141531705856, 236776: -3.7853713035583496, 818: -4.53537130355835, 3810: -6.78537130355835, 236829: -8.535371780395508}
  vllm: 'Here\'s what\'s in the center of the image:\n\nA vibrant Chinese-themed archway with red and gold decorations, featuring the Chinese characters "中华" (Zhōnghuá - meaning "China"). It\'s part of a Chinatown area.'       {236776: Logprob(logprob=-0.6554893255233765, rank=1, decoded_token='A'), 818: Logprob(logprob=-1.1554893255233765, rank=2, decoded_token='The'), 1509: Logprob(logprob=-2.155489444732666, rank=3, decoded_token='It'), 236829: Logprob(logprob=-3.155489444732666, rank=4, decoded_token='*'), 3810: Logprob(logprob=-4.905489444732666, rank=5, decoded_token='There')}
    comparator(

tests/models/multimodal/generation/test_common.py::test_single_image_models[gemma3-test_case0]
  /home/mozf/develop-projects/vllm/tests/models/multimodal/generation/vlm_utils/core.py:157: UserWarning: Test1:
  Matched tokens:       []
  hf:   'The center of the image features a vibrant Chinese-themed archway or gate. It\'s decorated with red and gold colors, traditional Chinese characters (likely meaning "Chinese Town"), and red lanterns.  There are also two white stone lion statues flanking the entrance.<end_of_turn>'    {818: -0.5989671945571899, 8291: -0.8489671945571899, 117494: -4.5989670753479, 6481: -5.3489670753479, 19058: -5.5989670753479}
  vllm: 'Here\'s what\'s in the center of the image:\n\n*   **A large, ornate Chinese gate or archway.** It\'s painted red and features traditional Chinese characters ("中华" - meaning "China") and decorative elements.\n*   **Two white stone lion statues** flanking the gate.\n*   **A black SUV** is parked in front of the gate.\n\nLet me know if you want me to describe any other specific elements in the image!'     {8291: Logprob(logprob=-0.3420378267765045, rank=1, decoded_token='Here'), 818: Logprob(logprob=-1.3420377969741821, rank=2, decoded_token='The'), 117494: Logprob(logprob=-4.342037677764893, rank=3, decoded_token='Certainly'), 6481: Logprob(logprob=-4.842037677764893, rank=4, decoded_token='Let'), 19058: Logprob(logprob=-5.592037677764893, rank=5, decoded_token='Okay')}
    comparator(

Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
@mergify mergify bot added the v1 label Nov 2, 2025
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
@mergify mergify bot added the multi-modality Related to multi-modality (#4194) label Nov 3, 2025
Isotr0py and others added 9 commits November 20, 2025 15:14
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
@mergify mergify bot added nvidia rocm Related to AMD ROCm labels Nov 21, 2025
@mergify mergify bot added the tpu Related to Google TPUs label Nov 21, 2025
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
@mergify
Copy link
Copy Markdown

mergify bot commented Nov 21, 2025

Documentation preview: https://vllm--27938.org.readthedocs.build/en/27938/

@mergify mergify bot added the documentation Improvements or additions to documentation label Nov 21, 2025
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
@Isotr0py Isotr0py changed the title [Draft][v1] Add PrefixLM support to FlexAttention backend [v1] Add PrefixLM support to FlexAttention backend Nov 21, 2025
@Isotr0py Isotr0py marked this pull request as ready for review November 21, 2025 17:29
@DarkLight1337
Copy link
Copy Markdown
Member

Let's get this merged then, can you fix the merge conflicts?

Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
@Isotr0py Isotr0py added the ready ONLY add when PR is ready to merge/full CI is needed label Dec 6, 2025
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
@mergify
Copy link
Copy Markdown

mergify bot commented Dec 7, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @Isotr0py.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Dec 7, 2025
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
@mergify mergify bot removed the needs-rebase label Dec 7, 2025
@Isotr0py Isotr0py enabled auto-merge (squash) December 7, 2025 12:27
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
@Isotr0py Isotr0py merged commit b952f4d into vllm-project:main Dec 7, 2025
59 of 60 checks passed
@github-project-automation github-project-automation bot moved this from In review to Done in NVIDIA Dec 7, 2025
@Isotr0py Isotr0py deleted the flex-prefixlm branch December 7, 2025 15:58
penfree pushed a commit to penfree/vllm that referenced this pull request Dec 8, 2025
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
iboiko-habana pushed a commit to vllm-project/vllm-gaudi that referenced this pull request Dec 8, 2025
Culprit: vllm-project/vllm#29665 and
vllm-project/vllm#27938

---------

Signed-off-by: Dobrzyniewicz, Agata <agata.dobrzyniewicz@intel.com>
wangxiyuan pushed a commit to vllm-project/vllm-ascend that referenced this pull request Dec 15, 2025
### What this PR does / why we need it?

### Does this PR introduce _any_ user-facing change?
1. fix vllm-project/vllm#27938
2. fix vllm-project/vllm#27145
pooling models now supports chunked prefill and prefix caching,
3. fix vllm-project/vllm#30181
define the CPU fields in the field config where they really belong.
4. fix vllm-project/vllm#28168
define the CPU fields in the field config where they really belong.
5. fix vllm-project/vllm#30201
some moudle rename
6. fix vllm-project/vllm#29067
fusedmoe moudle refactor
7. fix vllm-project/vllm#29066
fusedmoe moudle refactor
8. fix vllm-project/vllm#29624
### How was this patch tested?

- vLLM version: v0.12.0
- vLLM main:
vllm-project/vllm@ad32e3e

---------

Signed-off-by: wangli <wangli858794774@gmail.com>
chenaoxuan pushed a commit to chenaoxuan/vllm-ascend that referenced this pull request Dec 20, 2025
### What this PR does / why we need it?

### Does this PR introduce _any_ user-facing change?
1. fix vllm-project/vllm#27938
2. fix vllm-project/vllm#27145
pooling models now supports chunked prefill and prefix caching,
3. fix vllm-project/vllm#30181
define the CPU fields in the field config where they really belong.
4. fix vllm-project/vllm#28168
define the CPU fields in the field config where they really belong.
5. fix vllm-project/vllm#30201
some moudle rename
6. fix vllm-project/vllm#29067
fusedmoe moudle refactor
7. fix vllm-project/vllm#29066
fusedmoe moudle refactor
8. fix vllm-project/vllm#29624
### How was this patch tested?

- vLLM version: v0.12.0
- vLLM main:
vllm-project/vllm@ad32e3e

---------

Signed-off-by: wangli <wangli858794774@gmail.com>
dsuhinin pushed a commit to dsuhinin/vllm that referenced this pull request Jan 21, 2026
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
Signed-off-by: dsuhinin <suhinin.dmitriy@gmail.com>
ZRJ026 pushed a commit to ZRJ026/vllm-ascend that referenced this pull request Feb 28, 2026
### What this PR does / why we need it?

### Does this PR introduce _any_ user-facing change?
1. fix vllm-project/vllm#27938
2. fix vllm-project/vllm#27145
pooling models now supports chunked prefill and prefix caching,
3. fix vllm-project/vllm#30181
define the CPU fields in the field config where they really belong.
4. fix vllm-project/vllm#28168
define the CPU fields in the field config where they really belong.
5. fix vllm-project/vllm#30201
some moudle rename
6. fix vllm-project/vllm#29067
fusedmoe moudle refactor
7. fix vllm-project/vllm#29066
fusedmoe moudle refactor
8. fix vllm-project/vllm#29624
### How was this patch tested?

- vLLM version: v0.12.0
- vLLM main:
vllm-project/vllm@ad32e3e

---------

Signed-off-by: wangli <wangli858794774@gmail.com>
Signed-off-by: zrj026 <zhangrunjiang026@gmail.com>
ZRJ026 pushed a commit to ZRJ026/vllm-ascend that referenced this pull request Mar 4, 2026
### What this PR does / why we need it?

### Does this PR introduce _any_ user-facing change?
1. fix vllm-project/vllm#27938
2. fix vllm-project/vllm#27145
pooling models now supports chunked prefill and prefix caching,
3. fix vllm-project/vllm#30181
define the CPU fields in the field config where they really belong.
4. fix vllm-project/vllm#28168
define the CPU fields in the field config where they really belong.
5. fix vllm-project/vllm#30201
some moudle rename
6. fix vllm-project/vllm#29067
fusedmoe moudle refactor
7. fix vllm-project/vllm#29066
fusedmoe moudle refactor
8. fix vllm-project/vllm#29624
### How was this patch tested?

- vLLM version: v0.12.0
- vLLM main:
vllm-project/vllm@ad32e3e

---------

Signed-off-by: wangli <wangli858794774@gmail.com>
Signed-off-by: zrj026 <zhangrunjiang026@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation multi-modality Related to multi-modality (#4194) nvidia ready ONLY add when PR is ready to merge/full CI is needed rocm Related to AMD ROCm tpu Related to Google TPUs v1

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

3 participants