Qwen3vl accuracy fixes by libinta · Pull Request #884 · vllm-project/vllm-gaudi

libinta · 2026-01-27T00:25:10Z

for qwen3 vl, there is accuracy issue with multi-images within 1 request, this PR is to fix that. After fix, there are 3 paths for vision attention depending on the images count inside 1 request

single image, use fusedsdpa without attn mask
multi-images with threshold use fusedsdpa without attn_mask one by one
This pr also enables qwen3vl moe

github-actions · 2026-01-27T13:00:59Z

✅ CI Passed

All checks passed successfully against the following vllm commit:
d7de043d55d1dd629554467e23874097e1c48993

github-actions · 2026-01-28T16:19:31Z

🚧 CI Blocked

The main CI workflow was not started for the following reason:

Your branch is behind the base branch. Please merge or rebase to get the latest changes.

github-actions · 2026-01-29T01:30:50Z

✅ CI Passed

All checks passed successfully against the following vllm commit:
d7de043d55d1dd629554467e23874097e1c48993

github-actions · 2026-01-29T09:51:01Z

🚧 CI Blocked

The main CI workflow was not started for the following reason:

Your branch is behind the base branch. Please merge or rebase to get the latest changes.

github-actions · 2026-01-29T10:10:10Z

🚧 CI Blocked

The main CI workflow was not started for the following reason:

Your branch is behind the base branch. Please merge or rebase to get the latest changes.

Signed-off-by: slokesha <slokeshappa@habana.ai>

…ct#885) Due to MambaMixer2 implementation requirements, all buckets used for mamba must be a multiple of mamba chunk size. Signed-off-by: Jakub Byczkowski <jbyczkowski@habana.ai> Signed-off-by: slokesha <slokeshappa@habana.ai>

…roject#888) Reverts vllm-project#780 --------- Signed-off-by: Agata Dobrzyniewicz <adobrzyniewicz@habana.ai> Co-authored-by: Chendi.Xue <chendi.xue@intel.com> Signed-off-by: slokesha <slokeshappa@habana.ai>

1. vllm-project#805 2. vllm-project#837 3. vllm-project#855 4. vllm-project#862 --------- Signed-off-by: Radoslaw Smyrek <radoslawx.smyrek@intel.com> Signed-off-by: linoy buchnik <lbuchnik@habana.ai> Signed-off-by: Iryna Boiko <iboiko@habana.ai> Signed-off-by: Artur Fierka <artur.fierka@intel.com> Co-authored-by: Linoy Buchnik <linoybu@gmail.com> Co-authored-by: Iryna Boiko <iboiko@habana.ai> Co-authored-by: Artur Fierka <artur.fierka@intel.com> Signed-off-by: slokesha <slokeshappa@habana.ai>

Signed-off-by: slokesha <slokeshappa@habana.ai>

github-actions · 2026-01-29T17:19:37Z

🚧 CI Blocked

The main CI workflow was not started for the following reason:

Your branch is behind the base branch. Please merge or rebase to get the latest changes.

github-actions · 2026-01-30T01:19:38Z

✅ CI Passed

All checks passed successfully against the following vllm commit:
d7de043d55d1dd629554467e23874097e1c48993

wpyszka

Approved for 0.14.1

* Prevent cu_seqlens/mask mix-ups that can trigger performance regressions or incorrect attention behavior. * Remove the lens = (cu_seqlens[1:] - cu_seqlens[:-1]).tolist() computation from the Qwen2.5 path. This calculation is not required for Qwen2.5 and was causing a performance regression after PR #884. Removing it restores the previous performance without changing model behavior.

for qwen3 vl, there is accuracy issue with multi-images within 1 request, this PR is to fix that. After fix, there are 3 paths for vision attention depending on the images count inside 1 request 1. single image, use fusedsdpa without attn mask 3. multi-images with threshold use fusedsdpa without attn_mask one by one This pr also enables qwen3vl moe --------- Signed-off-by: slokesha <slokeshappa@habana.ai> Signed-off-by: Jakub Byczkowski <jbyczkowski@habana.ai> Signed-off-by: Agata Dobrzyniewicz <adobrzyniewicz@habana.ai> Signed-off-by: Radoslaw Smyrek <radoslawx.smyrek@intel.com> Signed-off-by: linoy buchnik <lbuchnik@habana.ai> Signed-off-by: Iryna Boiko <iboiko@habana.ai> Signed-off-by: Artur Fierka <artur.fierka@intel.com> Signed-off-by: Luca Calabria <luca.calabria@intel.com> Co-authored-by: Seunghyuk Park <separk@habana.ai> Co-authored-by: Jakub Byczkowski <jbyczkowski@habana.ai> Co-authored-by: Agata Dobrzyniewicz <160237065+adobrzyn@users.noreply.github.com> Co-authored-by: Chendi.Xue <chendi.xue@intel.com> Co-authored-by: Radosław Smyrek <radoslawx.smyrek@intel.com> Co-authored-by: Linoy Buchnik <linoybu@gmail.com> Co-authored-by: Iryna Boiko <iboiko@habana.ai> Co-authored-by: Artur Fierka <artur.fierka@intel.com> Co-authored-by: Luca Calabria <luca.calabria@intel.com> Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> Co-authored-by: slokesha <slokeshappa@habana.ai> Co-authored-by: Seunghyuk Park (shepark) <seunghyuk.h.park@intel.com> Co-authored-by: Michał Kuligowski <michal.kuligowski@intel.com> Co-authored-by: Katarzyna Fojcik <kfojcik@habana.ai> Co-authored-by: Krzysztof Smusz <ksmusz@habana.ai> Co-authored-by: Jozef Mamza <jmamzax@habana.ai>

* Prevent cu_seqlens/mask mix-ups that can trigger performance regressions or incorrect attention behavior. * Remove the lens = (cu_seqlens[1:] - cu_seqlens[:-1]).tolist() computation from the Qwen2.5 path. This calculation is not required for Qwen2.5 and was causing a performance regression after PR vllm-project#884. Removing it restores the previous performance without changing model behavior.

* Qwen3vl accuracy fixes (vllm-project#884) for qwen3 vl, there is accuracy issue with multi-images within 1 request, this PR is to fix that. After fix, there are 3 paths for vision attention depending on the images count inside 1 request 1. single image, use fusedsdpa without attn mask 3. multi-images with threshold use fusedsdpa without attn_mask one by one This pr also enables qwen3vl moe

* Qwen3vl accuracy fixes (vllm-project#884) for qwen3 vl, there is accuracy issue with multi-images within 1 request, this PR is to fix that. After fix, there are 3 paths for vision attention depending on the images count inside 1 request 1. single image, use fusedsdpa without attn mask 3. multi-images with threshold use fusedsdpa without attn_mask one by one This pr also enables qwen3vl moe Signed-off-by: slokesha <slokeshappa@habana.ai>

Cherry pick Llama4 missing fixes from #881 #862 #884 on releases/ branch Signed-off-by: Luca Calabria <luca.calabria@intel.com>

Added Llama4 missing fixes from #881 #862 #884 on main branch --------- Signed-off-by: Luca Calabria <luca.calabria@intel.com> Co-authored-by: Wojciech Pyszka <wpyszka@habana.ai>

* Qwen3vl accuracy fixes (vllm-project#884) for qwen3 vl, there is accuracy issue with multi-images within 1 request, this PR is to fix that. After fix, there are 3 paths for vision attention depending on the images count inside 1 request 1. single image, use fusedsdpa without attn mask 3. multi-images with threshold use fusedsdpa without attn_mask one by one This pr also enables qwen3vl moe Signed-off-by: slokesha <slokeshappa@habana.ai>

Added Llama4 missing fixes from #881 #862 #884 on main branch --------- Signed-off-by: Luca Calabria <luca.calabria@intel.com> Co-authored-by: Wojciech Pyszka <wpyszka@habana.ai>

libinta requested review from mgawarkiewicz-intel, piotrbocian and wpyszka as code owners January 27, 2026 00:25

github-actions Bot mentioned this pull request Jan 27, 2026

🚦 Team Review Dashboard #701

Open

afierka-intel approved these changes Jan 28, 2026

View reviewed changes

libinta and others added 20 commits January 29, 2026 17:15

add mask for couple of images in the same request

4a2a025

Signed-off-by: slokesha <slokeshappa@habana.ai>

fix accuracy issue for attn_mask path

644454b

Signed-off-by: slokesha <slokeshappa@habana.ai>

fix accuracy issue

7afab16

Signed-off-by: slokesha <slokeshappa@habana.ai>

remove not needed code

6875827

Signed-off-by: slokesha <slokeshappa@habana.ai>

fix precommit

750fcbc

Signed-off-by: slokesha <slokeshappa@habana.ai>

fix precommit

d839cbd

Signed-off-by: slokesha <slokeshappa@habana.ai>

fix precommit

c6e16a7

Signed-off-by: slokesha <slokeshappa@habana.ai>

precommit fix

fe5e05f

Signed-off-by: slokesha <slokeshappa@habana.ai>

precommit fix

2dce3d8

Signed-off-by: slokesha <slokeshappa@habana.ai>

precommit fix

fd78ad6

Signed-off-by: slokesha <slokeshappa@habana.ai>

precommit fix

b79ae59

Signed-off-by: slokesha <slokeshappa@habana.ai>

precommit fix

1ecb518

Signed-off-by: slokesha <slokeshappa@habana.ai>

fix precommit

c6e09ce

Signed-off-by: slokesha <slokeshappa@habana.ai>

Enable qwen3_vl_moe model

0f98033

Signed-off-by: slokesha <slokeshappa@habana.ai>

Fix device mismatch in mrope tensor assignment

47f790b

Signed-off-by: slokesha <slokeshappa@habana.ai>

fix precommits

e9c2052

Signed-off-by: slokesha <slokeshappa@habana.ai>

Update qwen3_vl.py for create_block_diagonal_mask optimization

c37d68f

Signed-off-by: slokesha <slokeshappa@habana.ai>

Revert "skip HPU graphs for long prefills" (vllm-project#850) (vllm-p…

a8def02

…roject#888) Reverts vllm-project#780 --------- Signed-off-by: Agata Dobrzyniewicz <adobrzyniewicz@habana.ai> Co-authored-by: Chendi.Xue <chendi.xue@intel.com> Signed-off-by: slokesha <slokeshappa@habana.ai>

precommit fix

c6668b2

Signed-off-by: slokesha <slokeshappa@habana.ai>

slokesha force-pushed the libinta/add_mask branch from 7017751 to c6668b2 Compare January 29, 2026 17:19

slokesha and others added 7 commits January 29, 2026 09:28

Merge branch 'releases/v0.14.1' into libinta/add_mask

a7072bb

fix gemma issue

83669dd

fix gemma3 issue

6371ea4

fix precommit

19aac64

remove uncessary code

1b7d074

precommit fix

8f7ab84

fix precommit comment

1439d0b

slokesha approved these changes Jan 30, 2026

View reviewed changes

michalkuligowski approved these changes Jan 30, 2026

View reviewed changes

wpyszka approved these changes Jan 30, 2026

View reviewed changes

wpyszka merged commit 20703dd into vllm-project:releases/v0.14.1 Jan 30, 2026
53 checks passed

slokesha mentioned this pull request Jan 30, 2026

removed lens calculation for qwen2_5 #906

Closed

shepark mentioned this pull request Jan 31, 2026

Update qwen2_5_vl attention forward #908

Merged

This was referenced Feb 6, 2026

Missing updates for Llama4 on main #940

Merged

cherry pick Llama4 on apply patches + QK flatten pos + perf drop #942

Merged

wpyszka pushed a commit that referenced this pull request Feb 9, 2026

cherry pick Llama4 on apply patches + QK flatten pos + perf drop (#942)

c0675d9

Cherry pick Llama4 missing fixes from #881 #862 #884 on releases/ branch Signed-off-by: Luca Calabria <luca.calabria@intel.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Qwen3vl accuracy fixes#884

Qwen3vl accuracy fixes#884
wpyszka merged 40 commits into
vllm-project:releases/v0.14.1from
libinta:libinta/add_mask

libinta commented Jan 27, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Jan 27, 2026

Uh oh!

github-actions Bot commented Jan 28, 2026

Uh oh!

github-actions Bot commented Jan 29, 2026

Uh oh!

github-actions Bot commented Jan 29, 2026

Uh oh!

github-actions Bot commented Jan 29, 2026

Uh oh!

github-actions Bot commented Jan 29, 2026

Uh oh!

github-actions Bot commented Jan 30, 2026

Uh oh!

wpyszka left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

10 participants

Conversation

libinta commented Jan 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions Bot commented Jan 27, 2026

✅ CI Passed

Uh oh!

github-actions Bot commented Jan 28, 2026

🚧 CI Blocked

Uh oh!

github-actions Bot commented Jan 29, 2026

✅ CI Passed

Uh oh!

github-actions Bot commented Jan 29, 2026

🚧 CI Blocked

Uh oh!

github-actions Bot commented Jan 29, 2026

🚧 CI Blocked

Uh oh!

github-actions Bot commented Jan 29, 2026

🚧 CI Blocked

Uh oh!

github-actions Bot commented Jan 30, 2026

✅ CI Passed

Uh oh!

wpyszka left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

10 participants

libinta commented Jan 27, 2026 •

edited

Loading