Qwen3vl accuracy fixes#884
Conversation
✅ CI PassedAll checks passed successfully against the following vllm commit: |
🚧 CI BlockedThe main CI workflow was not started for the following reason:
|
✅ CI PassedAll checks passed successfully against the following vllm commit: |
🚧 CI BlockedThe main CI workflow was not started for the following reason:
|
1 similar comment
🚧 CI BlockedThe main CI workflow was not started for the following reason:
|
Signed-off-by: slokesha <slokeshappa@habana.ai>
Signed-off-by: slokesha <slokeshappa@habana.ai>
Signed-off-by: slokesha <slokeshappa@habana.ai>
Signed-off-by: slokesha <slokeshappa@habana.ai>
Signed-off-by: slokesha <slokeshappa@habana.ai>
Signed-off-by: slokesha <slokeshappa@habana.ai>
Signed-off-by: slokesha <slokeshappa@habana.ai>
Signed-off-by: slokesha <slokeshappa@habana.ai>
Signed-off-by: slokesha <slokeshappa@habana.ai>
Signed-off-by: slokesha <slokeshappa@habana.ai>
Signed-off-by: slokesha <slokeshappa@habana.ai>
Signed-off-by: slokesha <slokeshappa@habana.ai>
Signed-off-by: slokesha <slokeshappa@habana.ai>
Signed-off-by: slokesha <slokeshappa@habana.ai>
Signed-off-by: slokesha <slokeshappa@habana.ai>
Signed-off-by: slokesha <slokeshappa@habana.ai>
Signed-off-by: slokesha <slokeshappa@habana.ai>
…ct#885) Due to MambaMixer2 implementation requirements, all buckets used for mamba must be a multiple of mamba chunk size. Signed-off-by: Jakub Byczkowski <jbyczkowski@habana.ai> Signed-off-by: slokesha <slokeshappa@habana.ai>
…roject#888) Reverts vllm-project#780 --------- Signed-off-by: Agata Dobrzyniewicz <adobrzyniewicz@habana.ai> Co-authored-by: Chendi.Xue <chendi.xue@intel.com> Signed-off-by: slokesha <slokeshappa@habana.ai>
1. vllm-project#805 2. vllm-project#837 3. vllm-project#855 4. vllm-project#862 --------- Signed-off-by: Radoslaw Smyrek <radoslawx.smyrek@intel.com> Signed-off-by: linoy buchnik <lbuchnik@habana.ai> Signed-off-by: Iryna Boiko <iboiko@habana.ai> Signed-off-by: Artur Fierka <artur.fierka@intel.com> Co-authored-by: Linoy Buchnik <linoybu@gmail.com> Co-authored-by: Iryna Boiko <iboiko@habana.ai> Co-authored-by: Artur Fierka <artur.fierka@intel.com> Signed-off-by: slokesha <slokeshappa@habana.ai>
Signed-off-by: slokesha <slokeshappa@habana.ai>
7017751 to
c6668b2
Compare
🚧 CI BlockedThe main CI workflow was not started for the following reason:
|
✅ CI PassedAll checks passed successfully against the following vllm commit: |
* Prevent cu_seqlens/mask mix-ups that can trigger performance regressions or incorrect attention behavior. * Remove the lens = (cu_seqlens[1:] - cu_seqlens[:-1]).tolist() computation from the Qwen2.5 path. This calculation is not required for Qwen2.5 and was causing a performance regression after PR #884. Removing it restores the previous performance without changing model behavior.
for qwen3 vl, there is accuracy issue with multi-images within 1 request, this PR is to fix that. After fix, there are 3 paths for vision attention depending on the images count inside 1 request 1. single image, use fusedsdpa without attn mask 3. multi-images with threshold use fusedsdpa without attn_mask one by one This pr also enables qwen3vl moe --------- Signed-off-by: slokesha <slokeshappa@habana.ai> Signed-off-by: Jakub Byczkowski <jbyczkowski@habana.ai> Signed-off-by: Agata Dobrzyniewicz <adobrzyniewicz@habana.ai> Signed-off-by: Radoslaw Smyrek <radoslawx.smyrek@intel.com> Signed-off-by: linoy buchnik <lbuchnik@habana.ai> Signed-off-by: Iryna Boiko <iboiko@habana.ai> Signed-off-by: Artur Fierka <artur.fierka@intel.com> Signed-off-by: Luca Calabria <luca.calabria@intel.com> Co-authored-by: Seunghyuk Park <separk@habana.ai> Co-authored-by: Jakub Byczkowski <jbyczkowski@habana.ai> Co-authored-by: Agata Dobrzyniewicz <160237065+adobrzyn@users.noreply.github.com> Co-authored-by: Chendi.Xue <chendi.xue@intel.com> Co-authored-by: Radosław Smyrek <radoslawx.smyrek@intel.com> Co-authored-by: Linoy Buchnik <linoybu@gmail.com> Co-authored-by: Iryna Boiko <iboiko@habana.ai> Co-authored-by: Artur Fierka <artur.fierka@intel.com> Co-authored-by: Luca Calabria <luca.calabria@intel.com> Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> Co-authored-by: slokesha <slokeshappa@habana.ai> Co-authored-by: Seunghyuk Park (shepark) <seunghyuk.h.park@intel.com> Co-authored-by: Michał Kuligowski <michal.kuligowski@intel.com> Co-authored-by: Katarzyna Fojcik <kfojcik@habana.ai> Co-authored-by: Krzysztof Smusz <ksmusz@habana.ai> Co-authored-by: Jozef Mamza <jmamzax@habana.ai>
for qwen3 vl, there is accuracy issue with multi-images within 1 request, this PR is to fix that. After fix, there are 3 paths for vision attention depending on the images count inside 1 request 1. single image, use fusedsdpa without attn mask 3. multi-images with threshold use fusedsdpa without attn_mask one by one This pr also enables qwen3vl moe --------- Signed-off-by: slokesha <slokeshappa@habana.ai> Signed-off-by: Jakub Byczkowski <jbyczkowski@habana.ai> Signed-off-by: Agata Dobrzyniewicz <adobrzyniewicz@habana.ai> Signed-off-by: Radoslaw Smyrek <radoslawx.smyrek@intel.com> Signed-off-by: linoy buchnik <lbuchnik@habana.ai> Signed-off-by: Iryna Boiko <iboiko@habana.ai> Signed-off-by: Artur Fierka <artur.fierka@intel.com> Signed-off-by: Luca Calabria <luca.calabria@intel.com> Co-authored-by: Seunghyuk Park <separk@habana.ai> Co-authored-by: Jakub Byczkowski <jbyczkowski@habana.ai> Co-authored-by: Agata Dobrzyniewicz <160237065+adobrzyn@users.noreply.github.com> Co-authored-by: Chendi.Xue <chendi.xue@intel.com> Co-authored-by: Radosław Smyrek <radoslawx.smyrek@intel.com> Co-authored-by: Linoy Buchnik <linoybu@gmail.com> Co-authored-by: Iryna Boiko <iboiko@habana.ai> Co-authored-by: Artur Fierka <artur.fierka@intel.com> Co-authored-by: Luca Calabria <luca.calabria@intel.com> Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> Co-authored-by: slokesha <slokeshappa@habana.ai> Co-authored-by: Seunghyuk Park (shepark) <seunghyuk.h.park@intel.com> Co-authored-by: Michał Kuligowski <michal.kuligowski@intel.com> Co-authored-by: Katarzyna Fojcik <kfojcik@habana.ai> Co-authored-by: Krzysztof Smusz <ksmusz@habana.ai> Co-authored-by: Jozef Mamza <jmamzax@habana.ai>
* Prevent cu_seqlens/mask mix-ups that can trigger performance regressions or incorrect attention behavior. * Remove the lens = (cu_seqlens[1:] - cu_seqlens[:-1]).tolist() computation from the Qwen2.5 path. This calculation is not required for Qwen2.5 and was causing a performance regression after PR vllm-project#884. Removing it restores the previous performance without changing model behavior.
* Qwen3vl accuracy fixes (vllm-project#884) for qwen3 vl, there is accuracy issue with multi-images within 1 request, this PR is to fix that. After fix, there are 3 paths for vision attention depending on the images count inside 1 request 1. single image, use fusedsdpa without attn mask 3. multi-images with threshold use fusedsdpa without attn_mask one by one This pr also enables qwen3vl moe
* Qwen3vl accuracy fixes (vllm-project#884) for qwen3 vl, there is accuracy issue with multi-images within 1 request, this PR is to fix that. After fix, there are 3 paths for vision attention depending on the images count inside 1 request 1. single image, use fusedsdpa without attn mask 3. multi-images with threshold use fusedsdpa without attn_mask one by one This pr also enables qwen3vl moe Signed-off-by: slokesha <slokeshappa@habana.ai>
* Qwen3vl accuracy fixes (vllm-project#884) for qwen3 vl, there is accuracy issue with multi-images within 1 request, this PR is to fix that. After fix, there are 3 paths for vision attention depending on the images count inside 1 request 1. single image, use fusedsdpa without attn mask 3. multi-images with threshold use fusedsdpa without attn_mask one by one This pr also enables qwen3vl moe Signed-off-by: slokesha <slokeshappa@habana.ai>
* Qwen3vl accuracy fixes (vllm-project#884) for qwen3 vl, there is accuracy issue with multi-images within 1 request, this PR is to fix that. After fix, there are 3 paths for vision attention depending on the images count inside 1 request 1. single image, use fusedsdpa without attn mask 3. multi-images with threshold use fusedsdpa without attn_mask one by one This pr also enables qwen3vl moe Signed-off-by: slokesha <slokeshappa@habana.ai>
for qwen3 vl, there is accuracy issue with multi-images within 1 request, this PR is to fix that. After fix, there are 3 paths for vision attention depending on the images count inside 1 request
This pr also enables qwen3vl moe