sycl : port multi-column MMVQ from CUDA backend (~45% speculative decoding speedup on Intel Arc)#21845
Conversation
|
@masonmilby Could you share the performance test cmd? Thank you! |
|
EDIT: I've been doing some experimenting, and I believe this to be the most reproducible setup to demonstrate the core issue of speculative decoding on SYCL being dramatically under-optimized. No cache, no reasoning, just SD now working as intended: Try building & running with this composeThen test with this scriptWorkflowCheckout CompareSpeculative decoding on
|
|
OK, got it! Thank you! |
|
@masonmilby Hope other help verify this PR! Thank you! |
|
@arthw |
|
@masonmilby Thank you! |
|
@NeoZhangJianyu |
|
@masonmilby 77.88 -> 64.03 tokens per second This PR had no impacted before the rebase. Could you check it? Thank you! |
|
@NeoZhangJianyu I can't replicate the regression you're seeing - pre or post rebase. |
I build and test locally. Thank you! |
arthw
left a comment
There was a problem hiding this comment.
@masonmilby
I test it on B60 with LLM: gemma-4-E2B-it-UD-Q4_K_XL.gguf, Qwen3.5-4B-Q4_K_M.gguf
The performance of PP and TG has no impact.
Code comes from: https://github.com/masonmilby/llama.cpp
Base:
commit 9789512 (HEAD)
Author: leonardHONG 2695316095@qq.com
Date: Tue Apr 21 05:30:38 2026 +0800
PR:
commit 32cc081 (HEAD -> sycl-mmvq-multicol
Test cmd:
./build/bin/llama-bench -m ../models/gemma-4-E2B-it-UD-Q4_K_XL.gguf
Could you check it?
|
Single-model inference is not affected by this PR. Your results are correct. You will only see a difference when running with both --model and --model-draft (assuming compatible model architecture and a properly sized draft model) |
|
@masonmilby Thank you! |
|
Tested this PR on Intel Arc Pro B70 (PCI 8086:e223, 32GB), Qwen3.6-35B-A3B-Q4_K_M with SYCL backend (oneAPI 2025.3, mainline b9187 + GDN K>1 fix from #23174). 20 of 21 hunks applied cleanly (1 dispatch hunk failed). Partial application compiled successfully but model output is broken — single-word replies with nonsensical timing stats. Reverting restores correct output. The 45% speedup claim is exactly what SYCL speculative decoding needs. Would love to see a rebased version that applies cleanly to current master (b9187+). Happy to retest. |
mmvq: Port the ncols_dst optimization from ggml-cuda/mmvq.cu to SYCL. Read weights once per dispatch instead of once per column. Covers all standard quant types + reorder paths for Q4_0, Q8_0, Q3_K, Q4_K, Q5_K, Q6_K. IQ types (except IQ4_XS) excluded due to incompatible vec_dot signatures. ggml-sycl: The weight reorder was only bootstrapped on single-token mat-vec (ne[1] == 1). Speculative / MTP verify issues only multi-column mat-vec, so it never triggered the reorder and ran on the slower non-reorder kernel. Bootstrap it on small multi-column batches (ne[1] <= 8) too.
d5ca092 to
113d79e
Compare
|
@R-SITES Rebased, and fixed a related MTP warmup issue. Mind giving it another shot?
My router presets for your reference: |
|
@masonmilby — tested the rebased PR on Intel Arc Pro B70 (Battlemage, PCI 8086:e223, 32GB) with Qwen3.6-27B dense Q4_K_S. Works. Dense MTP (PR #21845 applied):
+37% over no-MTP on SYCL. First time MTP has beaten pure AR decode on this hardware. Draft gen is ~480ms for 76 calls (6.3ms/call), and the MMVQ improvement is what makes it viable. n_max=3 regresses to ~29 tok/s on B70 (same pattern you saw — >2 degrades). Dense-only as expected — MoE (35B-A3B) is unaffected. Happy to test additional configs if you need more data points. Test setup:
|
|
@R-SITES That's FANTASTIC! Thank you for testing! I'm working on gathering data across more quants (Q4_K_XL and Q8_K_XL), and updating the write-ups. Should be ready soon. I see you originally tested with 35B – MoE paths will likely comes as a separate PR once this foundation is merged. More data is always welcome, enjoy! |
arthw
left a comment
There was a problem hiding this comment.
Here is my test result on B60:
./build/bin/llama-server -m ../models/Qwen3.6-27B-MTP-Q4_K_S.gguf -fa on --host 0.0.0.0 --port 8080 --spec-type draft-mtp --spec-draft-n-max 2 -c 262144
6.32-> 8.48
./build/bin/llama-server -m ../models/Qwen3.6-27B-MTP-Q4_K_S.gguf -fa on --host 0.0.0.0 --port 8080 --spec-type draft-mtp --spec-draft-n-max 2
19.51 -> 27.78
It's good job!
MTP is speed up really.
Thank you!
|
@arthw Thank you! I'm happy to contribute @NeoZhangJianyu Think you could give this another look? |
|
It's OK to me! I have no comments. :) Thank you! |
|
error loading model: unknown model architecture: 'gemma4_assistant' When will Gemma4_MTP be supported for playing files? gemma-4-E2B-it-assistant |
|
@tac39us-stack That work is happening on PR #23398 |
|
Amazing! Getting a steady 20 t/s with high context on a B70! Great work! |
mmvq: Port the ncols_dst optimization from ggml-cuda/mmvq.cu to SYCL. Read weights once per dispatch instead of once per column. Covers all standard quant types + reorder paths for Q4_0, Q8_0, Q3_K, Q4_K, Q5_K, Q6_K. IQ types (except IQ4_XS) excluded due to incompatible vec_dot signatures. ggml-sycl: The weight reorder was only bootstrapped on single-token mat-vec (ne[1] == 1). Speculative / MTP verify issues only multi-column mat-vec, so it never triggered the reorder and ran on the slower non-reorder kernel. Bootstrap it on small multi-column batches (ne[1] <= 8) too. (cherry picked from commit 7fe2ae4)
mmvq: Port the ncols_dst optimization from ggml-cuda/mmvq.cu to SYCL. Read weights once per dispatch instead of once per column. Covers all standard quant types + reorder paths for Q4_0, Q8_0, Q3_K, Q4_K, Q5_K, Q6_K. IQ types (except IQ4_XS) excluded due to incompatible vec_dot signatures. ggml-sycl: The weight reorder was only bootstrapped on single-token mat-vec (ne[1] == 1). Speculative / MTP verify issues only multi-column mat-vec, so it never triggered the reorder and ran on the slower non-reorder kernel. Bootstrap it on small multi-column batches (ne[1] <= 8) too. (cherry picked from commit 7fe2ae4)
Problem
Speculative decoding on SYCL is currently slower than single-token-prediction because the MMVQ dispatch launches a separate kernel per column, reading the full weight matrix N times.
Solution
Port the multi-column optimization from the CUDA backend (
ggml/src/ggml-cuda/mmvq.cu) so weights are read once and all columns are computed in a single dispatch.AND
Relax
should_reorder_tensorfromne[1] == 1tone[1] <= 8to bootstrap the reorder and take advantage of the reorder-multicol kernel path.Testing
GPU(s): Intel Arc Pro B70 (2x)
Model: Qwen3.6-27B(-MTP)
Quant:
UD-Q4_K_XLSingle vs multi token-prediction (speculative decoding).
Average t/s @ average-acceptance across all 15 runs.
mastersycl-mmvq-multicolMulti-token-prediction vs multi-token-prediction.
Average t/s @ average-acceptance across 5 runs per type.
mastersycl-mmvq-multicolQuant:
UD-Q8_K_XLmastersycl-mmvq-multicolmastersycl-mmvq-multicolValidation
test-backend-opsMUL_MATtests passed (920/920)To Reproduce
With Docker Compose
With Router Preset
With Args
Notes
spec-draft-n-max > 2degrades performanceScope
In
Out
vec_dotsignaturesmmvq.cpp.Requirements