sycl : port multi-column MMVQ from CUDA backend (~45% speculative decoding speedup on Intel Arc) by masonmilby · Pull Request #21845 · ggml-org/llama.cpp

masonmilby · 2026-04-13T08:44:40Z

Problem

Speculative decoding on SYCL is currently slower than single-token-prediction because the MMVQ dispatch launches a separate kernel per column, reading the full weight matrix N times.

Solution

Port the multi-column optimization from the CUDA backend (ggml/src/ggml-cuda/mmvq.cu) so weights are read once and all columns are computed in a single dispatch.

AND

Relax should_reorder_tensor from ne[1] == 1 to ne[1] <= 8 to bootstrap the reorder and take advantage of the reorder-multicol kernel path.

Testing

GPU(s): Intel Arc Pro B70 (2x)
Model: Qwen3.6-27B(-MTP)

Quant: UD-Q4_K_XL

Single vs multi token-prediction (speculative decoding).
Average t/s @ average-acceptance across all 15 runs.

Branch	STP	MTP	Speedup
`master`	22.47	19.71@73.3	-12.3%
`sycl-mmvq-multicol`	22.44	31.44@73.9	+40.1%

Multi-token-prediction vs multi-token-prediction.
Average t/s @ average-acceptance across 5 runs per type.

Prompt Type	`master`	`sycl-mmvq-multicol`	Speedup
JSON generation	18.67@74.9	31.69@74.9	+69.7%
Technical explanation	21.9@82.7	33.61@82.5	+53.5%
Creative writing	18.55@62.2	29.02@64.2	+56.4%
Average	19.71@73.3	31.44@73.9	+59.5%

Quant: UD-Q8_K_XL

Branch	STP	MTP	Speedup
`master`	13.3	13.95@75.2	+4.9%
`sycl-mmvq-multicol`	13.36	26.14@76.7	+95.7%

Prompt Type	`master`	`sycl-mmvq-multicol`	Speedup
JSON generation	14.33@78.6	26.43@78.2	+84.4%
Technical explanation	15.04@85	28.3@87.2	+88.2%
Creative writing	12.48@61.9	23.68@64.8	+89.7%
Average	13.95@75.2	26.14@76.7	+87.4%

Validation

test-backend-ops MUL_MAT tests passed (920/920)
Batch=1 generation: no regression
Speculative decoding: correct output, improved generation speed

To Reproduce

With Docker Compose

services:
  llama-sycl:
    build:
      context: ./llama.cpp
      dockerfile: .devops/intel.Dockerfile
      target: server
      args:
        GGML_SYCL_F16: "ON"
    container_name: llama-sycl
    restart: unless-stopped

    ports:
      - 8080:8080

    devices:
      - /dev/dri/:/dev/dri/

    volumes:
      - ./models:/models

    environment:
      ### Server
      LLAMA_ARG_HOST: 0.0.0.0
      LLAMA_ARG_PORT: 8080
      LLAMA_ARG_ENDPOINT_METRICS: 1
      ### Model
      LLAMA_ARG_ALIAS: Qwen3.6-Dense-MTP
      LLAMA_ARG_MODEL: /models/Qwen3.6-27B-UD-Q4_K_XL-MTP.gguf
      LLAMA_ARG_CTX_SIZE: 65536
      LLAMA_ARG_N_GPU_LAYERS: -1
      LLAMA_ARG_CHAT_TEMPLATE_KWARGS: '{"preserve_thinking":true}'
      LLAMA_ARG_SPEC_TYPE: draft-mtp
      LLAMA_ARG_SPEC_DRAFT_N_MAX: 2

With Router Preset

[Qwen3.6-Dense-MTP]
load-on-startup = true
ctx-size = 65536
model = /models/Qwen3.6-27B-UD-Q4_K_XL-MTP.gguf
n-gpu-layers = -1
chat-template-kwargs = {"preserve_thinking":true}
spec-type = draft-mtp
spec-draft-n-max = 2

With Args

--ctx-size 65536
--model /models/Qwen3.6-27B-UD-Q4_K_XL-MTP.gguf
--n-gpu-layers -1
--chat-template-kwargs '{"preserve_thinking":true}'
--spec-type draft-mtp
--spec-draft-n-max 2

Notes

spec-draft-n-max > 2 degrades performance
MoE paths are untouched; MoE-MTP will be slow.

Scope

In

Standard path: Q4_0, Q4_1, Q5_0, Q5_1, Q8_0, Q2_K, Q3_K, Q4_K, Q5_K, Q6_K, IQ4_XS, MXFP4, NVFP4
Reorder path: Q4_0, Q8_0, Q3_K, Q4_K, Q5_K, Q6_K

Out

MoE paths
IQ types (except IQ4_XS) due to incompatible vec_dot signatures
Consolidation into a reusable macro. Keeping the changes explicit matches the CUDA backend's style, and the style within mmvq.cpp.

Requirements

I have read and agree with the contributing guidelines
AI-assisted: Yes. Claude Code (Opus 4.8) was used for debugging, and the large amount of boilerplate code generation. All validation and benchmarking was run and analyzed by me, on my hardware.

arthw · 2026-04-15T05:04:34Z

@masonmilby
I test with LLM gemma-4-31B-it-UD-Q4_K_XL.gguf, gemma-4-E2B-it-UD-Q4_K_XL.gguf & Qwen3.5-4B-Q4_K_M.gguf on B60 and Arc770.
Build with fp32 or fp16, there is no performance increase in all cases.

Could you share the performance test cmd?
I will try again.

Thank you!

masonmilby · 2026-04-15T07:53:51Z

@arthw

EDIT: I've been doing some experimenting, and I believe this to be the most reproducible setup to demonstrate the core issue of speculative decoding on SYCL being dramatically under-optimized. No cache, no reasoning, just SD now working as intended:

Try building & running with this compose

services:
  llama-sycl:
    build:
      context: ./llama.cpp
      dockerfile: .devops/intel.Dockerfile
      target: server
      args:
        GGML_SYCL_F16: "ON"
    container_name: llama-sycl
    restart: unless-stopped

    ports:
      - 8080:8080

    devices:
      - /dev/dri/renderD128:/dev/dri/renderD128
      - /dev/dri/card0:/dev/dri/card0

    volumes:
      - ./models:/models

    environment:
      ### Server
      LLAMA_ARG_MODELS_MAX: 1
      LLAMA_ARG_HOST: 0.0.0.0
      LLAMA_ARG_PORT: 8080
      LLAMA_ARG_ENDPOINT_METRICS: 1
      ### Repeatability
      LLAMA_ARG_CACHE_RAM: 0
      LLAMA_ARG_CACHE_PROMPT: false
      LLAMA_ARG_REASONING: off
      ### Main
      LLAMA_ARG_MODEL: /models/gemma-4-31B-it-UD-Q4_K_XL.gguf
      LLAMA_ARG_CTX_SIZE: 4096
      LLAMA_ARG_N_GPU_LAYERS: -1
      LLAMA_ARG_FLASH_ATTN: on
      LLAMA_ARG_CACHE_TYPE_K: q8_0
      LLAMA_ARG_CACHE_TYPE_V: q8_0
      ### Draft
      LLAMA_ARG_MODEL_DRAFT: /models/gemma-4-E2B-it-UD-Q4_K_XL.gguf
      LLAMA_ARG_N_GPU_LAYERS_DRAFT: -1
      LLAMA_ARG_DRAFT_MIN: 0
      LLAMA_ARG_DRAFT_MAX: 4

Then test with this script

import argparse
import json
import urllib.request

def post(host: str, payload: dict) -> dict:
    req = urllib.request.Request(
        f"http://{host}/v1/chat/completions",
        data=json.dumps(payload).encode(),
        headers={"Content-Type": "application/json"},
        method="POST",
    )
    with urllib.request.urlopen(req) as resp:
        return json.loads(resp.read())


def main() -> None:
    p = argparse.ArgumentParser()
    p.add_argument("--host", default="localhost:8080")
    p.add_argument("--runs", type=int, default=3)
    p.add_argument("--warmup", type=int, default=1)
    p.add_argument("--temp", type=float, default=0.7)
    p.add_argument("--max-tokens", type=int, default=256)
    p.add_argument("--snippet", type=int, default=25)
    args = p.parse_args()

    payload = {
        "messages": [{"role": "user", "content": "Explain how a CPU executes a single instruction, from fetch to retire."}],
        "max_tokens": args.max_tokens,
        "temperature": args.temp,
    }

    for _ in range(args.warmup):
        print("Warming up...", end=" ", flush=True)
        post(args.host, payload)
        print("done")

    for i in range(1, args.runs + 1):
        r = post(args.host, payload)
        t = r.get("timings", {})
        line = f"Run {i}/{args.runs}: {t.get('predicted_per_second', 0):.2f} t/s ({t.get('predicted_n', 0)} tokens)"
        draft_n = t.get("draft_n", 0)
        accept = t.get("draft_n_accepted")
        if draft_n and accept is not None:
            line += f", {100*accept/draft_n:.1f}% accept ({accept}/{draft_n})"
        print(line)
        if args.snippet > 0:
            msg = r["choices"][0]["message"]
            content = (msg.get("content") or msg.get("reasoning_content") or "").strip().replace("\n", " ")
            snippet = content[: args.snippet] + ("..." if len(content) > args.snippet else "")
            print(f"    {snippet}")


if __name__ == "__main__":
    main()

Workflow

Checkout master
docker compose build --no-cache
docker compose up
Test
Checkout sycl-mmvq-multicol
docker compose build --no-cache
docker compose up
Test

Compare

Speculative decoding on `master`:

Run 1/3: 8.53 t/s (256 tokens), 50.2% accept (150/299)
The process of executing ...
Run 2/3: 8.78 t/s (256 tokens), 53.3% accept (154/289)
Executing a single instru...
Run 3/3: 8.52 t/s (256 tokens), 50.7% accept (153/302)
The process of executing ...

Speculative decoding on `sycl-mmvq-multicol`:

Run 1/3: 15.94 t/s (256 tokens), 52.0% accept (154/296)
The execution of a single...
Run 2/3: 19.20 t/s (256 tokens), 51.0% accept (153/300)
To understand how a CPU e...
Run 3/3: 18.92 t/s (256 tokens), 50.3% accept (149/296)
The process of executing ...

arthw · 2026-04-16T07:41:42Z

OK, got it!
I will try again!

Thank you!

arthw · 2026-04-16T14:17:59Z

@masonmilby
Sorry, the memory is not enough to load 2 LLMs: gemma-4-31B-it-UD-Q4_K_XL.gguf and gemma-4-E2B-it-UD-Q4_K_XL.gguf.
I can't get the benefit only load one of them.

Hope other help verify this PR!

Thank you!

masonmilby · 2026-04-16T19:24:37Z

@arthw
Understood, thank you for trying!

NeoZhangJianyu · 2026-04-21T02:48:43Z

@masonmilby
When this PR is ready to review and test again, please inform us!

Thank you!

masonmilby · 2026-04-21T04:46:14Z

@NeoZhangJianyu
Ready for review, thank you!

NeoZhangJianyu · 2026-04-22T14:38:02Z

@masonmilby
The latest code is rebased.
I find the performance is reduced than base on B60.

./examples/sycl/test.sh -m ../models/gemma-4-E2B-it-UD-Q4_K_XL.gguf

77.88 -> 64.03 tokens per second

This PR had no impacted before the rebase.
I guess after the rebase, there are some optimization code to be conflict to this PR.

Could you check it?

Thank you!

masonmilby · 2026-04-24T23:15:31Z

@NeoZhangJianyu
Are you building and testing locally, or with .devops/intel.Dockerfile?

I can't replicate the regression you're seeing - pre or post rebase.

NeoZhangJianyu · 2026-04-25T03:31:34Z

@NeoZhangJianyu Are you building and testing locally, or with .devops/intel.Dockerfile?

I can't replicate the regression you're seeing - pre or post rebase.

I build and test locally.

Thank you!

arthw

@masonmilby
I test it on B60 with LLM: gemma-4-E2B-it-UD-Q4_K_XL.gguf, Qwen3.5-4B-Q4_K_M.gguf
The performance of PP and TG has no impact.

Code comes from: https://github.com/masonmilby/llama.cpp
Base:
commit 9789512 (HEAD)
Author: leonardHONG 2695316095@qq.com
Date: Tue Apr 21 05:30:38 2026 +0800

PR:
commit 32cc081 (HEAD -> sycl-mmvq-multicol

Test cmd:

./build/bin/llama-bench -m ../models/gemma-4-E2B-it-UD-Q4_K_XL.gguf

Could you check it?

masonmilby · 2026-05-09T02:47:20Z

@arthw

Single-model inference is not affected by this PR. Your results are correct.

You will only see a difference when running with both --model and --model-draft (assuming compatible model architecture and a properly sized draft model)

arthw · 2026-05-11T09:48:34Z

@masonmilby
OK!
Could you share the test cmds for two models in your test?
I try to reproduce your test result if possible.

Thank you!

R-SITES · 2026-05-18T15:52:53Z

Tested this PR on Intel Arc Pro B70 (PCI 8086:e223, 32GB), Qwen3.6-35B-A3B-Q4_K_M with SYCL backend (oneAPI 2025.3, mainline b9187 + GDN K>1 fix from #23174).

20 of 21 hunks applied cleanly (1 dispatch hunk failed). Partial application compiled successfully but model output is broken — single-word replies with nonsensical timing stats. Reverting restores correct output.

The 45% speedup claim is exactly what SYCL speculative decoding needs. Would love to see a rebased version that applies cleanly to current master (b9187+). Happy to retest.

mmvq: Port the ncols_dst optimization from ggml-cuda/mmvq.cu to SYCL. Read weights once per dispatch instead of once per column. Covers all standard quant types + reorder paths for Q4_0, Q8_0, Q3_K, Q4_K, Q5_K, Q6_K. IQ types (except IQ4_XS) excluded due to incompatible vec_dot signatures. ggml-sycl: The weight reorder was only bootstrapped on single-token mat-vec (ne[1] == 1). Speculative / MTP verify issues only multi-column mat-vec, so it never triggered the reorder and ran on the slower non-reorder kernel. Bootstrap it on small multi-column batches (ne[1] <= 8) too.

masonmilby · 2026-06-04T09:24:41Z

@R-SITES Rebased, and fixed a related MTP warmup issue. Mind giving it another shot?

13.31 t/s --> 25.26 t/s (Dense -> Dense-MTP)
spec-draft-n-max = 2 seems to be the limit on my hardware, >2 degrades performance.
Tested and noted uplift on Qwen3.6-27B-MTP:Q4_K as well (@arthw)
MoE models are not affected by this PR

My router presets for your reference:

[Qwen3.6-Dense]
load-on-startup = false
ctx-size = 262144
model = /models/Qwen3.6-27B-UD-Q8_K_XL.gguf
n-gpu-layers = -1
flash-attn = on
chat-template-kwargs = {"preserve_thinking":true}

[Qwen3.6-Dense-MTP]
load-on-startup = true
ctx-size = 262144
model = /models/Qwen3.6-27B-UD-Q8_K_XL-MTP.gguf
n-gpu-layers = -1
flash-attn = on
chat-template-kwargs = {"preserve_thinking":true}
spec-type = draft-mtp
spec-draft-n-max = 2

R-SITES · 2026-06-04T14:42:58Z

@masonmilby — tested the rebased PR on Intel Arc Pro B70 (Battlemage, PCI 8086:e223, 32GB) with Qwen3.6-27B dense Q4_K_S. Works.

Dense MTP (PR #21845 applied):

Config	Server tok/s	Draft Accept
no-MTP baseline	25.23	—
MTP n_max=2 (cold)	34.52	80.3%
MTP n_max=2 (warm)	32.93	74.8%

+37% over no-MTP on SYCL. First time MTP has beaten pure AR decode on this hardware. Draft gen is ~480ms for 76 calls (6.3ms/call), and the MMVQ improvement is what makes it viable.

n_max=3 regresses to ~29 tok/s on B70 (same pattern you saw — >2 degrades). Dense-only as expected — MoE (35B-A3B) is unaffected.

Happy to test additional configs if you need more data points.

Test setup:

Build: b9484 (63e66fd) + sycl : port multi-column MMVQ from CUDA backend (~45% speculative decoding speedup on Intel Arc) #21845
Compiler: Intel oneAPI 2025.3
GGML_SYCL_F16=1
-t 24 -ngl 99 -ub 512 -b 4096 -np 1

masonmilby · 2026-06-04T15:28:55Z

@R-SITES That's FANTASTIC! Thank you for testing!

I'm working on gathering data across more quants (Q4_K_XL and Q8_K_XL), and updating the write-ups. Should be ready soon.

I see you originally tested with 35B – MoE paths will likely comes as a separate PR once this foundation is merged.

More data is always welcome, enjoy!

arthw

Here is my test result on B60:

./build/bin/llama-server -m ../models/Qwen3.6-27B-MTP-Q4_K_S.gguf -fa on --host 0.0.0.0 --port 8080 --spec-type draft-mtp --spec-draft-n-max 2 -c 262144

6.32-> 8.48

./build/bin/llama-server -m ../models/Qwen3.6-27B-MTP-Q4_K_S.gguf -fa on --host 0.0.0.0 --port 8080 --spec-type draft-mtp --spec-draft-n-max 2

19.51 -> 27.78

It's good job!
MTP is speed up really.

Thank you!

masonmilby · 2026-06-05T03:51:40Z

@arthw Thank you! I'm happy to contribute

@NeoZhangJianyu Think you could give this another look?

NeoZhangJianyu · 2026-06-05T04:17:37Z

It's OK to me! I have no comments. :)

Thank you!

tac39us-stack · 2026-06-05T06:48:27Z

error loading model: unknown model architecture: 'gemma4_assistant'

When will Gemma4_MTP be supported for playing files?

gemma-4-E2B-it-assistant
gemma-4-E4B-it-assistant
gemma-4-26B-A4B-it-assistant
gemma-4-31B-it-assistant
gemma-4-12B-it-assistant

masonmilby · 2026-06-05T07:08:40Z

@tac39us-stack That work is happening on PR #23398

sheigl · 2026-06-06T02:10:18Z

Amazing! Getting a steady 20 t/s with high context on a B70! Great work!

mmvq: Port the ncols_dst optimization from ggml-cuda/mmvq.cu to SYCL. Read weights once per dispatch instead of once per column. Covers all standard quant types + reorder paths for Q4_0, Q8_0, Q3_K, Q4_K, Q5_K, Q6_K. IQ types (except IQ4_XS) excluded due to incompatible vec_dot signatures. ggml-sycl: The weight reorder was only bootstrapped on single-token mat-vec (ne[1] == 1). Speculative / MTP verify issues only multi-column mat-vec, so it never triggered the reorder and ran on the slower non-reorder kernel. Bootstrap it on small multi-column batches (ne[1] <= 8) too. (cherry picked from commit 7fe2ae4)

masonmilby requested a review from a team as a code owner April 13, 2026 08:44

github-actions Bot added ggml changes relating to the ggml tensor library for machine learning SYCL https://en.wikipedia.org/wiki/SYCL - GPU programming language labels Apr 13, 2026

masonmilby changed the title ~~sycl : port multi-column MMVQ from CUDA backend (~75% speculative decoding speedup on Intel Arc)~~ sycl : port multi-column MMVQ from CUDA backend (~45% speculative decoding speedup on Intel Arc) Apr 15, 2026

arthw reviewed May 9, 2026

View reviewed changes

masonmilby marked this pull request as draft May 9, 2026 06:07

R-SITES mentioned this pull request May 23, 2026

SYCL MTP on Intel Arc: correct output but no speed gain over baseline #23533

Open

masonmilby force-pushed the sycl-mmvq-multicol branch from d5ca092 to 113d79e Compare June 4, 2026 08:52

masonmilby marked this pull request as ready for review June 4, 2026 17:46

masonmilby requested a review from arthw June 4, 2026 17:46

arthw approved these changes Jun 5, 2026

View reviewed changes

arthw added the merge ready A maintainer can use this label to indicate that they consider the changes final and ready to merge. label Jun 5, 2026

ggerganov merged commit 7fe2ae4 into ggml-org:master Jun 5, 2026
23 of 25 checks passed

Conversation

masonmilby commented Apr 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Solution

Testing

Validation

To Reproduce

Notes

Scope

In

Out

Requirements

Uh oh!

arthw commented Apr 15, 2026

Uh oh!

masonmilby commented Apr 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Workflow

Compare

Speculative decoding on master:

Speculative decoding on sycl-mmvq-multicol:

Uh oh!

arthw commented Apr 16, 2026

Uh oh!

arthw commented Apr 16, 2026

Uh oh!

masonmilby commented Apr 16, 2026

Uh oh!

NeoZhangJianyu commented Apr 21, 2026

Uh oh!

masonmilby commented Apr 21, 2026

Uh oh!

NeoZhangJianyu commented Apr 22, 2026

Uh oh!

masonmilby commented Apr 24, 2026

Uh oh!

NeoZhangJianyu commented Apr 25, 2026

Uh oh!

arthw left a comment

Choose a reason for hiding this comment

Uh oh!

masonmilby commented May 9, 2026

Uh oh!

arthw commented May 11, 2026

Uh oh!

R-SITES commented May 18, 2026

Uh oh!

masonmilby commented Jun 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

R-SITES commented Jun 4, 2026

Uh oh!

masonmilby commented Jun 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

arthw left a comment

Choose a reason for hiding this comment

Uh oh!

masonmilby commented Jun 5, 2026

Uh oh!

NeoZhangJianyu commented Jun 5, 2026

Uh oh!

Uh oh!

tac39us-stack commented Jun 5, 2026

Uh oh!

masonmilby commented Jun 5, 2026

Uh oh!

sheigl commented Jun 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

masonmilby commented Apr 13, 2026 •

edited

Loading

masonmilby commented Apr 15, 2026 •

edited

Loading

Speculative decoding on `master`:

Speculative decoding on `sycl-mmvq-multicol`:

masonmilby commented Jun 4, 2026 •

edited

Loading

masonmilby commented Jun 4, 2026 •

edited

Loading

sheigl commented Jun 6, 2026 •

edited

Loading