chore: sync with upstream ggml-org/llama.cpp (106 commits) by TheTom · Pull Request #46 · TheTom/llama-cpp-turboquant

TheTom · 2026-04-02T03:44:06Z

Summary

Merge 106 upstream commits from ggml-org/llama.cpp master into our fork.

Conflicts Resolved

4 files, 6 conflict regions:

fattn-tile.cu: Added upstream case 512 before our HIP guard, kept our case 576/640 inside guard
fattn.cu: Union of WMMA/MFMA head-dim exclusion lists (512 + 576 + 640)
llama-graph.cpp: Keep both TurboQuant V unpad and upstream self_v_rot (ordered: unpad first, then v_rot)
llama-kv-cache.cpp: Keep both InnerQ externs and upstream WHT helper functions

Before merge

Metal regression tests (M5 Max)
Metal regression tests (M2 Pro)
CUDA build verification
HIP build verification

DO NOT MERGE until regression tests pass and CUDA folk confirm build.

🤖 Generated with Claude Code

* mtmd: refactor image pre-processing * correct some places * correct lfm2 * fix deepseek-ocr on server * add comment to clarify about mtmd_image_preprocessor_dyn_size

…l-org#21035) Signed-off-by: Adrien Gallouët <angt@huggingface.co>

…r deepseek-ocr (ggml-org#21027) * mtmd: fix "v.patch_embd" quant and unsupported im2col ops on Metal for deepseek-ocr * Update src/llama-quant.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* cann: update docker images to 8.5.0 - bump CANN base image from 8.3.rc2 to 8.5.0 - bump ASCEND_VERSION from 8.1.RC1.alpha001 to 8.5.0 Move to newer stable releases. * cann: update CANN.md * Update CANN.md to include BF16 support Added BF16 support information to the CANN documentation and corrected formatting for the installation instructions. * Fix formatting issues in CANN.md Fix 234: Trailing whitespace

…ml-org#21048) Updates Metal tensor API test probe to fix the dimension constraint violation in the matmul2d descriptor (at least one value must be a multiple of 16).

…ng_content API field (ggml-org#21036) * webui: send reasoning_content back to model in context Preserve assistant reasoning across turns by extracting it from internal tags and sending it as a separate reasoning_content field in the API payload. The server and Jinja templates handle native formatting (e.g. <think> tags for Qwen, GLM, DeepSeek...). Adds "Exclude reasoning from context" toggle in Settings > Developer (off by default, so reasoning is preserved). Includes unit tests. * webui: add syncable parameter for excludeReasoningFromContext * chore: update webui build output

…correct) (ggml-org#20917) The embd.begin(), embd.begin() range is empty and inserts nothing, so session_tokens never gets updated after decoding. Should be embd.begin(), embd.end(). Introduced in commit 2b6dfe8.

The compute graph may contain tensors pointing to CPU buffers. In these cases the buffer address is serialized as 0 and sent over the wire. However, the data pointer is serialized as-is and this prevents proper validation on the server side. This patches fixes this by serializing the data pointer as 0 for non-RPC buffers and doing proper validation on the server side. closes: ggml-org#21006

* wip: server_tools * refactor * displayName -> display_name * snake_case everywhere * rm redundant field * change arg to --tools all * add readme mention * llama-gen-docs

* server: respect the verbose_prompt parameter * Revert "server: respect the verbose_prompt parameter" This reverts commit 8ed885c. * Remove --verbose-prompt parameter from llama-server * Using set_examples instead of set_excludes

… lazy loading with transitions to content blocks (ggml-org#20999) * refactor: Always use agentic content renderer for Assistant Message * feat: Improve initial scroll + auto-scroll logic + implement fade in action for content blocks * chore: update webui build output

* ggml-hexagon: add IQ4_NL and MXFP4 HMX matmul support - Add IQ4_NL quantization type support to Hexagon backend (buffer set/get tensor repack, mul_mat, mul_mat_id dispatch) - Implement HVX IQ4_NL vec_dot kernels (1x1, 2x1, 2x2) with LUT-based 4-bit index to int8 kvalue dequantization - Add MXFP4 HMX dequantization path with E8M0 scale conversion, including batch-4 fast path and single-tile fallback - Unify quantized row size / scale offset logic to handle Q4_0, Q8_0, IQ4_NL, and MXFP4 in the DMA fetch path * ggml-hexagon: fix SKIP_QUANTIZE src1 address mismatch in mixed-quant models * Fix the pragma indent

… embedded web ui (ggml-org#20158) * introduce LLAMA_SERVER_NO_WEBUI * LLAMA_SERVER_NO_WEBUI → LLAMA_BUILD_WEBUI * LLAMA_BUILD_WEBUI ON by default not based on LLAMA_STANDALONE * MIssed this * Add useWebUi to package.nix

…-org#20970) * common : inhibit grammar while reasoning budget is active * cont : update force_pos in accept * cont : fix tests * cont : tweak should apply logic * cont : return early not using grammar sampler * Add tests * cont : prevent backend sampling when reasoning budget enabled * cont : fix typo --------- Co-authored-by: Piotr Wilkin <piotr.wilkin@syndatis.com>

…21056) * server : add custom socket options to disable SO_REUSEPORT Signed-off-by: Adrien Gallouët <angt@huggingface.co> * Add --reuse-port $ strace -e trace=setsockopt,bind build/bin/llama-server -lv 2 --reuse-port setsockopt(3, SOL_TCP, TCP_NODELAY, [1], 4) = 0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0 setsockopt(3, SOL_SOCKET, SO_REUSEPORT, [1], 4) = 0 bind(3, {sa_family=AF_INET, sin_port=htons(8080), sin_addr=inet_addr("127.0.0.1")}, 16) = 0 $ strace -e trace=setsockopt,bind build/bin/llama-server -lv 2 setsockopt(3, SOL_TCP, TCP_NODELAY, [1], 4) = 0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0 bind(3, {sa_family=AF_INET, sin_port=htons(8080), sin_addr=inet_addr("127.0.0.1")}, 16) = 0 Signed-off-by: Adrien Gallouët <angt@huggingface.co> * Update tools/server/README.md (llama-gen-docs) Signed-off-by: Adrien Gallouët <angt@huggingface.co> * Fix windows Signed-off-by: Adrien Gallouët <angt@huggingface.co> --------- Signed-off-by: Adrien Gallouët <angt@huggingface.co>

* CI: fix ARM64 image build error & enable compilation * Update .github/workflows/docker.yml Co-authored-by: Aaron Teo <taronaeo@gmail.com> * CI: revert ggml/src/ggml-cpu/CMakeLists.txt * Update .github/workflows/docker.yml Co-authored-by: Aaron Teo <taronaeo@gmail.com> * CI: update runs-on to ubuntu24.04, and update ARM64 build image ( ubuntu_version: "24.04") * CI: change cpu.Dockerfile gcc to 14; * CI : cpu.Dockerfile , update pip install . * Update .github/workflows/docker.yml Co-authored-by: Aaron Teo <taronaeo@gmail.com> --------- Co-authored-by: Aaron Teo <taronaeo@gmail.com>

* add /glob command * output error when max files reached * support globbing outside curdir

…ml-org#21085) * fix whitespace reasoning issues + add reconstruction tests * Proper fix * fix Nemotron autoparser test expectations to include newline in marker

* vulkan: add noncontiguous GLU support * fix compile issue

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

* refactor: Make `DialogConfirmation` extensible with children slot * feat: Add conversation forking logic * feat: Conversation forking UI * feat: Update delete/edit dialogs and logic for forks * refactor: Improve Chat Sidebar UX and add MCP Servers entry * refactor: Cleanup * feat: Update message in place when editing leaf nodes * chore: Cleanup * chore: Cleanup * chore: Cleanup * chore: Cleanup * chore: Cleanup * chore: Cleanup * refactor: Post-review improvements * chore: update webui build output * test: Update Storybook test * chore: update webui build output * chore: update webui build output

…19771)

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

…g#21107)

…schema pattern converter (ggml-org#21124) The regex-to-grammar converter in _visit_pattern() crashes with SIGSEGV when a JSON schema "pattern" field contains a non-capturing group (?:...). Root cause: when the parser sees '(' followed by '?', it pushes a warning but does not advance past '?:'. The recursive transform() call then interprets '?' as a quantifier and calls seq.back() on an empty vector, causing undefined behavior. This commonly occurs when serving OpenAI-compatible tool calls from clients that include complex regex patterns in their JSON schemas (e.g., date validation patterns like ^(?:(?:\d\d[2468][048]|...)-02-29|...)$). The fix: - Skip '?:' after '(' to treat non-capturing groups as regular groups - For unsupported syntax (?=, ?!, etc.), skip to matching ')' safely, handling escaped characters to avoid miscounting parenthesis depth - Adjust the ')' unbalanced-parentheses check using direct char comparisons instead of substr - Add test cases for non-capturing groups (C++ only, as the JS/Python implementations do not yet support this syntax)

* kleidiai: add cpu feature detection to CI run script Signed-off-by: Martin Klacer <martin.klacer@arm.com> Change-Id: I663adc3a7691a98e7dac5488962c13cc344f034a * kleidiai: revert unrelated requirements change Signed-off-by: Martin Klacer <martin.klacer@arm.com> * kleidiai: removed cpu feature detection from CI run script * As per the maintainers' suggestion, removed cpu feature detection from CI run script as CMake handles it already Signed-off-by: Martin Klacer <martin.klacer@arm.com> --------- Signed-off-by: Martin Klacer <martin.klacer@arm.com>

…l-org#21269) * fix: Bypass API Key validation for static bundle assets * refactor: All bypassed routes in `public_endpoints` * test: Update static assets API Key test

…gml-org#21270) * contrib : rewrite AGENTS.md, make it more clear about types of permitted AI usage * permit AI for writing code

* hexagon : add cumsum op support * hexagon: enable dma for cumsum op * Fix line-ending --------- Co-authored-by: Max Krasnyansky <maxk@qti.qualcomm.com>

Resolved 4 conflict files: - fattn-tile.cu: added upstream case 512 before HIP guard, kept our 576/640 - fattn.cu: union of WMMA/MFMA exclusion lists (512+576+640) - llama-graph.cpp: keep both turbo V unpad and upstream self_v_rot (ordered: unpad first) - llama-kv-cache.cpp: keep both InnerQ externs and upstream WHT helpers Co-Authored-By: tturney@psyguard.ai Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…om/llama-cpp-turboquant into feature/turboquant-kv-cache

TheTom · 2026-04-02T04:13:18Z

Regression Test Results — PR #46 (Upstream Sync)

Branch: pr/upstream-sync-april (04eeabb)
Hardware: M5 Max (128GB) + Mac Mini M2 Pro (32GB)

M5 Max — Speed

Model	Config	pp512	tg128	Status
Qwen2.5-1.5B Q8_0	q8_0/q8_0	11,036	214	✅ +8% improved
Qwen2.5-1.5B Q8_0	q8_0/turbo4	10,485	138	✅
Qwen2.5-1.5B Q8_0	q8_0/turbo3	10,448	132	✅
Phi-4 14B Q8_0	q8_0/q8_0	1,099	33.6	✅
Phi-4 14B Q8_0	q8_0/turbo4	CRASH	CRASH	❌ REGRESSION
Qwen3.5-27B Q8_0	q8_0/q8_0	560	18.1	✅
Qwen3.5-27B Q8_0	q8_0/turbo4	554	17.4	✅
Qwen3.5-35B MoE Q8_0	q8_0/q8_0	2,916	90.4	✅ +18% improved
Qwen3.5-35B MoE Q8_0	q8_0/turbo4	2,859	78.2	✅ +12% improved

M5 Max — PPL

Model	Config	PPL	Status
Qwen2.5-1.5B	q8_0/q8_0	10.31	✅ matches known-good
Qwen2.5-1.5B	q8_0/turbo4	10.42	✅
Phi-4 14B	q8_0/q8_0	6.54	✅ matches known-good
Qwen3.5-27B	q8_0/q8_0	6.87	✅ matches known-good
Qwen3.5-35B MoE	q8_0/q8_0	6.53	✅ matches known-good

Mac Mini M2 Pro — Speed

Model	Config	pp512	tg128	Status
Qwen2.5-7B Q4_K_M	q8_0/q8_0	352	35.2	✅
Qwen2.5-7B Q4_K_M	q8_0/turbo4	349	30.1	✅
Qwen2.5-7B Q4_K_M	q8_0/turbo3	349	29.2	✅
Qwen2.5-7B Q4_K_M	turbo3/turbo3	345	25.8	✅

❌ Regression: Phi-4 + turbo4 KV crash

ggml/src/ggml.c:3620: GGML_ASSERT(ggml_is_contiguous(a)) failed
in ggml_reshape_2d, called from build_attn_mha (line 1915)

This crash does NOT occur on the pre-sync base (PR #45 branch, commit 6c3e503). Phi-4 + turbo4 works fine there (30.9 t/s decode). The crash is introduced by one of the 106 upstream commits interacting with our TurboQuant KV code in the build_attn_mha flash attention path.

Needs investigation before merge. All other models and configs pass.

Improvements noted

Qwen 1.5B decode +8% (198 → 214 t/s)
MoE 35B decode +18% (76.6 → 90.4 t/s)

Likely from upstream Metal optimizations.

Xuan-Son Nguyen and others added 30 commits March 26, 2026 19:49

mtmd: refactor image preprocessing (ggml-org#21031)

a73bbd5

* mtmd: refactor image pre-processing * correct some places * correct lfm2 * fix deepseek-ocr on server * add comment to clarify about mtmd_image_preprocessor_dyn_size

common : add getpwuid fallback for HF cache when HOME is not set (ggm…

287b5b1

…l-org#21035) Signed-off-by: Adrien Gallouët <angt@huggingface.co>

ci: pin external actions to exact commit SHA (ggml-org#21033)

8c60b8a

hip: use fnuz fp8 for conversion on CDNA3 (ggml-org#21040)

7ca0c9c

metal : Fix dimension constraint violation in matmul2d descriptor (gg…

9bcb4ef

…ml-org#21048) Updates Metal tensor API test probe to fix the dimension constraint violation in the matmul2d descriptor (at least one value must be a multiple of 16).

completion : Fix segfault on model load failure (ggml-org#21049)

a308e58

server: add built-in tools backend support (ggml-org#20898)

20197b6

* wip: server_tools * refactor * displayName -> display_name * snake_case everywhere * rm redundant field * change arg to --tools all * add readme mention * llama-gen-docs

mtmd: add more sanity checks (ggml-org#21047)

871f1a2

cli : add /glob command (ggml-org#21084)

c46758d

* add /glob command * output error when max files reached * support globbing outside curdir

common/parser: fix reasoning whitespace bugs + extra parser tests (gg…

1f5d15e

…ml-org#21085) * fix whitespace reasoning issues + add reconstruction tests * Proper fix * fix Nemotron autoparser test expectations to include newline in marker

vulkan: add noncontiguous GLU support (ggml-org#21081)

0eb4764

* vulkan: add noncontiguous GLU support * fix compile issue

vendor : update cpp-httplib to 0.40.0 (ggml-org#21100)

b0f0dd3

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

Document custom default webui preferences in server README (ggml-org#…

82b703f

…19771)

ci : gracefully shut down the server (ggml-org#21110)

3d66da1

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

server : fix processing of multiple back-to-back mtmd chunks (ggml-or…

edfb440

…g#21107)

common : add reasoning_format = none support to gpt-oss (ggml-org#21094)

e6f2ec0

martin-klacer-arm and others added 8 commits April 1, 2026 20:02

CUDA: fix FA kernel selection logic (ggml-org#21271)

86221cf

server: Bypass API Key validation for WebUI static bundle assets (ggm…

12dbf1d

…l-org#21269) * fix: Bypass API Key validation for static bundle assets * refactor: All bypassed routes in `public_endpoints` * test: Update static assets API Key test

opencl: fix leak in Adreno q8_0 path (ggml-org#21212)

95a6eba

contrib : rewrite AGENTS.md, make it more clear about project values (g…

c30e012

…gml-org#21270) * contrib : rewrite AGENTS.md, make it more clear about types of permitted AI usage * permit AI for writing code

hexagon : add cumsum op support (ggml-org#21246)

fbd441c

* hexagon : add cumsum op support * hexagon: enable dma for cumsum op * Fix line-ending --------- Co-authored-by: Max Krasnyansky <maxk@qti.qualcomm.com>

Merge branch 'feature/turboquant-kv-cache' of https://github.com/TheT…

04eeabb

…om/llama-cpp-turboquant into feature/turboquant-kv-cache

github-actions Bot added documentation Improvements or additions to documentation Nvidia GPU ggml examples server Apple Metal Vulkan testing devops python script model OpenCL SYCL build nix jinja parser Ascend NPU Hexagon WebGPU labels Apr 2, 2026

TheTom merged commit 04eeabb into feature/turboquant-kv-cache Apr 2, 2026
37 of 75 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

chore: sync with upstream ggml-org/llama.cpp (106 commits)#46

chore: sync with upstream ggml-org/llama.cpp (106 commits)#46
TheTom merged 89 commits into
feature/turboquant-kv-cachefrom
pr/upstream-sync-april

TheTom commented Apr 2, 2026

Uh oh!

TheTom commented Apr 2, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants