Skip to content

chore: sync with upstream ggml-org/llama.cpp (106 commits)#46

Merged
TheTom merged 89 commits into
feature/turboquant-kv-cachefrom
pr/upstream-sync-april
Apr 2, 2026
Merged

chore: sync with upstream ggml-org/llama.cpp (106 commits)#46
TheTom merged 89 commits into
feature/turboquant-kv-cachefrom
pr/upstream-sync-april

Conversation

@TheTom
Copy link
Copy Markdown
Owner

@TheTom TheTom commented Apr 2, 2026

Summary

Merge 106 upstream commits from ggml-org/llama.cpp master into our fork.

Conflicts Resolved

4 files, 6 conflict regions:

  • fattn-tile.cu: Added upstream case 512 before our HIP guard, kept our case 576/640 inside guard
  • fattn.cu: Union of WMMA/MFMA head-dim exclusion lists (512 + 576 + 640)
  • llama-graph.cpp: Keep both TurboQuant V unpad and upstream self_v_rot (ordered: unpad first, then v_rot)
  • llama-kv-cache.cpp: Keep both InnerQ externs and upstream WHT helper functions

Before merge

  • Metal regression tests (M5 Max)
  • Metal regression tests (M2 Pro)
  • CUDA build verification
  • HIP build verification

DO NOT MERGE until regression tests pass and CUDA folk confirm build.

🤖 Generated with Claude Code

Xuan-Son Nguyen and others added 30 commits March 26, 2026 19:49
* mtmd: refactor image pre-processing

* correct some places

* correct lfm2

* fix deepseek-ocr on server

* add comment to clarify about mtmd_image_preprocessor_dyn_size
…r deepseek-ocr (ggml-org#21027)

* mtmd: fix "v.patch_embd" quant and unsupported im2col ops on Metal for deepseek-ocr

* Update src/llama-quant.cpp

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

---------

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* cann: update docker images to 8.5.0

- bump CANN base image from 8.3.rc2 to 8.5.0
- bump ASCEND_VERSION from 8.1.RC1.alpha001 to 8.5.0

Move to newer stable releases.

* cann: update CANN.md

* Update CANN.md to include BF16 support

Added BF16 support information to the CANN documentation and corrected formatting for the installation instructions.

* Fix formatting issues in CANN.md

Fix 234: Trailing whitespace
…ml-org#21048)

Updates Metal tensor API test probe to fix the dimension constraint violation in the matmul2d descriptor (at least one value must be a multiple of 16).
…ng_content API field (ggml-org#21036)

* webui: send reasoning_content back to model in context

Preserve assistant reasoning across turns by extracting it from
internal tags and sending it as a separate reasoning_content field
in the API payload. The server and Jinja templates handle native
formatting (e.g. <think> tags for Qwen, GLM, DeepSeek...).

Adds "Exclude reasoning from context" toggle in Settings > Developer
(off by default, so reasoning is preserved). Includes unit tests.

* webui: add syncable parameter for excludeReasoningFromContext

* chore: update webui build output
…correct) (ggml-org#20917)

The embd.begin(), embd.begin() range is empty and inserts nothing, so session_tokens never gets updated after
  decoding. Should be embd.begin(), embd.end(). Introduced in commit 2b6dfe8.
The compute graph may contain tensors pointing to CPU buffers. In these
cases the buffer address is serialized as 0 and sent over the wire.
However, the data pointer is serialized as-is and this prevents proper
validation on the server side. This patches fixes this by serializing
the data pointer as 0 for non-RPC buffers and doing proper validation on
the server side.

closes: ggml-org#21006
* wip: server_tools

* refactor

* displayName -> display_name

* snake_case everywhere

* rm redundant field

* change arg to --tools all

* add readme mention

* llama-gen-docs
* server: respect the verbose_prompt parameter

* Revert "server: respect the verbose_prompt parameter"

This reverts commit 8ed885c.

* Remove --verbose-prompt parameter from llama-server

* Using set_examples instead of set_excludes
… lazy loading with transitions to content blocks (ggml-org#20999)

* refactor: Always use agentic content renderer for Assistant Message

* feat: Improve initial scroll + auto-scroll logic + implement fade in action for content blocks

* chore: update webui build output
* ggml-hexagon: add IQ4_NL and MXFP4 HMX matmul support

- Add IQ4_NL quantization type support to Hexagon backend (buffer
  set/get tensor repack, mul_mat, mul_mat_id dispatch)
- Implement HVX IQ4_NL vec_dot kernels (1x1, 2x1, 2x2) with
  LUT-based 4-bit index to int8 kvalue dequantization
- Add MXFP4 HMX dequantization path with E8M0 scale conversion,
  including batch-4 fast path and single-tile fallback
- Unify quantized row size / scale offset logic to handle Q4_0,
  Q8_0, IQ4_NL, and MXFP4 in the DMA fetch path

* ggml-hexagon: fix SKIP_QUANTIZE src1 address mismatch in mixed-quant models

* Fix the pragma indent
… embedded web ui (ggml-org#20158)

* introduce LLAMA_SERVER_NO_WEBUI

* LLAMA_SERVER_NO_WEBUI → LLAMA_BUILD_WEBUI

* LLAMA_BUILD_WEBUI ON by default not based on LLAMA_STANDALONE

* MIssed this

* Add useWebUi to package.nix
…-org#20970)

* common : inhibit grammar while reasoning budget is active

* cont : update force_pos in accept

* cont : fix tests

* cont : tweak should apply logic

* cont : return early not using grammar sampler

* Add tests

* cont : prevent backend sampling when reasoning budget enabled

* cont : fix typo

---------

Co-authored-by: Piotr Wilkin <piotr.wilkin@syndatis.com>
…21056)

* server : add custom socket options to disable SO_REUSEPORT

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

* Add --reuse-port

    $ strace -e trace=setsockopt,bind build/bin/llama-server -lv 2 --reuse-port
    setsockopt(3, SOL_TCP, TCP_NODELAY, [1], 4) = 0
    setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
    setsockopt(3, SOL_SOCKET, SO_REUSEPORT, [1], 4) = 0
    bind(3, {sa_family=AF_INET, sin_port=htons(8080), sin_addr=inet_addr("127.0.0.1")}, 16) = 0

    $ strace -e trace=setsockopt,bind build/bin/llama-server -lv 2
    setsockopt(3, SOL_TCP, TCP_NODELAY, [1], 4) = 0
    setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
    bind(3, {sa_family=AF_INET, sin_port=htons(8080), sin_addr=inet_addr("127.0.0.1")}, 16) = 0

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

* Update tools/server/README.md (llama-gen-docs)

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

* Fix windows

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

---------

Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* CI: fix ARM64 image build error & enable compilation

* Update .github/workflows/docker.yml

Co-authored-by: Aaron Teo <taronaeo@gmail.com>

* CI: revert ggml/src/ggml-cpu/CMakeLists.txt

* Update .github/workflows/docker.yml

Co-authored-by: Aaron Teo <taronaeo@gmail.com>

* CI: update runs-on to ubuntu24.04, and update ARM64 build image ( ubuntu_version: "24.04")

* CI: change cpu.Dockerfile gcc to 14;

* CI : cpu.Dockerfile , update pip install .

* Update .github/workflows/docker.yml

Co-authored-by: Aaron Teo <taronaeo@gmail.com>

---------

Co-authored-by: Aaron Teo <taronaeo@gmail.com>
* add /glob command

* output error when max files reached

* support globbing outside curdir
…ml-org#21085)

* fix whitespace reasoning issues + add reconstruction tests

* Proper fix

* fix Nemotron autoparser test expectations to include newline in marker
* vulkan: add noncontiguous GLU support

* fix compile issue
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* refactor: Make `DialogConfirmation` extensible with children slot

* feat: Add conversation forking logic

* feat: Conversation forking UI

* feat: Update delete/edit dialogs and logic for forks

* refactor: Improve Chat Sidebar UX and add MCP Servers entry

* refactor: Cleanup

* feat: Update message in place when editing leaf nodes

* chore: Cleanup

* chore: Cleanup

* chore: Cleanup

* chore: Cleanup

* chore: Cleanup

* chore: Cleanup

* refactor: Post-review improvements

* chore: update webui build output

* test: Update Storybook test

* chore: update webui build output

* chore: update webui build output
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
…schema pattern converter (ggml-org#21124)

The regex-to-grammar converter in _visit_pattern() crashes with SIGSEGV
when a JSON schema "pattern" field contains a non-capturing group (?:...).

Root cause: when the parser sees '(' followed by '?', it pushes a warning
but does not advance past '?:'. The recursive transform() call then
interprets '?' as a quantifier and calls seq.back() on an empty vector,
causing undefined behavior.

This commonly occurs when serving OpenAI-compatible tool calls from
clients that include complex regex patterns in their JSON schemas (e.g.,
date validation patterns like ^(?:(?:\d\d[2468][048]|...)-02-29|...)$).

The fix:
- Skip '?:' after '(' to treat non-capturing groups as regular groups
- For unsupported syntax (?=, ?!, etc.), skip to matching ')' safely,
  handling escaped characters to avoid miscounting parenthesis depth
- Adjust the ')' unbalanced-parentheses check using direct char
  comparisons instead of substr
- Add test cases for non-capturing groups (C++ only, as the JS/Python
  implementations do not yet support this syntax)
martin-klacer-arm and others added 8 commits April 1, 2026 20:02
* kleidiai: add cpu feature detection to CI run script

Signed-off-by: Martin Klacer <martin.klacer@arm.com>
Change-Id: I663adc3a7691a98e7dac5488962c13cc344f034a

* kleidiai: revert unrelated requirements change

Signed-off-by: Martin Klacer <martin.klacer@arm.com>

* kleidiai: removed cpu feature detection from CI run script

 * As per the maintainers' suggestion, removed cpu feature detection
   from CI run script as CMake handles it already

Signed-off-by: Martin Klacer <martin.klacer@arm.com>

---------

Signed-off-by: Martin Klacer <martin.klacer@arm.com>
…l-org#21269)

* fix: Bypass API Key validation for static bundle assets

* refactor: All bypassed routes in `public_endpoints`

* test: Update static assets API Key test
…gml-org#21270)

* contrib : rewrite AGENTS.md, make it more clear about types of permitted AI usage

* permit AI for writing code
* hexagon : add cumsum op support

* hexagon: enable dma for cumsum op

* Fix line-ending

---------

Co-authored-by: Max Krasnyansky <maxk@qti.qualcomm.com>
Resolved 4 conflict files:
- fattn-tile.cu: added upstream case 512 before HIP guard, kept our 576/640
- fattn.cu: union of WMMA/MFMA exclusion lists (512+576+640)
- llama-graph.cpp: keep both turbo V unpad and upstream self_v_rot (ordered: unpad first)
- llama-kv-cache.cpp: keep both InnerQ externs and upstream WHT helpers

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@TheTom
Copy link
Copy Markdown
Owner Author

TheTom commented Apr 2, 2026

Regression Test Results — PR #46 (Upstream Sync)

Branch: pr/upstream-sync-april (04eeabb)
Hardware: M5 Max (128GB) + Mac Mini M2 Pro (32GB)

M5 Max — Speed

Model Config pp512 tg128 Status
Qwen2.5-1.5B Q8_0 q8_0/q8_0 11,036 214 ✅ +8% improved
Qwen2.5-1.5B Q8_0 q8_0/turbo4 10,485 138
Qwen2.5-1.5B Q8_0 q8_0/turbo3 10,448 132
Phi-4 14B Q8_0 q8_0/q8_0 1,099 33.6
Phi-4 14B Q8_0 q8_0/turbo4 CRASH CRASH ❌ REGRESSION
Qwen3.5-27B Q8_0 q8_0/q8_0 560 18.1
Qwen3.5-27B Q8_0 q8_0/turbo4 554 17.4
Qwen3.5-35B MoE Q8_0 q8_0/q8_0 2,916 90.4 ✅ +18% improved
Qwen3.5-35B MoE Q8_0 q8_0/turbo4 2,859 78.2 ✅ +12% improved

M5 Max — PPL

Model Config PPL Status
Qwen2.5-1.5B q8_0/q8_0 10.31 ✅ matches known-good
Qwen2.5-1.5B q8_0/turbo4 10.42
Phi-4 14B q8_0/q8_0 6.54 ✅ matches known-good
Qwen3.5-27B q8_0/q8_0 6.87 ✅ matches known-good
Qwen3.5-35B MoE q8_0/q8_0 6.53 ✅ matches known-good

Mac Mini M2 Pro — Speed

Model Config pp512 tg128 Status
Qwen2.5-7B Q4_K_M q8_0/q8_0 352 35.2
Qwen2.5-7B Q4_K_M q8_0/turbo4 349 30.1
Qwen2.5-7B Q4_K_M q8_0/turbo3 349 29.2
Qwen2.5-7B Q4_K_M turbo3/turbo3 345 25.8

❌ Regression: Phi-4 + turbo4 KV crash

ggml/src/ggml.c:3620: GGML_ASSERT(ggml_is_contiguous(a)) failed
in ggml_reshape_2d, called from build_attn_mha (line 1915)

This crash does NOT occur on the pre-sync base (PR #45 branch, commit 6c3e503). Phi-4 + turbo4 works fine there (30.9 t/s decode). The crash is introduced by one of the 106 upstream commits interacting with our TurboQuant KV code in the build_attn_mha flash attention path.

Needs investigation before merge. All other models and configs pass.

Improvements noted

  • Qwen 1.5B decode +8% (198 → 214 t/s)
  • MoE 35B decode +18% (76.6 → 90.4 t/s)

Likely from upstream Metal optimizations.

@TheTom TheTom merged commit 04eeabb into feature/turboquant-kv-cache Apr 2, 2026
37 of 75 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.