feat(sync): upstream master sync (42 commits) + Gemma4 MTP via PR #23398 vendor by marksverdhei · Pull Request #93 · heiervang-technologies/ht-llama.cpp

marksverdhei · 2026-06-07T09:29:07Z

42-commit upstream master sync (0066404..upstream/master @ 465b1f0) + 27 commits from upstream PR ggml-org#23398 (Gemma4 MTP speculative decoding, am17an), with behavior-preserving DFlash adaptations for the post-sync API surface.

Why

Snoop-kube blocked deploying gemma-4-12B QAT + MTP on titan-llm: "unknown model architecture: gemma4-assistant". The fix is upstream PR ggml-org#23398, which has a hard transitive dependency on upstream PR ggml-org#23913 (input_tokens API). So the full master sync was the path.

Key commits in this PR

chore(sync): merge upstream master into ht (42 commits, 006640408..465b1f0e7) — the master sync itself
chore(sync): adapt DFlash to hparams.n_layer() method post-#24060 — the [A4] hparams refactor adaptation (3 callsites in dflash.cpp)
27 cherry-picked commits from PR llama : add Gemma4 MTP ggml-org/llama.cpp#23398 (gemma4-mtp), each preserving DFlash compatibility
chore(sync): drop intermediate llama_set_mtp_source call — matches PR's final ctx_other API

Audit + resolution decisions (Tier-A summary)

Commit	Disposition
`7c158fbb4` server : disable on-device spec checkpoints	accept upstream removal; keep our `if (!dflash)` guard
`6f3a9f3de` server: avoid unnecessary checkpoint restore	orthogonal upstream fix; applied
`7acb4e8cd` hparams : refactor `hparams.n_layer` (129 files)	adapted DFlash to `n_layer()` method; DFlash has no nextn layers so `n_layer() == n_layer_all`
`2154a0fdc` CUDA: enroll mul_mat_vec_q_moe into pdl	orthogonal upstream perf win for MoE MTP; applied
`6effcecd0` TP: round up granularity to 128	orthogonal; TurboQ block size 128 already matches
`f5c6ae182` mtmd, server: input_tokens API (ggml-org#23913)	the hard dep that gates PR ggml-org#23398; vendored as part of sync
`64086f2b2` model, mtmd: Granite4 Vision	new arch, isolated to own files; applied
`e8023568d` convert: Fix Gemma 4 Unified conversion	defensive checks; verified clean on Phase 6

Zero-regression gate — 5-axis Phase 6 results on titan

Image: unified-llm:mtp-pr23398-5e6dff22 (manifest digest sha256:db99e23f427630d6ba21e6d0e357e7372fdd6adb2d74c62f337d7f36c4ba33b9).

Perf parity — decode IMPROVED: gemma-4-12b +9.3%, gemma-4-12b-qat +13.0% (kernel commits). iq4 / Q5_K flat. 5 sub-2%-band ttft deltas (+0.7–1.2%, prefill) accepted by Markus as the trade for the decode wins.
MTP works — gemma4-assistant loads, 100.77 tok/s = 1.66× over qat baseline 60.80, draft accept rate 0.640 (vs am17an's reference ~0.66).
DFlash still LOADS — no worse than b0daec5; same pre-existing GGML_ASSERT(n_outputs_max) crash-on-decode (incident m-20260527), assert line moved 2343 → 2354 with the sync.
Correctness binary zero-tolerance — --tokendiff on both content and reasoning_content fields shows BYTE-IDENTICAL temp-0 token streams across iq4 / 12b / qat / Q5_K. Zero CONTENT flips, zero reasoning flips. Independently verified by big-dog on the raw tsv.
No new wedge-risk — iq4 (CUDA0) + MTP (CUDA1) coexist on titan, no OOM. The pre-existing Router co-loads same-device-pinned models and OOMs: --models-max fit decision ignores per-device VRAM #66 router-wedge isn't worsened by the added MTP preset.

Post-merge follow-ups (non-blocking)

[spec] failed to measure draft model memory: failed to create llama_context from model warning at MTP load — non-blocking (model loads + runs fine), worth polishing upstream for am17an's PR.
Issue Router co-loads same-device-pinned models and OOMs: --models-max fit decision ignores per-device VRAM #66 (per-GPU router fit) — Phase 0 baseline surfaced this as wedge-prone on b0daec5 too (unpinned Q5_K_M dual-card spread can OOM-block the workhorse). Priority reread post-γ-merge.
DFlash crash-on-decode (incident m-20260527) — separate from γ, still open.

…ml-org#23974) The XCFramework generated by build-xcframework.sh creates a module map that manually lists public headers. That list can fall out of sync with the framework's Headers directory. The module map is currently missing ggml-opt.h, which is present in the framework headers. This can cause downstream Apple builds to fail with: Include of non-modular header inside framework module 'llama' Use the framework's Headers directory itself as the module map umbrella instead of maintaining a manual header list. This makes all public headers under the generated framework's Headers directory part of the llama module.

…ggml-org#24065) * webui: fix tool selector toggle/counter, key tools by stable identity Key the disabled set, counts and toggles by a stable per-tool key instead of bare function name, deduped from one canonical list. Per-tool checkboxes become presentational (single row handler, no nested button), category checkboxes drop the tristate (n/total carries partial). One getEnabledToolsForLLM keeps normalized MCP schemas and dedupes by name. * ui: use SvelteSet and SvelteMap for local tool collections to satisfy svelte/prefer-svelte-reactivity

* agents: refactor, include more guidelines * better example * rephrase a bit * add more examples * nits

…ent (ggml-org#24110) * server: avoid unnecessary checkpoint restore when new tokens are present The pos_min_thold calculation unconditionally subtracts 1 to ensure at least one token is evaluated for logits when no new tokens exist. However, when the request contains new tokens beyond the cached prefix, this -1 is overly conservative and may trigger an unnecessary checkpoint restore. Conditionally apply the -1 only when n_past >= task.n_tokens() (no new tokens), avoiding redundant KV state restoration when there is actual work to do. * cont : add ref --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

) * ggml: vectorize ggml_vec_dot_q4_1_q8_1 with WASM SIMD128 Optimize the inner loop of ggml_vec_dot_q4_1_q8_1_generic using WASM SIMD128 intrinsics, gated behind #ifdef __wasm_simd128__ so non-wasm builds are completely unaffected. Approach: - single wasm_v128_load covers all 32 packed 4-bit weights - nibbles unpacked via AND/SHR into two u8x16 registers - widened to i16 before multiply (WASM SIMD has no i8*i8 instruction) - 4x wasm_i32x4_dot_i16x8 calls accumulate all 32 element pairs - horizontal reduce via 4x wasm_i32x4_extract_lane Benchmark (node v25, emcc -O3 -msimd128, 64 blocks x QK8_1=32, 200k iterations): | impl | ns/call | speedup | |--------|---------|---------| | scalar | 880.7 | 1.00x | | simd | 257.8 | 3.42x | Correctness verified against scalar reference across 10 random seeds with exact output match. * ggml: move q4_1_q8_1 WASM SIMD implementation to wasm backend Relocate the SIMD128 implementation of ggml_vec_dot_q4_1_q8_1 to ggml/src/ggml-cpu/arch/wasm/quants.c to follow architecture-specific layout. Restore the generic implementation in ggml/src/ggml-cpu/quants.c. Move for loop in the else block. * ggml: use generic q4_1_q8_1 fallback in wasm backend

* Fix Gemma 4 Unified conversion * Set audio hidden size to audio_embed_dim

Co-authored-by: lvyichen <lvyichen@stepfun.com>

* webui: added single line reasoning preview. * patch: reduce width slightly for the previewing section * refactor: move formatter constants to the right file * feat: reimplement reasoning preview with throttled dynamic per-line rendering * chore: fix spacing Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com> * chore: refactor to requested changes * refactor: grouped by capture pattern instead of block-level + inline * ui: fax interrupt state only trigger for 1st reasoning message * chore: make reasoning preview respects showThoughtInProgress setting * chore; newline at EOF Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com> * fix: thread rawContent so collapsible content can handle compute preview * patch: showThoughtInProgress accidentally blocks rawContent being passed * chore: fix lint * chore: change smoke test --------- Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com>

* chore(ui): pin package versions to currently installed - Update all dependencies and devDependencies to match exactly what's in package-lock.json - This ensures reproducible builds by locking to specific versions rather than semver ranges * chore: Update packages * chore: Move remaining dependencies to devDependencies * fix: Add missing `mermaid` package * chore: Update `cookie` package to `v1.1.1` * chore: Formatting * test: Update test configs

…gml-org#22445) * Deduplicate imatrix loading code * Add back LLAMA_TRACE, early exit on quantize missing metadata

…debar (ggml-org#23132) * use child snippets for landing and chat message elements * make ... icon visible in conversation history menu * conversation history forward tab fix * add snippet fix for fork icon in conversation history * focus/keyboard fix for attachment x icon and scroll left/right * formatting * fix scroll down issue * simply Statistics and pointer events in scrolldown * create storybook tests and move to folder * improve tests to actually assert on element

mmvq: Port the ncols_dst optimization from ggml-cuda/mmvq.cu to SYCL. Read weights once per dispatch instead of once per column. Covers all standard quant types + reorder paths for Q4_0, Q8_0, Q3_K, Q4_K, Q5_K, Q6_K. IQ types (except IQ4_XS) excluded due to incompatible vec_dot signatures. ggml-sycl: The weight reorder was only bootstrapped on single-token mat-vec (ne[1] == 1). Speculative / MTP verify issues only multi-column mat-vec, so it never triggered the reorder and ran on the slower non-reorder kernel. Bootstrap it on small multi-column batches (ne[1] <= 8) too.

This PR attempts to slim down the dependencies for build-msys jobs making the same changes that we applied in whisper.cpp to reduce the size of the github actions cache, and should also improve the run time due to fewer dependencies that need to be installed. I realize this is a scheduled job but I think it would still make sense to apply these changes. Refs: ggml-org/whisper.cpp#3858

* Enroll mul_mat_vec_q_moe into PDL, boosting MTP performance on BW Data collected on a B4500: Before ``` (llama.cpp) ➜ llama.cpp git:(master) ✗ python mtp-bench.py code_python pred= 192 draft= 150 acc= 116 rate=0.773 tok/s=202.8 code_cpp pred= 192 draft= 147 acc= 117 rate=0.796 tok/s=212.8 explain_concept pred= 192 draft= 161 acc= 110 rate=0.683 tok/s=196.4 summarize pred= 192 draft= 138 acc= 122 rate=0.884 tok/s=226.6 qa_factual pred= 192 draft= 138 acc= 121 rate=0.877 tok/s=225.1 translation pred= 192 draft= 158 acc= 112 rate=0.709 tok/s=201.5 creative_short pred= 192 draft= 160 acc= 110 rate=0.688 tok/s=197.2 stepwise_math pred= 192 draft= 150 acc= 115 rate=0.767 tok/s=209.2 long_code_review pred= 192 draft= 148 acc= 116 rate=0.784 tok/s=208.9 ``` After ``` (llama.cpp) ➜ llama.cpp git:(master) ✗ python mtp-bench.py code_python pred= 192 draft= 150 acc= 116 rate=0.773 tok/s=211.9 code_cpp pred= 192 draft= 147 acc= 117 rate=0.796 tok/s=224.6 explain_concept pred= 192 draft= 161 acc= 110 rate=0.683 tok/s=207.8 summarize pred= 192 draft= 138 acc= 122 rate=0.884 tok/s=240.2 qa_factual pred= 192 draft= 138 acc= 121 rate=0.877 tok/s=238.5 translation pred= 192 draft= 158 acc= 112 rate=0.709 tok/s=213.4 creative_short pred= 192 draft= 160 acc= 110 rate=0.688 tok/s=208.8 stepwise_math pred= 192 draft= 150 acc= 115 rate=0.767 tok/s=221.7 long_code_review pred= 192 draft= 148 acc= 116 rate=0.784 tok/s=220.7 ``` Server launched with: ``` ➜ llama.cpp git:(osimons/enroll_mul_mat_vec_q_moe_into_PDL) ✗ ./build-x64-linux-gcc-reldbg/bin/llama-server \ -m /mnt/share/gguf/unsloth/Qwen3.6-35B-A3B-MTP-GGUF/Qwen3.6-35B-A3B-UD-Q4_K_M.gguf -dio \ --spec-type draft-mtp \ --spec-draft-n-max 2 \ -ngl all \ -fa on \ --host 0.0.0.0 \ --port 8080 -np 1 --chat-template-kwargs "{\"preserve_thinking\": true}" ``` * LC to overlap with following kernels

…-org#23819)

* hparams : refactor hparams.n_layer * cont : remove `n_layer_kv()`, use n_layer_all instead * cont : type consistency * pi : update SYSTEM.md * models : fix Step3.5 MTP * cont : remove duplicate switch cases * cont : explicitly set `false` to extra layers for `is_swa` and `is_recr` * cont : fix nextn layer count handling Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Update quantization readme * install requirements * Apply suggestions from code review Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * dos2unix suggestions --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

The current link is to a non-existent file. I had a look at the repo, spotted the file containing the UI configuration key and updated the link

…ggml-org#24171)

Fixes ggml-org#23847

* TP: round up granularity to 128 * remove assert

* feat(convert): Get language model conversion working for 4.1 vision Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat(convert): Skip multimodal tensors for GraniteMoeHybrid (vision 4.0) Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Disable vocab padding for non-hybrid models that use GraniteMoeHybrid Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Plumb python-side vision projector names and mappings There are several awkward things here: 1. Most of these are essentially identical to the audio qformer tensors. On the c++ side, that's mapped using the prefix, so the rest of the GGUF name needs to align, but on the python side there's no prefix notion, so they all get duplicated. 2. There are a couple of net-new tensors for vision, in particular PROJ_NORM. In both speech and vision, the QF_PROJ_NORM is qualified as belonging to the qformer portion, but the GGUF name is simply proj_norm which conflicts with the ideal name for this new PROJ_NORM that is not qualified as part of the qformer. To get around this, I used "proj_layernorm" as the GGUF name. Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Add python side architecture name Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Add python-side plumbing for setting FEATURE_LAYERS hparam Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Add c++ side tensor naming defines NOTE: Usage of these hasn't been updated to include prefix yet Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat(mtmd): Convert vision_feature_layer to an ordered vector We need to preserve the ordering of these feature index values so that they can be mapped to the sub-tensors within the stacked projectors. Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat(mtmd): Add architecture label plumbing Branch: Granite4Vision AI-usage: full (OpenCode + qwen3.5:122b) Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat(wip): Add partial conversion for mmproj This handles stacking the projector tensors and setting the new harams Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Add gguf_writer and constant support for new hparams and deepstack layer arr Branch: Granite4Vision AI-usage: draft (OpenCode + qwen3.5:122b) Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Full conversion for mmproj w/ tensor mappings Branch: Granite4Vision AI-usage: full (OpenCode + qwen3.5:122b) Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Add lm_head skip for mmproj for 4.0 Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: De-alias text_config architecture in convert_lora_to_gguf.py Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Add --trust-remote-code arg to convert_lora_to_gguf.py This defaults to False, but allows a user to enable it programmaticly instead of using the interactive prompt. Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: De-alias model.language_model. -> model. for lora adapters Branch: Granite4Vision AI-usage: full (OpenCode + qwen3.5:122b) Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Extend language model tensor dealiasing in adapters Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Remove unnecessary registration for GraniteSpeech in language model Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Plumb through mm prefix formatting for qformer tensors Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * refactor: Refactor vision projector tensors to use predictor ID as the block This is cleaner than stacking them. The modeling file hard-codes single-layer qformers, so we can punt on the multiipule multi-layer projectors problem. Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Add spatial offests array hparam conversion Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Add stub plumbing for granite vision in mtmd Branch: Granite4Vision AI-usage: draft (OpenCode + qwen3.5:122b) Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Add new hparam and tensor naming in clip-impl.h New hparams: - KEY_PROJ_SAMPLE_QUERY_SIDE - KEY_PROJ_SAMPLE_WINDOW_SIDE - KEY_PROJ_SPATIAL_OFFSETS New tensors: - TN_MULTI_PROJ_IMG_POS - TN_MULTI_PROJ_QUERY - TN_MULTI_PROJ_LAYERNORM - TN_MULTI_PROJ_LINEAR - TN_MULTI_PROJ_NORM Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Move deepstack_layer_arr to llm hparam instead of mmproj Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Remove IS_DEEPSTACK_LAYERS This appears to have been added during Qwen3 VL (ggml-org#16780), but it was never actually used. Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * refactor: n_deepstack_layers -> deepstack_layer_arr The old logic hard coded a correspondence between the first N layers of the LLM and the 1->N entries in the input embeddings. Now, that relationship is maintained at loading time if the GGUF value is single-valued. If it is multi-valued, it loads directly allowing for deepstack layers to be spaced out throughout the model. Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Use try/catch for single/multi valued deepstack info The alternative would be to use get_key_or_arr, but then the single value would be populated through the entire array and we'd need to detect that and update it with the right correspondence. Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Add deepstack injection point for granite LLM The use of ggml_add here assumes that the elements of inp_embd will be pre- arranged to be the full embedding length with only the vision-mask'ed portions non-zero from the projector. This matches how Qwen3VL does it. Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: add missing vision attn layernorm eps Branch: Granite4Vision AI-usage: full (OpenCode + Qwen 3.6-35B) Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * refactor: Hoist qformer tensors into qf_block and hold a vector for multi-proj Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Fix missing prefix template for TN_QF_PROJ_LINEAR It's not strictly necessary since vision uses the blockwise version, but it makes the loading consistent. Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Add embedding scale and image grid pinpoints hparams in conversion Also remove dead parsing for self._deepstack_layer_arr Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Add mtmd KEY_ section for hparams shared with the LLM In this case, we need the EMBEDDING_SCALE so we can unscale the image embeddings to compensate for applying embedding scale to the input embeddings Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Implement c++ hparam parsing Branch: Granite4Vision AI-usage: draft (Claude Code) Co-authored-by: Eli Schwartz <eliyahu.schwartz@ibm.com> Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Flatten pinpoints in conversion Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Add missing break Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: No reason to have modality prefix for img_pos Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Add tensor loading Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix(convert): Fix confusion between proj.norm and proj.qformer.layernorm Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Use the right portion of speech for tensor loading! Also plumb through the layernorm -> post_norm naming change Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Add logging of deepstack_layers_arr if set I also changed the print_f output type to int32_t to avoid printing overflow values for -1. This could cause overflows on the other side, but I can't imagine a value for any of the current array hparams that would trigger that. Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Make sure input embeddings are cont before f_embedding_scale Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Add init and mmproj_embd cases for g4v The n_mmproj_embd is 1+ to make space for the text embedding and all 8 projectors Branch: Granite4Vision AI-usage: draft (Bob) Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Invert (h, w) -> (w, h) pinpoints Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Reorder projectors based on llm index and skip the first injection The multi-projector stack has a strange asymmetry based on how it's currently implemented for qwen3vl: on the mmproj side, it's all N projectors, but the output of the "first" (by inp_embd index) projector is automatically consumed as if it were a standard single-projector mmproj, so the deepstack portion needs to only contain the 1-N entries. Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> Co-authored-by: Eli Schwartz <eliyahu.schwartz@ibm.com> * fix: Fix mmproj hparams in conversion Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> Co-authored-by: Eli Schwartz <eliyahu.schwartz@ibm.com> * fix: Fix ordering/logic for deepstack injection in granite Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> Co-authored-by: Eli Schwartz <eliyahu.schwartz@ibm.com> * fix: Fix preprocessing config to match what the model needs Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> Co-authored-by: Eli Schwartz <eliyahu.schwartz@ibm.com> * wip: Partial port of Eli's implementation This is still pretty broken, but it's getting closer. It now happily generates tokens, but the values are quite incorrect still. I suspect it's caused by the mapping of projectors from safetensors to their respective orders here. Also, this implementation breaks encapsulation pretty badly in mtmd_encode. This will need a big refactor to put the G4V-specific encoding logic somewhere more appropriate. Branch: Granite4Vision AI-usage: draft (Claude Code, Bob) Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> Co-authored-by: Eli Schwartz <eliyahu.schwartz@ibm.com> * fix: Fix the pre-scaling on the input embeddings to correctly invert the scale We've got tokens! They still don't line up quite right, so something's a little off, but we're getting much closer now. Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: invert embedding multiplier -> base_scale at load Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Fix setting image_resize_pad after new enum introduced Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Add G4V to mmproj mapping in conversion Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Re-add padding disable for non-hybrid hybrid models Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * refactor: Simplify G4V n_tokens computation This is slightly more efficient and flexible for when we implement the unpad cropping. IMO, it's also clearer that it is adding the number of image_newline tokens (embeddings) to the grid, rather than recomputing the entire count. Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Add new clip APIs for post-tile-encoding assembly Granite 4 Vision uses llava-next style pack-and-unpad which requires injecting the learned newline after each row of the tile grid. A row here is a single row of the grid which is composed of (grid_x * cols_per_tile) * (grid_y * rows_per_tile), so the result is newlines injected in between individual tile rows, thus not something that can be handled with the standard llava-uhd block-wise endcoding. Branch: Granite4Vision AI-usage: draft (Claude Code + Opus 4.7) Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Add model interfaces for granite 4 vision assembler I'm on the fence about the best organization of this. These free functions allow the per-architecture logic in clip.cpp to access the model-specific graph building, but they still require a fair bit of model-specific logic in clip.cpp which is not ideal. I think a better approach may be to replicate what is done with the graph builders themselves (and possibly even make the assembler part of the model's existing graph builder). Branch: Granite4Vision AI-usage: full (Claude Code + Opus 4.7) Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * refactor: Remove all g4v-specific branching from mtmd.cpp in favor of clip assembler Branch: Granite4Vision AI-usage: full (Claude Code + Opus 4.7) Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * refactor(mtmd): Consolidate assembler logic into clip_assembler class family Just like `clip_graph` is the base class for building the model-specific encoder graphs, `clip_assembler` will be the base class for building the model-specific assembler graphs. This allows the assembly pattern to follow how the encoder pattern is implemented where the model-specific logic lives in a subclass co-located with the encoder graph builder that gets constructed by a simple factory method. Branch: Granite4Vision AI-usage: full (Claude Code + Opus 4.7) Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * style: Comment improvement Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * refactor: granite_vision -> granite4_vision Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Remove dead codepath for Qwen3VL add_vision_is_deepstack These pieces were never used on the c++ side (removed there in an earlier commit), so this is just cleanup that I missed before. Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Oops! I did not mean to commit one of my prompt files But now it's too far back in history to effectively rebase out, even with interactive and --rebase-merges :( Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Add missing <algorithm> include for std::find It seems that this was already pulled in on some platforms, but not on others Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Fix Flake8 warnings in granite conversion module Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * refactor: Remove clip_assembler in favor of clip_image_f32.append_token Per conversation in the PR, the clip_assembler pattern was too invasive. This is a compromise that limits model-specific blocks to add_media where each preprocessed tile is annotated with an injection type, after which all the token counting logic is generic and the newline injection itself is handled in the graph based on the value for the given tile image. Branch: Granite4Vision AI-usage: draft (Bob, OpenCode + Qwen 3.6 35b) Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * refactor(convert): Split n_deepstack_layers and deepstack_layers (array) Branch: Granite4Vision AI-usage: full (Bob, OpenCode + Qwen3.6-35b) Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * refactor(src): Handle n_deepstack_layers and deepstack_layers GGUF keys Branch: Granite4Vision AI-usage: draft (Bob, OpenCode + Qwen3.6-35b) Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Fix GGUF key for deepstack_layers_arr Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * refactor: Remove pre-scaling embeddings and skip scaling for raw embd inputs This follows how gemma3 and gemma4 handle embedding scaling by skipping the multiplier for raw input embeddings. Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * refactor: deepstack_layers(_arr) -> deepstack_mapping(_arr) Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * refactor: Fully revert changes to n_deepstack_layers and qwen3vl* Since we're going to keep the GGUF KVs separate, it makes sense to just keep the hparams separate too to limit the scope of this branch. The down side is that n_deepstack_layers and deepstack_mapping_arr are potentially conflicting. Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Revert removal of "is_deepstack_layers" GGUF KV This KV is not used at all on the c++ side, so it's fully dead, but there's also no need to conflate this cleanup with the addition of G4V. Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Remove unnecessary ggml_cont and build_forward_expand in cbx Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * style: Clean up comments Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Tighter and more flexible code for g4v_build_block This could be refactored to look a lot more like granite-speech, but the overall block constructs before/after the qformer are pretty different, so for now I'm going to leave it as is and just tighten a bit. Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Remove unnecessary `unordered_set` include Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Add architecture guard on deepstack_mapping_arr printout Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Remove unnecessary AI-gen comment Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Always initialize deepstack_mapping_arr with -1 values This was causing `test-llama-archs` to fail, likely due to trying to save the uninitialized values, then re-loading them. It's safer to always initialize so that other models don't forget and end up with undefined behavior. Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * style: Remove TODO about block/vs non-block tensor mapping Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * refactor: Move is_vision_feature_layer logic into clip_hparams Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * refactor: Use a bool for append_token Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * style: Remove unnecessary comment Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Remove unused get_model api yikes! Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * refactor: Rearrange helpers for g4v to be private members and use build_attn Branch: Granite4Vision AI-usage: full (Bob, OpenCode + Qwen3.6-35b) Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Fix off-by-one in vision layer index This was inherited from the Claude Code implementation that pushed the negative index inversion down into the model file. Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Fix norm/post_norm mixup in conversion face. palm. :( Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * style: More descriptive tensor names Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Apply PR cleanup for new conversion changes AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * fix(convert): Remove duplicate V_ENC_EMBD_IMGNL Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * refactor: append_token -> add_newline Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * style: Comment cleanup Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Cleaner error handling/checking NOTE: format_string is not available in granite.cpp (and including clip-impl.h to get it doesn't compile, so I think it violates the intended encapsulation), so std::stringstream is the simplest answer. Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> --------- Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* vulkan: add fwht support for Intel with shmem reduction * don't use N as workgroup size * disable subgroup shuffle on MoltenVK AMD * disable fwht shader on Intel Windows due to driver bug

The first PR ggml-org#23398 commit added an `llama_set_mtp_source(ctx_dft, ctx_tgt)` call after `llama_init_from_model`. Later cleanup commits in the same PR removed that API and moved the wiring to `cparams.ctx_other = ctx_tgt` set BEFORE init. Our keep-both resolution carried the intermediate call forward; this drops it to match the PR's final API. Drops 1 use of removed symbol, no behavior change (the rebased cparams.ctx_other assignment is what's actually used).

gmarzjr and others added 30 commits June 4, 2026 12:58

agents: refactor, include more guidelines (ggml-org#24111)

a121232

* agents: refactor, include more guidelines * better example * rephrase a bit * add more examples * nits

convert: Fix Gemma 4 Unified conversion (ggml-org#24118)

e802356

* Fix Gemma 4 Unified conversion * Set audio hidden size to audio_embed_dim

return filter to save memory (ggml-org#24125)

0dbfa66

Co-authored-by: lvyichen <lvyichen@stepfun.com>

Move duplicated imatrix code into single common imatrix-loader.cpp (g…

e7bcf1c

…gml-org#22445) * Deduplicate imatrix loading code * Add back LLAMA_TRACE, early exit on quantize missing metadata

arg: fix double mtp downloads (ggml-org#24128)

260862b

server : disable on-device spec checkpoints (ggml-org#24108)

7c158fb

kleidiai : dynamic chunck-based scheduling for hybrid execution (ggml…

3ecfb15

…-org#23819)

minor : fix lint issues (ggml-org#24165)

59917d3

ui: add ignore-scripts=true to npmrc (ggml-org#24149)

cc7bef3

Fix link to available UI settings (ggml-org#24169)

9c955c4

The current link is to a non-existent file. I had a look at the repo, spotted the file containing the UI configuration key and updated the link

ui: run npm install when package-lock.json is newer than node_modules (…

2016bf2

…ggml-org#24171)

model : fix llama_model::n_gpu_layers() (ggml-org#24188)

96fbe00

cli: fix model params not propagated (ggml-org#23893)

86591c7

Fixes ggml-org#23847

TP: round up granularity to 128 (ggml-org#24180)

6effcec

* TP: round up granularity to 128 * remove assert

model: fix build failed (ggml-org#24193)

c4a278d

vulkan: add fwht support for Intel with shmem reduction (ggml-org#23964)

e82beaa

* vulkan: add fwht support for Intel with shmem reduction * don't use N as workgroup size * disable subgroup shuffle on MoltenVK AMD * disable fwht shader on Intel Windows due to driver bug

common/chat : unify and fix LFM2/LFM2.5 tool parser (ggml-org#24178)

da87e9b

am17an and others added 23 commits June 7, 2026 13:46

add exception in test-llama-archs

5edc87f

move assistant to separate file

571a9dd

add unified assistant

b300965

cont : adjust to hparams changes

bcaf30d

cont : avoid computations on the CPU

57a2246

cont : clean-up

93aa400

cont : clean-up

89f00b7

cont : fix handling of unused tensors

5af09f1

cont : fix undefined

1df52f7

fix typo

86ef699

cont : enable gemma4 graph reuse

4278550

cont : fix assert

05e89f8

cont : fix quantized cache

a66b027

cont : fix names

7e2848a

cont : fix names

b00c1d6

cont : add reference for draft positions

bf67004

cont : fix multi-modality

96a14a9

cont : add comment about ctx_src

e10ad04

cont : clean-up server fit logic

024ac5f

cont : clean-up llama_context

6caeb6a

py : fix names

e41c9b0

cont : rename ctx_src -> ctx_other

0f2f35a

marksverdhei force-pushed the feat/gemma4-mtp-vendor branch from d1c651c to 5e6dff2 Compare June 7, 2026 11:54

marksverdhei changed the title ~~[PARKED] PR #23398 gemma4-MTP integration tracker — DO NOT MERGE~~ feat(sync): upstream master sync (42 commits) + Gemma4 MTP via PR #23398 vendor Jun 7, 2026

marksverdhei marked this pull request as ready for review June 7, 2026 13:41

marksverdhei merged commit 4c09765 into ht Jun 7, 2026
3 of 16 checks passed

marksverdhei deleted the feat/gemma4-mtp-vendor branch June 7, 2026 13:42

This was referenced Jun 7, 2026

chore(server): quiet benign Gemma4-Assistant memory-probe warning on MTP load #94

Merged

docs(readme): inventory DFlash + Gemma4 MTP under HT Fork Changes #96

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(sync): upstream master sync (42 commits) + Gemma4 MTP via PR #23398 vendor#93

feat(sync): upstream master sync (42 commits) + Gemma4 MTP via PR #23398 vendor#93
marksverdhei merged 72 commits into
htfrom
feat/gemma4-mtp-vendor

marksverdhei commented Jun 7, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants

Conversation

marksverdhei commented Jun 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Why

Key commits in this PR

Audit + resolution decisions (Tier-A summary)

Zero-regression gate — 5-axis Phase 6 results on titan

Post-merge follow-ups (non-blocking)

Related

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants

marksverdhei commented Jun 7, 2026 •

edited

Loading