Xsn/mtmd placeholder chunks by ngxson · Pull Request #106 · ngxson/llama.cpp

ngxson · 2026-05-30T16:37:07Z

For AI review

Summary by CodeRabbit

New Features
- Added OpenAI-compatible token-counting endpoints for chat completions and responses; router mode now proxies these.
Documentation
- Added docs and examples for the new token-counting endpoints.
Tests
- Added unit tests covering token-counting for chat and vision flows.
Bug Fixes / Improvements
- Media handling refactored to encapsulate bitmaps/images with placeholder-aware sizing, safe buffer access, and robust preprocessing/normalization.
Breaking Changes
- Removed several legacy raw-pixel and memory-sizing helpers and updated bitmap helper signatures to accept a placeholder flag.

coderabbitai · 2026-05-30T16:37:25Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

📝 Walkthrough

Walkthrough

Refactors CLIP/MTMD image and bitmap types to encapsulated accessors with placeholder support; updates image preprocessing, CLIP usage across vision graphs, media ingestion, removes legacy CLIP helpers, and adds server-side token-counting endpoints with tests and documentation.

Changes

Image Container Encapsulation and CLIP API Update

Layer / File(s)	Summary
Image container refactor `tools/mtmd/clip-impl.h`, `tools/mtmd/clip.h`	`clip_image_u8` and `clip_image_f32` move from public fields to private storage with accessor/mutator methods including placeholder detection, size queries, pixel get/set, conversion, normalization, and `clip_image_size::operator==`.
CLIP API removals and public function updates `tools/mtmd/clip.h`, `tools/mtmd/clip.cpp`	Removes deprecated functions (`clip_embd_nbytes`, `clip_image_u8_get_data`, `clip_build_img_from_pixels`, `clip_encode_float_image`, `clip_image_f32_batch_add_mel`); updates debug writers and conversion helpers to use accessor-based APIs.
Core CLIP function and image handling updates `tools/mtmd/clip.cpp`	Debug PPM/BMP writers, f32→u8 conversion, vision patch-count/position-embedding math, raw input tensor creation, warmup sizing, and batch encode paths now use `get_size()/nx()/ny()/get_ro_buf()/get_pixel()` APIs.

Media Bitmap and Tokenization Refactoring

Layer / File(s)	Summary
Bitmap class refactoring and placeholder support `tools/mtmd/mtmd.h`, `tools/mtmd/mtmd.cpp`, `tools/mtmd/mtmd-helper.h`, `tools/mtmd/mtmd-helper.cpp`	`mtmd_bitmap` becomes an initialized container that copies input data, exposes `get_ro_buf()`, `is_placeholder()`, and `n_bytes()`; helper initializers gain a `bool placeholder` parameter.
Token object placeholder detection and initialization `tools/mtmd/mtmd.cpp`	`mtmd_image_tokens` and `mtmd_audio_tokens` add `is_placeholder()` helpers; `mtmd_encode_chunk` and `mtmd_encode` reject null or placeholder batches.
Image/audio preprocessing and media encoding `tools/mtmd/mtmd.cpp`	Image/audio ingestion now validates bitmap dimensions, uses `set_size()`/`cpy_buf()` and `get_ro_buf()` for population, and marks mel/image buffers as placeholders when appropriate; debug helpers updated accordingly.
Image preprocessing tool refactoring `tools/mtmd/mtmd-image.cpp`	`img_u8_to_f32`, `resize`, `crop`, `composite`, `fill`, and resizing algorithms refactored to use `get_size()`, `get_pixel()`, `set_pixel()`, `cpy_buf()` and to handle placeholders; various preprocessors updated to use the accessor API.

Vision Graph and Model Updates

Layer / File(s)	Summary
Vision graph and batch encoding refactor `tools/mtmd/clip.cpp`	Patch-count, token-count, projector math, and `clip_image_batch_encode` vision/audio staging updated to read per-entry `nx()/ny()` and `get_ro_buf()`; conversion and sizing use `set_size()`/`cpy_buf()`.
Model graph builder dimension accessor updates `tools/mtmd/models/*`	Vision model graph builders (conformer, glm4v, granite-speech, kimik25, mimovl, qwen2vl, qwen3vl, whisper-enc) updated to call `img.nx()`/`img.ny()` instead of reading public fields.
CLI media loading updates `tools/mtmd/mtmd-cli.cpp`	`mtmd_cli_context::load_media` now calls `mtmd_helper_bitmap_init_from_file(..., false)` with explicit placeholder argument.

Server Token Counting Endpoints

Layer / File(s)	Summary
Token counting route handlers and implementation `tools/server/server-context.cpp`, `tools/server/server-context.h`	Adds `post_chat_completions_tok` and `post_responses_tok_oai` handlers; implements `handle_count_tokens()` to parse requests, convert payloads (OAI/Anthropic/Responses), extract prompts, and compute `input_tokens` via MTMD or fallback tokenization.
Server endpoint registration and wiring `tools/server/server.cpp`	Registers `POST /chat/completions/input_tokens`, `POST /responses/input_tokens` (and /v1 variants), wires proxy routes in router mode, and groups token-counting routes.
Token counting helper and documentation `tools/server/server-common.h`, `tools/server/server-common.cpp`, `tools/server/README.md`	`process_mtmd_prompt` now takes `const` references and an optional `is_placeholder` flag; README documents the new OpenAI-compatible token-counting endpoints and example responses.
Unit tests for token counting endpoints `tools/server/tests/unit/test_chat_completion.py`, `tools/server/tests/unit/test_vision_api.py`	New tests exercise `/chat/completions/input_tokens` for text-only and text+image payloads and assert successful responses with non-trivial token counts.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Poem

🐰 I hopped through pixels, hid widths and heights away,

Accessors whisper where the buffers play,
Placeholders nap until token counts call,
Routes listen closely and tests check them all,
A tiny rabbit applauds this tidy refactor day.

🚥 Pre-merge checks | ✅ 2 | ❌ 3

❌ Failed checks (2 warnings, 1 inconclusive)

Check name	Status	Explanation	Resolution
Description check	⚠️ Warning	The description is minimal ('For AI review') and lacks the required template sections (Overview, Additional information, Requirements). It does not explain what the PR accomplishes or why the changes are necessary.	Complete the PR description following the template: add an Overview section explaining the purpose, provide Additional information linking to upstream PR `#23913`, and confirm the Requirements checklist.
Docstring Coverage	⚠️ Warning	Docstring coverage is 30.30% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.
Title check	❓ Inconclusive	The title 'Xsn/mtmd placeholder chunks' is vague and does not clearly summarize the main changes, which include refactoring image/audio handling with encapsulation, API deprecations, and new token-counting endpoints.	Use a more descriptive title that clearly indicates the primary change, such as 'Refactor CLIP image/audio handling with encapsulation and add token-counting endpoints'.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

tools/mtmd/clip-impl.h (1)
7-15: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Make tools/mtmd/clip-impl.h self-contained.

tools/mtmd/clip-impl.h throws std::runtime_error (435-437, 544-546) but doesn’t include <stdexcept>. tools/mtmd/clip.cpp includes clip-impl.h before its own <stdexcept>, so the build currently depends on transitive include order rather than the header.
Proposed fix
 `#include` <array>
 `#include` <climits>
+#include <stdexcept>
 `#include` <cstdarg>
 `#include` <cinttypes>
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tools/mtmd/clip-impl.h` around lines 7 - 15, The header
tools/mtmd/clip-impl.h is not self-contained because it throws
std::runtime_error in functions that raise exceptions (see uses around the throw
sites), but doesn't include <stdexcept>; add `#include` <stdexcept> at the top of
clip-impl.h so the declarations that use or throw std::runtime_error compile
without relying on transitive includes (ensure the include sits with the other
standard headers already present).

🧹 Nitpick comments (2)

tools/server/tests/unit/test_vision_api.py (1)

101-117: ⚡ Quick win

Verify image content actually affects token counting.

input_tokens > 10 is a weak proxy. Add a text-only baseline and assert the multimodal request counts more tokens.

Proposed assertion upgrade

 def test_vision_chat_completion_token_count():
     global server
     server.start()
-    res = server.make_request("POST", "/chat/completions/input_tokens", data={
+    res = server.make_request("POST", "/chat/completions/input_tokens", data={
         "temperature": 0.0,
         "top_k": 1,
         "messages": [
             {"role": "user", "content": [
                 {"type": "text", "text": "What is this:"},
                 {"type": "image_url", "image_url": {
                     "url": get_img_url("IMG_URL_0"),
                 }},
             ]},
         ],
     })
     assert res.status_code == 200
+    assert res.body["object"] == "response.input_tokens"
     assert res.body["input_tokens"] > 10
+
+    text_only = server.make_request("POST", "/chat/completions/input_tokens", data={
+        "messages": [{"role": "user", "content": "What is this:"}],
+    })
+    assert text_only.status_code == 200
+    assert res.body["input_tokens"] > text_only.body["input_tokens"]

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tools/server/tests/unit/test_vision_api.py` around lines 101 - 117, The test
test_vision_chat_completion_token_count currently only asserts
res.body["input_tokens"] > 10; add a text-only baseline request using
server.make_request to the same "/chat/completions/input_tokens" endpoint with
an equivalent messages payload that contains only the text part (e.g.,
{"role":"user","content":[{"type":"text","text":"What is this:"}]}) and capture
its input_tokens, then assert the multimodal response's input_tokens is greater
than the text-only baseline (res.body["input_tokens"] >
baseline["input_tokens"]). Ensure you reuse the same request parameters
(temperature, top_k) and message ordering so the only difference is the
image_url content.

tools/server/tests/unit/test_chat_completion.py (1)

578-592: ⚡ Quick win

Strengthen token-count contract assertions.

This test currently only checks status and a loose lower bound. It should also validate the response discriminator and deterministic count across identical requests.

Proposed test hardening

 def test_chat_completions_token_count():
     global server
     server.start()
-    # make sure cache can be reused across multiple choices and multiple requests
-    # ref: https://github.com/ggml-org/llama.cpp/pull/18663
-    for _ in range(2):
+    counts = []
+    for _ in range(2):
         res = server.make_request("POST", "/chat/completions/input_tokens", data={
             "messages": [
                 {"role": "system", "content": "Book"},
                 {"role": "user", "content": "What is the best book"},
             ],
         })
         assert res.status_code == 200
-        assert res.body["input_tokens"] > 5
+        assert res.body["object"] == "response.input_tokens"
+        assert isinstance(res.body["input_tokens"], int)
+        assert res.body["input_tokens"] > 5
+        counts.append(res.body["input_tokens"])
+    assert counts[0] == counts[1]

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tools/server/tests/unit/test_chat_completion.py` around lines 578 - 592, The
test test_chat_completions_token_count only asserts status and a loose lower
bound; update it to also assert the response discriminator and deterministic
token counts: after calling server.make_request("POST",
"/chat/completions/input_tokens", ...) verify res.body contains a discriminator
(e.g., res.body["discriminator"] == "chat.completion" or the expected
discriminator key/value used by the API) and capture res.body["input_tokens"] on
the first request then assert on the second identical request that
res.body["input_tokens"] equals the first captured value (in addition to the
existing > 5 check) to ensure deterministic counts across identical requests.
Ensure asserts reference the test function name
test_chat_completions_token_count and the server.make_request response fields
res.body["input_tokens"] and res.body["discriminator"] (or the API's exact
discriminator key).

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@tools/mtmd/clip.cpp`:
- Around line 3432-3439: The GGML_ASSERT is comparing bytes to element count:
change the assertion to check element counts (use GGML_ASSERT(n_step * n_mel ==
buf.size());) because mel_inp->get_ro_buf() returns a std::vector<float> where
buf.size() is number of floats; keep the memcpy as-is (it should still copy
n_step * n_mel * sizeof(float) bytes into inp_raw).

In `@tools/server/server-context.cpp`:
- Around line 4838-4843: The token-counting path for /input_tokens currently
calls process_mtmd_prompt(mctx, prompt.get<std::string>(), files) which runs
full MTMD preprocessing; change this to the placeholder-mode path so counting
uses cheap placeholder chunks (e.g., call a placeholder variant such as
process_mtmd_prompt_placeholder(mctx, prompt.get<std::string>(), files) or add a
boolean flag to process_mtmd_prompt like process_mtmd_prompt(mctx,
prompt.get<std::string>(), files, /*placeholder=*/true)) so that when mctx is
non-null in the /input_tokens flow you compute n_tokens from the
placeholder-mode result instead of performing full preprocessing.

In `@tools/server/server.cpp`:
- Around line 192-193: Duplicate POST route registration for
"/responses/input_tokens" exists: locate the ctx_http.post calls that reference
routes.post_responses_tok_oai (the two entries registering
"/responses/input_tokens") and remove the redundant registration so only a
single ctx_http.post("/responses/input_tokens",
ex_wrapper(routes.post_responses_tok_oai)) remains; also scan the nearby block
(the other occurrence around the 207-211 region) to ensure no other duplicate
registrations remain and consolidate them to a single registration to avoid
route ambiguity.

---

Outside diff comments:
In `@tools/mtmd/clip-impl.h`:
- Around line 7-15: The header tools/mtmd/clip-impl.h is not self-contained
because it throws std::runtime_error in functions that raise exceptions (see
uses around the throw sites), but doesn't include <stdexcept>; add `#include`
<stdexcept> at the top of clip-impl.h so the declarations that use or throw
std::runtime_error compile without relying on transitive includes (ensure the
include sits with the other standard headers already present).

---

Nitpick comments:
In `@tools/server/tests/unit/test_chat_completion.py`:
- Around line 578-592: The test test_chat_completions_token_count only asserts
status and a loose lower bound; update it to also assert the response
discriminator and deterministic token counts: after calling
server.make_request("POST", "/chat/completions/input_tokens", ...) verify
res.body contains a discriminator (e.g., res.body["discriminator"] ==
"chat.completion" or the expected discriminator key/value used by the API) and
capture res.body["input_tokens"] on the first request then assert on the second
identical request that res.body["input_tokens"] equals the first captured value
(in addition to the existing > 5 check) to ensure deterministic counts across
identical requests. Ensure asserts reference the test function name
test_chat_completions_token_count and the server.make_request response fields
res.body["input_tokens"] and res.body["discriminator"] (or the API's exact
discriminator key).

In `@tools/server/tests/unit/test_vision_api.py`:
- Around line 101-117: The test test_vision_chat_completion_token_count
currently only asserts res.body["input_tokens"] > 10; add a text-only baseline
request using server.make_request to the same "/chat/completions/input_tokens"
endpoint with an equivalent messages payload that contains only the text part
(e.g., {"role":"user","content":[{"type":"text","text":"What is this:"}]}) and
capture its input_tokens, then assert the multimodal response's input_tokens is
greater than the text-only baseline (res.body["input_tokens"] >
baseline["input_tokens"]). Ensure you reuse the same request parameters
(temperature, top_k) and message ordering so the only difference is the
image_url content.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: abc7b4b0-c5df-4e97-862f-f86950f9c5c9

📥 Commits

Reviewing files that changed from the base of the PR and between d38d50e and 447e418.

📒 Files selected for processing (25)

tools/mtmd/clip-impl.h
tools/mtmd/clip.cpp
tools/mtmd/clip.h
tools/mtmd/models/conformer.cpp
tools/mtmd/models/glm4v.cpp
tools/mtmd/models/granite-speech.cpp
tools/mtmd/models/kimik25.cpp
tools/mtmd/models/mimovl.cpp
tools/mtmd/models/qwen2vl.cpp
tools/mtmd/models/qwen3vl.cpp
tools/mtmd/models/whisper-enc.cpp
tools/mtmd/mtmd-cli.cpp
tools/mtmd/mtmd-helper.cpp
tools/mtmd/mtmd-helper.h
tools/mtmd/mtmd-image.cpp
tools/mtmd/mtmd.cpp
tools/mtmd/mtmd.h
tools/server/README.md
tools/server/server-common.cpp
tools/server/server-common.h
tools/server/server-context.cpp
tools/server/server-context.h
tools/server/server.cpp
tools/server/tests/unit/test_chat_completion.py
tools/server/tests/unit/test_vision_api.py

coderabbitai

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

tools/mtmd/mtmd-image.cpp (1)
1265-1270: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Propagate the placeholder flag into dst.

Line 1265 hardcodes dst as non-placeholder, but the function immediately returns on placeholder inputs. That breaks the new placeholder flow for the Step3VL preprocessing path by leaving dst shaped like a real image without populated pixels. Pass src.is_placeholder() through when sizing dst instead of hardcoding false.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tools/mtmd/mtmd-image.cpp` around lines 1265 - 1270, The dst image is always
being created as non-placeholder by calling dst.set_size({target_width,
target_height}, false, false) even when src.is_placeholder() is true; change the
call in the function that handles Step3VL preprocessing so that the placeholder
flag is propagated (pass src.is_placeholder() as the placeholder argument to
dst.set_size) instead of hardcoding false, ensuring dst retains the placeholder
state when src.is_placeholder() returns true.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In `@tools/mtmd/mtmd-image.cpp`:
- Around line 1265-1270: The dst image is always being created as
non-placeholder by calling dst.set_size({target_width, target_height}, false,
false) even when src.is_placeholder() is true; change the call in the function
that handles Step3VL preprocessing so that the placeholder flag is propagated
(pass src.is_placeholder() as the placeholder argument to dst.set_size) instead
of hardcoding false, ensuring dst retains the placeholder state when
src.is_placeholder() returns true.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: c10f5ec2-d209-4a85-886d-8d9c18d2baff

📥 Commits

Reviewing files that changed from the base of the PR and between 447e418 and 8f67dfb.

📒 Files selected for processing (4)

tools/mtmd/clip-impl.h
tools/mtmd/clip.cpp
tools/mtmd/mtmd-image.cpp
tools/mtmd/mtmd.cpp

🚧 Files skipped from review as they are similar to previous changes (2)

tools/mtmd/clip-impl.h
tools/mtmd/mtmd.cpp

* qwen35: use post-norm hidden state for MTP * rename pre_norm to nextn * fix step35

* Tidy up SYCL doc a bit - Add explicit links to referenced items - Fix spelling errors Signed-off-by: Todd Malsbary <todd.malsbary@intel.com> * Correct documented default for GGML_SYCL_GRAPH The default is ON, not OFF: $ cmake -LAH -B build | grep GGML_SYCL_GRAPH ... GGML_SYCL_GRAPH:BOOL=ON Signed-off-by: Todd Malsbary <todd.malsbary@intel.com> * Move docker instructions from SYCL.md to docker.md This makes them directly accesible from the Quick Start section of the top-level README.md. Signed-off-by: Todd Malsbary <todd.malsbary@intel.com> * Refer to intel.Dockerfile for ARGs and their defaults The defaults are always changing; this avoids accuracy errors from duplicating the information. Signed-off-by: Todd Malsbary <todd.malsbary@intel.com> * Remove mention of Nvidia in SYCL row of backend table This support was removed in 2026.02 - refer to the SYCL.md News. Signed-off-by: Todd Malsbary <todd.malsbary@intel.com> --------- Signed-off-by: Todd Malsbary <todd.malsbary@intel.com>

* ggml-cpu: add rvv 512b,1024b impls for iq4_xs * ggml-cpu: refactor; add rvv 512b, 1024b impls for q6_K, i-quants * ggml-cpu: refactor; add 512 and 1024 implementations of tq3_s, iq3_xxs, iq2_s, iq2_xs, iq2_xxs improve iq2_xs impl for rvv 256 Co-authored-by: Rehan Qasim <rehan.qasim@10xengineers.ai> --------- Co-authored-by: taimur-10x <taimur.ahmad@10xengineers.ai> Co-authored-by: Rehan Qasim <rehan.qasim@10xengineers.ai>

…rt (#23834) * Start work on flash_attn refactor * Refactor * Split k/v quantization * Refactor and abstract quantization logic for flash_attn and mul_mat * Add quantization support to tile path * formatting * Move to functions, add a check

* tests : refactor test-save-load-state to accept token input - Default prompt is now empty; when not provided, generate n_batch random tokens (useful for models without a tokenizer) - Tokenization happens once upfront; pass token vector to test functions - generate_tokens prints token IDs instead of decoded pieces - Use llama_model_get_vocab / llama_vocab_n_tokens API - Upgrade log level from LOG_TRC to LOG_INF for visibility Assisted-by: llama.cpp:local pi * cont : use llama_tokens alias

* mtmd: handle Gemma 4 audio projector embedding size * rm projection_dim from clip_n_mmproj_embd --------- Co-authored-by: Xuan Son Nguyen <son@huggingface.co>

mmvq: Port the ncols_dst optimization from ggml-cuda/mmvq.cu to SYCL. Read weights once per dispatch instead of once per column. Covers all standard quant types + reorder paths for Q4_0, Q8_0, Q3_K, Q4_K, Q5_K, Q6_K. IQ types (except IQ4_XS) excluded due to incompatible vec_dot signatures. ggml-sycl: The weight reorder was only bootstrapped on single-token mat-vec (ne[1] == 1). Speculative / MTP verify issues only multi-column mat-vec, so it never triggered the reorder and ran on the slower non-reorder kernel. Bootstrap it on small multi-column batches (ne[1] <= 8) too.

This PR attempts to slim down the dependencies for build-msys jobs making the same changes that we applied in whisper.cpp to reduce the size of the github actions cache, and should also improve the run time due to fewer dependencies that need to be installed. I realize this is a scheduled job but I think it would still make sense to apply these changes. Refs: ggml-org/whisper.cpp#3858

* Enroll mul_mat_vec_q_moe into PDL, boosting MTP performance on BW Data collected on a B4500: Before ``` (llama.cpp) ➜ llama.cpp git:(master) ✗ python mtp-bench.py code_python pred= 192 draft= 150 acc= 116 rate=0.773 tok/s=202.8 code_cpp pred= 192 draft= 147 acc= 117 rate=0.796 tok/s=212.8 explain_concept pred= 192 draft= 161 acc= 110 rate=0.683 tok/s=196.4 summarize pred= 192 draft= 138 acc= 122 rate=0.884 tok/s=226.6 qa_factual pred= 192 draft= 138 acc= 121 rate=0.877 tok/s=225.1 translation pred= 192 draft= 158 acc= 112 rate=0.709 tok/s=201.5 creative_short pred= 192 draft= 160 acc= 110 rate=0.688 tok/s=197.2 stepwise_math pred= 192 draft= 150 acc= 115 rate=0.767 tok/s=209.2 long_code_review pred= 192 draft= 148 acc= 116 rate=0.784 tok/s=208.9 ``` After ``` (llama.cpp) ➜ llama.cpp git:(master) ✗ python mtp-bench.py code_python pred= 192 draft= 150 acc= 116 rate=0.773 tok/s=211.9 code_cpp pred= 192 draft= 147 acc= 117 rate=0.796 tok/s=224.6 explain_concept pred= 192 draft= 161 acc= 110 rate=0.683 tok/s=207.8 summarize pred= 192 draft= 138 acc= 122 rate=0.884 tok/s=240.2 qa_factual pred= 192 draft= 138 acc= 121 rate=0.877 tok/s=238.5 translation pred= 192 draft= 158 acc= 112 rate=0.709 tok/s=213.4 creative_short pred= 192 draft= 160 acc= 110 rate=0.688 tok/s=208.8 stepwise_math pred= 192 draft= 150 acc= 115 rate=0.767 tok/s=221.7 long_code_review pred= 192 draft= 148 acc= 116 rate=0.784 tok/s=220.7 ``` Server launched with: ``` ➜ llama.cpp git:(osimons/enroll_mul_mat_vec_q_moe_into_PDL) ✗ ./build-x64-linux-gcc-reldbg/bin/llama-server \ -m /mnt/share/gguf/unsloth/Qwen3.6-35B-A3B-MTP-GGUF/Qwen3.6-35B-A3B-UD-Q4_K_M.gguf -dio \ --spec-type draft-mtp \ --spec-draft-n-max 2 \ -ngl all \ -fa on \ --host 0.0.0.0 \ --port 8080 -np 1 --chat-template-kwargs "{\"preserve_thinking\": true}" ``` * LC to overlap with following kernels

* hparams : refactor hparams.n_layer * cont : remove `n_layer_kv()`, use n_layer_all instead * cont : type consistency * pi : update SYSTEM.md * models : fix Step3.5 MTP * cont : remove duplicate switch cases * cont : explicitly set `false` to extra layers for `is_swa` and `is_recr` * cont : fix nextn layer count handling Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Update quantization readme * install requirements * Apply suggestions from code review Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * dos2unix suggestions --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

The current link is to a non-existent file. I had a look at the repo, spotted the file containing the UI configuration key and updated the link

…#24171)

Fixes #23847

* TP: round up granularity to 128 * remove assert

* feat(convert): Get language model conversion working for 4.1 vision Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat(convert): Skip multimodal tensors for GraniteMoeHybrid (vision 4.0) Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Disable vocab padding for non-hybrid models that use GraniteMoeHybrid Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Plumb python-side vision projector names and mappings There are several awkward things here: 1. Most of these are essentially identical to the audio qformer tensors. On the c++ side, that's mapped using the prefix, so the rest of the GGUF name needs to align, but on the python side there's no prefix notion, so they all get duplicated. 2. There are a couple of net-new tensors for vision, in particular PROJ_NORM. In both speech and vision, the QF_PROJ_NORM is qualified as belonging to the qformer portion, but the GGUF name is simply proj_norm which conflicts with the ideal name for this new PROJ_NORM that is not qualified as part of the qformer. To get around this, I used "proj_layernorm" as the GGUF name. Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Add python side architecture name Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Add python-side plumbing for setting FEATURE_LAYERS hparam Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Add c++ side tensor naming defines NOTE: Usage of these hasn't been updated to include prefix yet Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat(mtmd): Convert vision_feature_layer to an ordered vector We need to preserve the ordering of these feature index values so that they can be mapped to the sub-tensors within the stacked projectors. Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat(mtmd): Add architecture label plumbing Branch: Granite4Vision AI-usage: full (OpenCode + qwen3.5:122b) Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat(wip): Add partial conversion for mmproj This handles stacking the projector tensors and setting the new harams Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Add gguf_writer and constant support for new hparams and deepstack layer arr Branch: Granite4Vision AI-usage: draft (OpenCode + qwen3.5:122b) Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Full conversion for mmproj w/ tensor mappings Branch: Granite4Vision AI-usage: full (OpenCode + qwen3.5:122b) Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Add lm_head skip for mmproj for 4.0 Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: De-alias text_config architecture in convert_lora_to_gguf.py Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Add --trust-remote-code arg to convert_lora_to_gguf.py This defaults to False, but allows a user to enable it programmaticly instead of using the interactive prompt. Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: De-alias model.language_model. -> model. for lora adapters Branch: Granite4Vision AI-usage: full (OpenCode + qwen3.5:122b) Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Extend language model tensor dealiasing in adapters Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Remove unnecessary registration for GraniteSpeech in language model Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Plumb through mm prefix formatting for qformer tensors Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * refactor: Refactor vision projector tensors to use predictor ID as the block This is cleaner than stacking them. The modeling file hard-codes single-layer qformers, so we can punt on the multiipule multi-layer projectors problem. Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Add spatial offests array hparam conversion Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Add stub plumbing for granite vision in mtmd Branch: Granite4Vision AI-usage: draft (OpenCode + qwen3.5:122b) Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Add new hparam and tensor naming in clip-impl.h New hparams: - KEY_PROJ_SAMPLE_QUERY_SIDE - KEY_PROJ_SAMPLE_WINDOW_SIDE - KEY_PROJ_SPATIAL_OFFSETS New tensors: - TN_MULTI_PROJ_IMG_POS - TN_MULTI_PROJ_QUERY - TN_MULTI_PROJ_LAYERNORM - TN_MULTI_PROJ_LINEAR - TN_MULTI_PROJ_NORM Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Move deepstack_layer_arr to llm hparam instead of mmproj Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Remove IS_DEEPSTACK_LAYERS This appears to have been added during Qwen3 VL (#16780), but it was never actually used. Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * refactor: n_deepstack_layers -> deepstack_layer_arr The old logic hard coded a correspondence between the first N layers of the LLM and the 1->N entries in the input embeddings. Now, that relationship is maintained at loading time if the GGUF value is single-valued. If it is multi-valued, it loads directly allowing for deepstack layers to be spaced out throughout the model. Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Use try/catch for single/multi valued deepstack info The alternative would be to use get_key_or_arr, but then the single value would be populated through the entire array and we'd need to detect that and update it with the right correspondence. Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Add deepstack injection point for granite LLM The use of ggml_add here assumes that the elements of inp_embd will be pre- arranged to be the full embedding length with only the vision-mask'ed portions non-zero from the projector. This matches how Qwen3VL does it. Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: add missing vision attn layernorm eps Branch: Granite4Vision AI-usage: full (OpenCode + Qwen 3.6-35B) Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * refactor: Hoist qformer tensors into qf_block and hold a vector for multi-proj Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Fix missing prefix template for TN_QF_PROJ_LINEAR It's not strictly necessary since vision uses the blockwise version, but it makes the loading consistent. Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Add embedding scale and image grid pinpoints hparams in conversion Also remove dead parsing for self._deepstack_layer_arr Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Add mtmd KEY_ section for hparams shared with the LLM In this case, we need the EMBEDDING_SCALE so we can unscale the image embeddings to compensate for applying embedding scale to the input embeddings Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Implement c++ hparam parsing Branch: Granite4Vision AI-usage: draft (Claude Code) Co-authored-by: Eli Schwartz <eliyahu.schwartz@ibm.com> Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Flatten pinpoints in conversion Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Add missing break Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: No reason to have modality prefix for img_pos Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Add tensor loading Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix(convert): Fix confusion between proj.norm and proj.qformer.layernorm Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Use the right portion of speech for tensor loading! Also plumb through the layernorm -> post_norm naming change Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Add logging of deepstack_layers_arr if set I also changed the print_f output type to int32_t to avoid printing overflow values for -1. This could cause overflows on the other side, but I can't imagine a value for any of the current array hparams that would trigger that. Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Make sure input embeddings are cont before f_embedding_scale Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Add init and mmproj_embd cases for g4v The n_mmproj_embd is 1+ to make space for the text embedding and all 8 projectors Branch: Granite4Vision AI-usage: draft (Bob) Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Invert (h, w) -> (w, h) pinpoints Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Reorder projectors based on llm index and skip the first injection The multi-projector stack has a strange asymmetry based on how it's currently implemented for qwen3vl: on the mmproj side, it's all N projectors, but the output of the "first" (by inp_embd index) projector is automatically consumed as if it were a standard single-projector mmproj, so the deepstack portion needs to only contain the 1-N entries. Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> Co-authored-by: Eli Schwartz <eliyahu.schwartz@ibm.com> * fix: Fix mmproj hparams in conversion Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> Co-authored-by: Eli Schwartz <eliyahu.schwartz@ibm.com> * fix: Fix ordering/logic for deepstack injection in granite Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> Co-authored-by: Eli Schwartz <eliyahu.schwartz@ibm.com> * fix: Fix preprocessing config to match what the model needs Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> Co-authored-by: Eli Schwartz <eliyahu.schwartz@ibm.com> * wip: Partial port of Eli's implementation This is still pretty broken, but it's getting closer. It now happily generates tokens, but the values are quite incorrect still. I suspect it's caused by the mapping of projectors from safetensors to their respective orders here. Also, this implementation breaks encapsulation pretty badly in mtmd_encode. This will need a big refactor to put the G4V-specific encoding logic somewhere more appropriate. Branch: Granite4Vision AI-usage: draft (Claude Code, Bob) Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> Co-authored-by: Eli Schwartz <eliyahu.schwartz@ibm.com> * fix: Fix the pre-scaling on the input embeddings to correctly invert the scale We've got tokens! They still don't line up quite right, so something's a little off, but we're getting much closer now. Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: invert embedding multiplier -> base_scale at load Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Fix setting image_resize_pad after new enum introduced Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Add G4V to mmproj mapping in conversion Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Re-add padding disable for non-hybrid hybrid models Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * refactor: Simplify G4V n_tokens computation This is slightly more efficient and flexible for when we implement the unpad cropping. IMO, it's also clearer that it is adding the number of image_newline tokens (embeddings) to the grid, rather than recomputing the entire count. Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Add new clip APIs for post-tile-encoding assembly Granite 4 Vision uses llava-next style pack-and-unpad which requires injecting the learned newline after each row of the tile grid. A row here is a single row of the grid which is composed of (grid_x * cols_per_tile) * (grid_y * rows_per_tile), so the result is newlines injected in between individual tile rows, thus not something that can be handled with the standard llava-uhd block-wise endcoding. Branch: Granite4Vision AI-usage: draft (Claude Code + Opus 4.7) Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Add model interfaces for granite 4 vision assembler I'm on the fence about the best organization of this. These free functions allow the per-architecture logic in clip.cpp to access the model-specific graph building, but they still require a fair bit of model-specific logic in clip.cpp which is not ideal. I think a better approach may be to replicate what is done with the graph builders themselves (and possibly even make the assembler part of the model's existing graph builder). Branch: Granite4Vision AI-usage: full (Claude Code + Opus 4.7) Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * refactor: Remove all g4v-specific branching from mtmd.cpp in favor of clip assembler Branch: Granite4Vision AI-usage: full (Claude Code + Opus 4.7) Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * refactor(mtmd): Consolidate assembler logic into clip_assembler class family Just like `clip_graph` is the base class for building the model-specific encoder graphs, `clip_assembler` will be the base class for building the model-specific assembler graphs. This allows the assembly pattern to follow how the encoder pattern is implemented where the model-specific logic lives in a subclass co-located with the encoder graph builder that gets constructed by a simple factory method. Branch: Granite4Vision AI-usage: full (Claude Code + Opus 4.7) Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * style: Comment improvement Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * refactor: granite_vision -> granite4_vision Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Remove dead codepath for Qwen3VL add_vision_is_deepstack These pieces were never used on the c++ side (removed there in an earlier commit), so this is just cleanup that I missed before. Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Oops! I did not mean to commit one of my prompt files But now it's too far back in history to effectively rebase out, even with interactive and --rebase-merges :( Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Add missing <algorithm> include for std::find It seems that this was already pulled in on some platforms, but not on others Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Fix Flake8 warnings in granite conversion module Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * refactor: Remove clip_assembler in favor of clip_image_f32.append_token Per conversation in the PR, the clip_assembler pattern was too invasive. This is a compromise that limits model-specific blocks to add_media where each preprocessed tile is annotated with an injection type, after which all the token counting logic is generic and the newline injection itself is handled in the graph based on the value for the given tile image. Branch: Granite4Vision AI-usage: draft (Bob, OpenCode + Qwen 3.6 35b) Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * refactor(convert): Split n_deepstack_layers and deepstack_layers (array) Branch: Granite4Vision AI-usage: full (Bob, OpenCode + Qwen3.6-35b) Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * refactor(src): Handle n_deepstack_layers and deepstack_layers GGUF keys Branch: Granite4Vision AI-usage: draft (Bob, OpenCode + Qwen3.6-35b) Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Fix GGUF key for deepstack_layers_arr Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * refactor: Remove pre-scaling embeddings and skip scaling for raw embd inputs This follows how gemma3 and gemma4 handle embedding scaling by skipping the multiplier for raw input embeddings. Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * refactor: deepstack_layers(_arr) -> deepstack_mapping(_arr) Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * refactor: Fully revert changes to n_deepstack_layers and qwen3vl* Since we're going to keep the GGUF KVs separate, it makes sense to just keep the hparams separate too to limit the scope of this branch. The down side is that n_deepstack_layers and deepstack_mapping_arr are potentially conflicting. Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Revert removal of "is_deepstack_layers" GGUF KV This KV is not used at all on the c++ side, so it's fully dead, but there's also no need to conflate this cleanup with the addition of G4V. Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Remove unnecessary ggml_cont and build_forward_expand in cbx Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * style: Clean up comments Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Tighter and more flexible code for g4v_build_block This could be refactored to look a lot more like granite-speech, but the overall block constructs before/after the qformer are pretty different, so for now I'm going to leave it as is and just tighten a bit. Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Remove unnecessary `unordered_set` include Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Add architecture guard on deepstack_mapping_arr printout Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Remove unnecessary AI-gen comment Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Always initialize deepstack_mapping_arr with -1 values This was causing `test-llama-archs` to fail, likely due to trying to save the uninitialized values, then re-loading them. It's safer to always initialize so that other models don't forget and end up with undefined behavior. Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * style: Remove TODO about block/vs non-block tensor mapping Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * refactor: Move is_vision_feature_layer logic into clip_hparams Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * refactor: Use a bool for append_token Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * style: Remove unnecessary comment Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Remove unused get_model api yikes! Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * refactor: Rearrange helpers for g4v to be private members and use build_attn Branch: Granite4Vision AI-usage: full (Bob, OpenCode + Qwen3.6-35b) Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Fix off-by-one in vision layer index This was inherited from the Claude Code implementation that pushed the negative index inversion down into the model file. Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Fix norm/post_norm mixup in conversion face. palm. :( Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * style: More descriptive tensor names Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Apply PR cleanup for new conversion changes AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * fix(convert): Remove duplicate V_ENC_EMBD_IMGNL Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * refactor: append_token -> add_newline Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * style: Comment cleanup Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Cleaner error handling/checking NOTE: format_string is not available in granite.cpp (and including clip-impl.h to get it doesn't compile, so I think it violates the intended encapsulation), so std::stringstream is the simplest answer. Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> --------- Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

ngxson added 9 commits May 30, 2026 16:18

mtmd: add "placeholder bitmap" for counting tokens w/o preprocessing

924bbab

fast path skip preproc for placeholder

064c2d7

fix build

d1a098d

correct the api

58171a6

add server endpoint + tests

f1503cf

add object name

aec9eff

update docs

035d72c

add proxy handling

3cb2d8c

fix build

447e418

github-actions Bot added examples python server labels May 30, 2026

coderabbitai Bot reviewed May 30, 2026

View reviewed changes

Comment thread tools/mtmd/clip.cpp Outdated

Comment thread tools/server/server-context.cpp

Comment thread tools/server/server.cpp Outdated

ngxson added 3 commits May 30, 2026 18:58

fix audio input path

8f67dfb

use is_placeholder in process_mtmd_prompt()

8351aaf

nits

1945165

coderabbitai Bot reviewed May 30, 2026

View reviewed changes

ngxson and others added 12 commits May 30, 2026 19:43

nits (2)

c72ef5c

docs: clarify chat/completions/input_tokens is not official

53e3e88

mtmd: enable non-causal vision for gemma 4 unified (#24082)

c8d6a00

qwen35: use post-norm hidden state for MTP (#24025)

166fe29

* qwen35: use post-norm hidden state for MTP * rename pre_norm to nextn * fix step35

mtmd: fix Gemma 4 unified FPE (#24088)

94a220c

metal : reduce rset heartbeat from 500ms -> 5ms (#24074)

3d19986

readme : add status badges (#24104)

6ddc943

fix(mtmd): handle Gemma 4 audio projector embedding size (#24091)

e3ba22d

* mtmd: handle Gemma 4 audio projector embedding size * rm projection_dim from clip_n_mmproj_embd --------- Co-authored-by: Xuan Son Nguyen <son@huggingface.co>

ngxson and others added 19 commits June 4, 2026 19:23

arg: fix double mtp downloads (#24128)

260862b

server : disable on-device spec checkpoints (#24108)

7c158fb

kleidiai : dynamic chunck-based scheduling for hybrid execution (#23819)

3ecfb15

minor : fix lint issues (#24165)

59917d3

ui: add ignore-scripts=true to npmrc (#24149)

cc7bef3

Fix link to available UI settings (#24169)

9c955c4

The current link is to a non-existent file. I had a look at the repo, spotted the file containing the UI configuration key and updated the link

ui: run npm install when package-lock.json is newer than node_modules (…

2016bf2

…#24171)

model : fix llama_model::n_gpu_layers() (#24188)

96fbe00

cli: fix model params not propagated (#23893)

86591c7

Fixes #23847

TP: round up granularity to 128 (#24180)

6effcec

* TP: round up granularity to 128 * remove assert

model: fix build failed (#24193)

c4a278d

Merge branch 'master' into xsn/mtmd_placeholder_chunks

acca080

fix merge problem

5b0cfdf

github-actions Bot added documentation Improvements or additions to documentation ggml SYCL Nvidia GPU testing devops script model Apple Metal server/ui WebGPU labels Jun 5, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Xsn/mtmd placeholder chunks#106

Xsn/mtmd placeholder chunks#106
ngxson wants to merge 56 commits into
ngxson:masterfrom
ggml-org:xsn/mtmd_placeholder_chunks

ngxson commented May 30, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented May 30, 2026 •

edited

Loading

Reviews paused

Walkthrough

Changes

Estimated code review effort

Poem

❌ Failed checks (2 warnings, 1 inconclusive)

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants

Conversation

ngxson commented May 30, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented May 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Changes

Estimated code review effort

Poem

❌ Failed checks (2 warnings, 1 inconclusive)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants

ngxson commented May 30, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented May 30, 2026 •

edited

Loading