Skip to content

Add Kimi-K2.5 support#19170

Merged
ngxson merged 16 commits intoggml-org:masterfrom
AesSedai:kimi-k2.5
Feb 11, 2026
Merged

Add Kimi-K2.5 support#19170
ngxson merged 16 commits intoggml-org:masterfrom
AesSedai:kimi-k2.5

Conversation

@AesSedai
Copy link
Contributor

@AesSedai AesSedai commented Jan 29, 2026

Adding support for https://huggingface.co/moonshotai/Kimi-K2.5

Since this model includes compressed-tensors (INT4 for the conditional experts), I moved the dequant_model to the prepare_tensors call at @compilade's suggestion. The model conversion fails otherwise because the quantization_config is nested under the text_config in the config.json.

Additionally, this model adds some new keys for the vision tower, prefixed as vt_, and the preprocessor_config.json has the expected fields nested in the media_proc_cfg key.

This PR does not include the "hacked" Q4_0 changes by @jukofyork, referred to in this comment.

I have added a first pass at vision support, heavily aided by LLM assistance. I entirely expect @ngxson to tear it to shreds or call me a dummy and show me an easier way to add that vision support :)

@AesSedai AesSedai requested a review from CISC as a code owner January 29, 2026 06:28
@github-actions github-actions bot added the python python script changes label Jan 29, 2026
@AesSedai
Copy link
Contributor Author

AesSedai commented Jan 29, 2026

image

I was testing the VL locally with this image, and the thinking block shows the oddness:

This is an interesting image. The user is asking me to identify what's in the picture. Let me look carefully.

The image shows what appears to be a furry animal resting on a tree branch. It looks like it has grayish-brown fur, and there are distinctive white markings on its face/ears that look like little tufts or tassels. The animal appears to be sleeping or resting with its face tucked in.

Based on the appearance - specifically the white ear tufts and the overall body shape - this looks like a **Common Marmoset** (Callithrix jacchus), also known as the White-tufted-ear Marmoset. These are small New World monkeys native to Brazil. The white ear tufts are their most distinctive feature.

However, I should also consider if it could be something else. Could it be a raccoon? No, the body shape is wrong and raccoons don't have white ear tufts like that. Could it be a type of lemur? Some lemurs have ear tufts, but the body shape here looks more like a marmoset or tamarin. The Common Marmoset fits best with those prominent white ear tufts.

The animal is curled up on a branch, likely sleeping or resting, with its tail visible (appears to be a long, furry tail). The background is green foliage, suggesting a natural or semi-natural habitat.

So my answer should identify it as a Common Marmoset or White-tufted Marmoset, pointing out the distinctive ear tufts and the fact that it appears to be resting on a branch.

Wait, looking closer at the image... is it possible this is multiple animals? Let me check. The pattern seems to show overlapping forms. Actually, looking carefully, this might be TWO marmosets huddled together? Or is it just one curled up?

Actually, looking at the fur pattern and the way the bodies are arranged, this appears to be two common marmosets huddled together on the branch, sleeping. I can see what looks like two heads with white ear tufts close together. Yes, that makes sense - they often sleep huddled together for warmth and security.

So the answer is: Common Marmosets (White-tufted Marmosets), likely two of them, sleeping or resting on a branch.

I should mention:
- Species: Common Marmoset (Callithrix jacchus) or White-tufted-ear Marmoset
- Distinctive white ear tufts
- They appear to be sleeping/resting on a tree branch
- Possibly two individuals huddled together
- Native to Brazil
- Small New World monkeys

This seems accurate based on the visual evidence.

Specifically, this line lead me to believe there's a problem:

The pattern seems to show overlapping forms

and I've seen that in other local generations with this image too, eg:

Actually, looking at the pattern again, this looks like it might be multiple images stacked or a glitch effect? No, looking closely, it appears to be a single image of a raccoon resting on a branch, but there might be some artifacting or the image is showing the raccoon in a specific pose. The repetition pattern suggests it might be a glitch or the image is tiled strangely? No, actually, looking carefully, it seems like the image might be corrupted or displaying incorrectly, or it's an artistic effect?

Wait, no, looking at the image again - it appears to be a raccoon lying on a tree branch, and the image quality or compression creates a somewhat repetitive pattern, but it's clearly a raccoon. The distinctive facial markings (black mask, white nose/forehead), the gray fur, the bushy tail possibly visible - this is definitely a Common Raccoon (Procyon lotor).

In comparison, this is a bit of the thinking from the OpenRouter API for Kimi-K2.5:

The user wants to know what's in the picture. Looking at the image, it's clearly a raccoon lying on a tree branch. The raccoon has the distinctive black mask around its eyes, gray fur, and is draped over the branch in a relaxed or tired pose. The background shows a forest or wooded area with green foliage.

This is a straightforward image description task. I should identify the animal correctly as a raccoon and describe what it's doing (resting on a branch). I don't need to overcomplicate this or add fictional elements since the user asked a direct question about the image content.

Vastly different feel in the confidence of its answer based on what the VL sees.

@CISC
Copy link
Member

CISC commented Jan 29, 2026

While the mmproj conversion appears to work and the model loads and can decode images, I've got some weird output when using the vision component that leads me to believe there is a conversion issue somewhere or some other missing component. I think I need some review from @ngxson to help get it working correctly.

Yep, seems something is not quite right yet.

Add new kimi-k2.5 keys to mtmd convert
Update V_MMPROJ tensor mapping for new mm_projector.proj keys
Update V_M_IMP_NORM for new mm_projector.pre_norm key
@AesSedai
Copy link
Contributor Author

AesSedai commented Feb 1, 2026

Vision is working now for images, uploaded MMPROJ files to my repo.

@ngxson I left comments about the places that confused me the most.

  1. the resize_position_embeddings_3d - might be combinable with the clip_graph::resize_position_embeddings if the tensors are handled better?
  2. clip_graph::build_rope_2d_interleaved roughly makes sense to me from a 10,000 foot view, but I was thinking that maybe zipping or transposing the pos_w / pos_h tensors might make the square peg fit in the round hole with a bit of a different math approach?
  3. I have no idea why inp = ggml_add(ctx0, inp, learned_pos_embd); wasn't working in build_vit by passing in the learned_pos_embed.

I think the rest of the changes are pretty sane.

@AesSedai
Copy link
Contributor Author

AesSedai commented Feb 1, 2026

Some test samples that I ran locally. A very basic OCR test:
chatlog (13)

A more complicated OCR test that includes transcription:
chatlog (14)

And the interpretation of the raccoon photo from earlier:
chatlog (15)

The two things that concern me still are:

  • Image 1: there is a mention of a "vertical black line/border on the left side" in the thinking, plus mention of a "Border: There is a thick vertical black line running along the left side of the image" in the response. The image padding is black, so perhaps something related to that?
  • Image 3: In the thinking, item 6 mentions: "There's a visible seam or line in the image, suggesting it might be a composite or stitched image, or perhaps just an artifact". There isn't a seam like that, so I'm concerned.

@AesSedai AesSedai marked this pull request as ready for review February 1, 2026 11:27
@AesSedai AesSedai requested a review from ngxson as a code owner February 1, 2026 11:27
@segmond
Copy link

segmond commented Feb 1, 2026

Great work AesSedai! I just downloaded the BF16 for mmproj. Is there any reason to get anything higher than Q8_0? What about ctk/ctv is there any good reason to run them in f16 instead of lower since the model is INT4?

@segmond
Copy link

segmond commented Feb 1, 2026

I'm happy to report that I have tested this branch and it works great. I ran it with the Q4_X quant and my ctk/ctv at q8_0. Using the BF16 mmproj.

Screen Shot 2026-02-01 at 12 31 36 PM

@AesSedai
Copy link
Contributor Author

AesSedai commented Feb 1, 2026

@segmond Thanks, for the MMPROJ some cards are more or less compatible with different versions. The BF16's don't work very well on my 3090s IIRC. The Q8_0 should be fine to use quality-wise.

Regarding CTK / CTV, you do not want to quantize the cache on this model at all. The model weight quantization is different than the cache quantization. MLA / GQA already comes with some pretty severe compression on the cache so by further quantizing it you'll degrade it more. Besides, the context is very lightweight anyways. Something like 165k context in FP16 is like ballpark 10GB or so.

@tempgidam
Copy link

Hello,

thank you for your work.
I've been testing this PR on my own server and I'm still getting some odd comments/observations from the model running some vision tests.

In my most basic test (showing the model an image of a simple black circle on a white background), my local version always insists that it's seeing two circles. When I counter-check the same image using K2.5 over openrouter (provider limited to MoonshotAI for consistency), this straight up never happens.
It does this for pretty much any image showing a simple circle that I've tried..

With more complex images like photos of people that I know the model recognizes or random artwork, my local variant consistently sees images as stitched, fragmented or artifacted with seams or similar things going through them. For any given image, the artifacts/effects my local version seems to see are consistent so it'll always note the same things over several attempts.
Again, K2.5 via the API doesn't show any of these problems for me.

In general, asking the model to comment on the image quality seems to reliably reproduce this issue. When I first tried the raccoon image above, the answer actually seemed fine at first glance. But adding an inquiry about quality makes it clear that the model is seeing some heavy issues like lines and even claiming that it appears to be rotated.

local w/o prompt about image quality

image

local w/prompt about image quality

image

api w/ prompt about the image quality

image ____

Here's two more examples, but as mentioned, the issues are present across numerous retries and several different images/styles of images.

local

image

api

image ____

local

image

api

image

I'm running your Q4_X quant and the bf16 mmproj. I built this PR from scratch just a few hours ago so it's the most recent version. I'm not using any form of kv-quantization.

Here's what I'm using to load the model:
./llama-server-k2.5 --model Kimi-K2.5-Q4_X-00001-of-00014.gguf --ctx-size 64000 -ngl 99 --n-cpu-moe 99 -ub 4096 -b 4096 --threads 32 --parallel 1 --host 0.0.0.0 --port 5001 --jinja --chat-template-file chat_template.jinja --special -mm mmproj-Kimi-K2.5-BF16.gguf

For testing purposes, I'm not running any sort of system prompt or similar in Sillytavern so that the only thing the model processes is my question and the image.

Sorry for the image spam but I hope this helps.

@AesSedai
Copy link
Contributor Author

AesSedai commented Feb 3, 2026

@tempgidam Huh, interesting. I haven't seen any double-vision issues on my setup. I did update the first shard yesterday evening to adjust the chat template since MoonshotAI updated the media token. There's a minimal prompt on my ST setup ("A chat between a curious human and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the human's questions.").

There was also a fix for the thinking block issue in the chat template that required an update a couple of days ago. Could you download a fresh copy of the first shard of the quant and re-test?

I'd be surprised if there wasn't a bug lurking in there somewhere, which is why I'm waiting on ngxson to review.

@z80maniac
Copy link
Contributor

I used AesSedai's model (IQ2_XXS + Q8_0 mmproj) and tempgidam's image of a circle. Then prompted this:

Look at this picture and create an approximate SVG file of what you see.

And it generated me this:

<svg viewBox="0 0 400 300" xmlns="http://www.w3.org/2000/svg">
  <circle cx="100" cy="150" r="80" fill="none" stroke="black" stroke-width="15"/>
  <circle cx="320" cy="150" r="80" fill="none" stroke="black" stroke-width="15"/>
</svg>
This is how it looks in Inkscape image

I had to disable the thinking, because with thinking enabled it just rambled on and on for 3000 tokens, endlessly correcting itself, and then gave me an unusable SVG in the end. However, in its ramblings it said this:

The left circle appears to have its left side cut off by the image boundary

...which doesn't correspond to the image above, but still mentions that one of the circles is cut off. Only in the image, the right circle is cut off slightly. So maybe there is something to it, or maybe it's just a hallucination because the quant is too small.

The whole thinking process of the failed run (which was NOT used to generated the above image)

The user wants me to create an approximate SVG file of the image shown. The image displays two black outlined circles (or ring shapes) on a white background. They appear to be positioned side by side, with the left one partially cut off on the left side and the right one fully visible or partially cut off on the right. Actually, looking more carefully, it seems like two circles with thick black strokes, positioned horizontally next to each other.

Let me analyze the image:

  1. Two circles with thick black borders
  2. The left circle appears to be cut off on the left side (or it's just positioned that way)
  3. The right circle is complete or also positioned similarly
  4. They are side by side horizontally
  5. The circles have white interiors (or transparent, but appear white against the background)

To recreate this in SVG:

  • I need two <circle> elements (or <path> or <ellipse>)
  • Thick stroke (probably around 10-15% of radius)
  • No fill (or white fill)
  • Black stroke
  • Positioned side by side

Looking at the proportions:

  • The circles seem to have roughly the same size
  • They are separated by some distance
  • The left one seems to be partially visible (maybe only the right half or two-thirds)
  • The right one seems fully visible or mostly visible

Actually, looking again, it seems the left circle is positioned so we see mostly its right side, and the right circle is positioned so we see its left side? No, wait. Let me reconsider.

The image shows:

  • Left side: a thick black arc that looks like the right side of a circle (or a full circle positioned partially off-screen to the left)
  • Right side: a thick black circle/ring that's positioned more to the right

Wait, looking carefully at the original image description and the visual content:

  • It appears to be two circles
  • The left circle is positioned with its left side cut off by the image boundary, showing mostly a "C" shape or partial circle on the left
  • The right circle is a complete ring positioned to the right
  • Or both are partial?

Actually, looking at the image again, it seems like:

  • Left element: A thick circular arc, open on the left side (like a "C" facing left, or rather, the right half of a circle)
  • Right element: A full circle or nearly full circle to the right

But wait, the left shape looks like it could be a full circle with the left side cut off by the image edge, and the right shape is a full circle.

Let me create an SVG with:

  • Two circles with thick strokes
  • Left circle centered at around (30, 50) with radius 40, stroke-width 8
  • Right circle centered at around (80, 50) with radius 40, stroke-width 8
  • ViewBox 0 0 100 100 (or similar)

Actually, looking at the aspect ratio, the image is wider than it is tall? No, the provided image seems to have a certain aspect ratio. Let me assume the viewBox is something like 0 0 200 100.

Better approximation:

  • Left circle: cx="50" cy="50" r="40" stroke-width="8"
  • Right circle: cx="150" cy="50" r="40" stroke-width="8"
  • But the left one is cut off, so maybe cx="20" so only the right part is visible?
  • Or maybe both are positioned so they overlap slightly or are separated?

Looking at the image once more:

  • The left circle appears to have its left side cut off by the image boundary
  • The right circle appears complete
  • They are separate, not touching
  • The left one is lower? No, they seem aligned vertically or nearly so.

Actually, looking at the bottom alignment:

  • The left circle seems to extend lower than the right circle? No, they look roughly aligned.

Let me go with:

  • ViewBox: proprietary 0 Circle 1: cx="25%" cy="50%" r="20%" stroke-width="5%"
  • Circle 2: cx="75%" cy="50%" r="20%" stroke-width="5%"

But to make it look like the image where the left one is cut off:

  • Circle 1: cx="20" cy="50" r="35" (so left edge is at -15, cut off)
  • Circle 2: cx="80" cy="50" r="35" (fully visible)

Wait, looking at the image provided, the left circle appears to be positioned so we see the right side of it (like a "D" shape), and the right circle is a full "O" shape. Actually, no - the left one looks like it might be a full circle with the left side cut off by the frame.

Let me create an SVG that captures the essence:

  • Two circles with thick black strokes
  • White fill (or transparent)
  • Left one partially visible (cut off on the left)
  • Right one fully visible
  • Vertical alignment roughly center
<svg width="200" height="100" viewBox="0 0 200 100" xmlns="http://www.w3.org/2000/svg">
  <!-- Left circle, partially cut off -->
  <circle cx="40" cy="50" r="35" fill="none" stroke="black" stroke-width="8" />
  <!-- Right circle -->
  <circle cx="140" cy="50" r="35" fill="none" stroke="black" stroke-width="8" />
</svg>

But looking at the image, the left circle might be positioned at around x=30 (if width is 100), and the right at x=70.

Actually, the image looks like it might be showing:

  • Left: A circle where only the right half is visible (cx=0 or negative)
  • Right: A full circle positioned to the right

Let me adjust:

  • Circle 1: cx="35" cy="50" r="40" (so left edge is at -5, cut off)
  • Circle 2: cx="120" cy="50" r="40" (fully visible)

Stroke width should be roughly 1/5 to 1/4 of the radius? Looking at the image, the stroke is quite thick. If radius is 40, stroke might be 8-10.

Let me provide a responsive SVG that approximates the visual:

<svg viewBox="0 0 160 100" xmlns="http://www.w3.org/2000/svg">
  <!-- Left circle (partially visible) -->
  <circle cx="35" cy="50" r="30" fill="none" stroke="black" stroke-width="8" />
  <!-- Right circle -->
  <circle cx="115" cy="50" r="30" fill="none" stroke="black" stroke-width="8" />
</svg>

This should create two thick-outlined circles, with the left one partially cut off on the left side if the viewBox starts at 0 and the circle extends to x=5 (35-30).

Actually, looking at the image one more time, I think the left circle is positioned such that its left side is cut off, showing a "C" shape (or rather, the right side of a circle), and the right circle is fully visible. The vertical position seems roughly centered.

Let me refine:

  • Left circle: cx=30, cy=50, r=25, stroke=6 (so left edge at 5, mostly visible but left side cut off)
  • Right circle: cx=80, cy=50, r=25, stroke=6

Wait, the spacing between them looks like roughly one radius or so.

Final check:

  • The image shows two black ring shapes
  • Left one appears to be cut off on the left edge of the image
  • Right one is fully visible
  • Both have same line thickness and diameter
  • They are side by side horizontally

So my SVG should be:

<svg width="300" height="150" viewBox="0 0 300 150" xmlns="http://www.w3.org/2000/svg">
  <circle cx="75" cy="75" r="60" fill="none" stroke="black" stroke-width="12" />
  <circle cx="225" cy="75" r="60" fill="none" stroke="black" stroke-width="12" />
</svg>

But with the left one positioned so its left side is cut off. So maybe cx="60" so left edge is at 0.

Actually, looking at the original image, the left circle seems to be positioned with roughly 1/4 to 1/3 of it visible? No, it looks like more than half is visible. Maybe 2/3 or 3/4 of the circle is visible on the left.

Let me go with:

  • Left circle: cx="50", r="40", so left edge at 10 (mostly visible)
  • Actually, looking at the crop, the left circle seems to extend from near the left edge to about the middle, and the right circle is from middle to right edge.

Wait, looking at the provided image crops:

  • First crop shows the left circle's right side and the gap between them
  • Second crop shows the right circle

So the left circle is indeed partially cut off on the left side of the image.

I'll provide an SVG with viewBox="0 0 200alpXD circle at (50,50) with r=40 and another at (150,50) with r=40, stroke-width around 8.

Actually, to make the left one cut off: cx=40, r=40 (so x from 0 to 80).

Final SVG:

<svg viewBox="0 0 200 100" xmlns="http://www.w3.org/2000/svg">
  <circle cx="···" cy="50" r="40" fill="none" stroke="black" stroke-width="8" />
  <circle cx="160" cy="50" r="40" fill="none" stroke="black" stroke-width="8" />
</svg>

With cx="40" for the left one (so it extends from 0 to 80), and cx="160" for the right one (80 to 160? No, 120 to 200). Wait, if cx=160 and r=40, it goes from 120 to 200.

Spacing: from 80 (right edge of left) to 120 (left edge of right) is 40 units gap. That seems like a big gap. In the image, the gap looks smaller than the radius.

Maybe:

  • Left: cx="45", r="35" (range: 10 to 80)
  • Right: cx="135", r="35" (range: 100 to 170)
  • Gap: 20 units
  • Stroke: 7

This looks better.

Or simply:

<svg viewBox="0 0 180 100" xmlns="http://www.w3.org/2000/svg">
  <circle cx···" cy="50" r="35" fill="none" stroke="black" stroke-width="8"/>
  <circle cx="130" cy="50" r="35" fill="none" stroke="black" stroke-width="8"/>
</svg>

Left circle at 45: range 10-80
Right circle at 130: range 95-165
Gap: 15 units (small gap)

Actually, looking at the image, the gap seems to be about one stroke-width or so, or maybe 1/4 of the diameter.

I'll provide a clean, simple approximation that captures the essence.

<svg viewBox="0 0 200 operativeCircle cx="45" cy="50" r="35" fill="none" stroke="black" stroke-width="8"/>
  <circle cx="140" cy="50" r="35" fill="none" stroke="black" stroke-width="8"/>
</svg>

@tempgidam
Copy link

tempgidam commented Feb 3, 2026

Okay, I replaced the first shard with the updated one (and removed the --chat-template-file parameter which was loading the modified template from the hf discussion thread to fix the thinking tags with the old version). Sadly, this doesn't seem to change anything for me. I'm getting the same results across the board.

I also tried some other things:
-redownloading/rebuilding the PR
-trying the other mmproj files (fp16, q8)
-the same setup I described above except with the UD-Q4_K-XL release by Unsloth
-updated ST for good measure
-tried switching the "Inline Image Quality" parameter in ST around from "High" to "Low" and back

However, nothing changed. The model still sees artifacts and seams across all of them.

@AesSedai
Copy link
Contributor Author

AesSedai commented Feb 3, 2026

Thanks both for the feedback, I'll continue to tinker with this and see if I can figure out the issue :)

@AesSedai
Copy link
Contributor Author

AesSedai commented Feb 4, 2026

@tempgidam / @z80maniac I've got something that might be worth testing on your setups.

In the following file: tools/mtmd/models/kimik25.cpp

diff --git a/tools/mtmd/models/kimik25.cpp b/tools/mtmd/models/kimik25.cpp
index d79b2f39c..6db47e2c9 100644
--- a/tools/mtmd/models/kimik25.cpp
+++ b/tools/mtmd/models/kimik25.cpp
@@ -26,7 +26,7 @@ ggml_tensor * clip_graph_kimik25::resize_position_embeddings_3d(uint32_t interpo
 
     pos_embd = ggml_permute(ctx0, pos_embd, 2, 1, 0, 3);
     pos_embd = ggml_interpolate(ctx0, pos_embd, height, width, n_embd, 1, mode);
-    pos_embd = ggml_permute(ctx0, pos_embd, 1, 2, 0, 3);
+    pos_embd = ggml_permute(ctx0, pos_embd, 2, 1, 0, 3);
     pos_embd = ggml_cont_2d(ctx0, pos_embd, n_embd, width * height);
     return pos_embd;
 }

Try swapping the numbers 1, 2, 0, 3 on line 29 to 2, 1, 0, 3, then recompile and test again please. I've tried both the RMS picture and the circle picture on my setup with this change and I'm not seeing any "double circle" or "glitch" thinking.

I might have goofed that permute and it's swapping w/h incorrectly before the ggml_cont_2d reshapes it.

@z80maniac
Copy link
Contributor

@AesSedai Yes, with this patch my test creates only one circle:

<svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 200 200">
  <circle cx="100" cy="100" r="80" fill="none" stroke="black" stroke-width="16"/>
</svg>

And the viewBox is a square, not a rectangle, like in the previous test.

This SVG is almost exact representation of the original picture:

Details image

And in the reasoning it doesn't say that there are two circles:

The image is a simple black circle outline (a ring) on a white background. It's a circle with a thick stroke and no fill (or white fill).

@AesSedai
Copy link
Contributor Author

AesSedai commented Feb 8, 2026

@vaulter Honestly I'm not sure that this can be easily ported to ik_llama.cpp because of the changes for mtmd. I haven't reviewed the ik_llama codebase so I can't really estimate how much work that would be.

@AesSedai
Copy link
Contributor Author

AesSedai commented Feb 8, 2026

The newest set of changes addresses PR feedback and require downloading updated mmproj files to be compatible. I've uploaded new mmprojs to my repo: https://huggingface.co/AesSedai/Kimi-K2.5-GGUF

@jukofyork
Copy link
Collaborator

@vaulter Honestly I'm not sure that this can be easily ported to ik_llama.cpp because of the changes for mtmd. I haven't reviewed the ik_llama codebase so I can't really estimate how much work that would be.

https://huggingface.co/gghfez said he managed to "vibe merge" it into ik_llama.cpp and Claude just reported that it couldn't find the function to do bi-cubic interpolation so used bi-linear instead, and AFAIK he said it worked fine.

@Lissanro
Copy link

Lissanro commented Feb 8, 2026

@jukofyork Has https://huggingface.co/gghfez shared the patch version for ik_llama.cpp somewhere? And did he make it based on the version here before or after today's update by AesSedai?

@jukofyork
Copy link
Collaborator

@jukofyork Has https://huggingface.co/gghfez shared the patch version for ik_llama.cpp somewhere? And did he make it based on the version here before or after today's update by AesSedai?

I don't think so - he just mentioned it here: https://huggingface.co/jukofyork/creative-writing-control-vectors-v3.0/discussions/15#698044e0f6c43685d5426b03

I don't know his GitHub handle or would ping him to ask.

@AesSedai
Copy link
Contributor Author

AesSedai commented Feb 8, 2026

Uploaded a new set of mmproj files after removing the V/O permutes (it should have canceled out though, and there aren't any changes to the cpp code here so the old files should still work, just their format is a tad different).

Comment on lines +658 to +662
// Ensure input is contiguous (needed when using merged QKV with ggml_view)
if (!ggml_is_contiguous(cur)) {
cur = ggml_cont(ctx0, cur);
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you sure #19299/#19338 didn't fix this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since #19338 only merged a few hours ago, I didn't have that one merged into this branch. I'll merge master and retry without that ggml_cont. Thanks for the callout!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@CISC I merged master locally, recompiled, and tested w/o the ggml_cont and the embeddings are different on my raccoon image test.

./build/bin/llama-mtmd-cli -m /mnt/srv/snowdrift/gguf/Kimi-K2.5-GGUF/aes_sedai/Kimi-K2.5-Q4_X.gguf --chat-template /mnt/srv/snowdrift/fp16/Kimi-K2.5/chat_template.jinja --jinja --mmproj /mnt/srv/snowdrift/gguf/Kimi-K2.5-GGUF/mmproj-Kimi-K2.5-F16.gguf --n-gpu-layers 999 --threads 54 --override-tensor "blk\..*_exps\.=CPU" --flash-attn on --image Kimi-K2.5/raccoon.png -p "Describe the contents of this image" --seed 42 --no-warmup

Before, working:

Token 0 (first 16 values): -0.022020 -0.102026 -0.162646 0.098094 -0.076318 -0.260308 -0.064937 0.102250 0.065213 0.213322 -0.105310 -0.137638 -0.101015 -0.098639 -0.135186 0.053588 
Token 0 (last 16 values):  -0.017177 0.046693 0.107003 0.026340 -0.026663 -0.013917 0.131319 -0.144277 -0.192712 0.093361 -0.096736 0.163151 0.216722 0.159008 -0.039639 0.178694 
Stats: mean=0.000457, std=0.078247, min=-4.923960, max=8.580477, sum=6362.352051

After, broken:

Token 0 (first 16 values): -0.004227 0.017700 0.045288 0.014412 0.019241 -0.000450 -0.011505 0.052048 -0.007355 -0.014572 0.018112 -0.027519 -0.005363 -0.020798 0.010681 -0.027069 
Token 0 (last 16 values):  -0.025635 0.009247 -0.005234 0.031509 -0.006050 -0.023453 -0.008224 -0.012344 -0.011505 -0.009567 -0.009521 -0.001862 0.034729 -0.016602 -0.026917 -0.002815 
Stats: mean=0.000353, std=0.068569, min=-1.949219, max=1.311523, sum=4924.728027
And the text output is now seeing the image as garbled
<think>The user wants me to describe the contents of the image. Looking at the image, I see a repeating pattern of what appears to be a stylized logo or emblem. 

The pattern consists of:
1. A circular gear-like or cog-like shape with teeth around the perimeter
2. Inside the circle, there appears to be a stylized design that looks like a DNA double helix or some kind of molecular structure in the center
3. The colors seem to be shades of blue, purple, and possibly some green or teal, with a gradient effect
4. The pattern is tiled across the entire image in a grid formation
5. The background appears to be dark (black or very dark gray)

The design looks like it could be related to:
- Bioinformatics
- DNA/genetics
- Engineering/mechanical gears
- A specific software logo or institutional logo

The image is essentially a wallpaper or background pattern made up of this repeating emblem. The emblem has a technical, scientific appearance combining mechanical (gear) and biological (DNA helix) elements.

I should describe this clearly and objectively, noting the repeating pattern, the gear-like shapes, the central helix design, and the color scheme.</think>The image displays a **repeating tiled pattern** consisting of a stylized emblem arranged in a grid across a dark background.

Each individual emblem features:
- **An outer gear or cog-like ring** with evenly spaced teeth around the circumference, rendered in shades of blue and cyan
- **A central design** resembling a DNA double helix or molecular structure, depicted in purple, pink, and light blue tones
- **A gradient color scheme** transitioning from cyan/blue on the outer ring to purple/pink in the center

The pattern creates a wallpaper-like effect with rows of these circular logos aligned both horizontally and vertically, filling the entire frame. The overall aesthetic suggests themes of **bioinformatics, biotechnology, genetics, or synthetic biology**, combining mechanical (gear) and biological (DNA) imagery. The dark background makes the cool-colored emblems stand out prominently.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, so we have another bug. @ORippler @JohannesGaessler

@jeffbolznv Mind testing on Vulkan?

Copy link
Contributor Author

@AesSedai AesSedai Feb 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tested one more time with e06088da0fa86aa444409f38dff274904931c507 (the non-contig rope fix) with the ggml_cont re-enabled and the embedding values look right again:

Token 0 (first 16 values): -0.029358 -0.105713 -0.161133 0.100708 -0.075317 -0.264465 -0.062134 0.100281 0.069458 0.216919 -0.102356 -0.134155 -0.098579 -0.098022 -0.130737 0.051514 
Token 0 (last 16 values):  -0.015167 0.048523 0.111225 0.024429 -0.022324 -0.014832 0.134521 -0.149582 -0.187012 0.097748 -0.097107 0.166382 0.219757 0.160767 -0.041840 0.179504 
Stats: mean=0.000452, std=0.078249, min=-4.953125, max=8.621094, sum=6304.381348
Correct text output
<think>The user wants me to describe the contents of the image. Let me analyze what I see:

 1. **Main subject**: A raccoon is the central focus of the image.
 2. **Position**: The raccoon is lying on a tree branch, draped over it in a relaxed or lazy manner.
 3. **Physical appearance**: 
    - Gray fur with the characteristic black "mask" around its eyes
    - White fur on its snout/muzzle area
    - Ringed tail (though partially obscured)
    - Small rounded ears with white fur on the edges
    - Front paws hanging down from the branch
 4. **Setting**: 
    - Outdoor, natural environment
    - Tree branch is thick and brown with some moss or lichen
    - Background is blurred (bokeh effect) showing green foliage and trees
    - Appears to be a forest or wooded area
 5. **Lighting**: Natural daylight, soft lighting
 6. **Mood**: The raccoon looks sleepy, relaxed, or resting. Its posture suggests it's comfortable and possibly napping.

 I should provide a clear, detailed description covering these elements.</think>The image shows a **raccoon** draped lazily over a thick, horizontal tree branch in what appears to be a forest or wooded setting. 

The raccoon has the species' characteristic features: grayish-brown fur, a distinctive black "mask" marking around its eyes, and white fur on its snout and the edges of its ears. It is lying with its belly on the branch, its head resting forward, and its front paws hanging down limply on either side, giving it a very relaxed, sleepy, or exhausted appearance.

The tree branch is sturdy and covered with patches of moss or lichen. The background is softly blurred (bokeh effect), showing various shades of green from foliage and the darker brown of tree trunks, suggesting a lush, natural environment with dappled daylight filtering through the leaves. The overall mood of the image is peaceful and endearing, capturing a moment of rest in the wild.
Reference image image

So it's definitely the ggml_cont there making the difference. I've pushed the updated merge from master to this branch so this should be reproducible by just commenting that line out and testing.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you try this patch:

diff --git a/tools/mtmd/clip.cpp b/tools/mtmd/clip.cpp
index dae17c6fb..422a0e410 100644
--- a/tools/mtmd/clip.cpp
+++ b/tools/mtmd/clip.cpp
@@ -655,11 +655,6 @@ ggml_tensor * clip_graph::build_rope_2d(
     const int64_t n_head = cur->ne[1];
     const int64_t n_pos  = cur->ne[2];
 
-    // Ensure input is contiguous (needed when using merged QKV with ggml_view)
-    if (!ggml_is_contiguous(cur)) {
-        cur = ggml_cont(ctx0, cur);
-    }
-
     // for example, if we have cur tensor of shape (n_dim=8, n_head, n_pos)
     // we will have a list of 4 inv_freq: 1e-0, 1e-1, 1e-2, 1e-3
     // first half of cur will use 1e-0, 1e-2 (even)
@@ -677,8 +672,8 @@ ggml_tensor * clip_graph::build_rope_2d(
     {
         first = ggml_view_3d(ctx0, cur,
             n_dim/2, n_head, n_pos,
-            ggml_row_size(cur->type, n_dim),
-            ggml_row_size(cur->type, n_dim*n_head),
+            cur->nb[1],
+            cur->nb[2],
             0);
         first = ggml_rope_ext(
             ctx0,
@@ -696,8 +691,8 @@ ggml_tensor * clip_graph::build_rope_2d(
     {
         second = ggml_view_3d(ctx0, cur,
             n_dim/2, n_head, n_pos,
-            ggml_row_size(cur->type, n_dim),
-            ggml_row_size(cur->type, n_dim*n_head),
+            cur->nb[1],
+            cur->nb[2],
             n_dim/2 * ggml_element_size(cur));
         second = ggml_rope_ext(
             ctx0,

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I'll give it a shot in a few hours once I'm back home from the office.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fine by me, not many models using this anyway I think?

@CISC Just repeating myself earlier, but this is the first model to use the build_rope_2d + merged QKV combo.

Other models seem to use the combo ggml_rope_ext + merged QKV so they're fine

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fine by me, not many models using this anyway I think?

@CISC Just repeating myself earlier, but this is the first model to use the build_rope_2d + merged QKV combo.

Other models seem to use the combo ggml_rope_ext + merged QKV so they're fine

Sure, I meant build_rope_2d in general.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No ggml_cont + CUDA_VISIBLE_DEVICES= has the wrong embedding and text output still:

CPU only, no ggml_cont
$ CUDA_VISIBLE_DEVICES= MTMD_DEBUG_EMBEDDINGS=1 ./build/bin/llama-mtmd-cli -m /mnt/srv/snowdrift/gguf/Kimi-K2.5-GGUF/aes_sedai/Kimi-K2.5-Q4_X.gguf --chat-template /mnt/srv/snowdrift/fp16/Kimi-K2.5/chat_template.jinja --jinja --mmproj /mnt/srv/snowdrift/gguf/Kimi-K2.5-GGUF/mmproj-Kimi-K2.5-F16.gguf --threads 54 --flash-attn on --image Kimi-K2.5/raccoon.png -p "Describe the contents of this image" --seed 42 --no-warmup 2>&1 | tee ggml-cpu-only.log
ggml_cuda_init: failed to initialize CUDA: no CUDA-capable device is detected
build: 7986 (16010cba6) with GNU 14.2.1 for Linux x86_64
common_init_result: fitting params to device memory, for bugs during this step try to reproduce them with -fit off, or provide --verbose logs if the bug only occurs with -fit on
llama_params_fit_impl: no devices with dedicated memory found
llama_params_fit: successfully fit params to free device memory
llama_params_fit: fitting params to free memory took 19.65 seconds
llama_model_loader: loaded meta data with 49 key-value pairs and 1096 tensors from /mnt/srv/snowdrift/gguf/Kimi-K2.5-GGUF/aes_sedai/Kimi-K2.5-Q4_X.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = deepseek2
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                         general.size_label str              = 384x14B
llama_model_loader: - kv   3:                            general.license str              = other
llama_model_loader: - kv   4:                       general.license.name str              = modified-mit
llama_model_loader: - kv   5:                               general.tags arr[str,1]       = ["image-text-to-text"]
llama_model_loader: - kv   6:                      deepseek2.block_count u32              = 61
llama_model_loader: - kv   7:                   deepseek2.context_length u32              = 262144
llama_model_loader: - kv   8:                 deepseek2.embedding_length u32              = 7168
llama_model_loader: - kv   9:              deepseek2.feed_forward_length u32              = 18432
llama_model_loader: - kv  10:             deepseek2.attention.head_count u32              = 64
llama_model_loader: - kv  11:          deepseek2.attention.head_count_kv u32              = 1
llama_model_loader: - kv  12:                deepseek2.rope.scaling.type str              = yarn
llama_model_loader: - kv  13:              deepseek2.rope.scaling.factor f32              = 64.000000
llama_model_loader: - kv  14: deepseek2.rope.scaling.original_context_length u32              = 4096
llama_model_loader: - kv  15:      deepseek2.rope.scaling.yarn_beta_fast f32              = 32.000000
llama_model_loader: - kv  16:      deepseek2.rope.scaling.yarn_beta_slow f32              = 1.000000
llama_model_loader: - kv  17:                   deepseek2.rope.freq_base f32              = 50000.000000
llama_model_loader: - kv  18: deepseek2.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  19:                deepseek2.expert_used_count u32              = 8
llama_model_loader: - kv  20:               deepseek2.expert_group_count u32              = 1
llama_model_loader: - kv  21:          deepseek2.expert_group_used_count u32              = 1
llama_model_loader: - kv  22:               deepseek2.expert_gating_func u32              = 2
llama_model_loader: - kv  23:        deepseek2.leading_dense_block_count u32              = 1
llama_model_loader: - kv  24:                       deepseek2.vocab_size u32              = 163840
llama_model_loader: - kv  25:            deepseek2.attention.q_lora_rank u32              = 1536
llama_model_loader: - kv  26:           deepseek2.attention.kv_lora_rank u32              = 512
llama_model_loader: - kv  27:             deepseek2.attention.key_length u32              = 576
llama_model_loader: - kv  28:           deepseek2.attention.value_length u32              = 512
llama_model_loader: - kv  29:         deepseek2.attention.key_length_mla u32              = 192
llama_model_loader: - kv  30:       deepseek2.attention.value_length_mla u32              = 128
llama_model_loader: - kv  31:       deepseek2.expert_feed_forward_length u32              = 2048
llama_model_loader: - kv  32:                     deepseek2.expert_count u32              = 384
llama_model_loader: - kv  33:              deepseek2.expert_shared_count u32              = 1
llama_model_loader: - kv  34:             deepseek2.expert_weights_scale f32              = 2.827000
llama_model_loader: - kv  35:              deepseek2.expert_weights_norm bool             = true
llama_model_loader: - kv  36:             deepseek2.rope.dimension_count u32              = 64
llama_model_loader: - kv  37: deepseek2.rope.scaling.yarn_log_multiplier f32              = 0.100000
llama_model_loader: - kv  38:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  39:                         tokenizer.ggml.pre str              = kimi-k2
llama_model_loader: - kv  40:                      tokenizer.ggml.tokens arr[str,163840]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  41:                  tokenizer.ggml.token_type arr[i32,163840]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  42:                      tokenizer.ggml.merges arr[str,163328]  = ["Ġ Ġ", "ĠĠ ĠĠ", "Ġ t", "i n",...
llama_model_loader: - kv  43:                tokenizer.ggml.bos_token_id u32              = 163584
llama_model_loader: - kv  44:                tokenizer.ggml.eos_token_id u32              = 163585
llama_model_loader: - kv  45:            tokenizer.ggml.padding_token_id u32              = 163839
llama_model_loader: - kv  46:                    tokenizer.chat_template str              = {%- macro render_content(msg) -%}\n   ...
llama_model_loader: - kv  47:               general.quantization_version u32              = 2
llama_model_loader: - kv  48:                          general.file_type u32              = 7
llama_model_loader: - type  f32:  365 tensors
llama_model_loader: - type q4_0:  180 tensors
llama_model_loader: - type q8_0:  551 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = Q8_0
print_info: file size   = 543.62 GiB (4.55 BPW) 
load: 0 unused tokens
load: printing all EOG tokens:
load:   - 163585 ('[EOS]')
load:   - 163586 ('<|im_end|>')
load:   - 163593 ('[EOT]')
load:   - 163839 ('[PAD]')
load: special tokens cache size = 256
load: token to piece cache size = 1.0606 MB
print_info: arch                  = deepseek2
print_info: vocab_only            = 0
print_info: no_alloc              = 0
print_info: n_ctx_train           = 262144
print_info: n_embd                = 7168
print_info: n_embd_inp            = 7168
print_info: n_layer               = 61
print_info: n_head                = 64
print_info: n_head_kv             = 1
print_info: n_rot                 = 64
print_info: n_swa                 = 0
print_info: is_swa_any            = 0
print_info: n_embd_head_k         = 576
print_info: n_embd_head_v         = 512
print_info: n_gqa                 = 64
print_info: n_embd_k_gqa          = 576
print_info: n_embd_v_gqa          = 512
print_info: f_norm_eps            = 0.0e+00
print_info: f_norm_rms_eps        = 1.0e-05
print_info: f_clamp_kqv           = 0.0e+00
print_info: f_max_alibi_bias      = 0.0e+00
print_info: f_logit_scale         = 0.0e+00
print_info: f_attn_scale          = 0.0e+00
print_info: n_ff                  = 18432
print_info: n_expert              = 384
print_info: n_expert_used         = 8
print_info: n_expert_groups       = 1
print_info: n_group_used          = 1
print_info: causal attn           = 1
print_info: pooling type          = 0
print_info: rope type             = 0
print_info: rope scaling          = yarn
print_info: freq_base_train       = 50000.0
print_info: freq_scale_train      = 0.015625
print_info: n_ctx_orig_yarn       = 4096
print_info: rope_yarn_log_mul     = 1.0000
print_info: rope_finetuned        = unknown
print_info: model type            = 671B
print_info: model params          = 1.03 T
print_info: general.name          = n/a
print_info: n_layer_dense_lead    = 1
print_info: n_lora_q              = 1536
print_info: n_lora_kv             = 512
print_info: n_embd_head_k_mla     = 192
print_info: n_embd_head_v_mla     = 128
print_info: n_ff_exp              = 2048
print_info: n_expert_shared       = 1
print_info: expert_weights_scale  = 2.8
print_info: expert_weights_norm   = 1
print_info: expert_gating_func    = sigmoid
print_info: vocab type            = BPE
print_info: n_vocab               = 163840
print_info: n_merges              = 163328
print_info: BOS token             = 163584 '[BOS]'
print_info: EOS token             = 163585 '[EOS]'
print_info: EOT token             = 163586 '<|im_end|>'
print_info: PAD token             = 163839 '[PAD]'
print_info: LF token              = 198 'Ċ'
print_info: FIM PAD token         = 163839 '[PAD]'
print_info: EOG token             = 163585 '[EOS]'
print_info: EOG token             = 163586 '<|im_end|>'
print_info: EOG token             = 163593 '[EOT]'
print_info: EOG token             = 163839 '[PAD]'
print_info: max token length      = 512
load_tensors: loading model tensors, this can take a while... (mmap = true, direct_io = false)
load_tensors:   CPU_Mapped model buffer size = 556663.41 MiB
load_tensors:   CPU_REPACK model buffer size = 544320.00 MiB
....................................................................................................
common_init_result: added [EOS] logit bias = -inf
common_init_result: added <|im_end|> logit bias = -inf
common_init_result: added [EOT] logit bias = -inf
common_init_result: added [PAD] logit bias = -inf
llama_context: constructing llama_context
llama_context: setting new yarn_attn_factor = 1.0000 (mscale == 1.0, mscale_all_dim = 1.0)
llama_context: n_seq_max     = 1
llama_context: n_ctx         = 262144
llama_context: n_ctx_seq     = 262144
llama_context: n_batch       = 2048
llama_context: n_ubatch      = 512
llama_context: causal_attn   = 1
llama_context: flash_attn    = enabled
llama_context: kv_unified    = false
llama_context: freq_base     = 50000.0
llama_context: freq_scale    = 0.015625
llama_context:        CPU  output buffer size =     0.62 MiB
llama_kv_cache:        CPU KV buffer size = 17568.00 MiB
llama_kv_cache: size = 17568.00 MiB (262144 cells,  61 layers,  1/1 seqs), K (f16): 17568.00 MiB, V (f16):    0.00 MiB
sched_reserve: reserving ...
sched_reserve:        CPU compute buffer size =   981.01 MiB
sched_reserve: graph nodes  = 4791
sched_reserve: graph splits = 1
sched_reserve: reserve took 8.48 ms, sched copies = 1
mtmd_cli_context: chat template example:
<|im_system|>system<|im_middle|>You are a helpful assistant<|im_end|><|im_user|>user<|im_middle|>Hello<|im_end|><|im_assistant|>assistant<|im_middle|><think></think>Hi there<|im_end|><|im_user|>user<|im_middle|>How are you?<|im_end|><|im_assistant|>assistant<|im_middle|>
clip_model_loader: model name:   Kimi K2.5
clip_model_loader: description:  
clip_model_loader: GGUF version: 3
clip_model_loader: alignment:    32
clip_model_loader: n_tensors:    335
clip_model_loader: n_kv:         28

clip_model_loader: has vision encoder
clip_ctx: CLIP using CPU backend
load_hparams: projector:          kimik25
load_hparams: n_embd:             1152
load_hparams: n_head:             16
load_hparams: n_ff:               4304
load_hparams: n_layer:            27
load_hparams: ffn_op:             gelu
load_hparams: projection_dim:     7168

--- vision hparams ---
load_hparams: image_size:         896
load_hparams: patch_size:         14
load_hparams: has_llava_proj:     0
load_hparams: minicpmv_version:   0
load_hparams: n_merge:            2
load_hparams: n_wa_pattern: 0
load_hparams: image_min_pixels:   1568
load_hparams: image_max_pixels:   3211264

load_hparams: model size:         908.42 MiB
load_hparams: metadata size:      0.12 MiB
main: loading model: /mnt/srv/snowdrift/gguf/Kimi-K2.5-GGUF/aes_sedai/Kimi-K2.5-Q4_X.gguf
WARN: This is an experimental CLI for testing multimodal capability.
      For normal use cases, please use the standard llama-cli
encoding image slice...
alloc_compute_meta:        CPU compute buffer size =   264.77 MiB
alloc_compute_meta: graph splits = 1, nodes = 1034
warmup: flash attention is enabled

=== MTMD_DEBUG_EMBEDDINGS ===
Shape: [7168, 1944]
Token 0 (first 16 values): -0.031418 0.018485 0.076531 0.027357 0.050587 -0.002738 -0.014581 0.131898 -0.018651 -0.064568 0.038107 -0.068829 -0.001136 -0.065361 0.036655 -0.092244 
Token 0 (last 16 values):  -0.046038 0.025009 -0.036792 0.040030 0.005790 -0.034914 -0.026276 -0.044675 0.006448 0.010157 -0.003982 -0.000043 0.091428 -0.022575 -0.079341 0.009261 
Stats: mean=0.000517, std=0.063481, min=-1.903979, max=1.135730, sum=7206.027344
=== END MTMD_DEBUG_EMBEDDINGS ===

image slice encoded in 15268 ms
decoding image batch 1/1, n_tokens_batch = 1944
image decoded (batch 1/1) in 64037 ms

<think>The user wants me to describe the contents of the image. Looking at the image, I see a repeating pattern of what appears to be a stylized icon or logo. 

The pattern consists of:
1. A purple/lavender colored circular gear-like shape with multiple teeth or cogs around the perimeter
2. Inside the circle, there's a white silhouette of what looks like a classical building or structure with columns (perhaps resembling a temple, courthouse, or government building)
3. The background appears to be a dark green or olive color
4. There's a wavy, translucent overlay in shades of orange, yellow, and red that runs horizontally across the image, creating a distorted or glitch-like effect over the repeating pattern

The pattern is tiled repeatedly across the entire image, creating a wallpaper-like effect. The wavy distortion overlay gives it a somewhat psychedelic or digital art aesthetic.

Let me provide a detailed description of these elements.</think>The image displays a **repeating tiled pattern** consisting of stylized icons set against a dark olive-green background. Each tile features:

**Main Icon Elements:**
- A **purple/lavender gear or cogwheel** with rounded teeth forming a circular border
- Inside each gear is a **white silhouette of a classical building** resembling a Greek or Roman temple with columns and a triangular pediment

**Overlay Effect:**
- A **horizontal, wavy, translucent distortion band** runs across the entire image in shades of orange, yellow, and amber
- This creates a "glitch" or heat-wave effect that slightly distorts the underlying pattern, making the icons appear to shimmer or ripple like a mirage

**Overall Composition:**
The pattern is arranged in a tight grid, with the gear icons touching or nearly touching each other, creating a wallpaper-like texture. The color palette is limited primarily to dark green, purple, and white, with the warm-toned wavy overlay adding contrast and visual movement to the otherwise static geometric repetition. The aesthetic suggests a blend of industrial/technical imagery (gears) with classical architecture, filtered through a digital or psychedelic visual effect.


llama_perf_context_print:        load time =  731834.48 ms
llama_perf_context_print: prompt eval time =   93832.96 ms /  1957 tokens (   47.95 ms per token,    20.86 tokens per second)
llama_perf_context_print:        eval time =   55847.56 ms /   428 runs   (  130.48 ms per token,     7.66 tokens per second)
llama_perf_context_print:       total time =  787807.93 ms /  2385 tokens
llama_perf_context_print:    graphs reused =        425

The patch for the cur->nb[1] and cur->nb[2] fixed it, no ggml_cont and the output is correct:

gpu w/ patch, no ggml_cont
$ MTMD_DEBUG_EMBEDDINGS=1 ./build/bin/llama-mtmd-cli -m /mnt/srv/snowdrift/gguf/Kimi-K2.5-GGUF/aes_sedai/Kimi-K2.5-Q4_X.gguf --chat-template /mnt/srv/snowdrift/fp16/Kimi-K2.5/chat_template.jinja --jinja --mmproj /mnt/srv/snowdrift/gguf/Kimi-K2.5-GGUF/mmproj-Kimi-K2.5-F16.gguf --threads 54 --flash-attn on --image Kimi-K2.5/raccoon.png -p "Describe the contents of this image" --seed 42 --no-warmup 2>&1 | tee ggml-rope-cur.log
ggml_cuda_init: found 2 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
  Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
build: 7986 (16010cba6) with GNU 14.2.1 for Linux x86_64
common_init_result: fitting params to device memory, for bugs during this step try to reproduce them with -fit off, or provide --verbose logs if the bug only occurs with -fit on
llama_params_fit_impl: projected memory use with initial parameters [MiB]:
llama_params_fit_impl:   - CUDA0 (NVIDIA GeForce RTX 3090):  24135 total, 288978 used, -265106 free vs. target of   1024
llama_params_fit_impl:   - CUDA1 (NVIDIA GeForce RTX 3090):  24135 total, 288217 used, -264345 free vs. target of   1024
llama_params_fit_impl: projected to use 577196 MiB of device memory vs. 47743 MiB of free device memory
llama_params_fit_impl: cannot meet free memory targets on all devices, need to use 531500 MiB less in total
llama_params_fit_impl: context size reduced from 262144 to 4096 -> need 20614 MiB less memory in total
llama_params_fit_impl: with only dense weights in device memory there is a total surplus of 27843 MiB
llama_params_fit_impl: filling dense-only layers back-to-front:
llama_params_fit_impl:   - CUDA1 (NVIDIA GeForce RTX 3090): 62 layers,  11789 MiB used,  12081 MiB free
llama_params_fit_impl:   - CUDA0 (NVIDIA GeForce RTX 3090):  0 layers,   6062 MiB used,  17809 MiB free
llama_params_fit_impl: converting dense-only layers to full layers and filling them front-to-back with overflow to next device/system memory:
llama_params_fit_impl:   - CUDA0 (NVIDIA GeForce RTX 3090):  2 layers ( 0 overflowing),  15805 MiB used,   8066 MiB free
llama_params_fit_impl:   - CUDA1 (NVIDIA GeForce RTX 3090): 60 layers (59 overflowing),  20136 MiB used,   3735 MiB free
llama_params_fit: successfully fit params to free device memory
llama_params_fit: fitting params to free memory took 11.35 seconds
llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) (0000:06:10.0) - 23871 MiB free
llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) (0000:06:11.0) - 23871 MiB free
llama_model_loader: loaded meta data with 49 key-value pairs and 1096 tensors from /mnt/srv/snowdrift/gguf/Kimi-K2.5-GGUF/aes_sedai/Kimi-K2.5-Q4_X.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = deepseek2
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                         general.size_label str              = 384x14B
llama_model_loader: - kv   3:                            general.license str              = other
llama_model_loader: - kv   4:                       general.license.name str              = modified-mit
llama_model_loader: - kv   5:                               general.tags arr[str,1]       = ["image-text-to-text"]
llama_model_loader: - kv   6:                      deepseek2.block_count u32              = 61
llama_model_loader: - kv   7:                   deepseek2.context_length u32              = 262144
llama_model_loader: - kv   8:                 deepseek2.embedding_length u32              = 7168
llama_model_loader: - kv   9:              deepseek2.feed_forward_length u32              = 18432
llama_model_loader: - kv  10:             deepseek2.attention.head_count u32              = 64
llama_model_loader: - kv  11:          deepseek2.attention.head_count_kv u32              = 1
llama_model_loader: - kv  12:                deepseek2.rope.scaling.type str              = yarn
llama_model_loader: - kv  13:              deepseek2.rope.scaling.factor f32              = 64.000000
llama_model_loader: - kv  14: deepseek2.rope.scaling.original_context_length u32              = 4096
llama_model_loader: - kv  15:      deepseek2.rope.scaling.yarn_beta_fast f32              = 32.000000
llama_model_loader: - kv  16:      deepseek2.rope.scaling.yarn_beta_slow f32              = 1.000000
llama_model_loader: - kv  17:                   deepseek2.rope.freq_base f32              = 50000.000000
llama_model_loader: - kv  18: deepseek2.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  19:                deepseek2.expert_used_count u32              = 8
llama_model_loader: - kv  20:               deepseek2.expert_group_count u32              = 1
llama_model_loader: - kv  21:          deepseek2.expert_group_used_count u32              = 1
llama_model_loader: - kv  22:               deepseek2.expert_gating_func u32              = 2
llama_model_loader: - kv  23:        deepseek2.leading_dense_block_count u32              = 1
llama_model_loader: - kv  24:                       deepseek2.vocab_size u32              = 163840
llama_model_loader: - kv  25:            deepseek2.attention.q_lora_rank u32              = 1536
llama_model_loader: - kv  26:           deepseek2.attention.kv_lora_rank u32              = 512
llama_model_loader: - kv  27:             deepseek2.attention.key_length u32              = 576
llama_model_loader: - kv  28:           deepseek2.attention.value_length u32              = 512
llama_model_loader: - kv  29:         deepseek2.attention.key_length_mla u32              = 192
llama_model_loader: - kv  30:       deepseek2.attention.value_length_mla u32              = 128
llama_model_loader: - kv  31:       deepseek2.expert_feed_forward_length u32              = 2048
llama_model_loader: - kv  32:                     deepseek2.expert_count u32              = 384
llama_model_loader: - kv  33:              deepseek2.expert_shared_count u32              = 1
llama_model_loader: - kv  34:             deepseek2.expert_weights_scale f32              = 2.827000
llama_model_loader: - kv  35:              deepseek2.expert_weights_norm bool             = true
llama_model_loader: - kv  36:             deepseek2.rope.dimension_count u32              = 64
llama_model_loader: - kv  37: deepseek2.rope.scaling.yarn_log_multiplier f32              = 0.100000
llama_model_loader: - kv  38:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  39:                         tokenizer.ggml.pre str              = kimi-k2
llama_model_loader: - kv  40:                      tokenizer.ggml.tokens arr[str,163840]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  41:                  tokenizer.ggml.token_type arr[i32,163840]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  42:                      tokenizer.ggml.merges arr[str,163328]  = ["Ġ Ġ", "ĠĠ ĠĠ", "Ġ t", "i n",...
llama_model_loader: - kv  43:                tokenizer.ggml.bos_token_id u32              = 163584
llama_model_loader: - kv  44:                tokenizer.ggml.eos_token_id u32              = 163585
llama_model_loader: - kv  45:            tokenizer.ggml.padding_token_id u32              = 163839
llama_model_loader: - kv  46:                    tokenizer.chat_template str              = {%- macro render_content(msg) -%}\n   ...
llama_model_loader: - kv  47:               general.quantization_version u32              = 2
llama_model_loader: - kv  48:                          general.file_type u32              = 7
llama_model_loader: - type  f32:  365 tensors
llama_model_loader: - type q4_0:  180 tensors
llama_model_loader: - type q8_0:  551 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = Q8_0
print_info: file size   = 543.62 GiB (4.55 BPW) 
load: 0 unused tokens
load: printing all EOG tokens:
load:   - 163585 ('[EOS]')
load:   - 163586 ('<|im_end|>')
load:   - 163593 ('[EOT]')
load:   - 163839 ('[PAD]')
load: special tokens cache size = 256
load: token to piece cache size = 1.0606 MB
print_info: arch                  = deepseek2
print_info: vocab_only            = 0
print_info: no_alloc              = 0
print_info: n_ctx_train           = 262144
print_info: n_embd                = 7168
print_info: n_embd_inp            = 7168
print_info: n_layer               = 61
print_info: n_head                = 64
print_info: n_head_kv             = 1
print_info: n_rot                 = 64
print_info: n_swa                 = 0
print_info: is_swa_any            = 0
print_info: n_embd_head_k         = 576
print_info: n_embd_head_v         = 512
print_info: n_gqa                 = 64
print_info: n_embd_k_gqa          = 576
print_info: n_embd_v_gqa          = 512
print_info: f_norm_eps            = 0.0e+00
print_info: f_norm_rms_eps        = 1.0e-05
print_info: f_clamp_kqv           = 0.0e+00
print_info: f_max_alibi_bias      = 0.0e+00
print_info: f_logit_scale         = 0.0e+00
print_info: f_attn_scale          = 0.0e+00
print_info: n_ff                  = 18432
print_info: n_expert              = 384
print_info: n_expert_used         = 8
print_info: n_expert_groups       = 1
print_info: n_group_used          = 1
print_info: causal attn           = 1
print_info: pooling type          = 0
print_info: rope type             = 0
print_info: rope scaling          = yarn
print_info: freq_base_train       = 50000.0
print_info: freq_scale_train      = 0.015625
print_info: n_ctx_orig_yarn       = 4096
print_info: rope_yarn_log_mul     = 1.0000
print_info: rope_finetuned        = unknown
print_info: model type            = 671B
print_info: model params          = 1.03 T
print_info: general.name          = n/a
print_info: n_layer_dense_lead    = 1
print_info: n_lora_q              = 1536
print_info: n_lora_kv             = 512
print_info: n_embd_head_k_mla     = 192
print_info: n_embd_head_v_mla     = 128
print_info: n_ff_exp              = 2048
print_info: n_expert_shared       = 1
print_info: expert_weights_scale  = 2.8
print_info: expert_weights_norm   = 1
print_info: expert_gating_func    = sigmoid
print_info: vocab type            = BPE
print_info: n_vocab               = 163840
print_info: n_merges              = 163328
print_info: BOS token             = 163584 '[BOS]'
print_info: EOS token             = 163585 '[EOS]'
print_info: EOT token             = 163586 '<|im_end|>'
print_info: PAD token             = 163839 '[PAD]'
print_info: LF token              = 198 'Ċ'
print_info: FIM PAD token         = 163839 '[PAD]'
print_info: EOG token             = 163585 '[EOS]'
print_info: EOG token             = 163586 '<|im_end|>'
print_info: EOG token             = 163593 '[EOT]'
print_info: EOG token             = 163839 '[PAD]'
print_info: max token length      = 512
load_tensors: loading model tensors, this can take a while... (mmap = true, direct_io = false)
load_tensors: offloading output layer to GPU
load_tensors: offloading 60 repeating layers to GPU
load_tensors: offloaded 62/62 layers to GPU
load_tensors:   CPU_Mapped model buffer size = 555458.51 MiB
load_tensors:        CUDA0 model buffer size =  9733.81 MiB
load_tensors:        CUDA1 model buffer size = 19508.51 MiB
....................................................................................................
common_init_result: added [EOS] logit bias = -inf
common_init_result: added <|im_end|> logit bias = -inf
common_init_result: added [EOT] logit bias = -inf
common_init_result: added [PAD] logit bias = -inf
llama_context: constructing llama_context
llama_context: setting new yarn_attn_factor = 1.0000 (mscale == 1.0, mscale_all_dim = 1.0)
llama_context: n_seq_max     = 1
llama_context: n_ctx         = 4096
llama_context: n_ctx_seq     = 4096
llama_context: n_batch       = 2048
llama_context: n_ubatch      = 512
llama_context: causal_attn   = 1
llama_context: flash_attn    = enabled
llama_context: kv_unified    = false
llama_context: freq_base     = 50000.0
llama_context: freq_scale    = 0.015625
llama_context: n_ctx_seq (4096) < n_ctx_train (262144) -- the full capacity of the model will not be utilized
llama_context:  CUDA_Host  output buffer size =     0.62 MiB
llama_kv_cache:      CUDA0 KV buffer size =     9.00 MiB
llama_kv_cache:      CUDA1 KV buffer size =   265.50 MiB
llama_kv_cache: size =  274.50 MiB (  4096 cells,  61 layers,  1/1 seqs), K (f16):  274.50 MiB, V (f16):    0.00 MiB
sched_reserve: reserving ...
sched_reserve:      CUDA0 compute buffer size =  6062.75 MiB
sched_reserve:      CUDA1 compute buffer size =   362.00 MiB
sched_reserve:  CUDA_Host compute buffer size =    36.01 MiB
sched_reserve: graph nodes  = 4791
sched_reserve: graph splits = 240 (with bs=512), 121 (with bs=1)
sched_reserve: reserve took 12.79 ms, sched copies = 1
mtmd_cli_context: chat template example:
<|im_system|>system<|im_middle|>You are a helpful assistant<|im_end|><|im_user|>user<|im_middle|>Hello<|im_end|><|im_assistant|>assistant<|im_middle|><think></think>Hi there<|im_end|><|im_user|>user<|im_middle|>How are you?<|im_end|><|im_assistant|>assistant<|im_middle|>
clip_model_loader: model name:   Kimi K2.5
clip_model_loader: description:  
clip_model_loader: GGUF version: 3
clip_model_loader: alignment:    32
clip_model_loader: n_tensors:    335
clip_model_loader: n_kv:         28

clip_model_loader: has vision encoder
clip_ctx: CLIP using CUDA0 backend
load_hparams: projector:          kimik25
load_hparams: n_embd:             1152
load_hparams: n_head:             16
load_hparams: n_ff:               4304
load_hparams: n_layer:            27
load_hparams: ffn_op:             gelu
load_hparams: projection_dim:     7168

--- vision hparams ---
load_hparams: image_size:         896
load_hparams: patch_size:         14
load_hparams: has_llava_proj:     0
load_hparams: minicpmv_version:   0
load_hparams: n_merge:            2
load_hparams: n_wa_pattern: 0
load_hparams: image_min_pixels:   1568
load_hparams: image_max_pixels:   3211264

load_hparams: model size:         908.42 MiB
load_hparams: metadata size:      0.12 MiB
main: loading model: /mnt/srv/snowdrift/gguf/Kimi-K2.5-GGUF/aes_sedai/Kimi-K2.5-Q4_X.gguf
WARN: This is an experimental CLI for testing multimodal capability.
      For normal use cases, please use the standard llama-cli
encoding image slice...
alloc_compute_meta:      CUDA0 compute buffer size =   264.77 MiB
alloc_compute_meta:        CPU compute buffer size =    17.50 MiB
alloc_compute_meta: graph splits = 1, nodes = 1034
warmup: flash attention is enabled

=== MTMD_DEBUG_EMBEDDINGS ===
Shape: [7168, 1944]
Token 0 (first 16 values): -0.029358 -0.105713 -0.161133 0.100708 -0.075317 -0.264465 -0.062134 0.100281 0.069458 0.216919 -0.102356 -0.134155 -0.098579 -0.098022 -0.130737 0.051514 
Token 0 (last 16 values):  -0.015167 0.048523 0.111225 0.024429 -0.022324 -0.014832 0.134521 -0.149582 -0.187012 0.097748 -0.097107 0.166382 0.219757 0.160767 -0.041840 0.179504 
Stats: mean=0.000452, std=0.078249, min=-4.953125, max=8.621094, sum=6304.381348
=== END MTMD_DEBUG_EMBEDDINGS ===

image slice encoded in 1017 ms
decoding image batch 1/1, n_tokens_batch = 1944
image decoded (batch 1/1) in 83232 ms

<think>The user wants me to describe the contents of the image. Let me analyze what I see:

 1. **Main subject**: A raccoon is the central focus of the image.
 2. **Position**: The raccoon is lying on a tree branch, draped over it in a relaxed or lazy manner.
 3. **Physical appearance**: 
    - Gray fur with the characteristic black "mask" around its eyes
    - White markings on its face, particularly around the nose and mouth area
    - Ringed tail (though partially visible)
    - Small, rounded ears with white fur on the edges
    - Front paws hanging down from the branch
 4. **Setting**: 
    - Outdoor, natural environment
    - Tree branch is thick and brown with some moss or lichen
    - Background is blurred (bokeh effect) showing green foliage and trees
    - Appears to be a forest or wooded area
 5. **Lighting**: Natural daylight, soft lighting
 6. **Mood**: The raccoon looks relaxed, possibly sleepy or resting. Its posture suggests it's comfortable and at ease.

 The image captures a candid moment of wildlife in a natural habitat. The raccoon's body language suggests it's taking a break or observing its surroundings from a safe perch.

 I should provide a clear, detailed description covering the subject, its appearance, its position/action, and the surrounding environment.</think>The image shows a **raccoon** draped lazily over a thick, horizontal tree branch in what appears to be a forest or wooded environment.

**The raccoon:**
- Has the species' characteristic grayish-brown fur with darker and lighter variations
- Displays the distinctive black "mask" markings around its eyes, contrasted with white fur on its snout and cheeks
- Is lying with its belly on the branch, front paws hanging down loosely, and head resting forward
- Has small, rounded ears with white trim and a slightly pointed snout
- Appears relaxed or sleepy, with a calm, somewhat melancholic expression

**The setting:**
- The tree branch is sturdy and brown with patches of moss or lichen
- Background features soft, out-of-focus greenery (bokeh effect) including leaves and tree trunks
- Natural daylight illuminates the scene, highlighting the texture of the raccoon's fur
- The overall atmosphere is peaceful and natural, suggesting a wildlife habitat or nature reserve

The composition captures an intimate, candid moment of the animal at rest in its natural arboreal environment.


llama_perf_context_print:        load time =  317102.77 ms
llama_perf_context_print: prompt eval time =   84821.87 ms /  1957 tokens (   43.34 ms per token,    23.07 tokens per second)
llama_perf_context_print:        eval time =   43858.18 ms /   523 runs   (   83.86 ms per token,    11.92 tokens per second)
llama_perf_context_print:       total time =  361103.84 ms /  2480 tokens
llama_perf_context_print:    graphs reused =        520

I can update this PR with that fix applied @ggerganov

AesSedai and others added 2 commits February 10, 2026 14:01
ggml_row_size(cur->type, n_dim),
ggml_row_size(cur->type, n_dim*n_head),
cur->nb[1],
cur->nb[2],
Copy link
Contributor Author

@AesSedai AesSedai Feb 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ngxson making sure you see this change and the one below in the second view too. Adjusting this removed the need for the ggml_cont above.

Copy link
Contributor

@ngxson ngxson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, thanks!!

@ngxson ngxson requested a review from CISC February 10, 2026 22:43
Copy link
Member

@CISC CISC left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Merge at will.

@ngxson ngxson merged commit e463bbd into ggml-org:master Feb 11, 2026
81 of 82 checks passed
liparetejas pushed a commit to liparetejas/llama.cpp that referenced this pull request Feb 23, 2026
* Move dequant_model to after the text_config merge
Add new kimi-k2.5 keys to mtmd convert
Update V_MMPROJ tensor mapping for new mm_projector.proj keys
Update V_M_IMP_NORM for new mm_projector.pre_norm key

* Fix a couple of oversights

* Add image support for Kimi-K2.5

* Revert changes to KimiVLForConditionalGeneration

* Fix an assert crash

* Fix permute swapping w / h on accident

* Kimi-K2.5: Use merged QKV for vision

* Kimi-K2.5: pre-convert vision QK to use build_rope_2d

* Kimi-K2.5: support non-interleaved rope for vision

* Kimi-K2.5: fix min / max pixel

* Kimi-K2.5: remove v/o permutes, unnecessary

* Kimi-K2.5: update permute name to match

* Update convert_hf_to_gguf.py

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Kimi-K2.5: replace build_rope_2d ggml_cont with ggml_view_3d pointers

---------

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
bartowski1182 pushed a commit to bartowski1182/llama.cpp that referenced this pull request Mar 2, 2026
* Move dequant_model to after the text_config merge
Add new kimi-k2.5 keys to mtmd convert
Update V_MMPROJ tensor mapping for new mm_projector.proj keys
Update V_M_IMP_NORM for new mm_projector.pre_norm key

* Fix a couple of oversights

* Add image support for Kimi-K2.5

* Revert changes to KimiVLForConditionalGeneration

* Fix an assert crash

* Fix permute swapping w / h on accident

* Kimi-K2.5: Use merged QKV for vision

* Kimi-K2.5: pre-convert vision QK to use build_rope_2d

* Kimi-K2.5: support non-interleaved rope for vision

* Kimi-K2.5: fix min / max pixel

* Kimi-K2.5: remove v/o permutes, unnecessary

* Kimi-K2.5: update permute name to match

* Update convert_hf_to_gguf.py

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Kimi-K2.5: replace build_rope_2d ggml_cont with ggml_view_3d pointers

---------

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
ArberSephirotheca pushed a commit to ArberSephirotheca/llama.cpp that referenced this pull request Mar 3, 2026
* Move dequant_model to after the text_config merge
Add new kimi-k2.5 keys to mtmd convert
Update V_MMPROJ tensor mapping for new mm_projector.proj keys
Update V_M_IMP_NORM for new mm_projector.pre_norm key

* Fix a couple of oversights

* Add image support for Kimi-K2.5

* Revert changes to KimiVLForConditionalGeneration

* Fix an assert crash

* Fix permute swapping w / h on accident

* Kimi-K2.5: Use merged QKV for vision

* Kimi-K2.5: pre-convert vision QK to use build_rope_2d

* Kimi-K2.5: support non-interleaved rope for vision

* Kimi-K2.5: fix min / max pixel

* Kimi-K2.5: remove v/o permutes, unnecessary

* Kimi-K2.5: update permute name to match

* Update convert_hf_to_gguf.py

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Kimi-K2.5: replace build_rope_2d ggml_cont with ggml_view_3d pointers

---------

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

examples python python script changes

Projects

None yet

Development

Successfully merging this pull request may close these issues.