Skip to content

Feat - add kimi 2.5 Vision#1280

Merged
ikawrakow merged 2 commits intoikawrakow:mainfrom
SamuelOliveirads:feat/kimi-vision
Feb 19, 2026
Merged

Feat - add kimi 2.5 Vision#1280
ikawrakow merged 2 commits intoikawrakow:mainfrom
SamuelOliveirads:feat/kimi-vision

Conversation

@SamuelOliveirads
Copy link
Contributor

Ports Kimi K2.5 Vision support based on the request in issue #1264.

This implementation references logic from upstream PR 19170, PR 16891 and PR 17022, and includes small updates in clip.cpp.

Tested in the default webui using IQ1-KT with the f16 mmproj.

Copy link
Owner

@ikawrakow ikawrakow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the code for the newest neighbor and bilinear interpolation taken from llama.cpp?

}

// other attention tensors (output / norms / ln) left as-is
layer.k_w = get_tensor(string_format(TN_ATTN_K, prefix, il, "weight"), false);
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What was the purpose of this change? In the original it is clear that we need to have either a merged qkv_w, or each of k_w, q_w, v_w. In the new version it is perfectly fine to not have any of these, or to have a subset of the required tensors.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is just a small refactor to align with current upstream llama.cpp. There, they try to load every tensor in clip_model_loader regardless of whether it exists or not, and moved the validation logic inside build_vit. The same applies to k_b.
This isn't strictly for Kimi vision, but rather to keep ik_llama updated with upstream and simplify future model ports. However, since this refactor could impact other models I haven't tested, I can simply revert this change if you prefer.

}

// keep other optional biases as before
layer.k_b = get_tensor(string_format(TN_ATTN_K, prefix, il, "bias"), false);
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment as above


const float * x = (float *)((char *) src0->data + i00*nb00 + i01*nb01 + i02*nb02 + i03*nb03);
float * y = (float *)((char *) dst->data + i0*nb0 + i1*nb1 + i2*nb2 + i3*nb3);
if (mode == GGML_SCALE_MODE_NEAREST || mode == GGML_SCALE_MODE_BILINEAR) {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It says nearest neighbor or bilinear. The code does neither of these two things, not to mention that it somehow assumes nearest neighbor interpolation is the same thing as bilinear interpolation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In llama.cpp they have the full bilinear code, but it wasn't necessary for Kimi, so I didn't include it. Since the current codebase only uses nearest neighbor, I assumed existing vision models were already relying on that logic. However, since I modified how clip.cpp handles ggml_scale_mode, I can include the proper bilinear implementation as a safety measure to avoid breaking other vision models.

@SamuelOliveirads
Copy link
Contributor Author

Is the code for the newest neighbor and bilinear interpolation taken from llama.cpp?

Yes, although I didn't port the bilinear implementation since Kimi doesn't use it.

@vaulter
Copy link

vaulter commented Feb 18, 2026

Is the code for the newest neighbor and bilinear interpolation taken from llama.cpp?

Yes, although I didn't port the bilinear implementation since Kimi doesn't use it.

Just wanted to highlight - qwen 3.5 also has vision modality, i'm not aware which implementation it uses - but it might make sense to check since for possble future implementation

@SamuelOliveirads
Copy link
Contributor Author

Just wanted to highlight - qwen 3.5 also has vision modality, i'm not aware which implementation it uses - but it might make sense to check since for possble future implementation

Qwen 3.5 use the same vision logic from 3V with we already support

@Skelectric
Copy link

Skelectric commented Feb 19, 2026

Hi I tested this with both qwen3-vl and kimi-k2.5 and can confirm it works. thanks for the PR!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants