Conversation
ikawrakow
left a comment
There was a problem hiding this comment.
Is the code for the newest neighbor and bilinear interpolation taken from llama.cpp?
| } | ||
|
|
||
| // other attention tensors (output / norms / ln) left as-is | ||
| layer.k_w = get_tensor(string_format(TN_ATTN_K, prefix, il, "weight"), false); |
There was a problem hiding this comment.
What was the purpose of this change? In the original it is clear that we need to have either a merged qkv_w, or each of k_w, q_w, v_w. In the new version it is perfectly fine to not have any of these, or to have a subset of the required tensors.
There was a problem hiding this comment.
It is just a small refactor to align with current upstream llama.cpp. There, they try to load every tensor in clip_model_loader regardless of whether it exists or not, and moved the validation logic inside build_vit. The same applies to k_b.
This isn't strictly for Kimi vision, but rather to keep ik_llama updated with upstream and simplify future model ports. However, since this refactor could impact other models I haven't tested, I can simply revert this change if you prefer.
| } | ||
|
|
||
| // keep other optional biases as before | ||
| layer.k_b = get_tensor(string_format(TN_ATTN_K, prefix, il, "bias"), false); |
|
|
||
| const float * x = (float *)((char *) src0->data + i00*nb00 + i01*nb01 + i02*nb02 + i03*nb03); | ||
| float * y = (float *)((char *) dst->data + i0*nb0 + i1*nb1 + i2*nb2 + i3*nb3); | ||
| if (mode == GGML_SCALE_MODE_NEAREST || mode == GGML_SCALE_MODE_BILINEAR) { |
There was a problem hiding this comment.
It says nearest neighbor or bilinear. The code does neither of these two things, not to mention that it somehow assumes nearest neighbor interpolation is the same thing as bilinear interpolation.
There was a problem hiding this comment.
In llama.cpp they have the full bilinear code, but it wasn't necessary for Kimi, so I didn't include it. Since the current codebase only uses nearest neighbor, I assumed existing vision models were already relying on that logic. However, since I modified how clip.cpp handles ggml_scale_mode, I can include the proper bilinear implementation as a safety measure to avoid breaking other vision models.
Yes, although I didn't port the bilinear implementation since Kimi doesn't use it. |
Just wanted to highlight - qwen 3.5 also has vision modality, i'm not aware which implementation it uses - but it might make sense to check since for possble future implementation |
Qwen 3.5 use the same vision logic from 3V with we already support |
|
Hi I tested this with both qwen3-vl and kimi-k2.5 and can confirm it works. thanks for the PR! |
Ports Kimi K2.5 Vision support based on the request in issue #1264.
This implementation references logic from upstream PR 19170, PR 16891 and PR 17022, and includes small updates in clip.cpp.
Tested in the default webui using IQ1-KT with the f16 mmproj.