Skip to content

UPSTREAM PR #19170: Add Kimi-K2.5 support#1119

Open
loci-dev wants to merge 5 commits intomainfrom
loci/pr-19170-kimi-k2.5
Open

UPSTREAM PR #19170: Add Kimi-K2.5 support#1119
loci-dev wants to merge 5 commits intomainfrom
loci/pr-19170-kimi-k2.5

Conversation

@loci-dev
Copy link

@loci-dev loci-dev commented Feb 1, 2026

Note

Source pull request: ggml-org/llama.cpp#19170

Adding support for https://huggingface.co/moonshotai/Kimi-K2.5

Since this model includes compressed-tensors (INT4 for the conditional experts), I moved the dequant_model to the prepare_tensors call at @compilade's suggestion. The model conversion fails otherwise because the quantization_config is nested under the text_config in the config.json.

Additionally, this model adds some new keys for the vision tower, prefixed as vt_, and the preprocessor_config.json has the expected fields nested in the media_proc_cfg key.

This PR does not include the "hacked" Q4_0 changes by @jukofyork, referred to in this comment.

I have added a first pass at vision support, heavily aided by LLM assistance. I entirely expect @ngxson to tear it to shreds or call me a dummy and show me an easier way to add that vision support :)

Add new kimi-k2.5 keys to mtmd convert
Update V_MMPROJ tensor mapping for new mm_projector.proj keys
Update V_M_IMP_NORM for new mm_projector.pre_norm key
@loci-review
Copy link

loci-review bot commented Feb 1, 2026

Overview

Analysis of 115,396 functions (35 modified, 69 new, 4 removed) across 15 binaries shows minimal performance impact from adding Kimi-K2.5 vision model support. Only build.bin.libmtmd.so exhibits measurable change with +0.77% power consumption increase (180,399.2 nJ vs 179,022.4 nJ). All other binaries remain unchanged: build.bin.libllama.so (249,105.8 nJ), build.bin.libggml-cpu.so (157,685.9 nJ), build.bin.libggml-base.so (73,208.7 nJ), build.bin.llama-tts (360,000.0 nJ), build.bin.llama-cvector-generator (354,510.6 nJ), build.bin.llama-bench (60,119.5 nJ), build.bin.llama-quantize (43,714.7 nJ), build.bin.llama-gguf-split (40,060.0 nJ), build.bin.llama-tokenize (38,524.7 nJ), build.bin.libggml.so (5,124.4 nJ), and CLI tools (277.2 nJ each). Core inference libraries show zero performance change, confirming no impact on critical paths.

Function Analysis

Six static initialization functions show expected startup overhead increases of 450-560ns response time (+1.9% to +2.4%) and 42-52ns throughput time (+10.1% to +12.5%) from adding one PROJECTOR_TYPE_KIMIK25 map entry. These are one-time costs during program startup with zero runtime impact.

STL utility functions demonstrate compiler optimization benefits: _M_const_cast improved -68.4% response time (-181ns), begin improved -51.2% response time (-88ns), and clip_projector_type_from_string improved -11.8% response time (-92ns) despite adding functionality. The string_replace_all function gained -7.8% response time (-27ns) from compiler optimizations.

All changes occur in non-critical model loading and initialization code. Matrix operations, attention mechanisms, KV cache management, and token generation paths remain completely unmodified.

Additional Findings

Zero impact on GPU backends (CUDA, Metal, HIP, Vulkan, SYCL) and inference hot paths. The 0.77% power increase in libmtmd.so represents negligible one-time startup cost, fully justified by new model architecture support. Changes demonstrate effective isolation with no propagation to performance-critical inference operations.

🔎 Full breakdown: Loci Inspector.
💬 Questions? Tag @loci-dev.

@loci-review
Copy link

loci-review bot commented Feb 1, 2026

Overview

Analysis of 115,396 functions across 14 binaries reveals minimal performance impact from Kimi-K2.5 multimodal model support addition. Modified: 35 functions (0.03%), New: 69 (0.06%), Removed: 4 (0.003%), Unchanged: 115,288 (99.91%).

Power Consumption Changes:

  • build.bin.libmtmd.so: +1,973 nJ (+1.1%) - only affected binary
  • build.bin.libllama.so: +0.72 nJ (0.0%)
  • build.bin.llama-tts: -0.57 nJ (0.0%)
  • build.bin.llama-cvector-generator: -0.61 nJ (0.0%)
  • build.bin.llama-bench: 0 nJ (0.0%)
  • build.bin.llama-tokenize: 0 nJ (0.0%)
  • build.bin.llama-quantize: 0 nJ (0.0%)
  • build.bin.llama-qwen2vl-cli: 0 nJ (0.0%)
  • build.bin.libggml.so: 0 nJ (0.0%)
  • build.bin.libggml-base.so: 0 nJ (0.0%)
  • build.bin.libggml-cpu.so: 0 nJ (0.0%)
  • build.bin.llama-gemma3-cli: 0 nJ (0.0%)
  • build.bin.llama-gguf-split: 0 nJ (0.0%)
  • build.bin.llama-llava-cli: 0 nJ (0.0%)
  • build.bin.llama-minicpmv-cli: 0 nJ (0.0%)

Function Analysis

Static Initialization Functions (5 functions in cogvlm.cpp, glm4v.cpp, internvl.cpp, minicpmv.cpp, clip.cpp):

  • __static_initialization_and_destruction_0 variants show consistent regressions
  • Response time: +549-558ns (+2.33-2.37%)
  • Throughput time: +44-53ns (+10.36-12.80%)
  • Cause: Added PROJECTOR_TYPE_KIMIK25 entry to PROJECTOR_TYPE_NAMES static map (30→31 entries)
  • Context: One-time startup initialization before main(), zero inference impact

STL Utility Functions (improvements despite no code changes):

  • std::_Rb_tree_const_iterator::_M_const_cast: -181ns response (-68.4%), -182ns throughput (-74.0%)
  • std::vector<llava_uhd::slice_coordinates>::begin: -88ns response (-51.2%), -88ns throughput (-58.5%)
  • std::vector<mobilenetv5_block>::_M_check_len: -55ns response (-3.5%), -56ns throughput (-26.0%)
  • Cause: Compiler optimizations (GCC 13, ARM64)

String Processing Functions:

  • clip_projector_type_from_string: -92ns response (-11.8%), -92ns throughput (-38.6%)
  • string_replace_all: -27ns response (-7.8%), -27ns throughput (-10.6%)
  • Cause: Compiler optimizations in string operations

Other analyzed functions showed negligible changes.

Additional Findings

Changes are architecturally isolated to MTMD (multimodal) subsystem. Core inference libraries (libllama.so, libggml.so, all GPU backends) show zero change, confirming no impact on performance-critical paths: matrix operations, attention computation, KV cache management, or token generation. The 1.1% power increase in libmtmd.so represents ~250ns one-time initialization overhead, fully offset by ~329ns in compiler optimization gains across other functions. Net startup performance is neutral to positive. Five commits added Kimi-K2.5 vision model support with proper stability fixes (assert crash fix, selective revert for backward compatibility). Implementation follows established patterns for projector type additions.

🔎 Full breakdown: Loci Inspector.
💬 Questions? Tag @loci-dev.

@loci-dev loci-dev force-pushed the main branch 19 times, most recently from 40ccb9a to d9cffb7 Compare February 2, 2026 08:22
@loci-dev loci-dev force-pushed the main branch 7 times, most recently from 048ad94 to 6c1fde6 Compare February 3, 2026 13:32
@loci-dev loci-dev force-pushed the main branch 10 times, most recently from 823244c to bab7d39 Compare February 19, 2026 02:17
@loci-dev loci-dev force-pushed the main branch 10 times, most recently from a92fe2a to 6495042 Compare February 27, 2026 02:17
@loci-dev loci-dev force-pushed the main branch 3 times, most recently from ef246cc to 8c889a6 Compare March 2, 2026 02:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants