Skip to content

Conversation

@DajanaV
Copy link
Collaborator

@DajanaV DajanaV commented Nov 13, 2025

Mirrored from ggml-org/llama.cpp#17251

The implementation might support Kimi-K2-Instruct too, but I don't have enough disk space to test now :(

Almost silly copy-paste from DeepSeek V3.1 ggml-org/llama.cpp#15533, modified according to https://github.com/MoonshotAI/Kimi-K2/blob/main/docs/tool_call_guidance.md: matching function id instead of plain function name.

Considerations:

  1. The official template does not contain any <think> tag at the end, so thinking_forced_open is false. Should we test it by modify the template manually?
  2. Did not add template update instruction to models/templates/README.md for now, because their template has tojson(separators=(',', ':')). Although the value of separators is the same as default value, but we must remove it to make the template work for minja.
  3. DeepSeek V3.1 might be possible to generate <|tool▁calls▁begin|>tool... and ignoring <|tool▁call▁begin|>, but I have not observed such behavior in Kimi-K2-Thinking and always get <|tool_calls_section_begin|><|tool_call_begin|>, therefore I'm removing the ? in the function regex.
    https://github.com/ggml-org/llama.cpp/blob/c4abcb2457217198efdd67d02675f5fddb7071c2/common/chat.cpp#L1751
    Actually, I always get an extra <|tool_calls_section_end|> when keeping ?, but I have not been able to fix it, so finally removed the ?.
  4. Have not tested lower quantized variants, maybe they could have different behavior which need to adapt the current parser?

For maintainers: I may have a busy weekend so fell free to edit directly if I'm not able to reply in time.

Closes #17155.

@loci-agentic-ai
Copy link

Access the complete analysis in the LOCI Dashboard

Performance Analysis Summary

Overview

Analysis of version 4f972376 compared to baseline c8ed6535 reveals a significant performance regression in STL vector operations within the llama-cvector-generator binary, coinciding with the implementation of Kimi K2 tool calling format support.

Key Findings

Highest Performance Impact:

  • std::vector<common_chat_msg>::end() shows 217% Response Time increase (82 ns → 261 ns) and 299% Throughput degradation (60 ns → 239 ns)
  • This function is not part of core inference pipeline (llama_decode, llama_encode, llama_tokenize), so tokens per second performance remains unaffected

Power Consumption Analysis:

  • build.bin.llama-cvector-generator: +0.67% increase (329,915 nJ → 332,137 nJ)
  • build.bin.llama-run: +0.65% increase (282,849 nJ → 284,693 nJ)
  • build.bin.llama-tts: +0.39% increase (339,098 nJ → 340,418 nJ)
  • All other binaries show no measurable change

Root Cause Analysis:

  • Flame Graph: Reveals __stack_chk_fail activation (8 ns overhead), indicating stack protection triggering during vector operations
  • CFG Comparison: Shows compiler optimization regression where function prologue was split into two basic blocks, adding unnecessary branching overhead
  • Code Review: The performance regression is not caused by the Kimi K2 implementation but appears to be a compiler optimization artifact

Technical Details:
The degraded function exhibits stack buffer overflow protection activation and suboptimal code generation with additional unconditional branching, suggesting build configuration changes rather than functional code issues.

Actionable Recommendations

  1. Investigate Build Configuration: Compare compiler optimization flags between versions to identify changes affecting STL code generation
  2. Validate Compiler Settings: Ensure consistent optimization levels and verify no unintended stack protection modifications
  3. Monitor Vector Operations: Profile other vector-intensive operations in chat message processing for similar regressions

The Kimi K2 functionality itself is well-implemented and isolated from core inference paths, with the performance issue stemming from build system changes rather than the new feature implementation.

@DajanaV DajanaV force-pushed the main branch 22 times, most recently from 4ab2d66 to 4fb52c0 Compare November 16, 2025 19:06
@DajanaV DajanaV force-pushed the main branch 8 times, most recently from f333350 to 9c4623f Compare November 18, 2025 09:10
@loci-dev loci-dev force-pushed the main branch 14 times, most recently from 8457f25 to a794ad0 Compare November 23, 2025 06:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants