Skip to content

UPSTREAM PR #19249: support infill for Falcon-H1-Tiny-Coder#1123

Open
loci-dev wants to merge 1 commit intomainfrom
loci/pr-19249-infill-falcon-h1
Open

UPSTREAM PR #19249: support infill for Falcon-H1-Tiny-Coder#1123
loci-dev wants to merge 1 commit intomainfrom
loci/pr-19249-infill-falcon-h1

Conversation

@loci-dev
Copy link

@loci-dev loci-dev commented Feb 1, 2026

Note

Source pull request: ggml-org/llama.cpp#19249

Added FIM tokens used in Falcon-H1-Tiny-Coder (see https://huggingface.co/tiiuae/Falcon-H1-Tiny-Coder-90M-GGUF#usage) to make the llama-server POST /infill handle work.

@loci-review
Copy link

loci-review bot commented Feb 1, 2026

Overview

Analysis of llama.cpp across 115,331 functions (9 modified, 4 new, 4 removed, 115,314 unchanged) reveals negligible performance impact from a single commit adding Falcon-H1-Tiny-Coder FIM vocabulary tokens.

Power Consumption Changes:

  • build.bin.libllama.so: +0.03% (+75.13 nJ)
  • build.bin.llama-cvector-generator: +0.00%
  • build.bin.llama-tts: +0.00%
  • build.bin.libmtmd.so: +0.00%
  • build.bin.llama-tokenize: 0.00%
  • build.bin.llama-quantize: 0.00%
  • build.bin.llama-qwen2vl-cli: 0.00%
  • build.bin.libggml-base.so: 0.00%
  • build.bin.libggml-cpu.so: 0.00%
  • build.bin.libggml.so: 0.00%
  • build.bin.llama-gguf-split: 0.00%
  • build.bin.llama-llava-cli: 0.00%
  • build.bin.llama-minicpmv-cli: 0.00%
  • build.bin.llama-gemma3-cli: 0.00%
  • build.bin.llama-bench: 0.00%

System-wide power consumption increased by 0.005% (+78.52 nJ).

Function Analysis

All measurable changes affect C++ standard library functions, not llama.cpp application code:

_M_swap_data (std::vector internal, build.bin.libllama.so): Response time improved 243.10ns → 169.67ns (-30.2%), throughput time 150.81ns → 77.38ns (-48.7%). Compiler optimization improvement in vector swap operations used during regex tokenization.

operator+ (iterator arithmetic, build.bin.libllama.so): Response time regressed 141.11ns → 177.26ns (+25.6%), throughput time 119.84ns → 155.99ns (+30.2%). Used in BPE tokenization paths with llm_symbol vectors. Regression likely from compiler version differences.

_M_insert_character_class_matcher (regex compiler, build.bin.libllama.so): Response time 27,224.06ns → 27,319.64ns (+0.35%), throughput time 251.63ns → 347.50ns (+38.1%). Significant self-time regression with negligible overall impact. Used only during initialization for argument parsing and template loading, not in inference paths.

Source code changes (vocabulary token additions in llama-vocab.cpp) do not directly affect these standard library functions. Performance variations stem from build environment differences rather than code modifications.

Additional Findings

Core inference operations remain unchanged: matrix multiplication (70-90% of inference time), attention computation, KV cache management, and GPU acceleration backends show zero modification. The vocabulary addition is purely additive and does not impact performance-critical paths. Tokenization may experience 1-2% regression in worst-case scenarios, representing <0.1% overall inference impact.

🔎 Full breakdown: Loci Inspector.
💬 Questions? Tag @loci-dev.

@loci-dev loci-dev force-pushed the main branch 26 times, most recently from cbda11a to 03fef13 Compare February 3, 2026 00:46
@loci-dev loci-dev force-pushed the main branch 10 times, most recently from 823244c to bab7d39 Compare February 19, 2026 02:17
@loci-dev loci-dev force-pushed the main branch 10 times, most recently from a92fe2a to 6495042 Compare February 27, 2026 02:17
@loci-dev loci-dev force-pushed the main branch 8 times, most recently from 4298c74 to 0db6c47 Compare March 7, 2026 02:16
@loci-dev loci-dev force-pushed the main branch 2 times, most recently from 8019888 to 17452e3 Compare March 9, 2026 02:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants