UPSTREAM PR #19249: support infill for Falcon-H1-Tiny-Coder#1123
UPSTREAM PR #19249: support infill for Falcon-H1-Tiny-Coder#1123
Conversation
OverviewAnalysis of llama.cpp across 115,331 functions (9 modified, 4 new, 4 removed, 115,314 unchanged) reveals negligible performance impact from a single commit adding Falcon-H1-Tiny-Coder FIM vocabulary tokens. Power Consumption Changes:
System-wide power consumption increased by 0.005% (+78.52 nJ). Function AnalysisAll measurable changes affect C++ standard library functions, not llama.cpp application code: _M_swap_data (std::vector internal, build.bin.libllama.so): Response time improved 243.10ns → 169.67ns (-30.2%), throughput time 150.81ns → 77.38ns (-48.7%). Compiler optimization improvement in vector swap operations used during regex tokenization. operator+ (iterator arithmetic, build.bin.libllama.so): Response time regressed 141.11ns → 177.26ns (+25.6%), throughput time 119.84ns → 155.99ns (+30.2%). Used in BPE tokenization paths with llm_symbol vectors. Regression likely from compiler version differences. _M_insert_character_class_matcher (regex compiler, build.bin.libllama.so): Response time 27,224.06ns → 27,319.64ns (+0.35%), throughput time 251.63ns → 347.50ns (+38.1%). Significant self-time regression with negligible overall impact. Used only during initialization for argument parsing and template loading, not in inference paths. Source code changes (vocabulary token additions in llama-vocab.cpp) do not directly affect these standard library functions. Performance variations stem from build environment differences rather than code modifications. Additional FindingsCore inference operations remain unchanged: matrix multiplication (70-90% of inference time), attention computation, KV cache management, and GPU acceleration backends show zero modification. The vocabulary addition is purely additive and does not impact performance-critical paths. Tokenization may experience 1-2% regression in worst-case scenarios, representing <0.1% overall inference impact. 🔎 Full breakdown: Loci Inspector. |
cbda11a to
03fef13
Compare
823244c to
bab7d39
Compare
a92fe2a to
6495042
Compare
4298c74 to
0db6c47
Compare
8019888 to
17452e3
Compare
Note
Source pull request: ggml-org/llama.cpp#19249
Added FIM tokens used in Falcon-H1-Tiny-Coder (see https://huggingface.co/tiiuae/Falcon-H1-Tiny-Coder-90M-GGUF#usage) to make the llama-server
POST /infillhandle work.