UPSTREAM PR #19460: model: support GLM MoE DSA arch (NOTE: indexer is not yet supported)#1164
UPSTREAM PR #19460: model: support GLM MoE DSA arch (NOTE: indexer is not yet supported)#1164
Conversation
OverviewAnalysis of 115,605 functions across 14 binaries revealed 46 modified functions (0.04%) with neutral to slightly positive performance impact. Power consumption decreased 0.021% in build.bin.libllama.so (252,210.46 nJ → 252,158.25 nJ), while all other binaries showed zero measurable change: build.bin.llama-tts (361,514 nJ), build.bin.llama-cvector-generator (356,031 nJ), build.bin.libmtmd.so (179,023 nJ), build.bin.libggml-base.so (73,290 nJ), build.bin.libggml-cpu.so (157,834 nJ), build.bin.libggml.so (5,124 nJ), build.bin.llama-gemma3-cli (277 nJ), build.bin.llama-gguf-split (40,087 nJ), build.bin.llama-llava-cli (277 nJ), build.bin.llama-minicpmv-cli (277 nJ), build.bin.llama-quantize (43,735 nJ), build.bin.llama-tokenize (38,552 nJ), build.bin.llama-qwen2vl-cli (277 nJ), and build.bin.llama-bench (60,106 nJ). Zero functions added or removed; 115,559 unchanged. Function AnalysisNine functions improved significantly: std::_Rb_tree::_S_key() showed -75.7% throughput time (-186.5ns), std::_Rb_tree::_M_const_cast() -74.0% (-181.5ns), std::make_move_iterator() -68.4% (-168.5ns), std::make_error_condition() -63.2% (-187.2ns), std::__new_allocator::deallocate() -49.1% (-21.8ns), std::_Hashtable_alloc::_M_allocate_buckets() -38.2% (-68.5ns), std::_Rb_tree::find() -37.1% (-62.7ns), llama_grammar_accept() -36.4% throughput (-99.1ns), and std::vector<llama_layer>::operator[] -33.5% (-6.9ns). Two functions regressed: std::vector<wchar_t>::end() +306.7% throughput (+183.3ns, Windows initialization only) and std::function<bool(char)>::operator=() +109.9% (+85.7ns, Jinja template parsing only). Improvements stem from compiler optimizations and removal of sanitizer flags, while regressions affect non-critical initialization paths. The vector layer accessor improvement directly benefits inference loops (5-11μs per token). Other analyzed functions showed negligible changes. Additional FindingsThe 60 commits added support for three new model architectures (Kimi-Linear with MLA, GLM-DSA, Step3.5-Flash) and delivered extensive GPU backend improvements: 24 commits across Vulkan, Metal, CUDA, SYCL, WebGPU, and VirtGPU backends. Flash Attention optimizations (FP16 accumulators, mask preprocessing, spec constants) and small-batch CUDA optimizations provide estimated 10-20% throughput gains in GPU-accelerated attention workloads. Critical bug fixes addressed non-contiguous RoPE (CUDA, Vulkan), MSVC regex undefined behavior, and multi-GPU device enumeration. No changes detected in inference hot paths (GEMM, attention kernels, KV cache operations, quantization kernels), confirming performance stability in critical areas. 🔎 Full breakdown: Loci Inspector. |
f998d1f to
30ef9d0
Compare
Note
Source pull request: ggml-org/llama.cpp#19460
Ref upstream vllm PR: vllm-project/vllm#34124
Important
This PR allows converting safetensors to GGUF while keeping the indexer tensors (for deepseek sparse attention), but they are left unused by the cpp code. The quality will be suboptimal
Support for indexer tensor will be in a follow-up PR. The GGUF will NOT need to be generated again
The arch should be exactly the same as GlmMoeLite (aka GLM 4.7 Flash, PR: ggml-org/llama.cpp#18936), but I'm taking time to properly moving it to a new arch while preserving the MTP tensors
Testing
Because the model is not public, I tried using GLM 4.7 Flash as the test subject.
From my tests,
compare-logprobs.pyreports 0.0 differences between the two