UPSTREAM PR #19460: model: support GLM MoE DSA arch (NOTE: indexer is not yet supported) by loci-dev · Pull Request #1164 · auroralabs-loci/llama.cpp

loci-dev · 2026-02-12T02:17:55Z

Note

Source pull request: ggml-org/llama.cpp#19460

Ref upstream vllm PR: vllm-project/vllm#34124

Important

This PR allows converting safetensors to GGUF while keeping the indexer tensors (for deepseek sparse attention), but they are left unused by the cpp code. The quality will be suboptimal
Support for indexer tensor will be in a follow-up PR. The GGUF will NOT need to be generated again

The arch should be exactly the same as GlmMoeLite (aka GLM 4.7 Flash, PR: ggml-org/llama.cpp#18936), but I'm taking time to properly moving it to a new arch while preserving the MTP tensors

Testing

Because the model is not public, I tried using GLM 4.7 Flash as the test subject.

Download https://huggingface.co/zai-org/GLM-4.7-Flash
Change the config.json: Glm4MoeLiteForCausalLM --> GlmMoeDsaForCausalLM
Convert it to GGUF
Test against the "normal" version of GLM 4.7 Flash GGUF (the one with deepseek2 arch)

From my tests, compare-logprobs.py reports 0.0 differences between the two

loci-review · 2026-02-12T03:39:04Z

Overview

Analysis of 115,605 functions across 14 binaries revealed 46 modified functions (0.04%) with neutral to slightly positive performance impact. Power consumption decreased 0.021% in build.bin.libllama.so (252,210.46 nJ → 252,158.25 nJ), while all other binaries showed zero measurable change: build.bin.llama-tts (361,514 nJ), build.bin.llama-cvector-generator (356,031 nJ), build.bin.libmtmd.so (179,023 nJ), build.bin.libggml-base.so (73,290 nJ), build.bin.libggml-cpu.so (157,834 nJ), build.bin.libggml.so (5,124 nJ), build.bin.llama-gemma3-cli (277 nJ), build.bin.llama-gguf-split (40,087 nJ), build.bin.llama-llava-cli (277 nJ), build.bin.llama-minicpmv-cli (277 nJ), build.bin.llama-quantize (43,735 nJ), build.bin.llama-tokenize (38,552 nJ), build.bin.llama-qwen2vl-cli (277 nJ), and build.bin.llama-bench (60,106 nJ). Zero functions added or removed; 115,559 unchanged.

Function Analysis

Nine functions improved significantly: std::_Rb_tree::_S_key() showed -75.7% throughput time (-186.5ns), std::_Rb_tree::_M_const_cast() -74.0% (-181.5ns), std::make_move_iterator() -68.4% (-168.5ns), std::make_error_condition() -63.2% (-187.2ns), std::__new_allocator::deallocate() -49.1% (-21.8ns), std::_Hashtable_alloc::_M_allocate_buckets() -38.2% (-68.5ns), std::_Rb_tree::find() -37.1% (-62.7ns), llama_grammar_accept() -36.4% throughput (-99.1ns), and std::vector<llama_layer>::operator[] -33.5% (-6.9ns). Two functions regressed: std::vector<wchar_t>::end() +306.7% throughput (+183.3ns, Windows initialization only) and std::function<bool(char)>::operator=() +109.9% (+85.7ns, Jinja template parsing only). Improvements stem from compiler optimizations and removal of sanitizer flags, while regressions affect non-critical initialization paths. The vector layer accessor improvement directly benefits inference loops (5-11μs per token). Other analyzed functions showed negligible changes.

Additional Findings

The 60 commits added support for three new model architectures (Kimi-Linear with MLA, GLM-DSA, Step3.5-Flash) and delivered extensive GPU backend improvements: 24 commits across Vulkan, Metal, CUDA, SYCL, WebGPU, and VirtGPU backends. Flash Attention optimizations (FP16 accumulators, mask preprocessing, spec constants) and small-batch CUDA optimizations provide estimated 10-20% throughput gains in GPU-accelerated attention workloads. Critical bug fixes addressed non-contiguous RoPE (CUDA, Vulkan), MSVC regex undefined behavior, and multi-GPU device enumeration. No changes detected in inference hot paths (GEMM, attention kernels, KV cache operations, quantization kernels), confirming performance stability in critical areas.

🔎 Full breakdown: Loci Inspector.
💬 Questions? Tag @loci-dev.

ngxson added 6 commits February 9, 2026 15:38

model: support GLM MoE DSA arch

cc0d6c2

working version

a44a3db

pyright

0451c84

keep indexer tensors

9e4e556

add indexer gguf params

64184c1

loaded now

d8a4656

loci-dev temporarily deployed to PROD__AL_DEMO February 12, 2026 02:17 — with GitHub Actions Inactive

loci-dev force-pushed the main branch 5 times, most recently from f998d1f to 30ef9d0 Compare February 16, 2026 02:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

UPSTREAM PR #19460: model: support GLM MoE DSA arch (NOTE: indexer is not yet supported)#1164

UPSTREAM PR #19460: model: support GLM MoE DSA arch (NOTE: indexer is not yet supported)#1164
loci-dev wants to merge 6 commits intomainfrom
loci/pr-19460-xsn-glm_dsa

loci-dev commented Feb 12, 2026

Uh oh!

loci-review bot commented Feb 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

loci-dev commented Feb 12, 2026

Testing

Uh oh!

loci-review bot commented Feb 12, 2026

Overview

Function Analysis

Additional Findings

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants