Skip to content

UPSTREAM PR #19460: model: support GLM MoE DSA arch (NOTE: indexer is not yet supported)#1164

Open
loci-dev wants to merge 6 commits intomainfrom
loci/pr-19460-xsn-glm_dsa
Open

UPSTREAM PR #19460: model: support GLM MoE DSA arch (NOTE: indexer is not yet supported)#1164
loci-dev wants to merge 6 commits intomainfrom
loci/pr-19460-xsn-glm_dsa

Conversation

@loci-dev
Copy link

Note

Source pull request: ggml-org/llama.cpp#19460

Ref upstream vllm PR: vllm-project/vllm#34124

Important

This PR allows converting safetensors to GGUF while keeping the indexer tensors (for deepseek sparse attention), but they are left unused by the cpp code. The quality will be suboptimal
Support for indexer tensor will be in a follow-up PR. The GGUF will NOT need to be generated again

The arch should be exactly the same as GlmMoeLite (aka GLM 4.7 Flash, PR: ggml-org/llama.cpp#18936), but I'm taking time to properly moving it to a new arch while preserving the MTP tensors

Testing

Because the model is not public, I tried using GLM 4.7 Flash as the test subject.

  1. Download https://huggingface.co/zai-org/GLM-4.7-Flash
  2. Change the config.json: Glm4MoeLiteForCausalLM --> GlmMoeDsaForCausalLM
  3. Convert it to GGUF
  4. Test against the "normal" version of GLM 4.7 Flash GGUF (the one with deepseek2 arch)

From my tests, compare-logprobs.py reports 0.0 differences between the two

@loci-review
Copy link

loci-review bot commented Feb 12, 2026

Overview

Analysis of 115,605 functions across 14 binaries revealed 46 modified functions (0.04%) with neutral to slightly positive performance impact. Power consumption decreased 0.021% in build.bin.libllama.so (252,210.46 nJ → 252,158.25 nJ), while all other binaries showed zero measurable change: build.bin.llama-tts (361,514 nJ), build.bin.llama-cvector-generator (356,031 nJ), build.bin.libmtmd.so (179,023 nJ), build.bin.libggml-base.so (73,290 nJ), build.bin.libggml-cpu.so (157,834 nJ), build.bin.libggml.so (5,124 nJ), build.bin.llama-gemma3-cli (277 nJ), build.bin.llama-gguf-split (40,087 nJ), build.bin.llama-llava-cli (277 nJ), build.bin.llama-minicpmv-cli (277 nJ), build.bin.llama-quantize (43,735 nJ), build.bin.llama-tokenize (38,552 nJ), build.bin.llama-qwen2vl-cli (277 nJ), and build.bin.llama-bench (60,106 nJ). Zero functions added or removed; 115,559 unchanged.

Function Analysis

Nine functions improved significantly: std::_Rb_tree::_S_key() showed -75.7% throughput time (-186.5ns), std::_Rb_tree::_M_const_cast() -74.0% (-181.5ns), std::make_move_iterator() -68.4% (-168.5ns), std::make_error_condition() -63.2% (-187.2ns), std::__new_allocator::deallocate() -49.1% (-21.8ns), std::_Hashtable_alloc::_M_allocate_buckets() -38.2% (-68.5ns), std::_Rb_tree::find() -37.1% (-62.7ns), llama_grammar_accept() -36.4% throughput (-99.1ns), and std::vector<llama_layer>::operator[] -33.5% (-6.9ns). Two functions regressed: std::vector<wchar_t>::end() +306.7% throughput (+183.3ns, Windows initialization only) and std::function<bool(char)>::operator=() +109.9% (+85.7ns, Jinja template parsing only). Improvements stem from compiler optimizations and removal of sanitizer flags, while regressions affect non-critical initialization paths. The vector layer accessor improvement directly benefits inference loops (5-11μs per token). Other analyzed functions showed negligible changes.

Additional Findings

The 60 commits added support for three new model architectures (Kimi-Linear with MLA, GLM-DSA, Step3.5-Flash) and delivered extensive GPU backend improvements: 24 commits across Vulkan, Metal, CUDA, SYCL, WebGPU, and VirtGPU backends. Flash Attention optimizations (FP16 accumulators, mask preprocessing, spec constants) and small-batch CUDA optimizations provide estimated 10-20% throughput gains in GPU-accelerated attention workloads. Critical bug fixes addressed non-contiguous RoPE (CUDA, Vulkan), MSVC regex undefined behavior, and multi-GPU device enumeration. No changes detected in inference hot paths (GEMM, attention kernels, KV cache operations, quantization kernels), confirming performance stability in critical areas.

🔎 Full breakdown: Loci Inspector.
💬 Questions? Tag @loci-dev.

@loci-dev loci-dev force-pushed the main branch 5 times, most recently from f998d1f to 30ef9d0 Compare February 16, 2026 02:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants