[fix][wip] GlmMoeDsa: try implement DSA by JaredforReal · Pull Request #43912 · huggingface/transformers

JaredforReal · 2026-02-11T12:37:41Z

What does this PR do?

fix k_norm as layernorm
add index_head_dim to config
rewrite GlmMoeDsaConfig from PreTrainedConfig
rewrite indexer as an nn.Module class
fix mlp layers mismatch
implement Attention.forward()

Current state:

GlmMoeDsaIndexer implemented, GlmMoeDsaAttention with MLA and DSA implemented.
No direct bug stops running, but there is something wrong in the implementation that makes the output nonsense

help wanted

Signed-off-by: JaredforReal <w13431838023@gmail.com>

JaredforReal · 2026-02-11T12:46:14Z

Known Bug as follow

(transformers) root@develop-20260210102829-0bprk:/vllm-workspace/transformers# python test.py
`torch_dtype` is deprecated! Use `dtype` instead!
Loading weights: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2559/2559 [02:22<00:00, 17.97it/s, Materializing param=model.norm.weight]
GlmMoeDsaForCausalLM LOAD REPORT from: /workspace/glm5-0210-fp8/
Key                                                           | Status     | Details
--------------------------------------------------------------+------------+--------
model.layers.78.mlp.shared_experts.up_proj.weight_scale_inv   | UNEXPECTED |
model.layers.78.self_attn.indexer.wk.weight                   | UNEXPECTED |
model.layers.78.self_attn.indexer.wk.weight_scale_inv         | UNEXPECTED |
model.layers.78.hnorm.weight                                  | UNEXPECTED |
model.layers.78.self_attn.indexer.k_norm.bias                 | UNEXPECTED |
model.layers.78.input_layernorm.weight                        | UNEXPECTED |
model.layers.78.self_attn.kv_b_proj.weight                    | UNEXPECTED |
model.layers.78.self_attn.q_b_proj.weight                     | UNEXPECTED |
model.layers.78.self_attn.q_a_layernorm.weight                | UNEXPECTED |
model.layers.78.mlp.shared_experts.up_proj.weight             | UNEXPECTED |
model.layers.78.eh_proj.weight                                | UNEXPECTED |
model.layers.78.mlp.shared_experts.gate_proj.weight_scale_inv | UNEXPECTED |
model.layers.78.self_attn.o_proj.weight                       | UNEXPECTED |
model.layers.78.self_attn.kv_a_proj_with_mqa.weight_scale_inv | UNEXPECTED |
model.layers.78.mlp.experts.down_proj_scale_inv               | UNEXPECTED |
model.layers.78.self_attn.kv_a_layernorm.weight               | UNEXPECTED |
model.layers.78.mlp.gate.e_score_correction_bias              | UNEXPECTED |
model.layers.78.mlp.gate.weight                               | UNEXPECTED |
model.layers.78.self_attn.kv_b_proj.weight_scale_inv          | UNEXPECTED |
model.layers.78.self_attn.indexer.wq_b.weight                 | UNEXPECTED |
model.layers.78.mlp.shared_experts.down_proj.weight_scale_inv | UNEXPECTED |
model.layers.78.self_attn.indexer.weights_proj.weight         | UNEXPECTED |
model.layers.78.self_attn.indexer.k_norm.weight               | UNEXPECTED |
model.layers.78.mlp.shared_experts.gate_proj.weight           | UNEXPECTED |
model.layers.78.self_attn.q_b_proj.weight_scale_inv           | UNEXPECTED |
model.layers.78.self_attn.q_a_proj.weight                     | UNEXPECTED |
model.layers.78.self_attn.q_a_proj.weight_scale_inv           | UNEXPECTED |
model.layers.78.self_attn.kv_a_proj_with_mqa.weight           | UNEXPECTED |
model.layers.78.self_attn.o_proj.weight_scale_inv             | UNEXPECTED |
model.layers.78.mlp.experts.gate_up_proj                      | UNEXPECTED |
model.layers.78.self_attn.indexer.wq_b.weight_scale_inv       | UNEXPECTED |
model.layers.78.enorm.weight                                  | UNEXPECTED |
model.layers.78.mlp.experts.down_proj                         | UNEXPECTED |
model.layers.78.mlp.shared_experts.down_proj.weight           | UNEXPECTED |
model.layers.78.shared_head.norm.weight                       | UNEXPECTED |
model.layers.78.post_attention_layernorm.weight               | UNEXPECTED |
model.layers.78.mlp.experts.gate_up_proj_scale_inv            | UNEXPECTED |

Notes:
- UNEXPECTED    :can be ignored when loading from different task/architecture; not ok if you expect identical arch.
The following generation flags are not valid and may be ignored: ['top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
Hi, introduce yourself

? true EM...



 EB EM external? real and all


: new a an the to re ",?

 local?

ernal a //?...





 or all

 false an or // or  new

 real in

 an ".

 local dark

 in re or




 //



 the lost: else the or or

 the.

 a //

 true // you else a as...



 new...



 or an.



://, E



 an   all

 lost in //, self saidernal the.

 then

 to

 and

? ":// the...

  you // else

 a reernal

 or
(transformers) root@develop-20260210102829-0bprk:/vllm-workspace/transformers#

Copilot

Pull request overview

This PR updates the GLM-MoE-DSA implementation to add a DSA indexer path and a rewritten MLA attention forward pass, alongside a config refactor to support new indexer parameters.

Changes:

Add a single-tensor RoPE helper and introduce a standalone GlmMoeDsaIndexer module used by attention.
Rewrite GlmMoeDsaAttention.forward() to apply top-k masking (DSA) and adjust caching/FP8-related behavior.
Refactor GlmMoeDsaConfig to inherit from PreTrainedConfig and add indexer-related config fields.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 11 comments.

File	Description
`src/transformers/models/glm_moe_dsa/modular_glm_moe_dsa.py`	Implements the new RoPE helper, indexer module, and rewrites attention forward + FP8-related init behavior.
`src/transformers/models/glm_moe_dsa/modeling_glm_moe_dsa.py`	Generated file reflecting the modular changes.
`src/transformers/models/glm_moe_dsa/configuration_glm_moe_dsa.py`	Refactors config and adds indexer parameters (`index_head_dim`, `index_n_heads`, etc.).

src/transformers/models/glm_moe_dsa/modular_glm_moe_dsa.py

Copilot · 2026-02-11T12:49:08Z

src/transformers/models/glm_moe_dsa/modular_glm_moe_dsa.py

        if self.q_lora_rank is None:
-            q_states = self.q_proj(hidden_states)
+            query_states = self.q_proj(hidden_states)
+            q_resid = None
        else:


When q_lora_rank is None, this sets q_resid = None, but the indexer path later expects a real [B, S, q_lora_rank] tensor (it feeds it through wq_b). Either compute an appropriate indexer query input for the non-LoRA case or explicitly disallow q_lora_rank=None for this model.

Copilot · 2026-02-11T12:49:08Z

src/transformers/models/glm_moe_dsa/configuration_glm_moe_dsa.py

+        # Indexer (DSA) parameters
+        self.index_topk = index_topk
+        self.index_head_dim = index_head_dim
+        self.index_n_heads = index_n_heads
+        self.indexer_rope_interleave = indexer_rope_interleave


index_head_dim/qk_rope_head_dim are used together in the indexer via torch.split(..., [qk_rope_head_dim, index_head_dim - qk_rope_head_dim]); if index_head_dim < qk_rope_head_dim this will fail at runtime. Consider validating these constraints (and index_n_heads > 0, index_topk > 0) in the config init so misconfigurations fail early with a clear error.

Copilot · 2026-02-11T12:49:08Z

src/transformers/models/glm_moe_dsa/configuration_glm_moe_dsa.py

-            Number of top tokens selected by the indexer for retrieval/attention in each step.
+            Number of top tokens selected by the indexer for sparse attention.
+        index_head_dim (`int`, *optional*, defaults to 128):
+            Head dimension for the indexer projections (DSA).


The config docstring lists index_topk, index_head_dim, and indexer_rope_interleave, but it doesn't document the newly added index_n_heads. Please add a short description of what index_n_heads controls and any constraints.

Suggested change

Head dimension for the indexer projections (DSA).

Head dimension for the indexer projections (DSA).

index_n_heads (`int`, *optional*, defaults to 8):

Number of attention heads used by the indexer projections. Should be a positive integer and is typically

less than or equal to `num_attention_heads`.

src/transformers/models/glm_moe_dsa/modular_glm_moe_dsa.py

Rocketknight1 · 2026-02-11T13:01:07Z

Hi @JaredforReal, you didn't attach the test.py script so we can't actually see what the bug is here!

JaredforReal · 2026-02-11T13:05:07Z

@Rocketknight1 @ArthurZucker
test.py

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "/workspace/glm5-0210-fp8/"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    device_map="auto",
)

prompt = "Hi, introduce yourself"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

with torch.no_grad():
    outputs = model.generate(
        **inputs,
        max_new_tokens=128,
        do_sample=True,
        temperature=1.0,
        top_p=0.95,
    )

print(tokenizer.decode(outputs[0], skip_special_tokens=True))

ArthurZucker

just a few small nits, and if you have an integration test (you are using locally maybe) would be good.

The TP plan probably needs an update as well given the q/k norms. If you don't want to figure out right now i recommend skipping TP for tthe attention part only do it on experts

src/transformers/models/glm_moe_dsa/modular_glm_moe_dsa.py

ArthurZucker · 2026-02-11T13:05:06Z

src/transformers/models/glm_moe_dsa/modular_glm_moe_dsa.py

+        # q·k^T per head: [B, S, H, D] @ [B, T, D]^T → [B, S, H, T]
+        scores = torch.einsum("bshd,btd->bsht", q.float(), k_cached.float()) * self.softmax_scale
+
+        # Weight per head and sum across heads → [B, S, T]
+        index_scores = torch.einsum("bsht,bsh->bst", scores, weights)


we avoid einsums when we can but its not a big deal

ArthurZucker · 2026-02-11T13:05:44Z

src/transformers/models/glm_moe_dsa/modular_glm_moe_dsa.py

+            if self._cached_keys is not None:
+                k_cached = torch.cat([self._cached_keys, k], dim=1)  # [B, T, D]
+            else:
+                k_cached = k
+            self._cached_keys = k_cached


Mmmnmm yeah it works, the main issue is if you generate again with a new prompt this is gonna break no? (dif k shape)

src/transformers/models/glm_moe_dsa/modular_glm_moe_dsa.py

Signed-off-by: JaredforReal <w13431838023@gmail.com>

JaredforReal · 2026-02-11T14:57:35Z

@ArthurZucker PTAL
the current output and logs:

HuggingFaceDocBuilderDev · 2026-02-11T15:15:59Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

ArthurZucker

Thanks for iterating

Signed-off-by: JaredforReal <w13431838023@gmail.com>

github-actions · 2026-02-13T15:36:20Z

[For maintainers] Suggested jobs to run (before merge)

run-slow: glm_moe_dsa

init

73cf287

Signed-off-by: JaredforReal <w13431838023@gmail.com>

Copilot AI review requested due to automatic review settings February 11, 2026 12:37

Copilot started reviewing on behalf of JaredforReal February 11, 2026 12:38 View session

format

b4c0a92

Signed-off-by: JaredforReal <w13431838023@gmail.com>

Copilot AI reviewed Feb 11, 2026

View reviewed changes

ArthurZucker approved these changes Feb 11, 2026

View reviewed changes

JaredforReal added 6 commits February 11, 2026 21:24

not indexer_rope_interleave

85c3bd1

Signed-off-by: JaredforReal <w13431838023@gmail.com>

set MLA rope interleave to False

545d91a

Signed-off-by: JaredforReal <w13431838023@gmail.com>

get rid of interleave in apply_rotary_pos_emb

6d0f6a8

Signed-off-by: JaredforReal <w13431838023@gmail.com>

Merge branch 'main' into glm-dsa

d829f86

reintroduce attention interface

e84b43c

Signed-off-by: JaredforReal <w13431838023@gmail.com>

reset _cached_keys

558989a

Signed-off-by: JaredforReal <w13431838023@gmail.com>

ArthurZucker approved these changes Feb 11, 2026

View reviewed changes

ngxson mentioned this pull request Feb 12, 2026

model: support GLM MoE DSA arch (NOTE: indexer is not yet supported) ggml-org/llama.cpp#19460

Merged

zRzRzRzRzRzRzR mentioned this pull request Feb 13, 2026

[GLM-5] ValueError: GenerationConfig is invalid #43937

Open

4 tasks

remove yarn

6245452

Signed-off-by: JaredforReal <w13431838023@gmail.com>

-            Head dimension for the indexer projections (DSA).
+            Head dimension for the indexer projections (DSA).
+        index_n_heads (`int`, *optional*, defaults to 8):
+            Number of attention heads used by the indexer projections. Should be a positive integer and is typically
+            less than or equal to `num_attention_heads`.

Conversation

JaredforReal commented Feb 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Current state:

help wanted

Uh oh!

JaredforReal commented Feb 11, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Copilot AI Feb 11, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 11, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 11, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Rocketknight1 commented Feb 11, 2026

Uh oh!

JaredforReal commented Feb 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ArthurZucker left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ArthurZucker Feb 11, 2026

Choose a reason for hiding this comment

Uh oh!

ArthurZucker Feb 11, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

JaredforReal commented Feb 11, 2026

Uh oh!

HuggingFaceDocBuilderDev commented Feb 11, 2026

Uh oh!

ArthurZucker left a comment

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Feb 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

JaredforReal commented Feb 11, 2026 •

edited

Loading

JaredforReal commented Feb 11, 2026 •

edited

Loading