tests : add support for qwen3 SSM archs by ggerganov · Pull Request #24031 · ggml-org/llama.cpp

ggerganov · 2026-06-02T16:00:12Z

Overview

Enable test-llama-archs for Qwen3 architectures using SSM.

Additional information

|       qwen3next|Apple M2 Ultra|   MoE|  OK (8.53e-08)|       OK|
|       qwen3next|    Accelerate|   MoE|  OK (1.00e-17)|       OK|
|       qwen3next|Apple M2 Ultra|   MoE|  OK (9.61e-14)|       OK|
|       qwen3next|          Meta|   MoE|  OK (8.53e-08)|     SKIP|
|          qwen35|Apple M2 Ultra| Dense|  OK (8.53e-08)|       OK|
|          qwen35|    Accelerate| Dense|  OK (0.00e+00)|       OK|
|          qwen35|Apple M2 Ultra| Dense|  OK (9.18e-14)|       OK|
|          qwen35|          Meta| Dense|  OK (8.53e-08)|     SKIP|
|       qwen35moe|Apple M2 Ultra|   MoE|  OK (8.53e-08)|       OK|
|       qwen35moe|    Accelerate|   MoE|  OK (0.00e+00)|       OK|
|       qwen35moe|Apple M2 Ultra|   MoE|  OK (9.49e-14)|       OK|
|       qwen35moe|          Meta|   MoE|  OK (8.53e-08)|     SKIP|

Requirements

I have read and agree with the contributing guidelines
AI usage disclosure: NO

JohannesGaessler · 2026-06-02T17:10:14Z

+        bool try_scalar = false;
+        try {
+            ml.get_key_or_arr(LLM_KV_FULL_ATTENTION_INTERVAL, hparams.recurrent_layer_arr, hparams.n_layer, false);
+        } catch (...) {
+            try_scalar = true;
+        }
+
+        if (try_scalar) {
+            const uint32_t n_main = hparams.n_layer - hparams.nextn_predict_layers;
+
+            uint32_t full_attn_interval = 4;
+            ml.get_key(LLM_KV_FULL_ATTENTION_INTERVAL, full_attn_interval, false);
+            for (uint32_t i = 0; i < hparams.n_layer; ++i) {
+                hparams.recurrent_layer_arr[i] = (i < n_main) && ((i + 1) % full_attn_interval != 0);
+            }


Suggested change

bool try_scalar = false;

try {

ml.get_key_or_arr(LLM_KV_FULL_ATTENTION_INTERVAL, hparams.recurrent_layer_arr, hparams.n_layer, false);

} catch (...) {

try_scalar = true;

}

if (try_scalar) {

const uint32_t n_main = hparams.n_layer - hparams.nextn_predict_layers;

uint32_t full_attn_interval = 4;

ml.get_key(LLM_KV_FULL_ATTENTION_INTERVAL, full_attn_interval, false);

for (uint32_t i = 0; i < hparams.n_layer; ++i) {

hparams.recurrent_layer_arr[i] = (i < n_main) && ((i + 1) % full_attn_interval != 0);

}

try {

ml.get_key_or_arr(LLM_KV_FULL_ATTENTION_INTERVAL, hparams.recurrent_layer_arr, hparams.n_layer, false);

} catch (...) {

const uint32_t n_main = hparams.n_layer - hparams.nextn_predict_layers;

uint32_t full_attn_interval = 4;

ml.get_key(LLM_KV_FULL_ATTENTION_INTERVAL, full_attn_interval, false);

for (uint32_t i = 0; i < hparams.n_layer; ++i) {

hparams.recurrent_layer_arr[i] = (i < n_main) && ((i + 1) % full_attn_interval != 0);

}

I think that this would be slightly simpler but either way is fine I think.

JohannesGaessler · 2026-06-02T17:11:48Z

What is the intended scope of this PR, will there be more changes?

ggerganov · 2026-06-02T17:14:53Z

It's ready now.

JohannesGaessler · 2026-06-02T17:34:36Z

    // by default, all layers are dense
    // note: using uint32_t type for compatibility reason
-    std::array<uint32_t, LLAMA_MAX_LAYERS> swa_layers;
+    std::array<uint32_t, LLAMA_MAX_LAYERS> is_swa_impl;


I'm not sure "is_swa_impl" is a good choice for the variable name. I'm reading it as "is SWA implementation" but then you have the code of the individual models manipulating it which to me would intuitively seem like the models messing with the internals of llama_hparams. Maybe "swa_pattern" to be consistent with set_swa_pattern?

I get the is_swa part, to match the function, but agreed, it's confusing, maybe is_swa_layer?

I'll follow-up with more refactoring of the hparams after this to avoid this PR growing. The main goal here is to get recurrent models enrolled in test-llama-archs to be able to generate small dummy models for testing purposes.

Be aware though that currently there is no implementation for creating a dummy vocab for those models - I have a poor understanding of the related code and did not want to delay the unit tests for TP. But this means that you cannot just use the dummy models for e.g. llama-perplexity or llama-completion.

Yes, I noticed that. I'll be using it with test-save-load-state and I can rework it to not require a vocab.

For dummy models, wouldn't it be fine to just map ASCII characters to int? I would intuitively assume that that would not be too difficult to implement, the problem for me was just that I would have to read up on the vocab code first.

Yes probably. It would be definitely useful to generate some dummy vocabs too. Will take a look.

ggerganov requested a review from CISC as a code owner June 2, 2026 16:00

ggerganov requested a review from JohannesGaessler June 2, 2026 16:01

github-actions Bot added the model Model specific label Jun 2, 2026

CISC reviewed Jun 2, 2026

View reviewed changes

Comment thread src/llama-model-loader.cpp Outdated

JohannesGaessler approved these changes Jun 2, 2026

View reviewed changes

ggerganov added 3 commits June 3, 2026 08:35

tests : add support for qwen3 SSM archs

d145f55

arch : add LLM_KV_ATTENTION_RECURRENT_LAYERS

2a2eaeb

cont : naming + TODOs

433b106

ggerganov force-pushed the gg/test-archs-add-qwen3-ssm branch from 3a20879 to 433b106 Compare June 3, 2026 05:36

JohannesGaessler approved these changes Jun 3, 2026

View reviewed changes

ggerganov merged commit 06938ac into master Jun 3, 2026
25 of 28 checks passed

ggerganov deleted the gg/test-archs-add-qwen3-ssm branch June 3, 2026 07:15

bobvious mentioned this pull request Jun 5, 2026

[CUDA] PR #23907 flash-attn F16 KV dequant-scratch sized by allocated (not used) KV -> large-context q8_0 decode regression / VRAM thrashing #24166

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

tests : add support for qwen3 SSM archs#24031

tests : add support for qwen3 SSM archs#24031
ggerganov merged 3 commits into
masterfrom
gg/test-archs-add-qwen3-ssm

ggerganov commented Jun 2, 2026 •

edited

Loading

Uh oh!

Uh oh!

JohannesGaessler Jun 2, 2026

Uh oh!

JohannesGaessler commented Jun 2, 2026

Uh oh!

ggerganov commented Jun 2, 2026

Uh oh!

Uh oh!

JohannesGaessler Jun 2, 2026

Uh oh!

CISC Jun 2, 2026

Uh oh!

ggerganov Jun 3, 2026

Uh oh!

JohannesGaessler Jun 3, 2026

Uh oh!

ggerganov Jun 3, 2026 •

edited

Loading

Uh oh!

JohannesGaessler Jun 3, 2026

Uh oh!

ggerganov Jun 3, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

ggerganov commented Jun 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview

Additional information

Requirements

Uh oh!

Uh oh!

JohannesGaessler Jun 2, 2026

Choose a reason for hiding this comment

Uh oh!

JohannesGaessler commented Jun 2, 2026

Uh oh!

ggerganov commented Jun 2, 2026

Uh oh!

Uh oh!

JohannesGaessler Jun 2, 2026

Choose a reason for hiding this comment

Uh oh!

CISC Jun 2, 2026

Choose a reason for hiding this comment

Uh oh!

ggerganov Jun 3, 2026

Choose a reason for hiding this comment

Uh oh!

JohannesGaessler Jun 3, 2026

Choose a reason for hiding this comment

Uh oh!

ggerganov Jun 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

JohannesGaessler Jun 3, 2026

Choose a reason for hiding this comment

Uh oh!

ggerganov Jun 3, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ggerganov commented Jun 2, 2026 •

edited

Loading

ggerganov Jun 3, 2026 •

edited

Loading