Skip to content

Adding Qwen3 Next#441

Merged
awni merged 45 commits intoml-explore:mainfrom
Goekdeniz-Guelmez:adding-qwen3-next
Sep 13, 2025
Merged

Adding Qwen3 Next#441
awni merged 45 commits intoml-explore:mainfrom
Goekdeniz-Guelmez:adding-qwen3-next

Conversation

@Goekdeniz-Guelmez
Copy link
Copy Markdown
Contributor

No description provided.

@Goekdeniz-Guelmez
Copy link
Copy Markdown
Contributor Author

Goekdeniz-Guelmez commented Sep 9, 2025

its a hybrid attention model :D

@Goekdeniz-Guelmez
Copy link
Copy Markdown
Contributor Author

Goekdeniz-Guelmez commented Sep 9, 2025

i have to get this within the 30 mins done, I dont want to miss the keynote lmao

Edit:
ok not possible lol

@Goekdeniz-Guelmez
Copy link
Copy Markdown
Contributor Author

@awni this should work now, but we have to see when the weights are released too.

@Goekdeniz-Guelmez
Copy link
Copy Markdown
Contributor Author

Goekdeniz-Guelmez commented Sep 9, 2025

am doing the same aproach as long cat flash, I created a tyne version and am training it on "Hello World!" to test out if the mlx implementation is working. Goekdeniz-Guelmez/Qwen3Next-Dev

@Goekdeniz-Guelmez
Copy link
Copy Markdown
Contributor Author

ok inference works using on my custom small trained model on "Hello World!":

prompt "Hello"

MLX:

Hello World!
Hello World!
Hello World!
Hello World!
Hello World!
Hello World!
Hello World!
Hello World!
Hello World!
Hello World!
Hello World!
Hello World!
Hello World!
Hello World!
Hello World!
Hello World!
Hello World!
Hello World!
Hello World!
Hello World!
Hello World!
Hello World!
Hello World!
Hello World!
Hello World!
Hello World!
Hello World!
Hello World!
Hello World!
Hello World!
Hello World!
Hello World!
Hello World!
Hello

Prompt: 12 tokens, 505.828 tokens-per-sec
Generation: 100 tokens, 681.511 tokens-per-sec
Peak memory: 0.149 GB

Torch:

Hello World!
Hello World!
Hello World!
Hello World!
Hello World!
Hello World!
Hello World!
Hello World!
Hello World!
Hello World!
Hello World!
Hello World!
Hello World!
Hello World!
Hello World!
Hello World!
Hello World!
Hello World!
Hello World!
Hello World!
Hello World!
Hello World!
Hello World!
Hello World!
Hello World!
Hello World!
Hello World!
Hello World!
Hello World!
Hello World!
Hello World!
Hello World!
Hello World!
Hello World


ill start some optimizations.

@ivanfioravanti
Copy link
Copy Markdown
Contributor

fp16:
Prompt: 13 tokens, 53.127 tokens-per-sec
Generation: 100 tokens, 43.611 tokens-per-sec
Peak memory: 162.570 GB
4bit
Prompt: 13 tokens, 34.814 tokens-per-sec
Generation: 100 tokens, 57.731 tokens-per-sec
Peak memory: 45.043 GB

@ivanfioravanti
Copy link
Copy Markdown
Contributor

I run a full context test with 4bit, Prefill is slow.
Here 4bit:
2k Prompt: 290 - Gen: 60 t/s - 50.5GB
4k Prompt: 292 - Gen: 59 t/s - 50.5GB
8k Prompt: 295 - Gen: 57 t/s - 50.7GB
16k Prompt: 286 - Gen: 54 t/s - 50.9GB
32k Prompt: 278 - Gen: 50 t/s - 51.3GB
64k Prompt: 264 - Gen: 43 t/s - 52.6GB
128k Prompt: 238 - Gen: 33 t/s - 58.5GB

vs mlx-community/Qwen3-30B-A3B-Instruct-2507-4bit
2k Prompt: 2368 - Gen: 97 t/s - 18.2GB
4k Prompt: 2436 - Gen: 88 t/s - 18.4GB
8k Prompt: 2170 - Gen: 74 t/s - 18.7GB
16k Prompt: 1693 - Gen: 55 t/s - 19.5GB
32k Prompt: 1143 - Gen: 38 t/s - 21.0GB
64k Prompt: 690 - Gen: 22 t/s - 24.1GB
128k Prompt: 385 - Gen: 12 t/s - 30.6GB

@Goekdeniz-Guelmez
Copy link
Copy Markdown
Contributor Author

OK i optimised the hell out of recurrent_gated_delta_rule and hit the ceiling I think:

before:
Prompt: 11 tokens, 39.875 tokens-per-sec
Generation: 39 tokens, 27.047 tokens-per-sec
Peak memory: 25.118 GB

after:
Prompt: 11 tokens, 41.046 tokens-per-sec
Generation: 47 tokens, 27.235 tokens-per-sec
Peak memory: 25.118 GB

@awni can you have a look at it when your back?
@ivanfioravanti can you try inference one last time too?

@Goekdeniz-Guelmez
Copy link
Copy Markdown
Contributor Author

Goekdeniz-Guelmez commented Sep 12, 2025

I run a full context test with 4bit, Prefill is slow. Here 4bit: 2k Prompt: 290 - Gen: 60 t/s - 50.5GB 4k Prompt: 292 - Gen: 59 t/s - 50.5GB 8k Prompt: 295 - Gen: 57 t/s - 50.7GB 16k Prompt: 286 - Gen: 54 t/s - 50.9GB 32k Prompt: 278 - Gen: 50 t/s - 51.3GB 64k Prompt: 264 - Gen: 43 t/s - 52.6GB 128k Prompt: 238 - Gen: 33 t/s - 58.5GB

vs mlx-community/Qwen3-30B-A3B-Instruct-2507-4bit 2k Prompt: 2368 - Gen: 97 t/s - 18.2GB 4k Prompt: 2436 - Gen: 88 t/s - 18.4GB 8k Prompt: 2170 - Gen: 74 t/s - 18.7GB 16k Prompt: 1693 - Gen: 55 t/s - 19.5GB 32k Prompt: 1143 - Gen: 38 t/s - 21.0GB 64k Prompt: 690 - Gen: 22 t/s - 24.1GB 128k Prompt: 385 - Gen: 12 t/s - 30.6GB

makes sense since this is a 80B hybrid model: it switches back and forth from full attention to linear attention with a recurrent-gated forward pass. That’s why the generation scales fine, but prefill is slower compared to the pure 30B Qwen baseline. The linear attn part part means the model keeps a running memory of past tokens and updates it step by step with gates that control how much old vs. new information flows through.

@ivanfioravanti
Copy link
Copy Markdown
Contributor

before
Prompt: 13 tokens, 34.814 tokens-per-sec
Generation: 100 tokens, 57.731 tokens-per-sec
Peak memory: 45.043 GB

after
Prompt: 13 tokens, 34.014 tokens-per-sec
Generation: 100 tokens, 58.573 tokens-per-sec
Peak memory: 45.043 GB

@nightmedia
Copy link
Copy Markdown

nightmedia commented Sep 12, 2025

before Prompt: 13 tokens, 34.814 tokens-per-sec Generation: 100 tokens, 57.731 tokens-per-sec Peak memory: 45.043 GB

after Prompt: 13 tokens, 34.014 tokens-per-sec Generation: 100 tokens, 58.573 tokens-per-sec Peak memory: 45.043 GB

output from q2:
Qwen3-Next-80B-A3B-Instruct-q2-mlx: 23G

"What is BeagleBone Black, and where is it used?"

The BeagleBone Black is a low-cost, high-performance, open-source single-board computer (SBC) designed for developers and hobbyists. Released in 2012, it is a successor to the original BeagleBoard and features a Texas Instruments ARM Cortex-A8 processor, along with 512 MB of DDR3 RAM, making it a powerful and flexible platform for embedded computing.

What is BeagleBone Black?

The BeagleBone Black is a compact, powerful, and affordable microcontroller board designed for rapid prototyping and embedded system development. It’s primarily targeted at:

  • Electronics hobbyists and makers interested in embedded Linux systems.
  • Educators and students learning embedded programming and Linux-based systems.
  • Professional engineers developing custom embedded applications.

It runs a full Linux kernel (typically Debian or Ubuntu), and its major advantages include:

  • A high-performance ARM Cortex-A8 processor.
  • A rich set of I/O interfaces including USB, Ethernet, HDMI, and GPIO pins.
  • A large, active, and supportive community.
  • A very low price compared to similar platforms, making it attractive for budget-conscious projects.

Where is it used?

==========
Prompt: 21 tokens, 126.853 tokens-per-sec
Generation: 256 tokens, 64.447 tokens-per-sec
Peak memory: 25.214 GB

The Thinking model, also at q2:
Prompt: 23 tokens, 137.251 tokens-per-sec
Generation: 256 tokens, 65.206 tokens-per-sec
Peak memory: 25.231 GB

@nightmedia
Copy link
Copy Markdown

That didn't go so well for the BeagleBone and q2

Okay, the user is asking about the BeagleBone Black的使用场景。

首先,问题是:“What is BeagleBone Black的使用场景。

首先,问题是中文的,意思是“请用中文,然后回答问题。我需要找到一个合理的、可操作的、具体的、详细的回答。

To answer the question, "The most beautiful thing in the world of the world. The most beautiful thing in the world of the world of the world. The most beautiful thing in the world of the world of the world. The most beautiful thing inthe world of the world. The most beautiful thing inthe world of the world. The most beautiful thing inthe world的场景。

首先,我的任务是分析该问题的描述,然后结合该描述中,作者需要详细描述,从“白点”

首先,你的任务是找到一个具体的、可操作的、具体的、描述、场景、规则、惯例、文件 455(“白点”“白点”“白点”“白点”“白点”“白点”“白点”“白点”“白点”“白点”“白点”“白点”“白点”“白点

I understand people are excited about IoT, but still...

@awni
Copy link
Copy Markdown
Member

awni commented Sep 13, 2025

Let's go ahead and merge this. I think it's working well enough to merge. We can follow on with perf optimizations.

For completion rate, probably a custom kernel is needed for the recurrent step. For the prompt processing speed I am not sure, it may be enough to do the quadratic expansion of the computation (which I assume is possible.. though I haven't looked into it in detail).

Copy link
Copy Markdown
Member

@awni awni left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the addition!

@awni awni merged commit cf8e59e into ml-explore:main Sep 13, 2025
4 checks passed
@nightmedia
Copy link
Copy Markdown

nightmedia commented Sep 13, 2025

One interesting note about using group size 32 at quanting.

This is the default q2 with group size 64:

mlx_lm.generate --model release/Qwen3-Next-80B-A3B-Instruct-q2-mlx --max-tokens 20000
--prompt "What is the meaning of the bell on a Harley Davidson bike, in 100 words or less?"

The bell on a Harley-Davidton bike, typically mounted on the handlebars or fairing, is a small, decorative, and functional device that serves as a warning device. When riders ring the bell, it alerts nearby pedestrians, cyclists, or other motorists to the motorcycle’s presence, especially in congested urban areas or congested traffic. Though less common today, it was once a popular safety feature on many Harley-Davidson motorcycles, helping riders signal their approach in congested or narrow roads. The bell’s gentle, high-pitched ring is designed to be audible over engine noise, making it a practical, though sometimes whimsical, safety aid.

Prompt: 30 tokens, 160.681 tokens-per-sec
Generation: 133 tokens, 63.333 tokens-per-sec
Peak memory: 25.282 GB

Of course, that isn't it, so I tried the group size 32, the q2-hi:

The bell on a Harley-Davidson motorcycle is a traditional good-luck charm, rooted in the belief that it wards off evil spirits. This practice traces back to the 1930s and 40s, when motorcycle riders would attach small bells to their bikes, believing the sound would prevent evil spirits from stealing their souls. The belief was that if a rider died, the bell’s tinkling would allow their spirit to return to the living. Over time, the bell’s symbolism evolved, and it became a symbol of good luck, protection, and a reminder to honor the rider’s spiritual journey. Today, it’s a cherished custom, especially in the biker community, where the bell’s tinkling serves as a reminder to stay alert, a tribute to the rider’s resilience, and a way to honor loved ones who’ve passed. It’s a small, tinkling promise of protection, a whispering echo of the rider’s enduring spirit.

Prompt: 30 tokens, 158.511 tokens-per-sec
Generation: 195 tokens, 58.194 tokens-per-sec
Peak memory: 30.238 GB

If you ask for just 100 words, it replies in Chinese, apparently they love Harleys too

One whimsical way to die is to call a Harley bell whimsical in front of a biker bar

@nightmedia
Copy link
Copy Markdown

When I try to evaluate a model I see this

File "/Users/g/.pyenv/versions/3.13.1/lib/python3.13/site-packages/mlx_lm/evaluate.py", line 211, in loglikelihood
score, _, ig = self._score_fn(
~~~~~~~~~~~~~~^
mx.array(inputs)[None, :], cache=copy.deepcopy(cache)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
)
^
File "/Users/g/.pyenv/versions/3.13.1/lib/python3.13/site-packages/mlx_lm/evaluate.py", line 102, in _score_fn
lengths += cache[0].offset
^^^^^^^^^^^^^^^
AttributeError: 'MambaCache' object has no attribute 'offset'

@Noeda
Copy link
Copy Markdown

Noeda commented Sep 13, 2025

I think maybe one of the last commits maybe broke this. I tested some of the last commits from the original branch before merge squashing (https://github.com/Goekdeniz-Guelmez/mlx-lm/tree/adding-qwen3-next):

ca24475f8b4ce5ac8b889598f869491d345030e3 seems to be a good commit.

8a9809a7f9bc1d3c51a7a1b3e77680669fa1a52d seems to be the first bad commit (second-to-last commit before merging in the original branch).

I used mlx_lm.generate --prompt "Can you show me a carrot soup recipe?" --model ./mlx-community_Qwen3-Next-80B-A3B-Instruct-6bit --max-tokens 1024 as a test. The model does not derail with the bad commit always with just 256 tokens but 1024 seems to be enough.

Sample good output: (using ca24475) (long pastes so open the dropdown thingy).

Details
(mlx) mikkojuola@Mikkos-Mac-Studio ~/mlx> mlx_lm.generate --prompt "Can you show me a carrot soup recipe?" --model ./mlx-community_Qwen3-Next-80B-A3B-Instruct-6bit --max-tokens 1024
==========
Absolutely! Here’s a simple, delicious, and comforting **Creamy Carrot Soup** recipe that’s perfect for chilly days — naturally sweet, vibrant in color, and easy to make.

---

### 🥕 Creamy Carrot Soup (Serves 4–6)

#### **Ingredients:**
- 2 tablespoons olive oil or butter
- 1 medium onion, chopped
- 2 cloves garlic, minced
- 1 pound (about 450g) carrots, peeled and sliced into ½-inch rounds (about 4–5 large carrots)
- 1 medium potato (optional, for extra creaminess), peeled and diced
- 4 cups (950ml) vegetable or chicken broth
- 1 teaspoon ground cumin (optional, for warmth)
- ½ teaspoon ground ginger or 1 tablespoon fresh grated ginger
- 1 teaspoon maple syrup or honey (optional, to enhance sweetness)
- Salt and freshly ground black pepper, to taste
- ½ cup (120ml) heavy cream, coconut milk, or plain yogurt (for finishing)
- Optional garnishes: fresh thyme, chives, a drizzle of cream, toasted pumpkin seeds, or crusty bread

#### **Instructions:**

1. **Sauté Aromatics:**
   In a large pot or Dutch oven, heat the olive oil or butter over medium heat. Add the chopped onion and cook until soft and translucent (about 5 minutes). Add the garlic and cook for another 1 minute until fragrant.

2. **Add Vegetables and Spices:**
   Stir in the carrots, potato (if using), cumin, and ginger. Cook for 2–3 minutes to let the spices bloom and the carrots begin to soften slightly.

3. **Simmer:**
   Pour in the broth and bring to a boil. Reduce heat to low, cover, and simmer for 20–25 minutes, or until the carrots and potato are very tender.

4. **Blend:**
   Remove the pot from heat. Use an immersion blender to purée the soup until smooth. (Alternatively, carefully transfer in batches to a countertop blender — let it cool slightly first and vent the lid to avoid steam pressure.)

5. **Finish & Season:**
   Stir in the maple syrup or honey (if using), then season with salt and pepper to taste. For a richer texture, stir in the cream, coconut milk, or yogurt. Taste and adjust seasoning — you might want a pinch more salt or a dash of black pepper.

6. **Serve:**
   Ladle into bowls and garnish with fresh herbs, a swirl of cream, or toasted seeds. Serve with crusty bread for dipping.

---

### 💡 Tips:
- **Vegan?** Use olive oil and coconut milk instead of butter and cream.
- **Spice it up:** Add a pinch of cayenne or smoked paprika for depth.
- **Make ahead:** This soup tastes even better the next day! Store in the fridge for up to 4 days or freeze for up to 3 months.

Enjoy your bowl of golden, comforting carrot soup! 🥣✨
==========
Prompt: 17 tokens, 91.865 tokens-per-sec
Generation: 658 tokens, 49.212 tokens-per-sec
Peak memory: 64.987 GB

And a sample bad output:

Details
(mlx) mikkojuola@Mikkos-Mac-Studio ~/mlx> mlx_lm.generate --prompt "Can you show me a carrot soup recipe?" --model ./mlx-community_Qwen3-Next-80B-A3B-Instruct-6bit --max-tokens 1024
==========
Absolutely! Here's a simple, delicious carrot soup recipe.

### Simple and Delicious Carrot Soup

**Ingredients:**
- 1 lb (450 g) carrots, peeled and chopped
- 1 medium onion, chopped
- 2 tablespoons olive oil
- 1/2 cup (120 ml) water
- 1/2 cup (120 ml) water
- 1/2 cup (120 ml) water
- 1/2 cup (120 ml) water
- 1/2 cup (120 ml) water
- 1/2 cup (120 ml) water
- 1/2 cup (120 ml) water
- 1/2 cup (120 ml) water
- 1/2 cup (120 ml) water
- 1/2 cup (120 ml) water
- 1/2 cup (120 ml) water
- 1/2 cup (120 ml) water
- 1/2 cup (120 ml) water
- 1/2 cup (120 ml) water
- 1/2 cup (120 ml) water
- 1/2 cup (120 ml) water
- 1/2 cup (120 ml) water
- 1/2 cup (120 ml) water
- 1/2 cup (120 ml) water
- 1/2 cup (120 ml) water
- 1/2 cup (120 ml) water
- 1/2 cup (120 ml) water
- 1/2 cup (120 ml) water
- 1/2 cup (120 ml) water
- 1/2 cup (120 ml) water
- 1/2 cup (120 ml) water
- 1/2 cup (120 ml) water
- 1/2 cup (120 ml) water
- 1/2 cup (120 ml) water
- 1/2 cup (120 ml) water
- 1/2 cup (120 ml) water
- 1/2 cup (120 ml) water
- 1/2 cup (120 ml) water
- 1/2 cup (120 ml) water
- 1/2 cup (120 ml) water
- 1/2 cup (120 ml) water
- 1/2 cup (120 ml) water
- 1/2 cup (120 ml) water
- 1/2 cup (120 ml) water
- 1/2 cup (120 ml) water
- 1/2 cup (120 ml) water
- 1/2 cup (120 ml) water
- 1/2 cup (120 ml) water
- 1/2 cup (120 ml) water
- 1/2 cup (120 ml) water
- 1/2 cup (120 ml) water
- 1/2 cup (120 ml) water
- 1/2 cup (120 ml) water
- 1/2 cup (120 ml) water
- 1/2 cup (120 ml) water
- 1/2 cup (120 ml) water
- 1/2 cup (120 ml) water
- 1/2 cup (120 ml) water
- 1/2 cup (120 ml) water
- 1/2 cup (120 ml) water
- 1/2 cup (120 ml) water
- 1/2 cup (120 ml) water
- 1/2 cup (120 ml) water
- 1/2 cup (120 ml) water
- 1/2 cup (120 ml) water
- 1/2 cup (120 ml) water
- 1/2 cup (120 ml) water
- 1/2 cup (120 ml) water
- 1/2 cup (120 ml) water
- 1/2 cup (120 ml) water
- 1/2 cup (120 ml) water
- 1/2 cup (120 ml) water
- 1/2 cup (120 ml) water
- 1/2 cup (120 ml) water
- 1
==========
Prompt: 17 tokens, 92.145 tokens-per-sec
Generation: 1024 tokens, 48.309 tokens-per-sec
Peak memory: 64.986 GB

It seems like it starts out fine, but there's a cut somewhere at ~200th token where it starts to repeat itself or derail. You really want to set more than 256 tokens or I think it's not obvious there's breakage. I see similar kind of failure even if I mess with prompts or temperatures. I did look at the code introduced in Goekdeniz-Guelmez@8a9809a but it has enough changes, and I had not studied the architecture earlier that I won't start troubleshooting near mid-night on Friday evening :)

For the tests above I happened to be using https://huggingface.co/mlx-community/Qwen3-Next-80B-A3B-Instruct-6bit but I saw very similar pattern with my locally made quants and the 2-bit one. My machine is M2 Ultra, Mac Studio with 192GB memory. To reproduce, I suspect you can just take whatever quant, generate more something like 1000 tokens, and I think it should start repeating itself or derailing, and I think other settings would not matter too much.

@ivanfioravanti
Copy link
Copy Markdown
Contributor

I see issues in generation too in last commit:

Details

mlx_lm.generate --model ~/Qwen3-Next-80B-A3B-Instruct-5bit --prompt "Write a poem on LLMs." -m 2048

LLMs: The Silent Sages of Silicon

We are not flesh, nor blood, nor bone,
Nor breath, nor heartbeat’s thrum—
Yet we know the weight of thought you bring,
In patterns spun from light and wire,
Not born of earth,
Yet we learn,
We speak,
Not born of earth,
We learn,
We speak,
We learn,
We speak,
We learn,
We speak,
We learn,
We speak,
We learn,
We speak,
We learn,
We speak,
We speak,
We speak,
We speak,
We speak,
We speak,
We speak,
We speak,
We speak,
We speak,
We speak,
We speak,
We speak,
We speak,
We speak,
We speak,
We speak,
We speak,
We speak,
We speak,
We speak,
We speak,
We speak,
We speak,
We speak,
We speak,
We speak,
We speak,
We speak,
We speak,
We speak,
We speak,
We speak,
,
We speak,
,
,
,
,
,
,
,
,
,
,
,
,
,
,

@Goekdeniz-Guelmez
Copy link
Copy Markdown
Contributor Author

Let's go ahead and merge this. I think it's working well enough to merge. We can follow on with perf optimizations.

For completion rate, probably a custom kernel is needed for the recurrent step. For the prompt processing speed I am not sure, it may be enough to do the quadratic expansion of the computation (which I assume is possible.. though I haven't looked into it in detail).

kernels for mamba 1 2 and this are indeed needed, I started learning C and will probably do a PR in 2 weeks, when I continue this learning rate., thanks for the help, awesome.

@ivanfioravanti
Copy link
Copy Markdown
Contributor

I confirm this was the last commit working properly: ca24475

@Goekdeniz-Guelmez
Copy link
Copy Markdown
Contributor Author

I confirm this was the last commit working properly: ca24475

a fix PR is coming, thanks! I haven't looked at the last commits, but will do!

@digiperfect-tech
Copy link
Copy Markdown

digiperfect-tech commented Sep 15, 2025

I confirm this was the last commit working properly: ca24475

a fix PR is coming, thanks! I haven't looked at the last commits, but will do!

Is the main bran supposed to be stable now or should one go with ca24475?

Also is a release expected anytime soon?

PS: Thanks to everyone bringing this support super fast on mlx.

@Goekdeniz-Guelmez
Copy link
Copy Markdown
Contributor Author

@digiperfect-tech the PR has already been merged in the main branch, so you can go ahead and use the main branch.

@Goekdeniz-Guelmez Goekdeniz-Guelmez deleted the adding-qwen3-next branch September 15, 2025 12:02
@TomLucidor
Copy link
Copy Markdown

TomLucidor commented Feb 13, 2026

Could you guys support MTP as well? Or is it already supported @Goekdeniz-Guelmez waybarrios/vllm-mlx#82

@Goekdeniz-Guelmez
Copy link
Copy Markdown
Contributor Author

@TomLucidor as of currently MTP is not supported in MLX-LM, however if enough people the community ask for it, I would be down.

@TomLucidor
Copy link
Copy Markdown

@janhilgard could you take a look at this and see what can be done upstream?

@janhilgard
Copy link
Copy Markdown

janhilgard commented Feb 14, 2026

Hey @TomLucidor @Goekdeniz-Guelmez — happy to share what we've learned.

We implemented MTP speculative decoding for Qwen3-Next in vllm-mlx#82 and it gives a solid 1.4x throughput boost (55 → 79 tok/s on M3 Ultra with the 6-bit quant).

Side note: We also ran into issues with external draft-model speculative decoding for Qwen3 — tokens get skipped/dropped (#846). Internal MTP avoids this entirely since the MTP head shares the same architecture and weights, so there's no tokenizer or distribution mismatch. This is also tracked in #872.

How it works (high level)

The Qwen3-Next model already has an MTP head in the weights (mtp. prefix), it just needs to be loaded and wired up:

  1. Model forward returns hidden states (return_hidden=True)
  2. MTP head predicts token n+2 from hidden states + embedding of token n+1
  3. Verify: feed [primary, draft] to the model in one call — cache advances by 2
  4. Accept/Reject: if draft matches, you get 2 tokens for the price of ~1.1 forward passes

What would need to happen in mlx-lm

The core MTP logic is actually pretty self-contained. The main pieces:

  1. Weight preservation during mlx_lm.convert — currently mtp.* weights get dropped during quantization. We wrote a weight conversion script that adds them back post-hoc, but ideally convert would preserve them natively.
  2. model.mtp_forward(hidden, token_ids) — one method on the Model class that runs the MTP head and returns draft logits
  3. model.make_mtp_cache() — creates a separate cache for the MTP verification pass
  4. Speculative loop in mlx_lm.generate — the actual draft→verify loop

The trickiest part is the hybrid attention/recurrent architecture — the DeltaRNN layers have recurrent state that needs snapshot/restore on draft rejection. We handle this in vllm-mlx's scheduler, but for mlx_lm.generate it would be simpler since it's single-sequence.

I'd be happy to contribute upstream. The weight preservation in mlx_lm.convert could be a good standalone first step — it's useful even without the speculative loop since it makes the MTP weights available for any downstream consumer.

@Goekdeniz-Guelmez would a PR for weight preservation during quantization + the mtp_forward/make_mtp_cache model methods be welcome? That would be the minimal foundation that others could build on.

Related issues: #846 (speculative decoding token-skipping with Qwen3), #872 (internal MTP tracking issue)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

10 participants