Conversation
|
its a hybrid attention model :D |
|
i have to get this within the 30 mins done, I dont want to miss the keynote lmao Edit: |
|
@awni this should work now, but we have to see when the weights are released too. |
|
am doing the same aproach as long cat flash, I created a tyne version and am training it on "Hello World!" to test out if the mlx implementation is working. Goekdeniz-Guelmez/Qwen3Next-Dev |
|
ok inference works using on my custom small trained model on "Hello World!": prompt "Hello" MLX:Hello World!
|
|
fp16: |
|
I run a full context test with 4bit, Prefill is slow. vs mlx-community/Qwen3-30B-A3B-Instruct-2507-4bit |
…exp(g) up to fix gibberish output
|
OK i optimised the hell out of before: after: @awni can you have a look at it when your back? |
makes sense since this is a 80B hybrid model: it switches back and forth from full attention to linear attention with a recurrent-gated forward pass. That’s why the generation scales fine, but prefill is slower compared to the pure 30B Qwen baseline. The linear attn part part means the model keeps a running memory of past tokens and updates it step by step with gates that control how much old vs. new information flows through. |
|
before after |
output from q2: "What is BeagleBone Black, and where is it used?" The BeagleBone Black is a low-cost, high-performance, open-source single-board computer (SBC) designed for developers and hobbyists. Released in 2012, it is a successor to the original BeagleBoard and features a Texas Instruments ARM Cortex-A8 processor, along with 512 MB of DDR3 RAM, making it a powerful and flexible platform for embedded computing. What is BeagleBone Black?The BeagleBone Black is a compact, powerful, and affordable microcontroller board designed for rapid prototyping and embedded system development. It’s primarily targeted at:
It runs a full Linux kernel (typically Debian or Ubuntu), and its major advantages include:
Where is it used?========== The Thinking model, also at q2: |
|
That didn't go so well for the BeagleBone and q2 Okay, the user is asking about the BeagleBone Black的使用场景。首先,问题是:“What is BeagleBone Black的使用场景。首先,问题是中文的,意思是“请用中文,然后回答问题。我需要找到一个合理的、可操作的、具体的、详细的回答。To answer the question, "The most beautiful thing in the world of the world. The most beautiful thing in the world of the world of the world. The most beautiful thing in the world of the world of the world. The most beautiful thing inthe world of the world. The most beautiful thing inthe world of the world. The most beautiful thing inthe world的场景。首先,我的任务是分析该问题的描述,然后结合该描述中,作者需要详细描述,从“白点”首先,你的任务是找到一个具体的、可操作的、具体的、描述、场景、规则、惯例、文件 455(“白点”“白点”“白点”“白点”“白点”“白点”“白点”“白点”“白点”“白点”“白点”“白点”“白点”“白点 I understand people are excited about IoT, but still... |
|
Let's go ahead and merge this. I think it's working well enough to merge. We can follow on with perf optimizations. For completion rate, probably a custom kernel is needed for the recurrent step. For the prompt processing speed I am not sure, it may be enough to do the quadratic expansion of the computation (which I assume is possible.. though I haven't looked into it in detail). |
|
One interesting note about using group size 32 at quanting. This is the default q2 with group size 64: mlx_lm.generate --model release/Qwen3-Next-80B-A3B-Instruct-q2-mlx --max-tokens 20000 The bell on a Harley-Davidton bike, typically mounted on the handlebars or fairing, is a small, decorative, and functional device that serves as a warning device. When riders ring the bell, it alerts nearby pedestrians, cyclists, or other motorists to the motorcycle’s presence, especially in congested urban areas or congested traffic. Though less common today, it was once a popular safety feature on many Harley-Davidson motorcycles, helping riders signal their approach in congested or narrow roads. The bell’s gentle, high-pitched ring is designed to be audible over engine noise, making it a practical, though sometimes whimsical, safety aid. Prompt: 30 tokens, 160.681 tokens-per-sec Of course, that isn't it, so I tried the group size 32, the q2-hi: The bell on a Harley-Davidson motorcycle is a traditional good-luck charm, rooted in the belief that it wards off evil spirits. This practice traces back to the 1930s and 40s, when motorcycle riders would attach small bells to their bikes, believing the sound would prevent evil spirits from stealing their souls. The belief was that if a rider died, the bell’s tinkling would allow their spirit to return to the living. Over time, the bell’s symbolism evolved, and it became a symbol of good luck, protection, and a reminder to honor the rider’s spiritual journey. Today, it’s a cherished custom, especially in the biker community, where the bell’s tinkling serves as a reminder to stay alert, a tribute to the rider’s resilience, and a way to honor loved ones who’ve passed. It’s a small, tinkling promise of protection, a whispering echo of the rider’s enduring spirit. Prompt: 30 tokens, 158.511 tokens-per-sec If you ask for just 100 words, it replies in Chinese, apparently they love Harleys too One whimsical way to die is to call a Harley bell whimsical in front of a biker bar |
|
When I try to evaluate a model I see this File "/Users/g/.pyenv/versions/3.13.1/lib/python3.13/site-packages/mlx_lm/evaluate.py", line 211, in loglikelihood |
|
I think maybe one of the last commits maybe broke this. I tested some of the last commits from the original branch before merge squashing (https://github.com/Goekdeniz-Guelmez/mlx-lm/tree/adding-qwen3-next): ca24475f8b4ce5ac8b889598f869491d345030e3 seems to be a good commit. 8a9809a7f9bc1d3c51a7a1b3e77680669fa1a52d seems to be the first bad commit (second-to-last commit before merging in the original branch). I used Sample good output: (using ca24475) (long pastes so open the dropdown thingy). DetailsAnd a sample bad output: DetailsIt seems like it starts out fine, but there's a cut somewhere at ~200th token where it starts to repeat itself or derail. You really want to set more than 256 tokens or I think it's not obvious there's breakage. I see similar kind of failure even if I mess with prompts or temperatures. I did look at the code introduced in Goekdeniz-Guelmez@8a9809a but it has enough changes, and I had not studied the architecture earlier that I won't start troubleshooting near mid-night on Friday evening :) For the tests above I happened to be using https://huggingface.co/mlx-community/Qwen3-Next-80B-A3B-Instruct-6bit but I saw very similar pattern with my locally made quants and the 2-bit one. My machine is M2 Ultra, Mac Studio with 192GB memory. To reproduce, I suspect you can just take whatever quant, generate more something like 1000 tokens, and I think it should start repeating itself or derailing, and I think other settings would not matter too much. |
|
I see issues in generation too in last commit: Details
mlx_lm.generate --model ~/Qwen3-Next-80B-A3B-Instruct-5bit --prompt "Write a poem on LLMs." -m 2048LLMs: The Silent Sages of Silicon We are not flesh, nor blood, nor bone, |
kernels for mamba 1 2 and this are indeed needed, I started learning C and will probably do a PR in 2 weeks, when I continue this learning rate., thanks for the help, awesome. |
|
I confirm this was the last commit working properly: ca24475 |
a fix PR is coming, thanks! I haven't looked at the last commits, but will do! |
Is the main bran supposed to be stable now or should one go with ca24475? Also is a release expected anytime soon? PS: Thanks to everyone bringing this support super fast on mlx. |
|
@digiperfect-tech the PR has already been merged in the main branch, so you can go ahead and use the main branch. |
|
Could you guys support MTP as well? Or is it already supported @Goekdeniz-Guelmez waybarrios/vllm-mlx#82 |
|
@TomLucidor as of currently MTP is not supported in MLX-LM, however if enough people the community ask for it, I would be down. |
|
@janhilgard could you take a look at this and see what can be done upstream? |
|
Hey @TomLucidor @Goekdeniz-Guelmez — happy to share what we've learned. We implemented MTP speculative decoding for Qwen3-Next in vllm-mlx#82 and it gives a solid 1.4x throughput boost (55 → 79 tok/s on M3 Ultra with the 6-bit quant). Side note: We also ran into issues with external draft-model speculative decoding for Qwen3 — tokens get skipped/dropped (#846). Internal MTP avoids this entirely since the MTP head shares the same architecture and weights, so there's no tokenizer or distribution mismatch. This is also tracked in #872. How it works (high level)The Qwen3-Next model already has an MTP head in the weights (
What would need to happen in mlx-lmThe core MTP logic is actually pretty self-contained. The main pieces:
The trickiest part is the hybrid attention/recurrent architecture — the DeltaRNN layers have recurrent state that needs snapshot/restore on draft rejection. We handle this in vllm-mlx's scheduler, but for I'd be happy to contribute upstream. The weight preservation in @Goekdeniz-Guelmez would a PR for weight preservation during quantization + the Related issues: #846 (speculative decoding token-skipping with Qwen3), #872 (internal MTP tracking issue) |
No description provided.