Conversation
|
Just pulled and built tl;dr; the pereplexity values on this PR look higher than mainline at the moment and throws a warning ikmainlinefull command and debug log here: https://huggingface.co/ubergarm/Qwen3.5-397B-A17B-GGUF/raw/main/logs/perplexity-Qwen3.5-397B-A17B-Q3_K.log |
|
Please see the updated comment, and get the latest version that I just pushed. For PPL testing, use If you see the log that's no good, it is going to be very slow. |
|
It does produce coherent output that looks normal at first glance, and is 3x faster than mainline. ikmainlineI'll pull the most recent changes and check, thanks! |
|
Great, going to default batches e.g. |
|
Yes, so here my PPL runs. 2x3090 + Ryzen-3995WX. I could have offloaded some of the experts to the GPUs, but didn't want to bother with that. ik_llama.cpp GPU/CPU./bin/llama-perplexity -m Qwen3.5-397B-A17B-IQ4_XS-00001-of-00006.gguf -f wiki.test.raw llama.cpp, GPU/CPU |
|
Here the beginning of the So far so good, it did offload some MoE layers to the GPUs. But then we get so basically 25% slower, which is not exactly the purpose of using as much of the available VRAM as possible. |
|
Great, I'm using this PR to cook a new imatrix from the full BF16 now, and will get some quants added to the collection! Perplexity with
Interesting, thanks for the heads up, I'll have to try some different offload strategies and compare speeds then. I recall on year old ktransformers offloading additional layers could slow down deepseek due to CUDA graphs reasons supposedly. I believe this is similar to what @magikRUKKOLA is trying to tell me here: #1268 (comment) ... Okay, gonna cook some ik_llama.cpp quants for Qwen3.5 MoE! |
|
... main: n_kv_max = 262144, n_batch = 4096, n_ubatch = 4096, flash_attn = 1, n_gpu_layers = 99, n_threads = 64, n_threads_batch = 64
perplexity: calculating perplexity over 580 chunks, n_ctx=512, batch_size=4096, n_seq=8 Final estimate: PPL over 580 chunks for n_ctx=512 = 3.6537 +/- 0.02000 [EDIT]: IQ2_KL 8x3090, f16 kv, 256k ctx: *note: kv cache f16 seems to be faster in decode. [EDIT#n]: IQ4_KSS, 2x3090, 3975wx, DDR4:
|
|
There's an issue. The model works ok generally, but after a while it will stop any output. Sending a message won't work anymore, swiping doesn't work, changing the context doesn't work either. It won't output any error, but the model will only produce one extremely fast token (4184.10 tokens per second), and the message ends. The problem disappears when I restart ik_llama.cpp. I'm trying to figure out what I'm doing to trigger the bug, but it's hard to replicate. This happened at various context lengths, both at 5000 and 30000. Using IQ2_XS quants. Also, idk if it's expected, but changing -ctk q8_0 -ctv q8_0 to different values, such as q4_0, doesn't affect tg or pp performance at all.
I'll simplify my arguments for testing from now on, but this is what I used up until this point.
|
Ha. Indeed. Another interesting thing is that the LLM stopped working with [EDIT]: I just noticed that its a topic about the Qwen3.5. I had these problems with GLM4.7. Hm ... [EDIT2]: As related to Qwen3.5 -- in case of the conversation interruption and sending a new request (with the same data) I do see the as if LLM actually continuing the old conversation. Something is wrong with the attention, perhaps? [EDIT3]: illustration: (2nd conversation after the interrupted (?) first one, related to the coding and debugging) So Qwen3.5 have an imprint of the first conversation in the second one. Ha. |
Well, the thing about these models is that they spend most of their time in the linear attention, so the standard transformer self-attentioni, which is used only in 1 out of 4 layers, does not play a major role in performance. Only at very long context will it contribute in a more significant way to the observed TF and PP. Hence, yes, it is expected that you do not see significant differences between different KV cache quantization types. |
|
I think the issues that have been observed are related to the fact that the server currently does not handle correctly the recurrent cache. One cannot simply rewind it, as one does with standard transformer KV cache. The recurrent cache is just a blob of floating point values that somehow encode the past context, and there is no rewind operator. Instead, one needs to take frequent snapshots, and then only restart a conversation from the closest snapshot available. If you don't do that, eventually the recurrent cache will contain a salad of unrelated contexts, so it is kind of expected that it will eventually stop working altogether. So, I guess, this is a serious limitation for Qwen3-Next and Qwen-3.5. I have zero interest in the server codebase, so hopefully @firecoperana will want to take it on. |
Can you debug it? What kind of memory errors?
This is because the cache is not being handled correctly, see my comment above. |
Well, its kinda hard to reproduce... Overall, it seems to kinda happen at a very long context.
Not sure what it was. |
Very high PPL, empty TG.
d7269f1 to
07516ce
Compare
Mainline has dedicated cache management for recurrent, hybrid and isswa model. I might need to port most of the kv cache/memory related code from mainline. Are you fine with this? |
I cannot say that I particularly like what they have done. It is not that I like the state of affairs in Given this, you don't think it can be dome without copying their unified cache management? |
|
I will just port recurrent and hybrid part then. It should be possible. |
|
I am getting the same error as @magikRUKKOLA on certain prompts, and the backtrace looks like this: |
|
Ohh the error happens with |
Yeah, that is what I was thinking. In my case it could be the bad risers. At one point I connected two of them consequently (w/o retimers etc.) and that turned out to be the problem. As of now I am still not sure if some of the risers are bad because I have this: lspci -vvv | grep -F -A 5 --colour 'LaneErr at lane'Alternatively, it could be some quirks of the motherboard or the SlimSAS risers so I am not sure what it is. Can you check if the command above detects any Lane Errors? [EDIT]: Just got another problem: Hm ... looks like the hardware issue again. |
Yes, I also think that there may be a hardware issue. I had a few occasions where inference will simply lock up, similar to the way it behaved before you changed the risers. It is much less frequent, but it does happen from time to time. |
|
So, despite the limitations outlined above, I'll merge the PR. Proper recurrent cache management will be added later. |
|
Is the caveat related to my issue? I got random output after sending the same prompt with prompt caching enabled |
~It loads and runs, but t does not work. ~
Adding it as a draft PR in case someone wants to try to figure out where I have gone wrong.It works now with the following CAVEAT (which, btw, applies to Qwen3-Next as well): one cannot have more than one sequence. The implementation here ended up being quite different from
llama.cpp, so I cannot copy from there, but I haven't yet fully wrapped my head around the delta-net thing, so haven't figured out yet how to do multiple sequence yet. This is not relevant for "normal" usage, but if you try to e.g. calculate perplexity for context of 512 using u-batches > 512 (as one usually does for hybrid CPU/GPU inference), that will not work. Neither will pipeline parallelism.As far as I can tell, this implementation is quite a bit faster than
llama.cpp. Below is a comparison with the latestllama.cppversion as of this writing (build: 8111 (11c325c6e)). CPU-only benchmark is on a Ryzen-3995WX CPU. The "CUDA" benchmark is on a 3090 with all MoE tensors left in RAM (full GPU offload is hopeless for this model).Oh, given the caveat above, if you do want to run a perplexity check and want to use batch/u-batch size > 512 (because of hybrid inference), just use a context that is the same or larger than the u-batch size. I have used
-c 4096 -b 4096 -ub 4096for my own testing.