-
Notifications
You must be signed in to change notification settings - Fork 47
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Long prompt with DeepSeek crashing with tensor size mismatch #101
Comments
I've reverted the changes one by one and the crash happens every time. It's being fed with a long prompt but ... I don't think [?] that should matter. Looks like a b00g to me. Works on other engines. |
It seems that it is caused by the inconsistent length of the attention_mask passed in and the input token. Can you share the code for starting the program? Let me repeat the problem you encountered. After using the REST API to call the DeepSeek model, the program will report an error in the second round of dialogue? |
The first time it crashed it was on the second prompt. The first prompt was simple and had a relatively short response. The second prompt was long and it crashed immediately when the request was made. No work was done. I have found that the long prompt will crash KTransformers even if it is the first prompt after starting KTransformers. I am using the latest Continue.dev VSCode extension. I have made no code changes and followed the build instructions precisely. The command line I'm starting KTransformers with is: This is an 8-bit quantized gguf model that was pulled from the ollama library. I will try to recreate the prompt again. It appears I accidentally deleted it when I periodically clean up my prompts. |
I had been using the quantized model available from ollama. I have now downloaded the DeepSeek model and quantized it myself to q8_0 using llama.cpp code current as of today. I have not seen this error (yet) but have opened another issue with a different error. I think there is probably some low hanging fruit with debugging increased output token counts. |
We have located the problem. We did not set the size of the kv cache to an interface. The default configuration is 4096. Conversations exceeding the length of 4096 will report an error. We are integrating the configuration files and will modify this problem in a later version. If you want to solve the problems you encounter in the current version, you can modify |
I've managed to change the kv cache to 8192, however changing it to 16384 results in CUDA out of memory errors - it tried to allocate on GPU 0 while there was free space on the other GPUs. Unifying the configuration would help as it's a bit confusing whether changing the GPU split option or editing the YAML files would solve the issue. |
I've also managed to change the cache_lens, max_new_tokens, and max_response_tokens to 8192 and achieved long[er] output without error. (This might be a better discussion for the other bug though?) However, as arthurv reports above my attempt at 16384 yields a CUDA out of memory error. I had to resort to using the multiple-GPU yaml optimization file (DeepSeek-V2-Chat-multi-gpu.yaml) for the 8192 configuration to complete. I have two 24GB GPUs and the memory usage is interesting. At one of the stages prior to text generation memory usage spikes: And when text is being generated memory utilization drops: When these settings are changed there's definitely a significant temporary increase in memory requirements. If that could be reduced or eliminated we'd probably be good to go. Though there does appear to be a significant performance impact for the 8192/multi-GPU configuration. (around 50% slower) |
We used DeepSeek-V2-Chat-multi-gpu.yaml to test the 16k situation in local_chat and server, and the phenomenon you mentioned did not occur. Moreover, our program applies for VRAM when starting, and the GPU memory usage should not fluctuate significantly. Can you provide the yaml file and run command you used? |
Not OP but here's my experience: Changed: ktransformers/server/backend/args.py Starting command: I start chatting through the API - I enter a prompt with 39 tokens, it generates 804 tokens as a reply. I enter 804 more tokens as input, and it crashes with torch.outOfMemoryError: CUDA out of memory Nvidia-smi gives this:
I have 4 GPUs, and you can see that it's mostly using only GPU 0. It crashed trying to allocate 6.23 GB on GPU 0 when there was only 5GB available. Is there a way to redistribute the memory use better? |
Adding to the info above - I reverted max_new_tokens and cache_lens to 8192, and launched ktransformers again with the same command. Submitted an 804 token prompt and it was OK. Memory usage was:
|
The multi-GPU memory usage I displayed was from the following multi-GPU configuration command line: Single GPU configured to generate 8192 tokens is not possible. I went through my logs and believe this is from a single GPU configuration with DeepSeek v2 modified for increased output. What other details would you like me to provide?
|
Using: KTransformers REST API
Model: DeepSeek Coder V2 236B Q8
I changed the following settings in args.py:
max_new_tokens to 16384
max_response_tokens to 16384
cache_q4 to False
I was attempting to increase the size of the response from the REST API for coding purposes. I changed the cache quantization while I was at it too. Don't think it makes much of a performance difference for me and I have system RAM available.
(I have not tracked down or understand the difference, or need, for both max_new_tokens versus max_response_tokens? From their descriptions it sounds like they do the same thing.)
What I did:
KTransformers successfully completed a significant prompt (the first prompt provided to the server) and produced a lengthy and complete response. When I attempted a follow up (the second prompt provided to the server) prompt I get the following messages in my logs and KTransformers server is unresponsive:
The text was updated successfully, but these errors were encountered: