-
Notifications
You must be signed in to change notification settings - Fork 961
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add batched inference #771
Comments
Silly question, does that also support for parallel decoding in llama.cpp? |
Does the newest version support "batched decoding" of llama.cpp? https://github.com/ggerganov/llama.cpp/pull/3228 |
This would be a huge improvement for production use. I tested locally with 4 parallel requests to the built-in ./server binary in llama.cpp and am able to hit some insanely good tokens/sec -- multiple times faster than what we get with a single request via the non-batched inference. |
@LoopControl How did you do 4 parallel requests to the ./server binary? Can you please provide an example, I'm trying to do the same. Thanks! |
There's 2 new flags in llama.cpp to add to your normal command |
@
Thanks, that works for me with |
When will this feature be available? I hope anyone can help solve this problem please. |
Let me know if there are any roadblocks - I might be able to provide some insight |
Hey @ggerganov I missed this earlier. Thank you, yeah I just need some quick clarifications around the kv cache behaviour. The following is my understanding of the
Is this correct? |
Yes, all of this is correct.
This call also sets a flag that upon the next Will soon add a couple of functions to the API that can be useful for monitoring the KV cache state: One of the main applications of |
I updated the version and saw the batch configuration. But when I ran it, the batch didn't take effect.When I send multiple requests, it still handles them one by one. My startup configuration is as follows: python3 -m llama_cpp.server --model ./models/WizardLM-13B-V1.2/ggml-model-f16-Q5.gguf --n_gpu_layers 2 --n_ctx 8000 --n_batch 512 --n_threads 10 --n_threads_batch 10 --interrupt_requests False Is there something wrong with my configuration? @abetlen |
@zpzheng It’s a draft PR so it’s not complete - you can see “Add support for parallel requests” is in the todo list |
@abetlen Is there any progress on this? |
+1, would be really great to have this |
+1, would be so great to have this! |
+1 |
+1 |
+1 |
1 similar comment
+1 |
Guys, any other solution in this?? |
+1 |
1 similar comment
+1 |
+1 I use llama-cpp-python as (presently) the sole first-class-support backend in a project I've been developing. I'm not sure how much it would benefit from batching, as I've yet to do performance testing against other backends, but I feel like it could be a significant boon. What's the current status of this and #951? I might be interested in taking a look at this, but I'm not certain I'd bring much to the table, I'll have to review the related code more. |
I would not do this. batching is super important and I had to move to llama.cpp's server (easy to deploy w/ docker or python, or even just the exe) because of lack of features on llama-cpp-python. If you're doing CPU inference, llama.cpp is a great option, otherwise I would use something like vLLM, BentoML's OpenLLM, or Predibase's LoRAx |
This is something I was considering, appreciate the advice. I'll likely end up doing that. I had to do the same with Ollama, but I wasn't on Ollama long and by no means felt it was the right fit for the job, support for it merely started from a peer showing interest and my compulsion to explore all viable options where possible. I'm doing GPU inference and sadly that means Nvidia's antics have hindered me from getting things running in a container just the way I'd like them to up until now... but that's another story. I haven't tried vLLM, OpenLLM or LoRAx, llama.cpp and llama-cpp-python have generally been all I've needed up till now (and for longer, I hope-- I really appreciate the work done by all contributors to both projects, exciting that we're at least where we are today). Are those libraries any good if you're looking to do something with the perplexity of say q6_k on a (VRAM) budget? I'd prefer to be able to run it on my 1080Ti, even when I have access to more VRAM in another environment. |
I am dealing with this right now -- and unfortunately llama-cpp-server has the only completions endpoint that I can find that supports logprobs properly (random hard requirement I have). llama.cpp doesn't natively support it in its API, which is why I use llama-cpp-python. vllm is great for big GPUs, however, it doesn't support gguf-quantized models or running at full-precision for smaller GPUs (I use T4s). Whereas ideally I could run llama-cpp-python on an instance with 4 T4s and batch across them (like vllm can do), I am now creating instances with 1 gpu and scaling horizontally. If anyone knows of a better openai-compatible API endpoint that wraps llama.cpp, I am listening, but I haven't found one. |
Have you tried https://github.com/ollama/ollama? |
ollama doesn't support batched inference what a silly suggestion. |
I case this is useful to others, as a workaround until this is implemented, I wrote a tiny python library that
This was needed because the raw server binary supports batched inference. All the heavy logic is already in the upstream C server, so all I needed to do was do the CLI and subprocess logic. |
Does this mean that continuous batching is not supported in llama-cpp-python? I assume this is the type of batching under consideration in this issue. |
Yes, but, no |
So, Is n_batch parameter is currently useless ?? I am wondering what the function of 'n_batch' is |
Yes it is. I played around alot with n_batch, n_threads etc. but it's all useless.. Further, I tried using the futures threadpool as well as the threading module to simulate parallelism. All failed with a crashing kernel. As long as a single thread is running everything is fine, proving that it's not the thread process itself. However, once a second thread is started and tries to propagate it crashes. I assume the weights are locked and cannot be accesses by two separate asynchronous processes. |
Any update on this ? |
Any updates? @abetlen I Think this is highly anticipated by many... |
llama_decode
instead of deprecatedllama_eval
inLlama
classgenerate
andcreate_completion
methods inLlama
classThe text was updated successfully, but these errors were encountered: