Add batched inference #771

abetlen · 2023-09-30T06:29:44Z

Use llama_decode instead of deprecated llama_eval in Llama class
Implement batched inference support for generate and create_completion methods in Llama class
Add support for streaming / infinite completion

The text was updated successfully, but these errors were encountered:

JackKCWong · 2023-10-11T13:54:42Z

Silly question, does that also support for parallel decoding in llama.cpp?

steveoon · 2023-10-12T07:58:49Z

Does the newest version support "batched decoding" of llama.cpp?

https://github.com/ggerganov/llama.cpp/pull/3228

@abetlen

LoopControl · 2023-10-30T01:23:07Z

This would be a huge improvement for production use.

I tested locally with 4 parallel requests to the built-in ./server binary in llama.cpp and am able to hit some insanely good tokens/sec -- multiple times faster than what we get with a single request via the non-batched inference.

hockeybro12 · 2023-11-03T22:16:48Z

This would be a huge improvement for production use.

I tested locally with 4 parallel requests to the built-in ./server binary in llama.cpp and am able to hit some insanely good tokens/sec -- multiple times faster than what we get with a single request via the non-batched inference.

@LoopControl How did you do 4 parallel requests to the ./server binary? Can you please provide an example, I'm trying to do the same. Thanks!

LoopControl · 2023-11-04T01:17:38Z

@LoopControl How did you do 4 parallel requests to the ./server binary? Can you please provide an example, I'm trying to do the same. Thanks!

There's 2 new flags in llama.cpp to add to your normal command -cb -np 4 (cb = continuous batching, np = parallel request count).

hockeybro12 · 2023-11-07T17:51:40Z

@

@LoopControl How did you do 4 parallel requests to the ./server binary? Can you please provide an example, I'm trying to do the same. Thanks!

There's 2 new flags in llama.cpp to add to your normal command -cb -np 4 (cb = continuous batching, np = parallel request count).

Thanks, that works for me with llama.cpp, but not llama-cpp-python, which I think is expected. Unfortunately, the server API in llama.cpp here doesn't seem to be as good as the server in llama-cpp-python, at least for my task. Using the same llama model, I get better results with llama-cpp-python. So, I hope this can be added soon!

zpzheng · 2023-11-20T05:30:28Z

When will this feature be available? I hope anyone can help solve this problem please.

ggerganov · 2023-11-21T17:42:45Z

Let me know if there are any roadblocks - I might be able to provide some insight

abetlen · 2023-11-23T08:40:41Z

Hey @ggerganov I missed this earlier.

Thank you, yeah I just need some quick clarifications around the kv cache behaviour.

The following is my understanding of the kv_cache implementation

The kv cache starts with a number of free cells initially equal to n_ctx
If the number of free cells gets down to 0 the kv cache / available context is full and some cells must be cleared to process any more tokens
When calling llama_decode, batch.n_tokens can only be as large as the largest free slot, if n_tokens is too large (llama_decode returns >1) you reduce the batch size it and retry
The number of occupied cells increases by batch.n_tokens on every call to llama_decode
The number of free cells increases when an occupied cell no longer belongs to any sequences or is shifted to pos < 0
Calling llama_kv_cache_seq_cp does not use cause any additional free cells to be occupied, the copy is "shallow" and only adds the new sequence id to the set
Calling llama_kv_cache_shift works by modifying the kv cells that belong to a given sequence however this also shifts this cell in all of the other sequences it belongs to

Is this correct?

ggerganov · 2023-11-23T08:57:42Z

Yes, all of this is correct.

Calling llama_kv_cache_shift works by modifying the kv cells that belong to a given sequence however this also shifts this cell in all of the other sequences it belongs to

This call also sets a flag that upon the next llama_decode, the computation will first shift the KV cache data before proceeding as usual.

Will soon add a couple of functions to the API that can be useful for monitoring the KV cache state:

ggerganov/llama.cpp#4170

One of the main applications of llama_kv_cache_seq_cp is to "share" a common prompt (i.e. same tokens at the same positions) across multiple sequences. Most trivial example is a system prompt which is at the start for all generated sequences. By sharing it, the KV cache will be reused and thus less memory will be consumed, instead of having a copy for each sequences.

zpzheng · 2023-12-08T03:19:11Z

I updated the version and saw the batch configuration. But when I ran it, the batch didn't take effect.When I send multiple requests, it still handles them one by one. My startup configuration is as follows:

python3 -m llama_cpp.server --model ./models/WizardLM-13B-V1.2/ggml-model-f16-Q5.gguf --n_gpu_layers 2 --n_ctx 8000 --n_batch 512 --n_threads 10 --n_threads_batch 10 --interrupt_requests False

Is there something wrong with my configuration? @abetlen

LoopControl · 2023-12-08T20:30:17Z

@zpzheng It’s a draft PR so it’s not complete - you can see “Add support for parallel requests” is in the todo list

Zahgrom34 · 2024-01-13T08:42:39Z

@abetlen Is there any progress on this?

K-Mistele · 2024-01-20T23:10:16Z

+1, would be really great to have this

everyfin-in · 2024-01-29T00:02:29Z

+1, would be so great to have this!

sadaisystems · 2024-02-01T10:47:06Z

+1

ArtyomZemlyak · 2024-02-07T02:30:24Z

+1

chenwr727 · 2024-03-13T06:28:07Z

+1

Connor2573 · 2024-03-16T22:10:44Z

+1

shoaibmalek21 · 2024-03-19T17:03:49Z

Guys, any other solution in this??

jasongst · 2024-03-20T23:10:41Z

+1

ganliqiang · 2024-04-03T07:52:39Z

+1

stanier · 2024-04-03T20:44:45Z

+1

I use llama-cpp-python as (presently) the sole first-class-support backend in a project I've been developing. I'm not sure how much it would benefit from batching, as I've yet to do performance testing against other backends, but I feel like it could be a significant boon.

What's the current status of this and #951? I might be interested in taking a look at this, but I'm not certain I'd bring much to the table, I'll have to review the related code more.

K-Mistele · 2024-04-03T20:49:41Z

I use llama-cpp-python as (presently) the sole first-class-support backend in a project I've been developing.

I would not do this. batching is super important and I had to move to llama.cpp's server (easy to deploy w/ docker or python, or even just the exe) because of lack of features on llama-cpp-python. If you're doing CPU inference, llama.cpp is a great option, otherwise I would use something like vLLM, BentoML's OpenLLM, or Predibase's LoRAx

stanier · 2024-04-04T01:20:30Z

I would not do this. batching is super important and I had to move to llama.cpp's server

This is something I was considering, appreciate the advice. I'll likely end up doing that. I had to do the same with Ollama, but I wasn't on Ollama long and by no means felt it was the right fit for the job, support for it merely started from a peer showing interest and my compulsion to explore all viable options where possible.

I'm doing GPU inference and sadly that means Nvidia's antics have hindered me from getting things running in a container just the way I'd like them to up until now... but that's another story. I haven't tried vLLM, OpenLLM or LoRAx, llama.cpp and llama-cpp-python have generally been all I've needed up till now (and for longer, I hope-- I really appreciate the work done by all contributors to both projects, exciting that we're at least where we are today). Are those libraries any good if you're looking to do something with the perplexity of say q6_k on a (VRAM) budget? I'd prefer to be able to run it on my 1080Ti, even when I have access to more VRAM in another environment.

yourbuddyconner · 2024-06-11T04:33:29Z

I am dealing with this right now -- and unfortunately llama-cpp-server has the only completions endpoint that I can find that supports logprobs properly (random hard requirement I have). llama.cpp doesn't natively support it in its API, which is why I use llama-cpp-python.

vllm is great for big GPUs, however, it doesn't support gguf-quantized models or running at full-precision for smaller GPUs (I use T4s).

Whereas ideally I could run llama-cpp-python on an instance with 4 T4s and batch across them (like vllm can do), I am now creating instances with 1 gpu and scaling horizontally.

If anyone knows of a better openai-compatible API endpoint that wraps llama.cpp, I am listening, but I haven't found one.

dimaioksha · 2024-06-11T11:01:42Z

I am dealing with this right now -- and unfortunately llama-cpp-server has the only completions endpoint that I can find that supports logprobs properly (random hard requirement I have). llama.cpp doesn't natively support it in its API, which is why I use llama-cpp-python.

vllm is great for big GPUs, however, it doesn't support gguf-quantized models or running at full-precision for smaller GPUs (I use T4s).

Whereas ideally I could run llama-cpp-python on an instance with 4 T4s and batch across them (like vllm can do), I am now creating instances with 1 gpu and scaling horizontally.

If anyone knows of a better openai-compatible API endpoint that wraps llama.cpp, I am listening, but I haven't found one.

Have you tried https://github.com/ollama/ollama?

yourbuddyconner · 2024-06-11T16:53:40Z

ollama doesn't support batched inference what a silly suggestion.

ollama/ollama#1396

NickCrews · 2024-06-24T03:55:07Z

I case this is useful to others, as a workaround until this is implemented, I wrote a tiny python library that

downloads and installs the raw llama.cpp server binary
downloads some model weights from huggingface hub
provides a simpleServer class to control starting/stopping the binary

This was needed because the raw server binary supports batched inference. All the heavy logic is already in the upstream C server, so all I needed to do was do the CLI and subprocess logic.

dabs9 · 2024-07-09T04:45:07Z

Does this mean that continuous batching is not supported in llama-cpp-python? I assume this is the type of batching under consideration in this issue.

KohakuBlueleaf · 2024-07-23T14:22:05Z

Does this mean that continuous batching is not supported in llama-cpp-python? I assume this is the type of batching under consideration in this issue.

Yes, but, no
Yes: continuous batching is not "utilized" in llama-cpp-python.
No: you can't even just do the simplest batching which encode multiple prompt at the same time, decode multiple sequence at the same time. Continuous batching is something "beyond" this

yjkim3 · 2024-07-25T01:14:19Z

So, Is n_batch parameter is currently useless ?? I am wondering what the function of 'n_batch' is

B0-B · 2024-08-22T06:52:13Z

So, Is n_batch parameter is currently useless ?? I am wondering what the function of 'n_batch' is

Yes it is. I played around alot with n_batch, n_threads etc. but it's all useless..

Further, I tried using the futures threadpool as well as the threading module to simulate parallelism. All failed with a crashing kernel. As long as a single thread is running everything is fine, proving that it's not the thread process itself. However, once a second thread is started and tries to propagate it crashes. I assume the weights are locked and cannot be accesses by two separate asynchronous processes.

ExtReMLapin · 2024-08-24T19:35:06Z

Any update on this ?

Backendmagier · 2024-10-08T10:42:37Z

Any updates? @abetlen I Think this is highly anticipated by many...

abetlen added the enhancement New feature or request label Sep 30, 2023

abetlen pinned this issue Sep 30, 2023

abetlen mentioned this issue Sep 29, 2023

Roadmap for v0.2 #487

Open

9 tasks

abetlen added the high-priority label Oct 4, 2023

abetlen mentioned this issue Oct 5, 2023

Add beam search #631

Open

akx mentioned this issue Oct 24, 2023

add support for batched inference #818

Closed

ggbetz mentioned this issue Nov 1, 2023

Feature request: Batched inference for llama.cpp models eth-sri/lmql#261

Open

abetlen mentioned this issue Nov 10, 2023

[BUG] Server cant handle two streaming connections in same time #897

Open

4 tasks

This was referenced Nov 21, 2023

Is there a possible memory leak in llama_cpp.llama_decode()? #924

Closed

batch support #726

Closed

abetlen mentioned this issue Nov 23, 2023

Why can't multiple apis be triggered at the same time #873

Open

abetlen linked a pull request Nov 28, 2023 that will close this issue

Add batch inference support (WIP) #951

Draft

3 tasks

AayushSameerShah mentioned this issue Jan 4, 2024

Concurrent request handling #1062

Open

yueying-teng mentioned this issue Jan 25, 2024

Data prep yueying-teng/generate-language-image-instruction-following-data#2

Merged

iamlemec mentioned this issue Feb 22, 2024

WIP: Parallel generation implemenation #1209

Open

jncraton mentioned this issue Jul 5, 2024

Add qwen2-0.5B-instruct support. jncraton/languagemodels#35

Open

baptistecolle mentioned this issue Oct 9, 2024

Why llamaCpp only support batchsize=1 in text generation huggingface/optimum-benchmark#283

Closed

Add batched inference #771

Add batched inference #771

Comments

abetlen commented Sep 30, 2023 • edited Loading

JackKCWong commented Oct 11, 2023

steveoon commented Oct 12, 2023 • edited Loading

LoopControl commented Oct 30, 2023 • edited Loading

hockeybro12 commented Nov 3, 2023

LoopControl commented Nov 4, 2023

hockeybro12 commented Nov 7, 2023

zpzheng commented Nov 20, 2023

ggerganov commented Nov 21, 2023

abetlen commented Nov 23, 2023 • edited Loading

ggerganov commented Nov 23, 2023 • edited Loading

zpzheng commented Dec 8, 2023 • edited Loading

LoopControl commented Dec 8, 2023 • edited Loading

Zahgrom34 commented Jan 13, 2024

K-Mistele commented Jan 20, 2024

everyfin-in commented Jan 29, 2024

sadaisystems commented Feb 1, 2024

ArtyomZemlyak commented Feb 7, 2024

chenwr727 commented Mar 13, 2024

Connor2573 commented Mar 16, 2024

shoaibmalek21 commented Mar 19, 2024

jasongst commented Mar 20, 2024

ganliqiang commented Apr 3, 2024

stanier commented Apr 3, 2024 • edited Loading

K-Mistele commented Apr 3, 2024

stanier commented Apr 4, 2024 • edited Loading

yourbuddyconner commented Jun 11, 2024 • edited Loading

dimaioksha commented Jun 11, 2024

yourbuddyconner commented Jun 11, 2024

NickCrews commented Jun 24, 2024 • edited Loading

dabs9 commented Jul 9, 2024

KohakuBlueleaf commented Jul 23, 2024 • edited Loading

yjkim3 commented Jul 25, 2024

B0-B commented Aug 22, 2024

ExtReMLapin commented Aug 24, 2024

Backendmagier commented Oct 8, 2024

abetlen commented Sep 30, 2023 •

edited

Loading

steveoon commented Oct 12, 2023 •

edited

Loading

LoopControl commented Oct 30, 2023 •

edited

Loading

abetlen commented Nov 23, 2023 •

edited

Loading

ggerganov commented Nov 23, 2023 •

edited

Loading

zpzheng commented Dec 8, 2023 •

edited

Loading

LoopControl commented Dec 8, 2023 •

edited

Loading

stanier commented Apr 3, 2024 •

edited

Loading

stanier commented Apr 4, 2024 •

edited

Loading

yourbuddyconner commented Jun 11, 2024 •

edited

Loading

NickCrews commented Jun 24, 2024 •

edited

Loading

KohakuBlueleaf commented Jul 23, 2024 •

edited

Loading