Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature request: Batched inference for llama.cpp models #261

Open
ggbetz opened this issue Nov 1, 2023 · 5 comments
Open

Feature request: Batched inference for llama.cpp models #261

ggbetz opened this issue Nov 1, 2023 · 5 comments
Labels
enhancement New feature or request good first issue Good for newcomers

Comments

@ggbetz
Copy link

ggbetz commented Nov 1, 2023

Hi Luca,

llama.cpp / llama-cpp-python are apparently going to allow for batched inference [1] [2].

Is this something you've on your radar, and are planning to reflect in the lmql-server, as well? So far, we have batch_size=1 for llama.cpp models, right?

Thanks again for creating and maintaining this great project!

Gregor

@lbeurerkellner lbeurerkellner added the enhancement New feature or request label Nov 2, 2023
@lbeurerkellner
Copy link
Collaborator

Thanks for raising this, we will keep it on our radar. It should be simple to add support for this, once it is upstreamed in llama.cpp/llama-cpp-python.

@reuank
Copy link
Contributor

reuank commented Nov 28, 2023

Just a quick update on this topic. It looks like llama-cpp-python will add that feature very soon: abetlen/llama-cpp-python#951.

@lbeurerkellner
Copy link
Collaborator

Marking this as a good first issue for backend work. The llama.cpp backend lives in https://github.com/eth-sri/lmql/blob/main/src/lmql/models/lmtp/backends/llama_cpp_model.py and is currently limited to max_batch_size of 1.

@lbeurerkellner lbeurerkellner added the good first issue Good for newcomers label Feb 27, 2024
@Saibo-creator
Copy link
Contributor

It looks like the batch inference feature mentioned earlier is still in development. I'll hold off for now.

@balu54
Copy link

balu54 commented Jul 23, 2024

Is parallel inference will comes with batche inference feature?

Marking this as a good first issue for backend work. The llama.cpp backend lives in https://github.com/eth-sri/lmql/blob/main/src/lmql/models/lmtp/backends/llama_cpp_model.py and is currently limited to max_batch_size of 1.

Is parallel inference includes in batch inference.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request good first issue Good for newcomers
Projects
None yet
Development

No branches or pull requests

5 participants