Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Llama.cpp Support #183

Closed
wants to merge 9 commits into from
Closed

Add Llama.cpp Support #183

wants to merge 9 commits into from

Conversation

bayedieng
Copy link
Contributor

@bayedieng bayedieng commented Aug 27, 2024

This PR adds support for Llama.cpp and closes #167.

@AlexCheema
Copy link
Contributor

AlexCheema commented Sep 5, 2024

Hey @bayedieng just checking in. Anything I can help with to move this along?

@bayedieng
Copy link
Contributor Author

Hey @AlexCheema I indeed was initially having trouble understanding the codebase however it's clearer now (Inheritance can be confusing). I wrote a basic sharded inference engine class and will proceed with the implementation.

My plan is to largely follow the implementation of the pytorch and tinygrad inference engines implementation with the only exception being skipping the tokenizer part of the problem. The Llama CPP API tokenizer is tied to the Llama class being instantiated. Also, the Tokenizer being defined in the other implementations don't seem to be tokenizing inputs but rather applies a chat template in the handle_chat_completions function of the ChatGPT API. I will be implementing it manually later in the call stack.

I will let you know if I have any further questions and am looking to have atleast a one working ai model being inferenced later today.

exo/models.py Outdated Show resolved Hide resolved
@AlexCheema
Copy link
Contributor

How is this going @bayedieng? Anything I can help with?

@bayedieng
Copy link
Contributor Author

I started a branch using ggml given that the llama cpp api doesn't expose the weights and thus wouldn't be able to be sharded. What would then be the requirements of having this PR done then? Each model type (vision or pure text LLM) would have to be implemented separately as ggml does not have an auto model generation similar to pytorch.

@AlexCheema
Copy link
Contributor

AlexCheema commented Oct 3, 2024

I started a branch using ggml given that the llama cpp api doesn't expose the weights and thus wouldn't be able to be sharded. What would then be the requirements of having this PR done then? Each model type (vision or pure text LLM) would have to be implemented separately as ggml does not have an auto model generation similar to pytorch.

Let's start with Llama, which would support Llama, Phi and Mistral model weights since they're all based on Llama. That would be sufficient for this bounty. We can add the other models in a follow-up bounty.

@AlexCheema
Copy link
Contributor

How's this going @bayedieng? Anything I can help with?

@bayedieng
Copy link
Contributor Author

Was waiting on confirmation for the bounty requirements. I should have a working llama implementation within the working week.

@AlexCheema
Copy link
Contributor

Was waiting on confirmation for the bounty requirements. I should have a working llama implementation within the working week.

Checking in again.
You can use the PR for TorchInferenceEngine as a reference: #139

@bayedieng
Copy link
Contributor Author

Thanks, I'm quite used to the Exo codebase at this point however, I've been struggling quite a bit with the GGML API as it's quite low level. it is largely undocumented and essentially any error you may have with regards to it will lead to a crash with very little information as to why.

This will likely take much more time than I had initially anticipated. With that being said I think a simpler solution to support llama cpp would be to use candle as it also supports the GGUF format for weights that llama cpp uses, and also has python bindings with an implementation of quantized llama which would essentially be the same I would have implemented using GGML but with a much simpler API.

I fully understand if you'd still want to go forward with llama cpp as it may not fill your needs but using candle would be much simpler.

@bayedieng
Copy link
Contributor Author

Made a few simpler changes in another branch that is in conflict with this one will be closing and opening a new one.

@bayedieng bayedieng closed this Oct 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BOUNTY - $500] Llama.cpp inference engine
2 participants