-
Notifications
You must be signed in to change notification settings - Fork 834
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Llama.cpp Support #183
Conversation
Hey @bayedieng just checking in. Anything I can help with to move this along? |
Hey @AlexCheema I indeed was initially having trouble understanding the codebase however it's clearer now (Inheritance can be confusing). I wrote a basic sharded inference engine class and will proceed with the implementation. My plan is to largely follow the implementation of the pytorch and tinygrad inference engines implementation with the only exception being skipping the tokenizer part of the problem. The Llama CPP API tokenizer is tied to the I will let you know if I have any further questions and am looking to have atleast a one working ai model being inferenced later today. |
How is this going @bayedieng? Anything I can help with? |
I started a branch using ggml given that the llama cpp api doesn't expose the weights and thus wouldn't be able to be sharded. What would then be the requirements of having this PR done then? Each model type (vision or pure text LLM) would have to be implemented separately as ggml does not have an auto model generation similar to pytorch. |
Let's start with Llama, which would support Llama, Phi and Mistral model weights since they're all based on Llama. That would be sufficient for this bounty. We can add the other models in a follow-up bounty. |
How's this going @bayedieng? Anything I can help with? |
Was waiting on confirmation for the bounty requirements. I should have a working llama implementation within the working week. |
Checking in again. |
Thanks, I'm quite used to the Exo codebase at this point however, I've been struggling quite a bit with the GGML API as it's quite low level. it is largely undocumented and essentially any error you may have with regards to it will lead to a crash with very little information as to why. This will likely take much more time than I had initially anticipated. With that being said I think a simpler solution to support llama cpp would be to use candle as it also supports the GGUF format for weights that llama cpp uses, and also has python bindings with an implementation of quantized llama which would essentially be the same I would have implemented using GGML but with a much simpler API. I fully understand if you'd still want to go forward with llama cpp as it may not fill your needs but using candle would be much simpler. |
Made a few simpler changes in another branch that is in conflict with this one will be closing and opening a new one. |
This PR adds support for Llama.cpp and closes #167.