Add Llama.cpp Support #183

bayedieng · 2024-08-27T12:26:39Z

This PR adds support for Llama.cpp and closes #167.

AlexCheema · 2024-09-05T12:30:14Z

Hey @bayedieng just checking in. Anything I can help with to move this along?

bayedieng · 2024-09-05T16:14:59Z

Hey @AlexCheema I indeed was initially having trouble understanding the codebase however it's clearer now (Inheritance can be confusing). I wrote a basic sharded inference engine class and will proceed with the implementation.

My plan is to largely follow the implementation of the pytorch and tinygrad inference engines implementation with the only exception being skipping the tokenizer part of the problem. The Llama CPP API tokenizer is tied to the Llama class being instantiated. Also, the Tokenizer being defined in the other implementations don't seem to be tokenizing inputs but rather applies a chat template in the handle_chat_completions function of the ChatGPT API. I will be implementing it manually later in the call stack.

I will let you know if I have any further questions and am looking to have atleast a one working ai model being inferenced later today.

exo/models.py

AlexCheema · 2024-09-30T22:54:33Z

How is this going @bayedieng? Anything I can help with?

bayedieng · 2024-10-01T15:38:20Z

I started a branch using ggml given that the llama cpp api doesn't expose the weights and thus wouldn't be able to be sharded. What would then be the requirements of having this PR done then? Each model type (vision or pure text LLM) would have to be implemented separately as ggml does not have an auto model generation similar to pytorch.

AlexCheema · 2024-10-03T22:00:35Z

I started a branch using ggml given that the llama cpp api doesn't expose the weights and thus wouldn't be able to be sharded. What would then be the requirements of having this PR done then? Each model type (vision or pure text LLM) would have to be implemented separately as ggml does not have an auto model generation similar to pytorch.

Let's start with Llama, which would support Llama, Phi and Mistral model weights since they're all based on Llama. That would be sufficient for this bounty. We can add the other models in a follow-up bounty.

AlexCheema · 2024-10-06T17:22:17Z

How's this going @bayedieng? Anything I can help with?

bayedieng · 2024-10-06T19:50:49Z

Was waiting on confirmation for the bounty requirements. I should have a working llama implementation within the working week.

AlexCheema · 2024-10-11T23:58:17Z

Was waiting on confirmation for the bounty requirements. I should have a working llama implementation within the working week.

Checking in again.
You can use the PR for TorchInferenceEngine as a reference: #139

bayedieng · 2024-10-12T00:52:25Z

Thanks, I'm quite used to the Exo codebase at this point however, I've been struggling quite a bit with the GGML API as it's quite low level. it is largely undocumented and essentially any error you may have with regards to it will lead to a crash with very little information as to why.

This will likely take much more time than I had initially anticipated. With that being said I think a simpler solution to support llama cpp would be to use candle as it also supports the GGUF format for weights that llama cpp uses, and also has python bindings with an implementation of quantized llama which would essentially be the same I would have implemented using GGML but with a much simpler API.

I fully understand if you'd still want to go forward with llama cpp as it may not fill your needs but using candle would be much simpler.

bayedieng · 2024-10-12T13:03:59Z

Made a few simpler changes in another branch that is in conflict with this one will be closing and opening a new one.

initial commit

f485b94

bayedieng added 2 commits September 5, 2024 16:25

add basic inference engine + 4-bit llama gguf

68a3dae

Merge remote-tracking branch 'upstream/main'

835a83e

bayedieng commented Sep 6, 2024

View reviewed changes

exo/models.py Outdated Show resolved Hide resolved

bayedieng added 3 commits September 9, 2024 19:04

use this model since it's the base

54f6e48

ignore tokenizer for now

fb4ddf3

download hugging face model

12ae92c

bayedieng force-pushed the main branch from 2d17066 to 12ae92c Compare September 11, 2024 13:14

bayedieng added 3 commits September 14, 2024 09:43

merge latest changes

e6dc47a

merge files

271e462

fix merge

c50f08a

AlexCheema mentioned this pull request Sep 27, 2024

Error: Failed to fetch completions: Error processing prompt (see logs with DEBUG>=2): No module named '_posixshmem' #237

Open

bayedieng closed this Oct 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Llama.cpp Support #183

Add Llama.cpp Support #183

bayedieng commented Aug 27, 2024 •

edited

Loading

AlexCheema commented Sep 5, 2024 •

edited

Loading

bayedieng commented Sep 5, 2024

AlexCheema commented Sep 30, 2024

bayedieng commented Oct 1, 2024

AlexCheema commented Oct 3, 2024 •

edited

Loading

AlexCheema commented Oct 6, 2024

bayedieng commented Oct 6, 2024

AlexCheema commented Oct 11, 2024

bayedieng commented Oct 12, 2024

bayedieng commented Oct 12, 2024

Add Llama.cpp Support #183

Add Llama.cpp Support #183

Conversation

bayedieng commented Aug 27, 2024 • edited Loading

AlexCheema commented Sep 5, 2024 • edited Loading

bayedieng commented Sep 5, 2024

AlexCheema commented Sep 30, 2024

bayedieng commented Oct 1, 2024

AlexCheema commented Oct 3, 2024 • edited Loading

AlexCheema commented Oct 6, 2024

bayedieng commented Oct 6, 2024

AlexCheema commented Oct 11, 2024

bayedieng commented Oct 12, 2024

bayedieng commented Oct 12, 2024

bayedieng commented Aug 27, 2024 •

edited

Loading

AlexCheema commented Sep 5, 2024 •

edited

Loading

AlexCheema commented Oct 3, 2024 •

edited

Loading