Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BOUNTY - $100] Support running any model from huggingface #357

Open
AlexCheema opened this issue Oct 16, 2024 · 10 comments
Open

[BOUNTY - $100] Support running any model from huggingface #357

AlexCheema opened this issue Oct 16, 2024 · 10 comments
Assignees

Comments

@AlexCheema
Copy link
Contributor

Like this: https://x.com/reach_vb/status/1846545312548360319

exo run hf. co/bartowski/Llama-3.2-1B-Instruct-GGUF:Q8_0

This should work out of the box with #139

@AlexCheema AlexCheema changed the title Support running any model from huggingface [BOUNTY - $100] Support running any model from huggingface Oct 16, 2024
@komikat
Copy link

komikat commented Oct 16, 2024

Huggingface transformers can run gguf files but they first dequantize it to fp32 defeating the purpose altogether. We can directly run this on llama.cpp instead of using the hf/torch inference engine but I'm not quite sure about that yet.

PS: #335 is still WIP but can probably base this feature on that, I can work on accelerating the progress as far as llama.cpp inference is concerned.

@AReid987
Copy link

@AlexCheema I would like to work on this. Please assign it to me

@AlexCheema
Copy link
Contributor Author

I assigned you both @komikat @AReid987 you both will receive the bounty for any meaningful work towards this - feel free to work independently or together, up to you,

@komikat
Copy link

komikat commented Oct 17, 2024

hi @AlexCheema, llama.cpp seems to natively support sharding using gguf-split, could we just use that to shard the downloaded gguf and run it on connected nodes? I also feel we will need to do this on llama.cpp considering the huggingface method is to dequantise it which is suboptimal.

@AlexCheema
Copy link
Contributor Author

AlexCheema commented Oct 17, 2024

hi @AlexCheema, llama.cpp seems to natively support sharding using gguf-split, could we just use that to shard the downloaded gguf and run it on connected nodes? I also feel we will need to do this on llama.cpp considering the huggingface method is to dequantise it which is suboptimal.

exo supports multiple inference backends through the InferenceEngine interface. It's not enough to support just llama.cpp

@komikat
Copy link

komikat commented Oct 17, 2024

I'm not sure if there is a way to run .gguf files on pytorch. Huggingface can be done but would have to be dequantised. Since there already is a huggingface inference engine I'd base this current feature on that; until llama.cpp inference comes around. How does this sound?

@AlexCheema
Copy link
Contributor Author

I'm not sure if there is a way to run .gguf files on pytorch. Huggingface can be done but would have to be dequantised. Since there already is a huggingface inference engine I'd base this current feature on that; until llama.cpp inference comes around. How does this sound?

Sure lets start with that.

@bayedieng
Copy link
Contributor

I'm not sure if there is a way to run .gguf files on pytorch. Huggingface can be done but would have to be dequantised. Since there already is a huggingface inference engine I'd base this current feature on that; until llama.cpp inference comes around. How does this sound?

I'm using this library to parse the gguf files, it takes the array of bytes tensors and converts them to numpy arrays. If you intend to load the weights into Pytorch you could just convert the numpy arrays to torch tensors.

@komikat
Copy link

komikat commented Oct 24, 2024

@bayedieng there's also llama.cpp to torch converter.

@komikat
Copy link

komikat commented Oct 24, 2024

MLX has documentation on using gguf files for generation, will integrate this into exo for now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants