[BOUNTY - $100] Support running any model from huggingface #357

AlexCheema · 2024-10-16T19:42:41Z

Like this: https://x.com/reach_vb/status/1846545312548360319

exo run hf. co/bartowski/Llama-3.2-1B-Instruct-GGUF:Q8_0

This should work out of the box with #139

komikat · 2024-10-16T23:14:08Z

Huggingface transformers can run gguf files but they first dequantize it to fp32 defeating the purpose altogether. We can directly run this on llama.cpp instead of using the hf/torch inference engine but I'm not quite sure about that yet.

PS: #335 is still WIP but can probably base this feature on that, I can work on accelerating the progress as far as llama.cpp inference is concerned.

AReid987 · 2024-10-17T02:16:29Z

@AlexCheema I would like to work on this. Please assign it to me

AlexCheema · 2024-10-17T04:22:44Z

I assigned you both @komikat @AReid987 you both will receive the bounty for any meaningful work towards this - feel free to work independently or together, up to you,

komikat · 2024-10-17T09:41:28Z

hi @AlexCheema, llama.cpp seems to natively support sharding using gguf-split, could we just use that to shard the downloaded gguf and run it on connected nodes? I also feel we will need to do this on llama.cpp considering the huggingface method is to dequantise it which is suboptimal.

AlexCheema · 2024-10-17T19:50:23Z

hi @AlexCheema, llama.cpp seems to natively support sharding using gguf-split, could we just use that to shard the downloaded gguf and run it on connected nodes? I also feel we will need to do this on llama.cpp considering the huggingface method is to dequantise it which is suboptimal.

exo supports multiple inference backends through the InferenceEngine interface. It's not enough to support just llama.cpp

komikat · 2024-10-17T19:54:01Z

I'm not sure if there is a way to run .gguf files on pytorch. Huggingface can be done but would have to be dequantised. Since there already is a huggingface inference engine I'd base this current feature on that; until llama.cpp inference comes around. How does this sound?

AlexCheema · 2024-10-17T19:55:30Z

I'm not sure if there is a way to run .gguf files on pytorch. Huggingface can be done but would have to be dequantised. Since there already is a huggingface inference engine I'd base this current feature on that; until llama.cpp inference comes around. How does this sound?

Sure lets start with that.

bayedieng · 2024-10-21T17:56:00Z

I'm not sure if there is a way to run .gguf files on pytorch. Huggingface can be done but would have to be dequantised. Since there already is a huggingface inference engine I'd base this current feature on that; until llama.cpp inference comes around. How does this sound?

I'm using this library to parse the gguf files, it takes the array of bytes tensors and converts them to numpy arrays. If you intend to load the weights into Pytorch you could just convert the numpy arrays to torch tensors.

komikat · 2024-10-24T08:42:06Z

@bayedieng there's also llama.cpp to torch converter.

komikat · 2024-10-24T12:31:13Z

MLX has documentation on using gguf files for generation, will integrate this into exo for now.

AReid987 · 2024-12-19T11:18:57Z

My apologies. I was sidetracked by a dense build at a startup I was working a contract for. Finished up now and ready to work on this or anything else pressing. @AlexCheema @komikat

AlexCheema changed the title ~~Support running any model from huggingface~~ [BOUNTY - $100] Support running any model from huggingface Oct 16, 2024

komikat mentioned this issue Oct 16, 2024

[WIP] run any model from huggingface #358

Draft

3 tasks

AlexCheema assigned AReid987 and komikat Oct 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BOUNTY - $100] Support running any model from huggingface #357

[BOUNTY - $100] Support running any model from huggingface #357

AlexCheema commented Oct 16, 2024

komikat commented Oct 16, 2024 •

edited

Loading

AReid987 commented Oct 17, 2024

AlexCheema commented Oct 17, 2024

komikat commented Oct 17, 2024 •

edited

Loading

AlexCheema commented Oct 17, 2024 •

edited

Loading

komikat commented Oct 17, 2024

AlexCheema commented Oct 17, 2024

bayedieng commented Oct 21, 2024

komikat commented Oct 24, 2024 •

edited

Loading

komikat commented Oct 24, 2024

AReid987 commented Dec 19, 2024

[BOUNTY - $100] Support running any model from huggingface #357

[BOUNTY - $100] Support running any model from huggingface #357

Comments

AlexCheema commented Oct 16, 2024

komikat commented Oct 16, 2024 • edited Loading

AReid987 commented Oct 17, 2024

AlexCheema commented Oct 17, 2024

komikat commented Oct 17, 2024 • edited Loading

AlexCheema commented Oct 17, 2024 • edited Loading

komikat commented Oct 17, 2024

AlexCheema commented Oct 17, 2024

bayedieng commented Oct 21, 2024

komikat commented Oct 24, 2024 • edited Loading

komikat commented Oct 24, 2024

AReid987 commented Dec 19, 2024

komikat commented Oct 16, 2024 •

edited

Loading

komikat commented Oct 17, 2024 •

edited

Loading

AlexCheema commented Oct 17, 2024 •

edited

Loading

komikat commented Oct 24, 2024 •

edited

Loading