Replies: 12 comments 1 reply
-
Hi! Thanks for your post and for all the help so far 😄
Glad to hear I'm not the only one who saw the potential in this project! I think having the potential to build something this huge and making it a CLI app is not aiming high enough, heh.
To be fair, I haven't even tested this on Windows. Maybe it builds just fine. But I didn't know what flags to set to compile with AVX and inference times without AVX are pretty bad (if you're telling me there's no multithreading on top of that, it's probably gonna be unusably slow anyway, unfortunately). I don't have a Windows machine to test this, so I didn't want to promise support for untested systems 😅
I thought about this! Even started with this design. But didn't want to force all the code to use a That said, the "bindings" in
Very good question! So far I was most concerned with whether I could do it 🤣 But now that the library exists, I'm thinking it would be pretty good to start improving this. A few things off the top of my head:
Yup, also considered that, I'd love to do this if possible :) Not sure how long it would take, but none of the tensor operations I ported seem too complicated. The code should be pretty straightforward to port to a different library, and I'm sure some of the Rust options achieve a more ergonomic (and safe!) API. We just have to keep an eye on performance, but having an already working ggml version means we can just benchmark. If that change also helps us support GPU inference, that'd be pretty cool. But I don't want to add GPU support if that means people having to mess with CUDA or rocm drivers and report all sorts of issues. Unless it's something portable that works out of the box for people, I'm not interested. I'm not sure how crazy it would be to build a tensor library on top of wgpu compute shaders. Just throwing that out there for anyone who feels crazy and/or adventurous enough 🤔 But that would eventually mean tensors on wasm, which is pretty darn cool I guess? Anyway, I'm happy to have someone else on board 😄 If there's anything I mentioned above you'd like to take the lead on, please say so! I'm not going to be making any big changes to the code in a few days. |
Beta Was this translation helpful? Give feedback.
-
Just something to follow w.r.t to GPTQ quantization 👀 |
Beta Was this translation helpful? Give feedback.
-
Also count me in for any future work! I've been obsessed with llama for the past few weeks and getting a solid Rust implementation of a modern machine learning model like this is really impressive. I might try tackling breaking the app into a library in the next few days (unless someone else beats me to it 😄) |
Beta Was this translation helpful? Give feedback.
-
I've been testing on Windows with my patches applied and it seems to work fine. It's probably not as fast as it could be, but it's plenty fast enough!
Yeah, I actually went back and forth on this. I started without any kind of checking, promptly got owned by accessing freed memory, fixed that, and then bolted on the Ideally, we'd still maintain the same
Huh, you're right - hadn't even thought about that. That's... pretty gnarly. I wonder if it's possible for the operands in a binary operation
Well... I'd actually started on this before you replied 😂 Here's the PR. I wrote a Discord bot to prove that it works, too. Hell of a thing to run a binary and be able to run a LLM with friends with an evening's work!
Yeah, I've also thought about this. Seems easy enough to do; I'd do it as a separate application just to keep the concerns separate, and to offer up a simple "API server" that anyone can run. My closest point of comparison is the API for the Automatic1111 Stable Diffusion web UI - it's not the best API, but it does prove that all you need to do is offer up a HTTP interface and They Will Come:tm:.
I think this could be exposed through the API, but it's not necessarily something that should be part of the API by default. I'd break apart the That being said, that sounds pretty reasonable to do for both the CLI and/or the API. Either/or could serve as a "batteries-included" example of how to ship something that's consistently fast with the library.
Oh yeah, it's pretty cool. I haven't played around with it much myself, but the folks over at the text-generation-webui have used it to get LLaMA 30B into 24GB VRAM without much quality loss. Seems like it's something that upstream is looking at, though, so I'm content to wait and see what they do first.
Yeah, I think most of the existing Rust ML libraries should be able to handle this. I was surprised at how few operations it used while porting it myself! It's certainly much simpler than Stable Diffusion.
Agreed. I have lost far too much of my time trying to set up CUDA + Torch.
Check out wonnx! I'm not sure if it can be used independently from ONNX, but it would be super cool to figure out. Worth having a chat with them at some point. You could also just run the existing CPU inference in WASM, I think - you might have to get a little clever with how you deal with memory, given the 32-bit memory space, but I think it should be totally feasible to run 7B on the web. The only reason I haven't looked into it is because the weights would have to be hosted somewhere 😅
I think we're in a pretty good place! I think it's just figuring out what the best "library API" would look like, and building some applications around it to test it. From there, we can figure out next steps / see what other people need. |
Beta Was this translation helpful? Give feedback.
-
Actually, I did a bit of port too lol. But I am not familiar with c bindings, and I am not so brave to port the ggml library, so I just rewrite some code and leave it there. it's the utils.cpp
|
Beta Was this translation helpful? Give feedback.
-
@Noeda of https://github.com/Noeda/rllama might wanna tag along here Also, a Tauri-app equivalent to https://github.com/lencx/ChatGPT would pair very well with this. Good task for anyone who wants to be involved but doesn’t quite feel comfortable with the low level internals. |
Beta Was this translation helpful? Give feedback.
-
Glad to hear it! Then I said nothing :) If you're able to test things there, we can aim for good Windows support too then. This is probably going to be something where Rust makes it a lot more simple to get going than the C++ version.
Me neither, I'm not even sure it is possible 🤔 But definitely interesting! Still, I'd rather go down the route of replacing
❤️
I think that's a very good point! As for the HTTP interface, one interesting requirement the image generation APIs don't have is that with text, you generally want to stream the outputs. A good way to do this without complicating the protocol is use something called "chunked transfer encoding" where the server sends bits of the response one piece at a time, and a compatible client can fetch the results as they come without waiting for the end of the HTTP response. Chunked transfer is a pretty old thing and should be well supported in every HTTP client. I know @darthdeus already did a little proof of concept and this works well.
Yes :) I'm really interested in making this as simple as possible. What we could do on the library side, is to have the main By default, callers just pass in
Will do!! 👀 The idea of wgpu tensors is just so appealing in that it basically works anywhere with no driver issues and on any GPU.
Sounds good :) |
Beta Was this translation helpful? Give feedback.
-
Indeed! The more we are working on this, the better 😄 As a first contact, some benchmarks comparing the |
Beta Was this translation helpful? Give feedback.
-
Howdy :) I am very happy too LLM stuff picking up in Rust.
Currently in terms of performance or memory use I just checked my latest commit and on CPU only OpenCL I got 678ms per token. (with GPU, ~230ms). The I have two ideas how to collaborate in near future:
I am currently working on removing more performance bottlenecks out which might improve my Excited for all us :) 👍 |
Beta Was this translation helpful? Give feedback.
-
I guess it depends on the CPU, but my times for the f16 models are closer to 230ms, so I'd be inclined to say GPU and CPU speed is comparable. This also matches my results from when I tried another gpu implementation. On the quantized models, I do get ~100ms/token.
That's a very good idea :) Other than setting let dist = WeightedIndex::new(&probs).expect("WeightedIndex error");
let idx = dist.sample(rng); So my guess is that as long as we're both using the rand crate, results should be comparable.
That would be amazing! :) |
Beta Was this translation helpful? Give feedback.
-
It's worth noting that quantisation affects both speed and quality, so any benchmarks should be done with the original weights (which will probably limit the maximum size that can be used). Additionally, That is to say - let's get this benchmark on the road, but I think we'll be returning slightly incorrect results until we can address those issues. |
Beta Was this translation helpful? Give feedback.
-
Hi, guys. I am so excited to find you guys. Wow, I am a contributor to other communities (related to the AI domain). I am planning to write a rust-llama.cpp |
Beta Was this translation helpful? Give feedback.
-
[apologies for early send, accidentally hit enter]
Hey there! Turns out we think on extremely similar wavelengths - I did the exact same thing as you, for the exact same reasons (libraryification), and through the use of similar abstractions: https://github.com/philpax/ggllama
Couple of differences I spotted on my quick perusal:
ggml
doesn't support multithreading on Windows.PhantomData
with theTensor
s to prevent them outliving theContext
they're spawned from.llama.cpp
in so that I could track it more directly and use itsggml.c/h
, and to make it obvious which version I was porting.Given yours actually works, I think that it's more promising :p
What are your immediate plans, and what do you want people to help you out with? My plan was to get it working, then librarify it, make a standalone Discord bot with it as a showcase, and then investigate using a Rust-native solution for the tensor manipulation (burn, ndarray, arrayfire, etc) to free it from the ggml dependency.
Beta Was this translation helpful? Give feedback.
All reactions