This repository has been archived by the owner on Jun 24, 2024. It is now read-only.
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Sync to GGML version as of 2023-04-23 02:17 UTC
- Loading branch information
6b2765d
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If it's helpful:
It's supposed to use workflows and periodic actions to automatically sync with
llama.cpp
's GGML and publish the crate. (The last bit may or may not be there currently, I'll have to wait around 6 hours to see.)This won't help you with the llama.cpp side but it's a possibly an easier way to sync than having to actually do it manually like this.
6b2765d
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, neat. I don't mind having to do it manually - especially as it forces me to check over the code to see if anything broke - but setting up a script to pull it in and automatically make a commit might be nice.
6b2765d
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I don't think you would want to just use it directly since there are no guarantees about any random build.
However, it could simplify synchronizing: You could, for example, just change the
ggml-sys
crate here to point at a specific known good version and then upgrading would just involve bumping that to point at the next version you wanted to use instead of manually having to check out the GGML or llama.cpp repo and copy files, update text files, etc.Also, it might make comparing the changes between GGML release easier also because you could just do a comparison between the current + proposed release and just see the GGML differences. Looking at a comparison with the GGML repo is getting longer and more complicated since there are so many other files.
6b2765d
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, that seems reasonable. Can you open a PR?
6b2765d
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, was just checking if that was something you'd be interested in using.
I will probably wait a day or so just to make sure all my workflows and scripts are running automatically without any issues.
I also just added some stuff to try to copy relevant commit messages across to the automatically generated synchronization ones.
6b2765d
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, how do you feel about making
ggml-sys
here its own crate underrustformers
or whatever? Then it would be possible for other projects to use a known-working version and it could also be published on crates.io to that same end.Obviously that would be more work for you (or whoever) because you'd want to have it set up with the automated building, passing tests, etc. I guess as a test you'd probably also want to build a llama-rs using the changes too and be sure all its tests pass successfully.
Unfortunately, I'm not really volunteering to do that part because I'm not really that familiar with setting up workflows/automated testing and that kind of thing. I got some simple stuff working with this, but I wouldn't really trust myself setting up something like I proposed.
Regarding the
ggml-sys-bleedingedge
crate/repo, I can also add you, setzer as co-owners or collaborators if you want. I'm also not really opposed to transferring ownership entirely.6b2765d
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was actually going to suggest that as well - we could just adopt
ggml-sys-bleedingedge
asggml-sys
proper. The published crate should be sufficient for us if it continues to work. Especially if we haven't publishedggml-sys
yet.That being said, I just thought of a potential problem: we can/may/will patch ggml with additional operations that can't be described by the unary/binary operations (e.g. the new operations for BLOOM). How would we handle that?
6b2765d
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not so sure about that part. I think it makes more sense to have
ggml-sys-bleedingedge
separate and generating automated release and something likeggml-sys
pointing to a currated/known-good version. From what I've seen about other-sys
crates, they're low level but they're supposed to actually work.ggml-sys-bleedingedge
doesn't have that guarantee.You could possibly add more automated testing, but just for example when I did my own synchronizations I actually made sure that a
llama-rs
built with the new version could run inference. That would be a hard thing to test automatically, since you'd need an actual model available.If you're pretty sure you will need to do that, then the change I'm talking about might not make sense. Or perhaps it would be better to make a fork of
ggml-sys-bleedingedge
which also tries to automatically apply your set of custom patches and only succeeds if that can be done without conflicts (if not, then you'd have to resolve it manually, probably by changing the patches).If it's just something that you might have to do at some point in the future, then going back to the old method wouldn't really be much work. You could basically just grab a frozen version of
ggml-sys-bleedingedge
that was the last one you could use, copy it into the tree and start synchronizing it manually as before.Of course, trying to get your changes merged into GGML proper is also a potential way to deal with the problem. By own experience with getting my map operations merged was that they'd be open to merging reasonable, generally useful functions if people asked for them and submitted a pull. That would certainly be the best approach whenever possible, in my opinion.
Also, about the Bloom thing, are you positive it couldn't be done with the map operations? I glanced at it and it basically seems like a unary map op, but there may be some complications since it seems like it needs state passed to it (possibly could get around that) but more importantly it seems like it works on a 3d tensor and needs to know one of the indices (while the map operations just give you an arbitrary slice of the tensor to perform some operation on but you don't necessarily know where it is). It's possible there still could be a way to work around that, by using views maybe.
6b2765d
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why would it stop working? It's just the raw C + autogenerated bindings. As long as it compiles, it should be fine for what it is. If there is some kind of regression, that can be handled by using an older release until it's sorted out.
I'd like to merge the BLOOM changes sooner rather than later, so yeah, likely.
Noted. I was thinking automated patching might be sufficient.
Yeah that's certainly doable, would need to check if there are any existing plans to do that. Hurdle to cross once we get there.
Not really, no - I just had a quick lookover and thought it looked more complex than basic function application. If it can be reformulated in terms of the existing operations, great!
6b2765d
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not concerned about the bindings generation failing. It's just that the llama.cpp project moves so fast and often has several releases a day.
Usually if I pull in a Rust crate, I expect it to generally work correct. However, that assurance doesn't exist when just automatically grabbing releases from a repo like the llama.cpp one. Also, an upstream like llama-rs would be taking more time to make sure a specific release actually works and since that work of determining a curated set of releases is usable already gets done here, why not let other people take advantage of it?
Anyway, that's my line of thinking.
I guess you could clone the repo in and then apply a patch? Or does cargo actually let you patch the actual source code? I know there's the
[patch]
block but it seems like it only lets you "patch" the version or source of a crate, not the literal files.Or did I misunderstand and you're talking about something else entirely?
Wouldn't you already be there at the point you're contemplating maintaining a patched version of GGML?
I looked at it more closely. At least as the code is currently, I think it might be technically possible but it certain wouldn't be easy and there would be limitations/tradeoffs.
Right now, since the "handler" for the map has to be
extern "C"
, you can't use a closure. That means the only way to get state to the function is to use something like a global static. (But I think that would make running multiple inference sessions simultaneously impossible... Maybe there would be a way to index it by thread id or something, but that's getting kind of crazy.)It also seems like it does stuff that relies on the indices of the elements. A possible workaround for that would be to make it a binary map operation and put the values of the indices into another tensor. Obviously that's going to require more memory and have a performance cost as well.
It's possible there's a way to refactor the actual algorithm, I'm not really talking from a position of deep understanding. Just looking at the existing code. I'm going to mention this over in the Bloom support pull request.
6b2765d
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Eh, I think people know that
-sys
crates offer none of Rust's guarantees. What people would be more likely to use is ourggml
crate, which does offer (most of) those guarantees and targets a specific version of-sys
.Yes, which seems like it'll be sooner rather than later. Oops.
Yeah you're right, it would be preferable to upstream these instead of maintaining our own patches. I suppose we could always have a fork of
-sys
when we need to patch, and then upstream as soon as possible.6b2765d
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe I'm weird, but I'd expect to find a
-sys
crate based on a known working version of whatever other software it was interfacing with. To me,-sys
means "this is low level", not "this is unstable and may simply not work at all even if you do everything right".I mean, if you're going through the work of curating a specific working version of
ggml-sys-bleedingedge
to use for that, why not just just publish it asggml-sys
? It wouldn't really need anything more thanpub use ggml_sys_bleedingedge::*
, right?I think there's space for the "bleeding edge" version that gets changes immediately so people can try them if they want as long as they're willing to take risks and also a more stable version, and you already need the stable version for
llama-rs
or its associated higher level GGML binding.By the way, somewhat related, but I've been working on my own my Rust-idiomatic bindings for GGML: https://github.com/KerfuffleV2/rusty-ggml
Using it looks like: https://github.com/KerfuffleV2/smolrsrwkv/blob/4e9980f9fa59fa891353a318c7308c4e803aff24/smolrwkv/src/ggml/graph.rs#L35 (I repurposed the XOR operator
^
for matrix multiplication).Since you want to keep it close to the C interface to make porting changes easier, that's probably not something you'd want to use directly but there could be some useful ideas. For example, I use const generics to statically verify that tensor dimensions are correct. For example: https://github.com/KerfuffleV2/rusty-ggml/blob/4cfa241c88db7d992b08231ee9703a9120e64654/src/tensor.rs#L280
That means when you do
tensor.conv_1d(othertensor, false)
,tensor
must a 2 or 3 dimensional tensor,othertensor
must be at least 2 dimensions and you get a 2 dimensional tensor as the result.I also used a different approach to keeping the context alive: I just wrap it in an
Arc<Mutex>
and it can't be dropped while there are any other references. (Operations involving multiple tensors also verify they all come from the same context.)Stuff like
view_1d
,view_2d
,view_3d
can just be expressed as taking a variable number of items in an array with the length as a const generic, so there's no need for multiple functions.I actually think it might not be necessary to add a new operation at all.
6b2765d
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, at work, so my responses will be terser than usual...
Yeah, fair point. Not sure if there are any automatically published
-sys
crates - we should look into what the going standard is here.Agreed. Here's another potential solution: one
ggml-sys
crate that automatically updates on new pushes, asbleedingedge
does now, but with prereleases that won't be selected, and then a human goes in every once in a while and cuts specific releases that are known-good.That gives us three solutions:
-sys
with a new version, which we think is a bad ideaggml-sys
crate that just re-exports a known-good versionWhich one do you think is best?
Very nice! I'm starting to be less and less sure that we need to maintain exact compat - it seems like a lot of the more chaotic development has slowed down. Will keep an eye on it.
Agree with the bulk of your changes. Don't have much to add - the const generics are nice to have (I believe
dfdx
does the same thing).Do you have any plans to abstract over the backend for smolrsrwkv, or are you keeping the implementations separate for now? Just wondering if you're going to end up implicitly tackling #137 😅
Yep. For reader's context: #141 (comment)
6b2765d
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As expected, I don't like option 1 at all. I strongly prefer option 2, but I could live with option 3 I guess.
Having a separate automatic release crate seems like it's the most flexible way to deal with this and doesn't prevent any other option. For example, you could implement options 2 and 3 on top of
ggml-sys-bleedingedge
.One thing I don't really like about option 3 is that there's basically no way to respect semver.
ggml-sys-bleedingedge
makes every release a major one (and also has a crazy long major version) but ideally a crate that's suitable for general use would have versioning that respects semver, otherwise even if a specific version works just fine there's still no way for the user to know if there are breaking changes or not.(I only dislike option 3 as a replacement for something like
ggml-sys-bleedingedge
. I think it's okay as a separate crate.)Thanks. Quite a bit of the interface is still sketchy, or temporary stuff just to make it usable that should be replaced later. I'd like to reduce the unsafe bits quite a lot, but I needed to get it to the point where I could port smolrsrwkv over and see it actually do stuff. (It does work nicely right now, and I can auto-quantize to any of the formats easily, even the just-added
q5_0
andq5_1
.)Also, I don't like letting GGML panic or panicking myself. However, stuff like
std::ops::Add
can't really be fallible. The way I'm planning to deal with that is to set a failed state in the context and just let most operations proceed (but not do anything). Then the user can ensure everything is okay before actually running the graph (or be forced to deal with it by those methods returning aResult
).It's something that's been in my mind. The answer is kind-of. The interface that exists in
rusty-ggml
right now is what I'd consider the low level version. What I want to do is have a higher level interface that just lets you build up the structure of the graph without actually doing anything and can then build the GGML graph later on. Even if it's GGML-specific, it shouldn't be too hard to repurpose that kind of approach once it's working.I also want to abstract the loading stuff as well. I could still use help with multi-part GGML models, hint hint. All I really need is someone to run a tool against them and show me the output and I can probably get some generic loader code going that works with all existing formats (it currently can read every format as far as I know, I just don't know what the multi-part ones look like after being loaded since GGML doesn't directly include information in the format that says "This is a multi-part tensor").
6b2765d
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Option 2 it is. I'm not that picky about it.
Yes, that's more or less what I was thinking of as a solution.
They don't have anything particularly special. The tensors are split up according to one of two schemes, and loading of each part resumes from after the prelude, with each tensor being reconstructed in accordance with the schemes.
See
That being said, I would strongly advise not bothering. Multipart models are rare enough to not justify spending time on them - I only have one, and that's because I converted it myself.
6b2765d
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I know, and I've already looked at the source. It would still be really helpful to see an example of what it actually looks like.
I mean, I already have a tool that will scan a GGML/GGMF/GGJT file and dump information about the tensors. I just need someone to take 10 minutes to clone a repo and run it against some files and shove the output in a gist or something.
I'm not really too motivated spend a bunch of time scouring the internet for multi-part GGML models and then download huge files to do it myself. A loader that can handle those files would also facilitate converting them without forcing users to just download the original model and convert it from scratch, which seems like it wouldn't be useless.
6b2765d
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Link me to the repo and I'll see if I can do it in the next few days
6b2765d
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Never mind. I won't bring this subject up again.
6b2765d
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's not me brushing you off - I will actually do it, I just need the code to run and some free time at my own computer, which I haven't had much of lately. Just let me know where.