Skip to content
This repository has been archived by the owner on Jun 24, 2024. It is now read-only.

Commit

Permalink
Sync to GGML version as of 2023-04-23 02:17 UTC
Browse files Browse the repository at this point in the history
  • Loading branch information
philpax committed Apr 23, 2023
1 parent 2e62a18 commit 6b2765d
Show file tree
Hide file tree
Showing 4 changed files with 1,780 additions and 766 deletions.
4 changes: 2 additions & 2 deletions ggml-sys/ggml/CREDITS.txt
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
Vendored version: https://github.com/ggerganov/llama.cpp/commit/74f5899df4a6083fc467b620baa1cf821e37799d
Vendored version: https://github.com/ggerganov/llama.cpp/commit/0e018fe008eacebdbcfa2d61b6c988c245c961cd
For convenience, changes involved in this sync:
https://github.com/ggerganov/llama.cpp/compare/f4d277ae17247ee51129ef1a9ff74d377cc90b1b..74f5899df4a6083fc467b620baa1cf821e37799d
https://github.com/ggerganov/llama.cpp/compare/74f5899df4a6083fc467b620baa1cf821e37799d..0e018fe008eacebdbcfa2d61b6c988c245c961cd

The ggml.c and ggml.h files are distributed under the terms of the MIT license.
Credit goes to the original authors: Copyright (c) 2023 Georgi Gerganov
Loading

19 comments on commit 6b2765d

@KerfuffleV2
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it's helpful:

It's supposed to use workflows and periodic actions to automatically sync with llama.cpp's GGML and publish the crate. (The last bit may or may not be there currently, I'll have to wait around 6 hours to see.)

This won't help you with the llama.cpp side but it's a possibly an easier way to sync than having to actually do it manually like this.

@philpax
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, neat. I don't mind having to do it manually - especially as it forces me to check over the code to see if anything broke - but setting up a script to pull it in and automatically make a commit might be nice.

@KerfuffleV2
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't mind having to do it manually - especially as it forces me to check over the code to see if anything broke

Yeah, I don't think you would want to just use it directly since there are no guarantees about any random build.

However, it could simplify synchronizing: You could, for example, just change the ggml-sys crate here to point at a specific known good version and then upgrading would just involve bumping that to point at the next version you wanted to use instead of manually having to check out the GGML or llama.cpp repo and copy files, update text files, etc.

Also, it might make comparing the changes between GGML release easier also because you could just do a comparison between the current + proposed release and just see the GGML differences. Looking at a comparison with the GGML repo is getting longer and more complicated since there are so many other files.

@philpax
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, that seems reasonable. Can you open a PR?

@KerfuffleV2
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, was just checking if that was something you'd be interested in using.

I will probably wait a day or so just to make sure all my workflows and scripts are running automatically without any issues.

I also just added some stuff to try to copy relevant commit messages across to the automatically generated synchronization ones.

@KerfuffleV2
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, how do you feel about making ggml-sys here its own crate under rustformers or whatever? Then it would be possible for other projects to use a known-working version and it could also be published on crates.io to that same end.

Obviously that would be more work for you (or whoever) because you'd want to have it set up with the automated building, passing tests, etc. I guess as a test you'd probably also want to build a llama-rs using the changes too and be sure all its tests pass successfully.

Unfortunately, I'm not really volunteering to do that part because I'm not really that familiar with setting up workflows/automated testing and that kind of thing. I got some simple stuff working with this, but I wouldn't really trust myself setting up something like I proposed.

Regarding the ggml-sys-bleedingedge crate/repo, I can also add you, setzer as co-owners or collaborators if you want. I'm also not really opposed to transferring ownership entirely.

@philpax
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was actually going to suggest that as well - we could just adopt ggml-sys-bleedingedge as ggml-sys proper. The published crate should be sufficient for us if it continues to work. Especially if we haven't published ggml-sys yet.


That being said, I just thought of a potential problem: we can/may/will patch ggml with additional operations that can't be described by the unary/binary operations (e.g. the new operations for BLOOM). How would we handle that?

@KerfuffleV2
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was actually going to suggest that as well - we could just adopt ggml-sys-bleedingedge as ggml-sys proper.

I'm not so sure about that part. I think it makes more sense to have ggml-sys-bleedingedge separate and generating automated release and something like ggml-sys pointing to a currated/known-good version. From what I've seen about other -sys crates, they're low level but they're supposed to actually work. ggml-sys-bleedingedge doesn't have that guarantee.

You could possibly add more automated testing, but just for example when I did my own synchronizations I actually made sure that a llama-rs built with the new version could run inference. That would be a hard thing to test automatically, since you'd need an actual model available.

That being said, I just thought of a potential problem: we can/may/will patch ggml with additional operations that can't be described by the unary/binary operations (e.g. the new operations for BLOOM). How would we handle that?

If you're pretty sure you will need to do that, then the change I'm talking about might not make sense. Or perhaps it would be better to make a fork of ggml-sys-bleedingedge which also tries to automatically apply your set of custom patches and only succeeds if that can be done without conflicts (if not, then you'd have to resolve it manually, probably by changing the patches).

If it's just something that you might have to do at some point in the future, then going back to the old method wouldn't really be much work. You could basically just grab a frozen version of ggml-sys-bleedingedge that was the last one you could use, copy it into the tree and start synchronizing it manually as before.

Of course, trying to get your changes merged into GGML proper is also a potential way to deal with the problem. By own experience with getting my map operations merged was that they'd be open to merging reasonable, generally useful functions if people asked for them and submitted a pull. That would certainly be the best approach whenever possible, in my opinion.

Also, about the Bloom thing, are you positive it couldn't be done with the map operations? I glanced at it and it basically seems like a unary map op, but there may be some complications since it seems like it needs state passed to it (possibly could get around that) but more importantly it seems like it works on a 3d tensor and needs to know one of the indices (while the map operations just give you an arbitrary slice of the tensor to perform some operation on but you don't necessarily know where it is). It's possible there still could be a way to work around that, by using views maybe.

@philpax
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not so sure about that part. I think it makes more sense to have ggml-sys-bleedingedge separate and generating automated release and something like ggml-sys pointing to a currated/known-good version. From what I've seen about other -sys crates, they're low level but they're supposed to actually work. ggml-sys-bleedingedge doesn't have that guarantee.

Why would it stop working? It's just the raw C + autogenerated bindings. As long as it compiles, it should be fine for what it is. If there is some kind of regression, that can be handled by using an older release until it's sorted out.

If you're pretty sure you will need to do that, then the change I'm talking about might not make sense.

I'd like to merge the BLOOM changes sooner rather than later, so yeah, likely.

Or perhaps it would be better to make a fork of ggml-sys-bleedingedge which also tries to automatically apply your set of custom patches and only succeeds if that can be done without conflicts (if not, then you'd have to resolve it manually, probably by changing the patches).

If it's just something that you might have to do at some point in the future, then going back to the old method wouldn't really be much work. You could basically just grab a frozen version of ggml-sys-bleedingedge that was the last one you could use, copy it into the tree and start synchronizing it manually as before.

Noted. I was thinking automated patching might be sufficient.

Of course, trying to get your changes merged into GGML proper is also a potential way to deal with the problem. By own experience with getting my map operations merged was that they'd be open to merging reasonable, generally useful functions if people asked for them and submitted a pull. That would certainly be the best approach whenever possible, in my opinion.

Yeah that's certainly doable, would need to check if there are any existing plans to do that. Hurdle to cross once we get there.

Also, about the Bloom thing, are you positive it couldn't be done with the map operations? I glanced at it and it basically seems like a unary map op, but there may be some complications since it seems like it needs state passed to it (possibly could get around that) but more importantly it seems like it works on a 3d tensor and needs to know one of the indices (while the map operations just give you an arbitrary slice of the tensor to perform some operation on but you don't necessarily know where it is). It's possible there still could be a way to work around that, by using views maybe.

Not really, no - I just had a quick lookover and thought it looked more complex than basic function application. If it can be reformulated in terms of the existing operations, great!

@KerfuffleV2
Copy link
Contributor

@KerfuffleV2 KerfuffleV2 commented on 6b2765d Apr 25, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why would it stop working? It's just the raw C + autogenerated bindings. As long as it compiles, it should be fine for what it is.

I'm not concerned about the bindings generation failing. It's just that the llama.cpp project moves so fast and often has several releases a day.

Usually if I pull in a Rust crate, I expect it to generally work correct. However, that assurance doesn't exist when just automatically grabbing releases from a repo like the llama.cpp one. Also, an upstream like llama-rs would be taking more time to make sure a specific release actually works and since that work of determining a curated set of releases is usable already gets done here, why not let other people take advantage of it?

Anyway, that's my line of thinking.

I was thinking automated patching might be sufficient.

I guess you could clone the repo in and then apply a patch? Or does cargo actually let you patch the actual source code? I know there's the [patch] block but it seems like it only lets you "patch" the version or source of a crate, not the literal files.

Or did I misunderstand and you're talking about something else entirely?

Hurdle to cross once we get there.

Wouldn't you already be there at the point you're contemplating maintaining a patched version of GGML?

If it can be reformulated in terms of the existing operations, great!

I looked at it more closely. At least as the code is currently, I think it might be technically possible but it certain wouldn't be easy and there would be limitations/tradeoffs.

Right now, since the "handler" for the map has to be extern "C", you can't use a closure. That means the only way to get state to the function is to use something like a global static. (But I think that would make running multiple inference sessions simultaneously impossible... Maybe there would be a way to index it by thread id or something, but that's getting kind of crazy.)

It also seems like it does stuff that relies on the indices of the elements. A possible workaround for that would be to make it a binary map operation and put the values of the indices into another tensor. Obviously that's going to require more memory and have a performance cost as well.

It's possible there's a way to refactor the actual algorithm, I'm not really talking from a position of deep understanding. Just looking at the existing code. I'm going to mention this over in the Bloom support pull request.

@philpax
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not concerned about the bindings generation failing. It's just that the llama.cpp project moves so fast and often has several releases a day.

Usually if I pull in a Rust crate, I expect it to generally work correct. However, that assurance doesn't exist when just automatically grabbing releases from a repo like the llama.cpp one. Also, an upstream like llama-rs would be taking more time to make sure a specific release actually works and since that work of determining a curated set of releases is usable already gets done here, why not let other people take advantage of it?

Eh, I think people know that -sys crates offer none of Rust's guarantees. What people would be more likely to use is our ggml crate, which does offer (most of) those guarantees and targets a specific version of -sys.

Wouldn't you already be there at the point you're contemplating maintaining a patched version of GGML?

Yes, which seems like it'll be sooner rather than later. Oops.


Yeah you're right, it would be preferable to upstream these instead of maintaining our own patches. I suppose we could always have a fork of -sys when we need to patch, and then upstream as soon as possible.

@KerfuffleV2
Copy link
Contributor

@KerfuffleV2 KerfuffleV2 commented on 6b2765d Apr 27, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Eh, I think people know that -sys crates offer none of Rust's guarantees.

Maybe I'm weird, but I'd expect to find a -sys crate based on a known working version of whatever other software it was interfacing with. To me, -sys means "this is low level", not "this is unstable and may simply not work at all even if you do everything right".

What people would be more likely to use is our ggml crate, which does offer (most of) those guarantees and targets a specific version of -sys.

I mean, if you're going through the work of curating a specific working version of ggml-sys-bleedingedge to use for that, why not just just publish it as ggml-sys? It wouldn't really need anything more than pub use ggml_sys_bleedingedge::*, right?

I think there's space for the "bleeding edge" version that gets changes immediately so people can try them if they want as long as they're willing to take risks and also a more stable version, and you already need the stable version for llama-rs or its associated higher level GGML binding.

By the way, somewhat related, but I've been working on my own my Rust-idiomatic bindings for GGML: https://github.com/KerfuffleV2/rusty-ggml

Using it looks like: https://github.com/KerfuffleV2/smolrsrwkv/blob/4e9980f9fa59fa891353a318c7308c4e803aff24/smolrwkv/src/ggml/graph.rs#L35 (I repurposed the XOR operator ^ for matrix multiplication).

Since you want to keep it close to the C interface to make porting changes easier, that's probably not something you'd want to use directly but there could be some useful ideas. For example, I use const generics to statically verify that tensor dimensions are correct. For example: https://github.com/KerfuffleV2/rusty-ggml/blob/4cfa241c88db7d992b08231ee9703a9120e64654/src/tensor.rs#L280

    pub fn conv_1d<const RDIMS: usize, const ODIMS: usize, T: AsRef<GTensor<RDIMS>>>(
        &self,
        rhs: T,
        is_2s: bool,
    ) -> Self
    where
        Dim<RDIMS>: DimValid,
        Dim<ODIMS>: DimValid,
        DimPair<DIMS, 2>: DimGtE,
        DimPair<DIMS, 4>: DimLt,
        DimPair<RDIMS, 2>: DimGtE,
        DimPair<ODIMS, 2>: DimEq,

That means when you do tensor.conv_1d(othertensor, false), tensor must a 2 or 3 dimensional tensor, othertensor must be at least 2 dimensions and you get a 2 dimensional tensor as the result.

I also used a different approach to keeping the context alive: I just wrap it in an Arc<Mutex> and it can't be dropped while there are any other references. (Operations involving multiple tensors also verify they all come from the same context.)

Stuff like view_1d, view_2d, view_3d can just be expressed as taking a variable number of items in an array with the length as a const generic, so there's no need for multiple functions.

Yes, which seems like it'll be sooner rather than later.

I actually think it might not be necessary to add a new operation at all.

@philpax
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, at work, so my responses will be terser than usual...

Maybe I'm weird, but I'd expect to find a -sys crate based on a known working version of whatever other software it was interfacing with. To me, -sys means "this is low level", not "this is unstable and may simply not work at all even if you do everything right".

Yeah, fair point. Not sure if there are any automatically published -sys crates - we should look into what the going standard is here.

I mean, if you're going through the work of curating a specific working version of ggml-sys-bleedingedge to use for that, why not just just publish it as ggml-sys? It wouldn't really need anything more than pub use ggml_sys_bleedingedge::*, right?

I think there's space for the "bleeding edge" version that gets changes immediately so people can try them if they want as long as they're willing to take risks and also a more stable version, and you already need the stable version for llama-rs or its associated higher level GGML binding.

Agreed. Here's another potential solution: one ggml-sys crate that automatically updates on new pushes, as bleedingedge does now, but with prereleases that won't be selected, and then a human goes in every once in a while and cuts specific releases that are known-good.

That gives us three solutions:

  • YOLO it and just publish -sys with a new version, which we think is a bad idea
  • Publish a separate ggml-sys crate that just re-exports a known-good version
  • Publish a single crate that gets automatically updated, with certain builds being certified with a non-prerelease release.

Which one do you think is best?

By the way, somewhat related, but I've been working on my own my Rust-idiomatic bindings for GGML: https://github.com/KerfuffleV2/rusty-ggml

Very nice! I'm starting to be less and less sure that we need to maintain exact compat - it seems like a lot of the more chaotic development has slowed down. Will keep an eye on it.

Agree with the bulk of your changes. Don't have much to add - the const generics are nice to have (I believe dfdx does the same thing).

Do you have any plans to abstract over the backend for smolrsrwkv, or are you keeping the implementations separate for now? Just wondering if you're going to end up implicitly tackling #137 😅

I actually think it might not be necessary to add a new operation at all.

Yep. For reader's context: #141 (comment)

@KerfuffleV2
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That gives us three solutions:

As expected, I don't like option 1 at all. I strongly prefer option 2, but I could live with option 3 I guess.

Having a separate automatic release crate seems like it's the most flexible way to deal with this and doesn't prevent any other option. For example, you could implement options 2 and 3 on top of ggml-sys-bleedingedge.

One thing I don't really like about option 3 is that there's basically no way to respect semver. ggml-sys-bleedingedge makes every release a major one (and also has a crazy long major version) but ideally a crate that's suitable for general use would have versioning that respects semver, otherwise even if a specific version works just fine there's still no way for the user to know if there are breaking changes or not.

(I only dislike option 3 as a replacement for something like ggml-sys-bleedingedge. I think it's okay as a separate crate.)

Agree with the bulk of your changes.

Thanks. Quite a bit of the interface is still sketchy, or temporary stuff just to make it usable that should be replaced later. I'd like to reduce the unsafe bits quite a lot, but I needed to get it to the point where I could port smolrsrwkv over and see it actually do stuff. (It does work nicely right now, and I can auto-quantize to any of the formats easily, even the just-added q5_0 and q5_1.)

Also, I don't like letting GGML panic or panicking myself. However, stuff like std::ops::Add can't really be fallible. The way I'm planning to deal with that is to set a failed state in the context and just let most operations proceed (but not do anything). Then the user can ensure everything is okay before actually running the graph (or be forced to deal with it by those methods returning a Result).

Do you have any plans to abstract over the backend for smolrsrwkv, or are you keeping the implementations separate for now? Just wondering if you're going to end up implicitly tackling #137

It's something that's been in my mind. The answer is kind-of. The interface that exists in rusty-ggml right now is what I'd consider the low level version. What I want to do is have a higher level interface that just lets you build up the structure of the graph without actually doing anything and can then build the GGML graph later on. Even if it's GGML-specific, it shouldn't be too hard to repurpose that kind of approach once it's working.

I also want to abstract the loading stuff as well. I could still use help with multi-part GGML models, hint hint. All I really need is someone to run a tool against them and show me the output and I can probably get some generic loader code going that works with all existing formats (it currently can read every format as far as I know, I just don't know what the multi-part ones look like after being loaded since GGML doesn't directly include information in the format that says "This is a multi-part tensor").

@philpax
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Option 2 it is. I'm not that picky about it.

It's something that's been in my mind. The answer is kind-of. The interface that exists in rusty-ggml right now is what I'd consider the low level version. What I want to do is have a higher level interface that just lets you build up the structure of the graph without actually doing anything and can then build the GGML graph later on. Even if it's GGML-specific, it shouldn't be too hard to repurpose that kind of approach once it's working.

Yes, that's more or less what I was thinking of as a solution.

I also want to abstract the loading stuff as well. I could still use help with multi-part GGML models, hint hint. All I really need is someone to run a tool against them and show me the output and I can probably get some generic loader code going that works with all existing formats (it currently can read every format as far as I know, I just don't know what the multi-part ones look like after being loaded since GGML doesn't directly include information in the format that says "This is a multi-part tensor").

They don't have anything particularly special. The tensors are split up according to one of two schemes, and loading of each part resumes from after the prelude, with each tensor being reconstructed in accordance with the schemes.

See

That being said, I would strongly advise not bothering. Multipart models are rare enough to not justify spending time on them - I only have one, and that's because I converted it myself.

@KerfuffleV2
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They don't have anything particularly special. The tensors are split up according to one of two schemes

Yes, I know, and I've already looked at the source. It would still be really helpful to see an example of what it actually looks like.

That being said, I would strongly advise not bothering. Multipart models are rare enough to not justify spending time on them - I only have one, and that's because I converted it myself.

I mean, I already have a tool that will scan a GGML/GGMF/GGJT file and dump information about the tensors. I just need someone to take 10 minutes to clone a repo and run it against some files and shove the output in a gist or something.

I'm not really too motivated spend a bunch of time scouring the internet for multi-part GGML models and then download huge files to do it myself. A loader that can handle those files would also facilitate converting them without forcing users to just download the original model and convert it from scratch, which seems like it wouldn't be useless.

@philpax
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Link me to the repo and I'll see if I can do it in the next few days

@KerfuffleV2
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Never mind. I won't bring this subject up again.

@philpax
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's not me brushing you off - I will actually do it, I just need the code to run and some free time at my own computer, which I haven't had much of lately. Just let me know where.

Please sign in to comment.