Skip to content
This repository has been archived by the owner on Jun 24, 2024. It is now read-only.

Non-ggml backend #31

Open
philpax opened this issue Mar 17, 2023 · 28 comments
Open

Non-ggml backend #31

philpax opened this issue Mar 17, 2023 · 28 comments
Labels
issue:enhancement New feature or request meta:help-wanted Extra attention is needed topic:backend-support Support for alternate non-GGML backends, or for particular GGML backend features

Comments

@philpax
Copy link
Collaborator

philpax commented Mar 17, 2023

This has been a topic of some discussion in #4 and on the Discord, so I figured I'd document our initial findings so far.

We would like to switch away from ggml at some point so that we can remove the C compiler dependency, and enable running on other types of devices (namely the GPU).

Our primary candidate for a Rust-native ML/tensor backend is burn, which is a flexible deep learning framework that supports multiple backends (including ndarray and torch).

Unfortunately, it doesn't support the two formats we need: f16 (original weights) and q4_0/q4_1 (quantized weights). Adding these to the ndarray backend should be viable, but getting it right and working optimally (i.e. similar to ggml's optimisations for those datatypes) will take some time.

Torch does support f16 on the GPU only, and burn's Torch backend supports it. The main problem there is actually just testing: the 7B weights are 14GB, which is difficult to make work with most consumer GPUs.

So we're in a bit of a pickle - there are three options available, all of which will require some work, and all of which have individual drawbacks:

  1. Quantize the model to standard uint8 and use ndarray/torch backends. This is the least work (at least in theory), but uint8 quantization performs worse than either f16 or q4, from what I've heard.
  2. Add support for f16 to burn's ndarray backend. The torch backend should already work, but it will be very hard to test with most of our machines. Adding support to ndarray for CPU inference shouldn't be impossible either (especially if we just convert to f32 for every operation), but it will be difficult to make it performance-optimal.
  3. Add support to q4_0/1 to burn's ndarray backend. This is the option that will give us the most parity with the current implementation (assuming the majority of our users are using q4 weights), but it has the same performance-optimality issue as f16 on the CPU (every cumulative operation, like matrix multiplication and such, will need to be specialised). Additionally, there is no way to natively store a 4-bit element, so there's no guarantee that this will be space-optimal (e.g. we can't assume that ndarray and rustc will remap [[bool; 4]; N] to [u8; N/2]).

This is summarised in the following table:

uint8 f16 q4
ndarray Yes, but at noticeable quality loss Requires semi-significant implementation work Requires significant implementation work
torch Yes, but at noticeable quality loss (GPU, CPU) Yes, but is GPU-only Unknown - should be possible, but likely requires custom code

An idea that I briefly floated was porting ggml itself to Rust using c2rust and some cleanup work, but that's likely to be quite time-consuming and it locks us out of the relatively-free improvements we get from people making PRs against llama.cpp's ggml implementation. The gain from having pure Rust would be outweighed by the maintenance burden we'd put on ourselves.


I believe the other Rust ML crates also do not support f16 or q4, but that's from a cursory exploration. Happy to be proven wrong!

@katopz
Copy link
Contributor

katopz commented Mar 18, 2023

Not sure this one is related (as opposed to pickle)? https://github.com/huggingface/safetensors#yet-another-format-
Sorry if it's not, I'm try to catch up here not quite on the same page yet 🤯.

@philpax
Copy link
Collaborator Author

philpax commented Mar 18, 2023

safetensors are cool, but that's probably more applicable for #21. (Note that the weights would still need to be converted to safetensors format.)

@katopz
Copy link
Contributor

katopz commented Mar 19, 2023

So it look something like this am i right? maybe we should add this to some md file for new comer(me!). 🤔

graph TD;
  A("PyTorch") --"<pre>1️⃣/2️⃣&nbsp;export_state_dict_checkpoint.py</pre>PyTorch model checkpoints (pth)"--> B(Python) --"<pre>3️⃣&nbsp;convert-pth-to-ggml.py</pre>Geometric Deep Learning Markup Language (ggml)"--> C(C++)--"<pre>4️⃣&nbsp;quantize.cpp</pre>Quantized ggml (bin)"-->D(Rust);
Loading

1️⃣ tloen/alpaca-lora/export_state_dict_checkpoint.py (llama-7b-hf)
2️⃣ jankais3r/LLaMA_MPS/export_state_dict_checkpoint.py (llama-13b-hf)
3️⃣ llama.cpp/convert-pth-to-ggml.py
4️⃣ llama.cpp/quantize.cpp

@philpax
Copy link
Collaborator Author

philpax commented Mar 26, 2023

Also worth keeping an eye on: @Narsil's https://github.com/Narsil/smelte-rs.

@KerfuffleV2
Copy link
Contributor

This is another one that could possibly be worth looking at: https://github.com/coreylowman/dfdx

One thing about it is it seems like it's pretty hard to load models where there stuff like the array dimensions or structure are dynamic.

I looked at smelte for other stuff too, but one big con at the moment is it says it's single threaded. So I don't think it would even be able to get close to the current approach on CPU at least.

@Narsil
Copy link

Narsil commented Mar 27, 2023

So I don't think it would even be able to get close to the current approach on CPU at least.

You'd be suprised :) matmul is still linked against mkl which is multi threaded and make the overall thing fast enough. Even ggml uses threading only for a few select ops, not for all of them.

@jasonviipers
Copy link

Look into Rust's tch crate, which is a high-level deep learning library built on top of PyTorch. PyTorch has built-in support for f16 and q4, so tch may be able to support those formats.

@philpax
Copy link
Collaborator Author

philpax commented Mar 31, 2023

tch works great, but it requires Torch to be installed at the system level, which is non-ideal for us (we want using llama-rs to be as easy as any other Rust crate).

@Narsil
Copy link

Narsil commented Mar 31, 2023

Hey I've started seeing if the code from ggml couldn't be done in pure Rust, here's the first draft:

https://github.com/Narsil/rblas

It's x86_64, avx-only right now and I'm getting 2x slower than intel-mkl on my old personal computer.

RUSTFLAGS="-C target-cpu=native" cargo +nightly bench
# don't forget to build for native to get all avx thing supported.

Not sure if I screwed something up in the translation, the f32 matmul of ggml isn't as good as intel-mkl, or my threading policy sucks (Using simple threadpool which isn't using spinlocks under the hood afaik)

Also threadpool and num_cpus can be removed as dependencies, they just make my life and the code easier.

Still if people find that interesting to work on.

@KerfuffleV2
Copy link
Contributor

I'm not sure if it's the same for GPT (I assume it would be) but at least with RWKV the vast, vast majority of the time was spent just in the matrix multiplication. The rest was basically insignificant.

You probably can just do simple/unoptimized versions of the other ops and come very close to equal performance as long as the MM part is fast.

@Narsil
Copy link

Narsil commented Mar 31, 2023

the vast, vast majority of the time was spent just in the matrix multiplication. The rest was basically insignificant.

The softmax and layer norm can start to take up some time when not threaded.
Still not horrifying and beating torch, but still much more significant than one would expect.

Not sure if I screwed something up in the translation, the f32 matmul of ggml isn't as good as intel-mkl, or my threading policy sucks (Using simple threadpool which isn't using spinlocks under the hood afaik)

I'm beating cblas by ~30% using this code on M1.. I guess it's not that bad.

@kazord
Copy link

kazord commented May 1, 2023

Did you take a look at https://github.com/Noeda/rllama works ?
doesn't have q4 yet but i have got decent result cpu and opencl on my radeon.
It supports spliting job cpu/gpu (but his implementation have to much barrier, had to rewrite it https://github.com/kazord/rllama/tree/oclsplit2, currently stuck at 200ms/token on f16, too much memory move cpu<->gpu, costing each 18ms)

@mert-kurttutan
Copy link

mert-kurttutan commented May 8, 2023

Just tossing another idea around, another choice of backend would be to use faer-core.

In my linux machine, it gives the same performance (both speed and cpu utilitization) as intel mkl, which is suprising enough that I kind of doubt my result, but I checked several times if anything is wrong.

One thing is that it was not able to run mac m1 machine.

@Narsil
Copy link

Narsil commented May 9, 2023

In my linux machine, it gives the same performance (both speed and cpu utilitization) as intel mkl, which is suprising enough that I kind of doubt my result, but I checked several times if anything is wrong.

This is indeed true and a testament to its creator.
However it isn't as fast on matmul on non contiguous blas calls (like A.matmul(B.T))

@mert-kurttutan
Copy link

mert-kurttutan commented May 9, 2023

In my linux machine, it gives the same performance (both speed and cpu utilitization) as intel mkl, which is suprising enough that I kind of doubt my result, but I checked several times if anything is wrong.

This is indeed true and a testament to its creator. However it isn't as fast on matmul on non contiguous blas calls (like A.matmul(B.T))

@Narsil Are you sure about the statement noncontiguous calls? In my experiments, which uses your ggblas bench script but faer-core version= 0.8.0 and a change in matrix dims, faer-core still gives same performance as intel mkl for both matmul and matmul_t

The result logs

test bench_faer_rs_n ... bench:   3,245,234 ns/iter (+/- 815,495)
test bench_faer_rs_t ... bench:   3,393,976 ns/iter (+/- 785,863)
test bench_mkl_n     ... bench:   3,042,964 ns/iter (+/- 677,965)
test bench_mkl_t     ... bench:   3,573,910 ns/iter (+/- 894,360)

I also got similar results for gemm backend (instead of faer-core backend).

I also checked that they are indeed calculating AB^T

@Narsil
Copy link

Narsil commented May 9, 2023

Interesting numbers (they seem pretty high, are you modifying shapes ?)

test bench_faer_rs_n ... bench:     432,096 ns/iter (+/- 73,060)
test bench_faer_rs_t ... bench:     721,426 ns/iter (+/- 200,362)
test bench_ggblas_n  ... bench:     571,520 ns/iter (+/- 199,552)
test bench_ggblas_t  ... bench:     393,694 ns/iter (+/- 200,837)

Interestingly mkl is performing much worse today on my computer (not sure if I updated since).

In any case, I thought there was potentially some nice upgrade possible on faer-rs an get the best of both worlds ideally

@mert-kurttutan
Copy link

mert-kurttutan commented May 9, 2023

Yeap I did modify

const M: usize = 128;
const N: usize = 768 * 3;
const K: usize = 768;

This seemed to decrease the variance/mean ratio (not so much).
I think more reliable/realistic/small variance test would be to try it as matmul backend of some NN model.
I will try it in your smelte-rs project to see how it behaves

@Narsil
Copy link

Narsil commented May 9, 2023

I will try it in your smelte-rs project to see how it behaves

Thanks. I don't have much time to add new features on it. In general mkl used to be the best runtime for all engines I tried.

@coreylowman
Copy link

coreylowman commented May 9, 2023

Hey this is the dfdx author. I recently added f16 support for both cpu & gpu (on main currently waiting for new release from half-rs). I actually just fully implemented llama using dfdx, which you can find here https://github.com/coreylowman/llama-dfdx. Notably it supports GPU, and can lazily load tensors stored on disk at runtime when they are needed.

I also moved the CPU matrix multiplication to using gemm (which is the backend of faer-rs) literally this morning 😅

Happy to add anything that you guys would need to use it!

Edit: I was getting ~30ms/token on an A10 GPU with llama-7b

@philpax
Copy link
Collaborator Author

philpax commented May 10, 2023

Awesome! The main things we'd want that I don't believe dfdx has is being able to use memory-mapped tensors (so not just lazy-loading, but not copying at all) and 4/5/8-bit quantisation support.

Our current thinking on this is to implement a computation graph abstraction (#137), and then have that shell out to alternate backends as required or as available. I'd love to see dfdx as either the provider of the abstraction or a backend in itself :)

@Narsil
Copy link

Narsil commented May 10, 2023

a computation graph abstraction

I feel forced to say that this approach has major drawbacks, the biggest of all being that it's hard to implement efficient runtimes.
onnxruntime has fixed primitives to rewrite graphs on the fly for gpt2, and those do not work for most other models (which use slightly different attention code). And in order to implement anything of the kind you're basically implementing a graph compiler.
And no one so far has managed to pull it off. (Pytorch has 10 different kinds, and it doesn't work most of the times, XLA has very severe limitations in terms of control, onnxruntime is also limited).
And what seems to happen in the optimization space rn, is that users will try 10 different "out-of-the-box" compiling solutions, and keep the most performant (even if it's super bad and just happens to be the best). And usually there is no consistent winner across the board for different hardware and different models.

And if someone has a super clever idea to do computations more efficiently, well it's much harder to implement in those graphs, since you have to talk an entire new language (the graph structure).

My very personal opinion is that we shouldn't have "graph computation" models, but real code, as real code is a full descriptor of your graph (no need to reimplement the wheel there). It's fully qualified already, there are already great ways to modify any part you want without having to understand the graph structure.

Like if I want to reimplement a given model with new kernel X that can replace existing ops, there's an obvious way to do it (rewrite the corresponding code). It's very much not easy to do on computation graphs.

Ofc, for training, in order to get automatic differentiation, we need a computation graph. PyTorch seemed to have made it work correctly without having a "real computation graph", it ends up being classic python code, which is where it wins imo.

@coreylowman
Copy link

Yeah of those are both accurate.

4/5/8-bit quantisation support.

We have a tracking issue for 4 bit quantization, and someone actually was working on a draft PR for this, but it has kind of stalled out given how specific of a use case it is. So there is a fairly complex way forward, but it'll take a not insignificant amount of time. Luckily llama doesn't use all the possible tensor operations, so the MVP is probably just implementing kernels for the specific ops we'd need.

Has anyone done 4 bit quantization on CUDA? Or is this specifically for Cpu optimizations?

memory-mapped tensors (so not just lazy-loading, but not copying at all)

I was thinking about memory mapped tensors yesterday (probably for similar use case where CPU tensors for llama can just use memory mapping for data storage). There might be a way to do this on top of dfdx, similar to how I did the lazy tensor stuff, however it'd be really unsafe/sketchy. Basically we'd have to construct a Vec<T> from the mmap'd &[u8] and then ensure that we can mem::forget the tensor so the rust allocator doesn't try to free the vec. This may not be possible without causing undefined behavior. I'll experiment with this and let ya'll know.

a computation graph abstraction

+1 on narsil's response.

This seems like a very complex way to gain access to running on multiple devices. The easiest/hackiest way to do it would be with feature flags, something like:

#[cfg(feature = "cuda")]
// call e.g. dfdx backend
#[cfg(feature = "cpu)]
// call existing ggml backend

While you have to maintain two separate pieces of code that do the same thing, I think its probably simpler than creating/impl'ing a graph abstraction.

Thoughts?

@Narsil
Copy link

Narsil commented May 10, 2023

Has anyone done 4 bit quantization on CUDA? Or is this specifically for Cpu optimizations?

GPTQ does it : https://github.com/qwopqwop200/GPTQ-for-LLaMa (Triton backed, so you could steal their ptx file ! )

@coreylowman
Copy link

coreylowman commented May 10, 2023

Also I tried out some sketch mmap stuff, and it seems like you can create a Vec from an mmap buffer. I have no idea how safe it is, but it seems to work (it produces the same generations as the regular copy version) 🤷 Was able to "load" all the 13gb of weights in 10ms on my dev laptop

Detected model folder as LLaMa 7b.
Model size: 13476 MB
13476 MB of model parameters will be held in RAM.
Loaded weights in 9.5947ms

PR is here coreylowman/llama-dfdx#15

@philpax
Copy link
Collaborator Author

philpax commented May 11, 2023

My very personal opinion is that we shouldn't have "graph computation" models, but real code, as real code is a full descriptor of your graph (no need to reimplement the wheel there). It's fully qualified already, there are already great ways to modify any part you want without having to understand the graph structure.

My thinking on this is that we already use a computation graph, through ggml: https://github.com/rustformers/llm/blob/main/crates/models/llama/src/lib.rs#L141-L326

Replicating this graph would be no worse than the current state of affairs, and it would allow us to directly "compile" our graph to the GGML graph in a way that would let us maintain compatibility.

I would need to read more into the state of affairs here before making a decision.

While you have to maintain two separate pieces of code that do the same thing, I think its probably simpler than creating/impl'ing a graph abstraction.

At present, we have five models (with a sixth hopefully coming soon). Multiplying the maintenance work by the number of backends seems intractable over time. I'd like to avoid that as much as possible, for as long as possible.

Also I tried out some sketch mmap stuff, and it seems like you can create a Vec from an mmap buffer. I have no idea how safe it is, but it seems to work (it produces the same generations as the regular copy version) 🤷 Was able to "load" all the 13gb of weights in 10ms on my dev laptop

Very cool! I'll have to give this more of a look soon 🙂

@hhamud hhamud added the meta:help-wanted Extra attention is needed label May 13, 2023
@philpax
Copy link
Collaborator Author

philpax commented May 17, 2023

I've opened an issue with wonnx regarding GPU inference: webonnx/wonnx#169

I imagine it will be non-trivial for them to implement a more freeform interface (if they're interested in doing so), so it may not be done/could take a long time. That being said, I would love to see non-CUDA GPU inference!

@philpax
Copy link
Collaborator Author

philpax commented May 22, 2023

Just listing all potential backends that come to mind, feel free to suggest more:

  • ggml
  • burn
  • dfdx
  • wonnx
  • smelte-rs
  • faer
  • onnxruntime
  • MLC
  • cuDNN
  • OpenVINO
  • ROCm
  • Torch

Note that some of these overlap and/or are at different abstraction levels. I'm just listing them out for general reference.

@wsxiaoys
Copy link

wsxiaoys commented May 22, 2023

https://github.com/OpenNMT/CTranslate2 is another solid choice (cpu / gpu (cuda) support, wide model support matrix)

@philpax philpax added the topic:backend-support Support for alternate non-GGML backends, or for particular GGML backend features label Jun 15, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
issue:enhancement New feature or request meta:help-wanted Extra attention is needed topic:backend-support Support for alternate non-GGML backends, or for particular GGML backend features
Projects
None yet
Development

No branches or pull requests

10 participants