-
Notifications
You must be signed in to change notification settings - Fork 9.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CUDA: mul_mat_vec_q for batch sizes > 1 #5351
CUDA: mul_mat_vec_q for batch sizes > 1 #5351
Conversation
742cf0d
to
dbb795b
Compare
This is looking very good! I'm doing some V100 tests and seeing similar gains
Yes, I think 8 should be fine for most purposes
I was thinking of writing a |
I've tried some improved implementations for both I think one issue with FP16 arithmetic will be shared memory limits. In principle, if you can fit the entire hidden state into shared memory then you should be able to write a kernel that needs to load it only once per streaming multiprocessor. Even for the smallest hidden state of 4096 values, if you have a batch size of 16 you would need 16*4096*2 = 131072 bytes to store it as FP16. But you would only need 73728 bytes to store it as q8_1 and then it would fit into the 102400 bytes of shared memory per SM on Ampere (the A100 has 167936 bytes per SM). There are also issues with tail effects. Ideally you would distribute the rows as evenly as possible between SMs to maximize GPU utilization but tensor cores restrict you to multiples of 8/16/32 (depending on whether there are at least 32/16/8 hidden state columns). If you just use |
Do you have some 13B models handy to see how are the results (I have just 7B on this V100 machine and it will take me some time to setup)? I suspect that the results might not be so good for larger models. LLAMA_CUBLAS=1 make -j batched-bench && ./batched-bench /mnt/llama.cpp/models/open-llama/7B-v2/ggml-model-q8_0.gguf 4800 0 99 0 50 100 1,2,3,4,5,6,7,8,16,32,64
Before this PR I got:
I'm thinking, the larger the model, the stronger this effect would be maybe |
Device 0: NVIDIA GeForce RTX 3090 Ti, compute capability 8.6, VMM: yes
build: 9392ebd (2061)
Device 0: NVIDIA GeForce RTX 3090 Ti, compute capability 8.6, VMM: yes
build: 9392ebd (2061) |
This is using LLAMA_CUBLAS=1 make -j && ./llama-bench -ngl 99 -m models/openllama-7b-v2/ggml-model-q8_0.gguf -p 512 -b 1,2,3,4,5,6,7,8,16 -n 0 Device 0: Tesla V100-PCIE-16GB, compute capability 7.0, VMM: yes
build: dbb795b (2075) Before this PR: Device 0: Tesla V100-PCIE-16GB, compute capability 7.0, VMM: yes
build: 8a79c59 (2078) Btw @slaren you might be running a wrong commit? 9392ebd is on |
You are right, I forgot to pull before running the test. This is with this PR: Device 0: NVIDIA GeForce RTX 3090 Ti, compute capability 8.6, VMM: yes
build: 2e9c0bd (2080) Device 0: NVIDIA GeForce RTX 3090 Ti, compute capability 8.6, VMM: yes
build: 2e9c0bd (2080) |
The size of the model should not make a difference. That only increases the number of blocks that the GPU works on but not the relative speed of the kernels. I suspect it's an issue related to GPU architecture where on Volta the number of registers per thread ends up too high; I'll look into adding launch bounds. |
I also see a performance drop with batch sizes 6-8 with 13B Q8_0 compared to the previous master with a 3090 Ti. |
The model size does seem to matter somehow because on 3090 it's faster for 7B Q8_0, but it is slower for 13B Q8_0 at BS 6-8 |
The regression I'm seeing in mine and slaren's data is for batch size 4->5 and 6->7. 4->6, 6->8, and 4->8 is consistently faster. I'm currently profiling a few runs to make sure there are no other weird things going on at those batch sizes. |
I ran:
Then I looked at the runtime for
There is a performance regression for 4->5 and 6->7 that is not caused by any other kernels. This is maybe related to pointer arithmetic because multiplications and divisions by powers of 2 can be replaced with bit shifts which makes them much faster. In addition to that there may be GPU architecture related issues that cause problems on Volta (if I had to guess the compiler assigns different numbers of registers per thread so the impact of tail effects is different). More generally I think the issue is that the current implementation in |
Yup, let's limit the |
On master the
mul_mat_vec_q
kernel only supports a batch size of 1. This PR implements support for batch sizes up to 8 in a minimally invasive way. In this rangemul_mat_vec_q
is universally faster. For larger batch sizes it depends on the quantization format. To keep things simple I am therefore only enabling the new implementation for batch sizes <= 8 (which should be enough for techniques like speculative decoding). I think the optimal solution would be to rewrite themul_mat_vec_q
kernel in a way that maximizes memory bandwidth (and also more optimization for the competingmul_mat_q
kernels) but that is a larger project. As part of this PR I have also deduplicated the code for callingmul_mat_vec_q
. On my systems the performance changes as follows:I did not test the new XS/XSS quants because I did not yet get around to setting them up.