-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Speed up inference ~4x for 7B model without introducing too much complexity #95
base: master
Are you sure you want to change the base?
Conversation
@krzysztof-jusiak Hey there - could you please explain how this works: "unroll the loop in matmul to perform 4 operation in parallel with simd." Presumably each line is done in parallel, but are there more details? i'm super-curious, thanks! |
I think this patch is quite nice as it adds a minimal amount of lines.
The loop is unrolled four times. See generated assembly comparison. This loop iterates by 4 ( This patch is similar to #55 and #94, which also uses the same technique, but unrolled 16x. Below are some of my results comparing this with current main branch (using the smaller models listed in README.md) 1 run, two lines for tok/s, first line being the original and second being updated version runmodel
model44m
runfastmodel
model44m
runompmodel
model44m
here is the result running hyperfine, 10 runsrun
runfast
runomp
Here are the specs of the machine/env I'm running on
|
Very cool stuff, perhaps we can integrate your loop unrolling with my fused matrix multiplies Here is what I am getting with your PR on the same box I was testing with in my PR #94 Fast f42@formica:~/dev/llama2.c$ ./run out44m/model44m.bin
<s>
Once upon a time, there was a little peanut. The peanut was very small and lived in a big garden. One day, the peanut met a big, tall tree. The tree was very kind and let the peanut live with it.
One day, it was very cold outside. The peanut started to shiver. The big, tall tree saw the squirrel shivering too. The tree said to the peanut, "Come, sit with me. I will keep you warm." The peanut was polite and said, "Thank you, tree."
They became good friends. The peanut, the tree, and the tree were always together. They played and talked every day. The peanut was happy and warm. The big, tall tree was happy too. And they all lived happily ever after.
<s>
Once upon a time, there was a little boy named Tim. Tim was very excited because he found a big gear in his toy box. He wanted to show it to his friend, Sue.
At school, Tim met Sue and said, "Look at my big gear!" Sue looked at the gear and said, "Wow!
achieved tok/s: 47.832586 OMP f42@formica:~/dev/llama2.c$ OMP_NUM_THREADS=12 ./run out44m/model44m.bin
<s>
Once upon a time, there was a little red car. The car had a dream. It wanted to gain something special. One day, the car went on a long trip. It had to leave its friends behind. The car was very happy.
But on the trip, the car saw a big mess. There was a terrible mess everywhere. The car was sad. It thought, "I wanted to gain something special today, but there was no." It did not like the mess.
Then, the car saw a big tree. The tree was full of pretty flowers. The car had a good idea. It started to pick the flowers. The flowers made the terrible mess go away. The car gained something special after all. It gained the pretty flowers. The car was very happy.
<s>
Once upon a time, there was a loud dog named Max. Max loved to bark all day. He barked at his toys, at the flowers, and even at the people walking by.
One day, Max found a magazine on the ground. It had many fun pictures in it. Max thought it would be fun to bark at the pictures in the magazine, too. So, he barked and barked, and the pictures in
achieved tok/s: 175.222450 These are very impressive numbers for such a small change. It makes me even more confident by combining the strategies it could be a huge potential win |
You mentioned
Your question regarding the potential for performance improvements through techniques such as quantization for larger, memory-bound models is certainly intriguing. As you correctly pointed out, llama.cpp's q4 quantization does lead to significant speed improvements. However, like you said, perhaps this strays away from the vision and could potentially fall outside the scope of this project, which, as we've discussed, strives to strike a balance between simplicity and performance. Reflecting on @karpathy's work with nanoGPT and miniGPT, it's clear that he has already explored the spectrum from baseline models to more sophisticated implementations. In many ways, this project feels like a step up, pushing the envelope while still keeping educational value high. It's incredibly fun seeing how far we can take things though. By examining what optimizations can be applied and understanding their impact, we're really pushing what can be achieved with CPU-bound models.. and keeping the complexity at a manageable level (fingers crossed) Really looking forward to seeing where the project goes next. |
It's a quite past my bedtime but was finally able to produce a couple of llama2-7b benchmarks. f42@formica:~/dev/llama2.c$ OMP_NUM_THREADS=12 ./run ../llama/llama2_7b.bin
<s>
Here you can discover the Ukrainian brides, that can be found for a wedding in Kiev. There are thousands of Ukrainian women who are very dreamy in regards to the probabilities of getting to know the perfect man on the earth.
Ukrainian girls are very frank, so do not be afraid to ask her how a lot she costs for a date. There are many reasons why Ukrainian brides are so fashionable among men from the United States.
</s>
#94
<s>
SMART GOALS (Set Goals, Make a Plan, Accept Responsibility, Track Progress, and Achieve Success)
Great idea and not so great idea checklist for new habits
Prioritize: Can you imagine not prioritizing? Even if we don’t write it down, we do prioritize in life, setting goals and committing to them. Goal setting isn’t new, but it is powerful. What we write down is powerful and then taking concrete action steps to meet our goals.
Write goals for the future: write both long-term goals (five years) and short-term goals (one year).
When writing goals, the SMART checklist helps:
–
achieved tok/s: 2.055086 f42@formica:~/dev/llama2.c$ OMP_NUM_THREADS=12 ./run ../llama/llama2_7b.bin
<s>
Tags: linux, bash, shell, makefile
Question: Bash script to remove specific files that I don’t know their names
I’m trying to make a bash script where I have a list files to remove and at the same time I have a list files that I don’t want removed.
It can be like this:
\begin{code}
filesCommon = /home/user/somelongname.zip /home/user/somelongname.txt
filesNontoRemove = /home/user/genre.json /home/user/tags.csv
filesToRemove ?= $filesCommon
remove = cp $filesNontoRemove $filesCommon
\end{code}
I don’t know how to solve this problem. I would really appreciate your help.
Answer: If the common files are simply named `foo/something` and `bar/something`, then you could do `filesCommon=(`. But you'd better to find a solution that does not depend on names.
The fundamental problem is that a bash variable is the value of a variable, not the variable itself. A string that contains a list of strings can be
achieved tok/s: 2.369427 I'll take a look tomorrow and see if it is possible to merge the loop unrolling and the broader work. It certainly was a good start. Have a good day and look forward to more soon |
Wow, way cleaner than #94 |
Absolutely, individual preferences can sway toward solutions requiring fewer changes, especially in the context of projects like this one where simplicity is a key factor. However, it's important to recognize that to realize substantial performance improvements, certain fundamental alterations may be unavoidable. In my experience, it's indeed common to attain significant performance gains—up to 100%—with relatively minor adjustments. But when striving for even greater enhancements, one often has to delve deeper and be prepared for more extensive modifications. The introduction of fused matrix multiplication, for instance, isn't something that could be achieved with just a few lines of code; it's intrinsic to its nature. Consequently, I believe that making these more complex changes earlier, when possible, sets a stronger foundation for future improvements. All while keeping in mind the delicate balance between optimization and maintainability. In the end, our shared passion for maximizing the potential of this project is what unites us. Despite it being in its early stages—merely two days old—it's awesome to see the diverse range of ideas and approaches being explored. It underscores the importance of evaluating all possibilities, to truly optimize what an be achieved. Passion is certainly a feature. |
Nice patch. I suggest adding comments to the code to retain its instructive nature and to provide novice code readers with an indication of what it does. #pragma omp parallel for
for (int i = 0; i < d; i++) {
float val = 0.0f;
const int i_n = i * n;
// Loop is incremented by 4 to perform four calculations at once. This is known as loop unrolling.
for (int j = 0; j < n; j+=4) {
// Four calculations are conducted in parallel, utilizing AVX2 instructions to speed up the processing time.
val += w[i_n + j] * x[j];
val += w[i_n + j + 1] * x[j + 1];
val += w[i_n + j + 2] * x[j + 2];
val += w[i_n + j + 3] * x[j + 3];
}
xout[i] = val;
} |
Problem: - inference for 7B model is slow. Solution: - unroll the loop in matmul to perform 4 operation in parallel with simd. Result (with float16): - before: 16tok/s - after: 71tok/s
@clebert Thanks, added the comments, agree that they are very useful in this case as there is additional complexity to deal with but hopefully not too much. |
I'm not very familiar with this, but: why don't we just parallelize the for loop? I.e. add another |
Why not use |
I am curious have you considered using OpenMP SIMD directive. In my case this didn't do much as I suspect the compiler was already taking care of SIMD automatically. #pragma omp simd
for (int j = 0; j < n; j++) {
val += w[i * n + j] * x[j];
} |
@aegkmq There are defo ways to optimize it much more whilst keeping the simplicity, I just explored a bit after verifying that the matmul is the bottleneck, noticed the improvement and created a MR as IMHO nice step forward without introducing much complexity but wanted to verify whether that's overall consensus. godbolt link for the solutions - https://godbolt.org/z/Gb3dbxz6W |
@kris-jusiak Weirdly enough, I get around 20% speedup by doing this. I specify number of iterations to be vectorized to be 4 instead of letting omp decide it. I guess this happens because 128bit instructions have less latency and/or more instructions per cycle on my machine. #pragma omp simd simdlen(4)
for (int j = 0; j < n; j++) {
val += w[i * n + j] * x[j];
} Edit: this also seems to use xmm registers and do 4 iterations at a time because of specified simdlen(4). Could you have a look of you have time? Thanks. |
This is good discussion. Here are some additional results
I think the way to go for simplicity, portability and performance would be to get aligned memory (which @Foundation42 has already worked on) and use Vector Extensions (https://gcc.gnu.org/onlinedocs/gcc/Vector-Extensions.html). That would simplify the code a bit and would perform probably better too. Something worth exploring, IMHO. |
@krzysztof-jusiak I don't mind complexifying matmul a little bit because it is the place where 90% of the FLOPS go and I think a good tradeoff. So I'm happy to merge something like this. That said I'm. not able to reproduce this speedup. Both master and this branch run at ~4.5 tok/s on my cloud machine with OMP 48 threads. Can you say a bit more about where you run this, how it was compiled? |
Improvement has been tested with fp16 model (#93) on machine with sse3/avx/avx2 support.
|
@karpathy One question. Is your cloud machine a single CPU machine or does it have multiple CPUs? In the latter case, your machine might be a Non uniform memry access (NUMA) in which case multithreading can cause slowdowns because of data locality issues. Essentially, data allocated in memory of one node is harder to access from other nodes which causes latency. |
@krzysztof-jusiak oops I missed #93 will def take a look after work. @Ea0011 my
|
@karpathy Ok. You have a single node and a single CPU. So, you should not worry about any of that :). But using 48 threads can be a bit too much I think. I wonder of you could achieve speedup using less threads. |
If you add |
Problem:
Solution:
Result (with float16):
Note: