Replies: 10 comments 8 replies
-
In theory we should be able to generate these tuning files using Tensile. In practice getting Tensile to run seems to be a bit of a challenge; RocBLAS can do it, but if I try invoking it from its own directory it doesn't find its dependencies. It's also not clear what configs (problem sizes) are needed to output the required tuning files. The docs mention a sample config file, but this is probably not what RocBLAS uses; we may need to reverse engineer it from the existing results. Which is crazy because there's no reason AMD shouldn't just provide these files or a script that generates them, but hey. The files in Ah, I see this is also discussed in #53. Never mind the overlap. :P When we have a working setup for generating full asm for a configuration, I can run it for gfx1036. Though, if the issues in the rocBLAS repo are to be believed, AMD is still leaving a lot of performance on the table for common cases. |
Beta Was this translation helpful? Give feedback.
-
I have done the same steps outlined here. Of note is that, to build the benchmark, I had to specify Interestingly, while the new kernel is there and gives identical performance in the rocBLAS benchmark:
The same does not apply to the llama.cpp benchmark, which is still slow:
I can confirm the custom kernel is present in There is no gfx1030 or gfx1036-specific logic in llama.cpp, other than marking both of them as RDNA2, so I find it very hard to explain this result. And yes, I did rebuild it in between, and ensured that Unfortunately I didn't keep a backup with the old build without the custom kernel because I'm an idiot. However, if I remove the logic files for Raphael from
Note that in this benchmark lower is better/faster, so without the copied kernel the rocBLAS benchmark is indeed up to 4x slower. llama.cpp does not benefit at all, which is odd. At best you would expect performance no worse than what we now get for gfx1030 since we ought to be using the same routines now, but there's still a >6x slowdown. |
Beta Was this translation helpful? Give feedback.
-
Good findings, I will do the llama testing later on today. As a sidenote there still seems to be couple of places in code where some checks are hardcoded to gfx1030. For example on MIOpen
And rocfft seems to also have some gfx1030 specific check. |
Beta Was this translation helpful? Give feedback.
-
Before I spend a lot of time tuning for gfx1036 I'll first need to know where the perf difference is coming from, because copying the gfx1030 files should mean we get the same performance, but we don't. Tuning therefore cannot be the issue, at least not on its own. llama does compile its own kernels, so maybe something is going wrong in LLVM/clang there before it even gets to the rocBLAS kernels. I'm a bit skeptical, though. I have no experience tracking this sort of thing down, so it might take a while. |
Beta Was this translation helpful? Give feedback.
-
I am busy mostly on today, but as a try out either this rocFFT or MIOpen batch speedup? 0001-MIOpen-experimental-patch.patch.txt |
Beta Was this translation helpful? Give feedback.
-
I'm going to spend some time teaching myself how to trace these issues all the way down to the hardware, profiling, benchmarking, actual GPU assembler etc., since I am in no particular personal hurry to get better support for the gfx1036. Obviously this will take some time, but it will hopefully increase my value to this project (and just in general). This means I will also take a break from endlessly recompiling this repo, which is nice. ;) |
Beta Was this translation helpful? Give feedback.
-
That's fine. I think the code could now be pretty much ready to be tagged as a one release, I will also like to explore couple of things a little deeper. Maybe do some coding to get the benchmarking started. I did start also the build command for single binfo file but did not finish it yet. |
Beta Was this translation helpful? Give feedback.
-
Yeah, I don't think the release (or any particular release, for that matter) should be contingent on getting GPU support without an override, because for the most part the override works well, with the only scenario where it will not do is when you want a multi-GPU setup that 1) involves an iGPU that 2) is of a different generation to the other GPUs and 3) you want to use them both from one application, so |
Beta Was this translation helpful? Give feedback.
-
Well, the llama.cpp regression has been explained -- I don't know how I missed this, it was the second match of the file. There is a check in there that inappropriately does Of course llama.cpp isn't the only project that checks for specific architectures in places, seeing as how they're not usually used to iGPUs, so we may have more things to patch. Hopefully not including any things that aren't part of the repo, since that would be inconvenient to end users. |
Beta Was this translation helpful? Give feedback.
-
I added your llama-server patch now to build and I can confirm that it improved the token generation 5x also on gfx1035.
Biggest thing missing is the xformers integration. It's big task as it would need adding radeon support to
I found some radeon navi31 branch for flash_attention and composable kernel that could help but the merging of that code looked like a big job if wanted to do in a maintainable way. |
Beta Was this translation helpful? Give feedback.
-
In addition of earlier added gfx1035,rocm-sdk-builder has now support for gfx1036 and gfx1103.
Original pull request for it contains quite a lot of discussion about testing and benchmarking with lllama_cpp and pytorch-benchmark so I move the discussion here as the pull-request will be closed.
#111
In addition of adding the build support for those GPU's, we have also added the initial rocBLAS logic files for those GPU's after jeroen-mostert detected that gfx1036 was running faster by using the HSA_OVERRIDE method.
Even the initial logic files for gfx1035, gfx1036 and gfx1103 indicates pretty good improvements for some benchmarks while they are not neccessarily optimal for the GPUs.
They are not yet made from scratch by tuning the kernel but instead by modifying the existing ones so that they get loaded for gfx1035, gfx1036 and and gfx1103. They are based on to logic files for gfx1030 and gfx1102 and are not ideal.
Initial testing has been done by running a with a simplified version of
https://github.com/LeiWang1999/rocblas-benchmark
by modifying the test-code to only have this test:
std::make_tuple(8192, 8192, 8192, false, false, enable_tune),
Results (each test has been run couple of times to verify that numbers are about on same level on each run)
So current logic files already reduces the execution times for 2 last tasks significantly while fp32 tasks does not get any benefits.
Jeroen has also worked with the llama_cpp for benchmarking.
Beta Was this translation helpful? Give feedback.
All reactions