iGPU integrations and benchmark improvements #114

lamikr · 2024-07-17T03:45:49Z

lamikr
Jul 17, 2024
Maintainer

In addition of earlier added gfx1035,rocm-sdk-builder has now support for gfx1036 and gfx1103.
Original pull request for it contains quite a lot of discussion about testing and benchmarking with lllama_cpp and pytorch-benchmark so I move the discussion here as the pull-request will be closed.
#111

In addition of adding the build support for those GPU's, we have also added the initial rocBLAS logic files for those GPU's after jeroen-mostert detected that gfx1036 was running faster by using the HSA_OVERRIDE method.
Even the initial logic files for gfx1035, gfx1036 and gfx1103 indicates pretty good improvements for some benchmarks while they are not neccessarily optimal for the GPUs.

They are not yet made from scratch by tuning the kernel but instead by modifying the existing ones so that they get loaded for gfx1035, gfx1036 and and gfx1103. They are based on to logic files for gfx1030 and gfx1102 and are not ideal.

Initial testing has been done by running a with a simplified version of
https://github.com/LeiWang1999/rocblas-benchmark
by modifying the test-code to only have this test:

std::make_tuple(8192, 8192, 8192, false, false, enable_tune),

Results (each test has been run couple of times to verify that numbers are about on same level on each run)

gfx1035 without logic files

$ ./rocblas_benchmark
Device 0: AMD Radeon Graphics
  m,n,k,a_t,b_t,enable_tune,fp32 time (msec),fp16-f32 time (msec), f16-f16 time (msec), int8-int32 time (msec)
  8192,8192,8192,n,n,0,912.287,814.502,854.257,865.103

Times for 4 different tasks in msec
  912.287
  814.502
  854.257
  865.103

gfx1035 with logic files

$ ./rocblas_benchmark
Device 0: AMD Radeon Graphics
  m,n,k,a_t,b_t,enable_tune,fp32 time (msec),fp16-f32 time (msec), f16-f16 time (msec), int8-int32 time (msec)
  8192,8192,8192,n,n,0,652.499,834.796,237.42,189.945

Times for 4 different tasks in msec
  652.499
  834.796
  237.42
  189.945

gfx1103 without logic files

$ ROCR_VISIBLE_DEVICES="1" ./rocblas_benchmark
Device 0: AMD Radeon 780M
  m,n,k,a_t,b_t,enable_tune,fp32 time (msec),fp16-f32 time (msec), f16-f16 time (msec), int8-int32 time (msec)
  8192,8192,8192,n,n,0,916.684,820.721,823.48,1018.46
  
Times for 4 different tasks in msec
  916.684
  820.721
  823.48
  1018.46

gfx1103 with logic files

$ ROCR_VISIBLE_DEVICES="1" ./rocblas_benchmark
Device 0: AMD Radeon 780M
  m,n,k,a_t,b_t,enable_tune,fp32 time (msec),fp16-f32 time (msec), f16-f16 time (msec), int8-int32 time (msec)
  8192,8192,8192,n,n,0,1346.02,634.836,193.613,119.29

Times for 4 different tasks in msec
  1346.02
  634.836
  193.613
  119.29

So current logic files already reduces the execution times for 2 last tasks significantly while fp32 tasks does not get any benefits.

Jeroen has also worked with the llama_cpp for benchmarking.

jeroen-mostert · 2024-07-17T07:03:42Z

jeroen-mostert
Jul 17, 2024

In theory we should be able to generate these tuning files using Tensile. In practice getting Tensile to run seems to be a bit of a challenge; RocBLAS can do it, but if I try invoking it from its own directory it doesn't find its dependencies. It's also not clear what configs (problem sizes) are needed to output the required tuning files. The docs mention a sample config file, but this is probably not what RocBLAS uses; we may need to reverse engineer it from the existing results. Which is crazy because there's no reason AMD shouldn't just provide these files or a script that generates them, but hey. The files in Tensile/Tensile/Configs/navi21 look promising.

Ah, I see this is also discussed in #53. Never mind the overlap. :P When we have a working setup for generating full asm for a configuration, I can run it for gfx1036. Though, if the issues in the rocBLAS repo are to be believed, AMD is still leaving a lot of performance on the table for common cases.

0 replies

jeroen-mostert · 2024-07-17T16:29:05Z

jeroen-mostert
Jul 17, 2024

I have done the same steps outlined here. Of note is that, to build the benchmark, I had to specify CMAKE_PREFIX_PATH=${ROCM_PATH}/lib64/cmake. The lib64 search directory thing is something we've ran into before, I think.

Interestingly, while the new kernel is there and gives identical performance in the rocBLAS benchmark:

> CUDA_VISIBLE_DEVICES=2 ./rocblas_benchmark

Device 0: AMD Radeon Graphics
m,n,k,a_t,b_t,enable_tune,fp32 time (msec),fp16-f32 time (msec), f16-f16 time (msec), int8-int32 time (msec)
8192,8192,8192,n,n,0,2027.32,5859.45,1673.74,1144.26

> HSA_OVERRIDE_GFX_VERSION="10.3.0" CUDA_VISIBLE_DEVICES=2 ./rocblas_benchmark

Device 0: AMD Radeon Graphics
m,n,k,a_t,b_t,enable_tune,fp32 time (msec),fp16-f32 time (msec), f16-f16 time (msec), int8-int32 time (msec)
8192,8192,8192,n,n,0,2019.33,5860.11,1675.07,1243.25

The same does not apply to the llama.cpp benchmark, which is still slow:

> CUDA_VISIBLE_DEVICES=2 llama.cpp/llama-cli -ngl 0 -m ~/models/Phi-3-mini-4k-instruct-q4.gguf -n 10 -f <(printf 'banana %0.s' {1..50}) -v 2>&1 | grep timings

llama_print_timings:        load time =     779,50 ms
llama_print_timings:      sample time =       0,14 ms /    10 runs   (    0,01 ms per token, 71428,57 tokens per second)
llama_print_timings: prompt eval time =   11183,69 ms /   102 tokens (  109,64 ms per token,     9,12 tokens per second)
llama_print_timings:        eval time =     389,16 ms /     9 runs   (   43,24 ms per token,    23,13 tokens per second)
llama_print_timings:       total time =   11573,14 ms /   111 tokens

> HSA_OVERRIDE_GFX_VERSION="10.3.0" CUDA_VISIBLE_DEVICES=2 llama.cpp/llama-cli -ngl 0 -m ~/models/Phi-3-mini-4k-instruct-q4.gguf -n 10 -f <(printf 'banana %0.s' {1..50}) -v 2>&1 | grep timings

llama_print_timings:        load time =     790,42 ms
llama_print_timings:      sample time =       0,17 ms /    10 runs   (    0,02 ms per token, 58139,53 tokens per second)
llama_print_timings: prompt eval time =    1749,79 ms /   102 tokens (   17,15 ms per token,    58,29 tokens per second)
llama_print_timings:        eval time =     387,79 ms /     9 runs   (   43,09 ms per token,    23,21 tokens per second)
llama_print_timings:       total time =    2138,39 ms /   111 tokens

I can confirm the custom kernel is present in lib64/rocblas/library; if I remove the gfx1036 files running either of these fails. The .hsaco files between gfx1030 and gfx1036 are almost identical, differing only in three places, one of which is the difference between 0 and 6 and the other dealing with the flags and checksums.

There is no gfx1030 or gfx1036-specific logic in llama.cpp, other than marking both of them as RDNA2, so I find it very hard to explain this result. And yes, I did rebuild it in between, and ensured that ccache was disabled, just in case.

Unfortunately I didn't keep a backup with the old build without the custom kernel because I'm an idiot. However, if I remove the logic files for Raphael from asm_full and rebuild rocBLAS, the benchmarks become:

> CUDA_VISIBLE_DEVICES=2 ./rocblas_benchmark
Device 0: AMD Radeon Graphics
m,n,k,a_t,b_t,enable_tune,fp32 time (msec),fp16-f32 time (msec), f16-f16 time (msec), int8-int32 time (msec)
8192,8192,8192,n,n,0,5696.03,5828.88,5912.23,4363.57

> HSA_OVERRIDE_GFX_VERSION="10.3.0" CUDA_VISIBLE_DEVICES=2 ./rocblas_benchmark

Device 0: AMD Radeon Graphics
m,n,k,a_t,b_t,enable_tune,fp32 time (msec),fp16-f32 time (msec), f16-f16 time (msec), int8-int32 time (msec)
8192,8192,8192,n,n,0,2018.87,5894.7,1668.78,1231.43

Note that in this benchmark lower is better/faster, so without the copied kernel the rocBLAS benchmark is indeed up to 4x slower.

llama.cpp does not benefit at all, which is odd. At best you would expect performance no worse than what we now get for gfx1030 since we ought to be using the same routines now, but there's still a >6x slowdown.

0 replies

lamikr · 2024-07-17T17:07:46Z

lamikr
Jul 17, 2024
Maintainer Author

Good findings, I will do the llama testing later on today. As a sidenote there still seems to be couple of places in code where some checks are hardcoded to gfx1030. For example on MIOpen

src/solver/conv_asm_implicit_gemm_gtc_fwd_nchwc.cpp
has IsApplicable method with check

if((device_name != "gfx1030"))
I have not debugged/traced yet when it's get called or why that kernel is not suitable for gfx103* but it should most likely be changed to something that checks more navi 1/2/3 gpu's in style

if (device_name.startswith("gfx103") || device_name.startsWith(gfx110".

And rocfft seems to also have some gfx1030 specific check.

2 replies

jeroen-mostert Jul 17, 2024

Of course, it would be entirely too easy and elegant if software simply relied on the actual declared properties of the hardware. Hard-coding names ftw! :P

lamikr Jul 17, 2024
Maintainer Author

Some projects in rocm does some things nicely by probing which assembly methods are available by doing a small build tests. But, in this specific case one line comment why not Applicaple for 1030 would be helpful...

jeroen-mostert · 2024-07-18T12:21:56Z

jeroen-mostert
Jul 18, 2024

Before I spend a lot of time tuning for gfx1036 I'll first need to know where the perf difference is coming from, because copying the gfx1030 files should mean we get the same performance, but we don't. Tuning therefore cannot be the issue, at least not on its own. llama does compile its own kernels, so maybe something is going wrong in LLVM/clang there before it even gets to the rocBLAS kernels. I'm a bit skeptical, though.

I have no experience tracking this sort of thing down, so it might take a while.

0 replies

lamikr · 2024-07-18T18:17:53Z

lamikr
Jul 18, 2024
Maintainer Author

I am busy mostly on today, but as a try out either this rocFFT or MIOpen batch speedup?

0001-MIOpen-experimental-patch.patch.txt
0001-rocFFT-experimental-patch.patch.txt

0 replies

jeroen-mostert · 2024-07-18T22:53:32Z

jeroen-mostert
Jul 18, 2024

I'm going to spend some time teaching myself how to trace these issues all the way down to the hardware, profiling, benchmarking, actual GPU assembler etc., since I am in no particular personal hurry to get better support for the gfx1036. Obviously this will take some time, but it will hopefully increase my value to this project (and just in general). This means I will also take a break from endlessly recompiling this repo, which is nice. ;)

0 replies

lamikr · 2024-07-19T00:52:44Z

lamikr
Jul 19, 2024
Maintainer Author

That's fine. I think the code could now be pretty much ready to be tagged as a one release, I will also like to explore couple of things a little deeper. Maybe do some coding to get the benchmarking started. I did start also the build command for single binfo file but did not finish it yet.

0 replies

jeroen-mostert · 2024-07-19T03:23:59Z

jeroen-mostert
Jul 19, 2024

Yeah, I don't think the release (or any particular release, for that matter) should be contingent on getting GPU support without an override, because for the most part the override works well, with the only scenario where it will not do is when you want a multi-GPU setup that 1) involves an iGPU that 2) is of a different generation to the other GPUs and 3) you want to use them both from one application, so CUDA_VISIBLE_DEVICES can't help you. This is a pretty uncommon setup. For all other setups, at worst you may not be getting optimal performance, though whether that's really the case is still up in the air. I sadly suspect that simply letting Tensile run hogwild on iGPU models isn't actually going to squeeze significantly more perf out of the kernels compiled for the big models, though it would be lovely to be proven wrong.

0 replies

jeroen-mostert · 2024-07-22T10:15:33Z

jeroen-mostert
Jul 22, 2024

Well, the llama.cpp regression has been explained -- I don't know how I missed this, it was the second match of the file. There is a check in there that inappropriately does if defined(__gfx1030__) instead of if defined (RDNA2). As a result it falls back on a very slow path for dot products instead of using the __builtin_amdgcn_sdot4 intrinsic. Simply changing that check brings things back up to speed. I've opened a pull request to get that fixed (ggerganov/llama.cpp#8629).

Of course llama.cpp isn't the only project that checks for specific architectures in places, seeing as how they're not usually used to iGPUs, so we may have more things to patch. Hopefully not including any things that aren't part of the repo, since that would be inconvenient to end users.

0 replies

lamikr · 2024-08-01T04:17:53Z

lamikr
Aug 1, 2024
Maintainer Author

I added your llama-server patch now to build and I can confirm that it improved the token generation 5x also on gfx1035.
I have now also integrated the stable-diffusion-webui, vllm and corectrl binfo files to rocm sdk build.
They are not build by default, so you need to build them them separately. For example

./babs.sh -b binfo/llama_cpp.binfo option

Biggest thing missing is the xformers integration. It's big task as it would need adding radeon support to

composible_kernels new tile models (composite_kernel/include/ck/tiled dir...)
flast attention
then something also to xformers itself...

I found some radeon navi31 branch for flash_attention and composable kernel that could help but the merging of that code looked like a big job if wanted to do in a maintainable way.

6 replies

lamikr Aug 2, 2024
Maintainer Author

Have you tested whether the latest versions work ok? I tried latest one about 1-2 week ago and noticed that there were some changes that broke the /opt/rocm_sdk_models/microsoft/Phi-3-mini-4k-instruct-q4.gguf

If I understood correctly it needed some new fields and my understanding was that the model was not updated on hugginface to handle those year. (Or maybe the llama_cpp was supposed to do some update by itself for the model. I don't really know what is the policy how these models are managed.)

jeroen-mostert Aug 3, 2024

Yes, new versions can (occasionally, not often) break compatibility with old .ggufs, which is trivially fixed by re-quantizing the original model with the script included with LCPP. The quantized models on HF are provided as a convenience, but if you have the RAM and the disk space you can run this step yourself. This, however, has nothing to do with ROCm support. LCPP is a moving target; there aren't really any "stable" releases we can pick that ensure compatibility with anything.

Also I picked the ready-to-run Phi 3 model more or less at random to do a baseline perf test; if it hadn't worked with the current version I'd just have picked something else suitably small. I didn't mean to imply the Phi 3 model was "stable" or some sort of best practice that everyone would want (indeed I think its fairly pants, even at coding tasks, though decent for its size, I suppose :P)

lamikr Aug 3, 2024
Maintainer Author

There seems to be bugs now open from those, I will wait that they get resolved and then do the update as the docs/examples i have references now for using that model.

ggerganov/llama.cpp#8627
ggerganov/llama.cpp#8845

And this one I would like to get fixed before update

ggerganov/llama.cpp#8833

jeroen-mostert Aug 4, 2024

Well, the resolution is to wait for new models, or use third-party quants like this one. That MS hasn't regenerated their own stuff yet is unfortunate, but there's almost certainly not going to be a software change. At any point in time l.cpp has bugs that I would consider "somewhat concerning", as they very much like to move fast and break things. Unfortunately it's more or less the best game in town when it comes to performance, so alternatives are scarce (most things that look like alternatives are in fact built on l.cpp).

lamikr Aug 5, 2024
Maintainer Author

Yes, noticed also that others are mostly build on ui on top of llama.cpp. I kind of like the llama.cpp's own simple webui.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

iGPU integrations and benchmark improvements #114

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 10 comments 8 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

Select a reply

iGPU integrations and benchmark improvements #114

lamikr Jul 17, 2024 Maintainer

Replies: 10 comments · 8 replies

lamikr Jul 17, 2024 Maintainer Author

lamikr Jul 17, 2024 Maintainer Author

lamikr Jul 18, 2024 Maintainer Author

lamikr Jul 19, 2024 Maintainer Author

lamikr Aug 1, 2024 Maintainer Author

lamikr Aug 2, 2024 Maintainer Author

lamikr Aug 3, 2024 Maintainer Author

lamikr Aug 5, 2024 Maintainer Author

lamikr
Jul 17, 2024
Maintainer

Replies: 10 comments 8 replies

lamikr
Jul 17, 2024
Maintainer Author

lamikr Jul 17, 2024
Maintainer Author

lamikr
Jul 18, 2024
Maintainer Author

lamikr
Jul 19, 2024
Maintainer Author

lamikr
Aug 1, 2024
Maintainer Author

lamikr Aug 2, 2024
Maintainer Author

lamikr Aug 3, 2024
Maintainer Author

lamikr Aug 5, 2024
Maintainer Author