[SYCL][Intel GPU] Long Term Features & Issues Tracking #5277

airMeng · 2024-02-02T08:19:09Z

airMeng
Feb 2, 2024
Collaborator

Feel free to drop a note, let's know if you have any feature request or bugs (even unconfirmed)

Multi-card Support. Issue #5282, PR #5806
Multi-batch Support #5272 Low performance with Sycl Backend #5480
CI test error for more than one GPU is detected and used.
Current code returns all SYCL devices, including CPU, GPU (level-zero, opencl), FPGA. SYCL only support GPU. So when CI test on other devices, it will be fault.
Support no-mmap parameter in other application.
There is known issue of SYCL: memcpy() from host (mmap) to device will hang in same cases. It's not resolved now. A work around solution is no use mmap. I have handled it in llama-bench (add --mmap parameter). We need add to more applications in examples.
Clean code for warning and unused macro and variable.
Suggest to handle it after multiple-card is finished. Lots of such unused code will be useful for multiple-card feature.
Support SYCL build for Nvidia and AMD targets #5357
Improve first token performance.

Also let's know if you have taken any tasks here.

cc @NeoZhangJianyu @luoyu-intel @abhilash1910

NeoZhangJianyu · 2024-02-02T08:38:14Z

NeoZhangJianyu
Feb 2, 2024
Collaborator

I'd like handle: Multi-card Support, CI test error for more than one GPU is detected and used.

0 replies

abhilash1910 · 2024-02-02T12:21:39Z

abhilash1910
Feb 2, 2024
Collaborator

For code cleaning and sanitization of compiler runtime I will be adding patch post the previous changes
Also sycl for other vendor build targets.

0 replies

DatCaptainHorse · 2024-02-02T19:15:35Z

DatCaptainHorse
Feb 2, 2024

Would like to see support for SOTA 2-bit quantized models (GGML_TYPE_IQ2_XXS, GGML_TYPE_IQ2_XS, GGML_TYPE_IQ3_XXS).

Been trying to do this myself for past hour or two, using dpct isn't as trouble-free as it makes you think.

0 replies

characharm · 2024-02-02T20:22:04Z

characharm
Feb 2, 2024

before airMeng:sycl_fix_max_alloc_size

I don't know if these numbers can be considered when the model produces useless output.

Using device 0 (Intel(R) Arc(TM) A770 Graphics) as main device

model	size	params	backend	ngl	test	t/s
llama 13B Q5_K - Medium	8.60 GiB	13.02 B	GPU BLAS	99	pp 512	430.52 ± 38.83
llama 13B Q5_K - Medium	8.60 GiB	13.02 B	GPU BLAS	99	tg 128	16.14 ± 0.13

model	size	params	backend	ngl	test	t/s
llama 7B Q6_K	5.53 GiB	7.24 B	GPU BLAS	99	pp 512	734.90 ± 149.16
llama 7B Q6_K	5.53 GiB	7.24 B	GPU BLAS	99	tg 128	22.00 ± 0.17

after

model	size	params	backend	ngl	test	t/s
llama 13B Q5_K - Medium	8.60 GiB	13.02 B	SYCL	99	pp 512	392.43 ± 33.81
llama 13B Q5_K - Medium	8.60 GiB	13.02 B	SYCL	99	tg 128	9.75 ± 0.06

model	size	params	backend	ngl	test	t/s
llama 7B Q6_K	5.53 GiB	7.24 B	SYCL	99	pp 512	703.17 ± 122.82
llama 7B Q6_K	5.53 GiB	7.24 B	SYCL	99	tg 128	19.07 ± 0.15

vulkan latest

model	size	params	backend	ngl	test	t/s
llama 13B Q5_K - Medium	8.60 GiB	13.02 B	Vulkan	99	pp 512	33.99 ± 0.53
llama 13B Q5_K - Medium	8.60 GiB	13.02 B	Vulkan	99	tg 128	8.04 ± 0.01

model	size	params	backend	ngl	test	t/s
llama 7B Q6_K	5.53 GiB	7.24 B	Vulkan	99	pp 512	74.76 ± 3.49
llama 7B Q6_K	5.53 GiB	7.24 B	Vulkan	99	tg 128	21.26 ± 0.10

0 replies

qnixsynapse · 2024-02-10T16:52:07Z

qnixsynapse
Feb 10, 2024

I wonder why build fails with -DLLAMA_SYCL_F16=ON for my Intel Arc 750..

I think this GPU support f16.

But anyways getting 16 tokens/sec for q4_k_m for 7B with all layers on GPU.

With batch bench, mitral -7B Q4_K_M:

PP	TG	B	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s	T s	S t/s
128	128	1	256	0.497	257.31	8.129	15.75	8.626	29.68
128	128	2	512	0.539	475.29	43.357	5.90	43.896	11.66
128	128	4	1024	0.799	641.06	30.104	17.01	30.903	33.14
128	128	8	2048	1.546	662.36	30.739	33.31	32.285	63.44
128	128	16	4096	3.261	627.93	32.109	63.78	35.371	115.80
128	256	1	384	0.338	378.75	15.331	16.70	15.669	24.51
128	256	2	768	0.452	566.45	86.797	5.90	87.249	8.80
128	256	4	1536	0.759	674.89	60.522	16.92	61.281	25.06
128	256	8	3072	1.548	661.63	61.703	33.19	63.251	48.57

4 replies

NeoZhangJianyu Feb 12, 2024
Collaborator

@akarshanbiswas
It should be fixed by PR: #5411.
The FP16 build is added to CI. There won't be build issue of FP16 in the feature.

By the way, if you have more issues, please create/update in the "issue" field for it, instead of "discussion" field.

qnixsynapse Feb 12, 2024

Oh okay. The title of this discussion has "issues tracking" so I mentioned that issue here. :)

airMeng Feb 13, 2024
Collaborator Author

@akarshanbiswas Could you report the full performance here like the above comments since we don't have A750 on our side, so a performance baseline on A750 would be quite helpful? Thank you for your effort to help improving SYCL backend.

You can find the standard performance measurements here

qnixsynapse Feb 13, 2024

@airMeng Thank you. I was able to test a little and I have updated my comment here.

kunger97 · 2024-03-01T13:17:57Z

kunger97
Mar 1, 2024

I would like to know if there are plans to support quantization types that are not currently supported like iq3, iq4?

0 replies

characharm · 2024-03-05T15:06:47Z

characharm
Mar 5, 2024

With the recent changes, the model nous-hermes-2-34b-2.69 has seen a significant speedup. From the unusable 3-4 tokens per second, it now reaches 7-8

detect 1 SYCL GPUs: [0,] with Max compute units:512

model	size	params	backend	ngl	test	t/s
llama 30B Q2_K - Medium	10.76 GiB	34.39 B	SYCL	99	pp 512	159.76 ± 8.53
llama 30B Q2_K - Medium	10.76 GiB	34.39 B	SYCL	99	tg 128	8.52 ± 0.02

build: 21b0867 (2345)

for comparison

Vulkan0: Intel(R) Arc(TM) A770 Graphics | uma: 0 | fp16: 1 | warp size: 32

model	size	params	backend	ngl	test	t/s
llama 30B Q2_K - Medium	10.76 GiB	34.39 B	Vulkan	99	pp 512	33.27 ± 0.29
llama 30B Q2_K - Medium	10.76 GiB	34.39 B	Vulkan	99	tg 128	5.05 ± 0.05

build: 82cb31e (2348)

I hope other quantization methods will also see an improvement, for now, they more or less perform similarly to Vulkan.

0 replies

qnixsynapse · 2024-03-08T13:06:55Z

qnixsynapse
Mar 8, 2024

I have a question. During prompt processing or generation, the llama.cpp's SYCL backend seems to use only one of the (I am assuming XMX) engines of my GPU. Although they are tagged 'unknown' in intel_gpu_top. Does anyone knows why? And is it possible to parallelize across all of them?

On Linux, I can't even monitor neither the VRAM usage, nor the temps which is surprising because it is a while since this GPU was launched. My all hopes for the new Xe driver.

0 replies

airMeng · 2024-03-10T11:55:56Z

airMeng
Mar 10, 2024
Collaborator Author

I have a question. During prompt processing or generation, the llama.cpp's SYCL backend seems to use only one of the (I am assuming XMX) engines of my GPU. Although they are tagged 'unknown' in intel_gpu_top. Does anyone knows why? And is it possible to parallelize across all of them?

On Linux, I can't even monitor neither the VRAM usage, nor the temps which is surprising because it is a while since this GPU was launched. My all hopes for the new Xe driver.

@akarshanbiswas can you try https://github.com/intel/xpumanager to monitor the usage?

3 replies

qnixsynapse Mar 10, 2024

It is not yet available for Arch Linux. I think I will have to compile it from source. Also, it is for datacenter GPUs. Will it support Intel Arc?

p.s I didn't find any sysfs interface for monitoring temps or vram usage. Will this work with currently stable mainline kernel?

airMeng Mar 11, 2024
Collaborator Author

will it support Intel Arc

yes it is.

I didn't find any sysfs interface for monitoring temps or vram usage

you can get temps and vram usage here https://github.com/intel/xpumanager/blob/fdcb817c0dfadf9423da4dfe6d99fa712f62f5b6/doc/smi_user_guide.md?plain=1#L637

qnixsynapse May 11, 2024

I ended up successfully building the package and it shows this output(with root):

+-----------------------------+--------------------------------------------------------------------+
| Device ID                   | 0                                                                  |
+-----------------------------+--------------------------------------------------------------------+
| GPU Utilization (%)         | N/A                                                                |
| EU Array Active (%)         | N/A                                                                |
| EU Array Stall (%)          | N/A                                                                |
| EU Array Idle (%)           | N/A                                                                |
|                             |                                                                    |
| Compute Engine Util (%)     | N/A                                                                |
| Render Engine Util (%)      | Engine 0: 0                                                        |
| Media Engine Util (%)       | N/A                                                                |
| Decoder Engine Util (%)     | Engine 0: 0, Engine 1: 0                                           |
| Encoder Engine Util (%)     | Engine 0: 0, Engine 1: 0                                           |
| Copy Engine Util (%)        | Engine 0: 0                                                        |
| Media EM Engine Util (%)    | Engine 0: 0, Engine 1: 0                                           |
| 3D Engine Util (%)          | N/A                                                                |
+-----------------------------+--------------------------------------------------------------------+
| Reset                       | N/A                                                                |
| Programming Errors          | N/A                                                                |
| Driver Errors               | N/A                                                                |
| Cache Errors Correctable    | N/A                                                                |
| Cache Errors Uncorrectable  | N/A                                                                |
| Mem Errors Correctable      | N/A                                                                |
| Mem Errors Uncorrectable    | N/A                                                                |
+-----------------------------+--------------------------------------------------------------------+
| GPU Power (W)               | 37                                                                 |
| GPU Frequency (MHz)         | 2400                                                               |
| Media Engine Freq (MHz)     | N/A                                                                |
| GPU Core Temperature (C)    | N/A                                                                |
| GPU Memory Temperature (C)  | N/A                                                                |
| GPU Memory Read (kB/s)      | N/A                                                                |
| GPU Memory Write (kB/s)     | N/A                                                                |
| GPU Memory Bandwidth (%)    | N/A                                                                |
| GPU Memory Used (MiB)       | 6804                                                               |
| GPU Memory Util (%)         | 84                                                                 |
| Xe Link Throughput (kB/s)   | N/A                                                                |
+-----------------------------+--------------------------------------------------------------------+

Got the memory usage now but temps are still N/A.

qnixsynapse · 2024-03-21T14:46:36Z

qnixsynapse
Mar 21, 2024

Adding this here. This may add to the list of todos and fixes. Credit goes to Gemini 1.5 Pro 1 million. :)

Device-to-device memory copy across GPUs: The current implementation uses a workaround (copying data to host and then back to the other device) for device-to-device memory copies across different GPUs. (This goes along with multi card support)
[NeoZhangJianyu] Yes, it's known limitation. More efficient implement depends on other library. This function won't impact the performance due to less call times. ✅
Missing support for certain data types: Some data types, like GGML_TYPE_IQ4_NL, GGML_TYPE_IQ2_S, and GGML_TYPE_IQ4_XS, are not supported in the SYCL backend. Issue: [SYCL] Support newer non linear quantization #5674, PR for GGML_TYPE IQ2_S : [SYCL] iq2_s #6052, A fix (?): IQ1_S: attempt to fix SYCL #6014
Solved in support/fix more IQ OPs #6521 ✅
Limited use of SYCL extensions: The code could potentially benefit from utilizing more SYCL extensions for optimization. For example, the ext_oneapi_usm_device_read_only extension could be used for constant memory, and the ~~ext_intel_free_memory extension could provide more accurate memory usage information.~~
[AkarshanBiswas] ext_intel_free_memory is used but not sure about ext_oneapi_usm_device_read_only
[NeoZhangJianyu] Already use them, like ext_intel_free_memory. ✅
Thread safety: Some parts of the code, like the buffer type map in ggml_backend_sycl_split_buffer_type, might not be thread-safe. This could lead to issues when using multiple models or contexts simultaneously.
[NeoZhangJianyu] Is there any test case for the issue? I don't test with multiple modes case.
[AkarshanBiswas] Yes. Test it with continuous batching with slots > 1 and see if the driver is stable.
~~Memory management: Implementing a buffer pool for SYCL similar to the CUDA version could improve memory allocation and reuse efficiency.~~
[NeoZhangJianyu] The buffer pool is used in SYCL backend. ✅
Flash attention SYCL implementation after ggml : add Flash Attention #5021

Will update when I find anything else with my limited knowledge.

2 replies

airMeng Apr 7, 2024
Collaborator Author

#6521

piDack Sep 16, 2024

flashattention have any progress?i find huggingface intel backend have impl it https://github.com/intel/intel-extension-for-deepspeed/tree/774453156c4ad6f4b831fb985494254645f5fe1d/intel_extension_for_deepspeed/op_builder/csrc/flash_attn

slaren · 2024-03-21T14:55:45Z

slaren
Mar 21, 2024
Collaborator

The SYCL backend should be updated to adopt these changes in ggml-backend:

Remove the code in ggml.c and implement the offload_op interface: backend : offload large batches to GPU #6083
Remove the code that copies GGML_BACKEND_TYPE_CPU tensors automatically to VRAM. After adopting the previous change, this will not be necessary, all the tensors received by the SYCL backend will always be allocated in a SYCL buffer.
Remove all usage of ggml_tensor::backend, as this will be removed in the future. To support split buffer types, use ggml_tensor::buffer to identify the storage type of the tensor instead.

5 replies

NeoZhangJianyu Mar 22, 2024
Collaborator

Yes, we have a plan to fix them by 2 PRs.

airMeng Mar 22, 2024
Collaborator Author

@slaren May I know more context about offload_op? Seems only a ggml_backend_sycl_offload_op needed?

https://github.com/ggerganov/llama.cpp/pull/6217/files

slaren Mar 22, 2024
Collaborator

The purpose of offload_op is to offload computation to the GPU when the batch size is large enough, even if the weights are not stored in VRAM and have to be copied. For small batch sizes the cost of copying the weights is higher than the cost of the computation, but for large enough batch sizes, it is often faster to offload the computation to the GPU since it is usually much faster than the CPU.

Previously, this was implemented by hooking into the CPU backend in ggml.c in the function ggml_compute_forward, and doing the computation in a call to ggml_sycl_compute_forward. Now, this is handled in ggml_backend_sched entirely, and the backends only need to implement the offload_op function to choose what operations they wish to run even when the weights are on system memory. The weights are copied to VRAM by ggml_backend_sched, so backends do not need to check if the tensors are in RAM and copy them, they can assume that all the tensors they receive in a call to graph_compute will always be allocated in the backend-specific buffer.

Your implementation is a good first step, but you should also remove the code of SYCL in ggml.c. To test this, you can try prompt processing without fully offloading the model. For example, llama-bench -n 0 -ngl 0. For a backend such as SYCL, it may be good to consider the type of accelerator being used, and possibly disable for slow devices. This will also allow you to remove a lot of the code in the SYCL backend that deals with copying tensors between CPU and GPU, and generally simplify this logic.

You would also need to modify llama.cpp in llama_new_context_with_model to always create an instance of the SYCL backend, even when model->n_gpu_layers == 0.

airMeng Mar 24, 2024
Collaborator Author

@slaren for point 2 and 3, can we refer to #6170 ?

slaren Mar 24, 2024
Collaborator

Sure, if you find that useful you can use it. However, there will be further refactoring of the CUDA backend in #6269.

qnixsynapse · 2024-04-19T05:35:37Z

qnixsynapse
Apr 19, 2024

Just an update here: I did not use llama.cpp for like few days because I was busy. I ran it today to test llama-3 and I found out that it hangs here everytime with every model right here:

......................................................................................
llama_new_context_with_model: n_ctx      = 512
llama_new_context_with_model: n_batch    = 512
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: freq_base  = 500000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:  SYCL_Host KV buffer size =    64.00 MiB
llama_new_context_with_model: KV self size  =   64.00 MiB, K (f16):   32.00 MiB, V (f16):   32.00 MiB
llama_new_context_with_model:  SYCL_Host  output buffer size =     0.98 MiB
llama_new_context_with_model:      SYCL0 compute buffer size =   266.50 MiB
llama_new_context_with_model:  SYCL_Host compute buffer size =    45.01 MiB
llama_new_context_with_model: graph nodes  = 1062
llama_new_context_with_model: graph splits = 66

I am running with --no-mmap .

In the logs I found out:

kernel: Fence expiration time out i915-0000:03:00.0:server[5816]:14c!

Not sure if this is because of an update that I received on Arch Linux or not, not a single model is running with the same binary that it is used to run before.
I have intel-compute-runtime version 23.48.27912.11-2, on lts kernel 6.6.27, intel-oneapi-base-toolkit is 2024.0.0.49564 and level-zero-loader is 1.15.1-1.
Not opening an issue right now because I am not sure if it is a bug in llama.cpp or not.

Update: Not related to llama.cpp, JAX with intel-extension-for-openxla hangs too. (now confirmed)

Update 2: Came across this: intel/compute-runtime#497
Update 3: linux > 6.6.25 is broken ~~(suspicious commit saving it here for bug reporting upstream.)~~

13 replies

qnixsynapse Apr 22, 2024

Yes. Here is the output of sycl-ls on my PC (I am still trying to debug):

[opencl:acc:0] Intel(R) FPGA Emulation Platform for OpenCL(TM), Intel(R) FPGA Emulation Device OpenCL 1.2  [2023.16.10.0.17_160000]
[opencl:cpu:1] Intel(R) OpenCL, 12th Gen Intel(R) Core(TM) i3-12100F OpenCL 3.0 (Build 0) [2023.16.10.0.17_160000]
[opencl:gpu:2] Intel(R) OpenCL Graphics, Intel(R) Arc(TM) A750 Graphics OpenCL 3.0 NEO  [23.48.27912]
[ext_oneapi_level_zero:gpu:0] Intel(R) Level-Zero, Intel(R) Arc(TM) A750 Graphics 1.3 [1.3.27912]

I have reproduced it with llama2-7b-q4 and that's why said earlier than it is not related to llama.cpp. Even JAX hangs. (with intel openxla pjrt plugin) .

qnixsynapse Apr 22, 2024

Update: Fixed it!

There is a reason I did not report an issue at first since I was not sure of what was causing the problem. Now I know. And it is the new ~~linux-firmware version: 20240409.1addd7dc-1~~ linux version > 6.6.25 .

I downgraded to version 6.6.25 and now it works.

Thinking of switching to debian on this system, I really hate updates breaking my important things around.

airMeng Apr 22, 2024
Collaborator Author

can you help to report the issue to https://github.com/oneapi-src/level-zero? much thanks for your help!

qnixsynapse Apr 22, 2024

No problem!
Yes, I will open an issue tomorrow.

qnixsynapse May 7, 2024

Update: The regression happened on i915 kernel driver and potential commits(according to Matt):

a7ff84a6fe5a drm/i915/gt: Enable only one CCS for compute workload
726ff623869d drm/i915/gt: Do not generate the command streamer for all the CCS
c1f7ce2a11a9 drm/i915/gt: Disable HW load balancing for CCS

Quoting him, "That workaround requires that we disable load balancing on the compute engines, assign all underlying hardware compute resources to a single engine, and then hide all except for one of the engines from userspace (which is also why the "physical engines" count goes from 4 down to 1 as you noted). It's a pretty complicated and invasive hardware workaround, so it has the highest likelihood of accidentally introducing issues."

Fix is 4cfca03f7641 ("drm/i915/gt: Automate CCS Mode setting during engine resets") which will hopefully land on stable kernels by next week.

Adding these here so that people who are experiencing similar issues might find these helpful.

airMeng · 2024-06-12T07:30:04Z

airMeng
Jun 12, 2024
Collaborator Author

Latest SYCL is broken due to #7640 (comment), I am looking into it and hope to fix it soon.

cc @NeoZhangJianyu @AidanBeltonS

1 reply

airMeng Jun 17, 2024
Collaborator Author

fixed in #7710

mistrjirka · 2024-06-23T21:31:33Z

mistrjirka
Jun 23, 2024

I am not sure if it is expected or not but attempting to use Intel GPU ( Intel® Iris® Xe Graphics ) on Intel® Core™ i7-1165G7 results in much slower performance then on the CPU alone. I tried just the example prompt:

ZES_ENABLE_SYSMAN=1 ./build/bin/llama-cli -m models/Lexi-Llama-3-8B-Uncensored_Q4_K_M.gguf -p "Building a website can be done in 10 simple steps:" -n 400 -e -ngl 33 -sm none -mg 0
I get:

So around 1.6t/s for prompt eval and 3.55t/s for generation.
The GPU is used according to intel_gpu_top:

When I try it on CPU I get much better performance:
command: ZES_ENABLE_SYSMAN=1 ./build/bin/llama-cli -m models/Lexi-Llama-3-8B-Uncensored_Q4_K_M.gguf -p "Building a website can be done in 10 simple steps:" -n 400 -e -ngl 0 -sm none -mg 0

It also seems that the output from the GPU is garbled and generally wrong.

3 replies

airMeng Jun 24, 2024
Collaborator Author

hi @mistrjirka we have observed the extremely slow prompt evel issue and we are working on it. As for prompt generation, as you might know, iGPU and CPU shares the same DDR memory so the performance shall be on the same level. we are working on it too but expect not too much gains.

cc @luoyu-intel

mistrjirka Jun 26, 2024

hi @mistrjirka we have observed the extremely slow prompt evel issue and we are working on it. As for prompt generation, as you might know, iGPU and CPU shares the same DDR memory so the performance shall be on the same level. we are working on it too but expect not too much gains.

cc @luoyu-intel

Thanks for the info and thank you for working on this.
I am honestly more concerned about the weird repeating output
Tried the prompt multiple times and the output is generally just nonsense. Sometimes it print random characters. It shows up even when using less gpu layers. That could point to some performance non related problem. I observed that it holds on Intel xe graphics and on Intel 620m. I am not sure if someone can replicate my problems on dedicated Card.

airMeng Jun 26, 2024
Collaborator Author

we are aware of that too, seems the recent update of rope and mul_mat_id breaks the accuracy. we are working on it, sorry for any inconvenience.

ColonelPhantom · 2024-06-26T11:23:55Z

ColonelPhantom
Jun 26, 2024

Hi, I'm trying to compile the SYCL backend but the compiler seems to almost get stuck on unicode-data.cpp? It's already taking 3 minutes and counting. Is this normal?

I do have an older, slower CPU (i5-8350U).

/opt/intel/oneapi/compiler/2024.0/bin/icpx -DGGML_SCHED_MAX_COPIES=4 -DGGML_USE_LLAMAFILE -DGGML_USE_SYCL -D_GNU_SOURCE -D_XOPEN_SOURCE=600 -I/home/quinten/src/llama.sycl/. -Wno-narrowing -O3 -fsycl -L/lib -O3 -DNDEBUG -std=gnu++17 -I./ -I//opt/intel/oneapi/compiler/2024.0/include -Wmissing-declarations -Wmissing-noreturn -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -march=native -MD -MT CMakeFiles/llama.dir/unicode-data.cpp.o -MF CMakeFiles/llama.dir/unicode-data.cpp.o.d -o CMakeFiles/llama.dir/unicode-data.cpp.o -c /home/quinten/src/llama.sycl/unicode-data.cpp

20 replies

mistrjirka Jul 4, 2024

llama3 8B FP16 need 18+GB memory. You could download the llama3 8B Q4 gguf format model in huggingface.

Yeah that is what I am already testing but I think the issue could be with the performance on quantized model which I cannot really test by using quantized model.

mistrjirka Jul 4, 2024

llama3 8B FP16 need 18+GB memory. You could download the llama3 8B Q4 gguf format model in huggingface.

I looked that you have merged some performance related PR. I will test if the performance imporved in my case.

mistrjirka Jul 4, 2024

llama3 8B FP16 need 18+GB memory. You could download the llama3 8B Q4 gguf format model in huggingface.

Tested master, due to #8286 but it did not really improve anything. There is no performance imporvement.

ColonelPhantom Jul 4, 2024

llama3 8B FP16 need 18+GB memory. You could download the llama3 8B Q4 gguf format model in huggingface.

Yeah that is what I am already testing but I think the issue could be with the performance on quantized model which I cannot really test by using quantized model.

You could consider testing another model, e.g. Phi-3-mini which is <8 GB unquantized.

mistrjirka Jul 6, 2024

llama3 8B FP16 need 18+GB memory. You could download the llama3 8B Q4 gguf format model in huggingface.

Yeah that is what I am already testing but I think the issue could be with the performance on quantized model which I cannot really test by using quantized model.

You could consider testing another model, e.g. Phi-3-mini which is <8 GB unquantized.

No it is not the quantization never mind. It is much slower than quantized versions. So I guess that sycl on integrated GPUs is just bad.

qnixsynapse · 2024-07-01T12:24:21Z

qnixsynapse
Jul 1, 2024

Does performance supposed to get worse when using the keys cache quantization to either q8 or q4? Normally with f16 gives 20 tokens/sec and with q_8 gives 5 tokens/sec on my Intel Arc.

Gemma 9B Q4_K_S:
with f16 k-cache:

With q8_0 k-cache:

12 replies

slaren Jul 6, 2024
Collaborator

That's another issue, but these copies are caused by the src1 being non contiguous. With f16 kv there is the mul_mat_vec_p021 kernel, but that doesn't work with quantized src0.

JohannesGaessler Jul 6, 2024
Collaborator

Ah okay, I see what you mean. Yes, for batch size 1 that would be an issue.

qnixsynapse Jul 7, 2024

Which kernel works with quantized src0 if you don't mind me asking?

zhentaoyu Jul 9, 2024

It's ggml_sycl_op_mul_mat which need tensor cpy to make it contiguous. fp16 has some other specific kernels that can handle non-contiguous q. see kernel dispatcher here: https://github.com/ggerganov/llama.cpp/blob/master/ggml/src/ggml-sycl.cpp#L3661

qnixsynapse Jul 9, 2024

I am guessing the equivalent in cuda is ggml_cuda_op_mul_mat.

I came across this:

if (src0_is_contiguous) {
            dev[id].src0_dd = split ? (char *) src0_extra->data_device[id] : (char *) src0->data;
        } else {
            dev[id].src0_dd = dev[id].src0_dd_alloc.alloc(ctx.pool(id), ggml_nbytes(src0));
        }

joeatodd · 2024-07-03T09:02:01Z

joeatodd
Jul 3, 2024

Hello - a heads up from the Codeplay side - we found that recent changes (#6408) introduced a lot of expensive device info queries:

        const int work_group_size = get_work_group_size(stream->get_device());

Unfortunately these queries aren't cached, and due to the way the DPCT headers are currently designed, this actually makes 15 different device info queries each time its called 🫠. @OuadiElfarouki from our side is working on a caching mechanism which should fix this. On Nvidia hardware we found this created a significant performance drop (down from ~10 T/s to ~2 T/s).

Like I say, we are working on a fix, and I am just posting for info. @airMeng @NeoZhangJianyu

Maybe these queries are cheap on Intel drivers, but I am seeing above a lot of discussions about performance regressions. Could be related? @zhentaoyu

7 replies

NeoZhangJianyu Jul 4, 2024
Collaborator

@joeatodd
I have same concern when see this code.
I will fix it by local cache.

NeoZhangJianyu Jul 4, 2024
Collaborator

Recently, there are more PRs of SYCL backend, some miss to test the performance.
I will check one by one and fix them one by one.
Maybe some solution will be reverted if no function impact.

NeoZhangJianyu Jul 4, 2024
Collaborator

I have provided similar solution in my previous PR. But it's no merged and canceled now.
I will create a new PR for it.

NeoZhangJianyu Jul 4, 2024
Collaborator

Fixed by PR: #8286

OuadiElfarouki Jul 4, 2024

Fix I've been working on : #8301

joeatodd · 2024-07-24T11:02:29Z

joeatodd
Jul 24, 2024

A bug that I introduced with #8644 reveals that we don't have any CI which tests static builds (-DBUILD_SHARED_LIBS=OFF). On #8667 @airMeng suggested to update one of the CI Dockerfiles to test this. I merged that PR since it was fixing a bug on master, and I am opening this comment to track/discuss how best to approach this.

I suggested a couple of approaches here. What do you think is best?

0 replies

kwaa · 2024-07-28T08:17:24Z

kwaa
Jul 28, 2024

Hello, here is a Feature Request: Can SYCL build be supported in llama.cpp's Nix Flakes?

I want to use it in NixOS but struggle to package oneAPI.

4 replies

airMeng Jul 29, 2024
Collaborator Author

I am not familiar with Nix, but I saw someone working on it from ollama NixOS/nixpkgs#327999 (comment), could you give a try?

NeoZhangJianyu Jul 29, 2024
Collaborator

You could refer to the package method of windows release package in https://github.com/ggerganov/llama.cpp/blob/master/.github/workflows/build.yml: windows-latest-cmake-sycl.

We aren't familiar with NixOS in fact.
We welcome the experts of NixOS to join in this work. We could provide the support.

kwaa Aug 17, 2024

I tried to package intel-llvm, but its buildbot script doesn't work (and there's almost no documentation on it).

I've also looked at MordragT's packaging approach, which feels a bit too cumbersome.

MordragT Dec 12, 2024

Hey kwaa if you still need help setting it up, I could maybe create something with cachix if the build time was your problem.

qnixsynapse · 2024-08-12T04:46:53Z

qnixsynapse
Aug 12, 2024

I just found out that SYCL backend don't use joint_matrix, so, probably no Tensor cores or XMX.

This is confirmed by the comment here:

https://github.com/ggerganov/llama.cpp/blob/4134999e01f31256b15342b41c4de9e2477c4a6c/ggml/src/ggml-sycl/common.hpp#L59C1-L63C7

which means that backend can get much faster with proper implementation on discrete GPUs. It is not even utilizing the hardware properly! I need to forget about flash attention (the issue was closed for being stale)

We have many examples here and I am not sure how to implement this.

5 replies

airMeng Aug 12, 2024
Collaborator Author

hi @qnixsynapse for most GEMV cases, there will be no benefit from XMX. For first token GEMM, we rely on MKL which should utilize XMX already

qnixsynapse Aug 12, 2024

first token GEMM

You mean (batched) prompt processing?

airMeng Aug 12, 2024
Collaborator Author

either bs!=1 or seq!=1

qnixsynapse Aug 12, 2024

Got it! Thanks.

qnixsynapse Aug 26, 2024

Will it be preferable if we switch all the kernels to use oneDNN, not just matmul?
The API for oneDNN primitives looks very clean to me and said to have better compatibility.

0xDEADFED5 · 2024-11-28T05:56:02Z

0xDEADFED5
Nov 28, 2024

thank you SYCL crew, all the work is appreciated.

0 replies

qnixsynapse · 2024-12-02T12:43:55Z

qnixsynapse
Dec 2, 2024

Hi. I have some questions.

Is this intentional?

We can simply call the functions with the required parameters which should be better right?

5 replies

airMeng Dec 2, 2024
Collaborator Author

of course it is doable.

qnixsynapse Dec 2, 2024

I think we should simplify this so that function for each operation can be independent in choosing its parameters.

qnixsynapse Dec 13, 2024

Thoughts?
Also, CC @Rbiessy and maybe @abhilash1910

Rbiessy Dec 13, 2024
Collaborator

On the first question about refactoring ggml_sycl_compute_forward it seems that func is used to be able to call ggml_sycl_set_peer_access in some cases before calling func. Assuming we want to keep this behavior I would suggest to keep func but to change the type of ggml_sycl_func_t so that it does not contain src0 and src1 arguments. They can all be deduced from the last ggml_tensor object.

On the second question about using mmq, we could help to review this PR. It has some large implications:

We would need to verify the correctness of such a pass. Unfortunately running the tests is not always enough, internally we also run the model and verify that the output is not gibberish.
We would need to verify the impact on performance for some models and some Nvidia and Intel devices. The PR would need to show evidence that it has a overall a positive impact. We should be able to help with that.

qnixsynapse Dec 13, 2024

Thank you for sharing your insights. My original plan of refactoring compute_forward function was shelved because I decided to use GGML_UNUSED() macro for unused variables.

My current plan is to make this backend migrate to tensor->buffer and write optimized mmq kernels for Intel to speed up PP. But I am not an expert in writing SYCL kernels so I not having enough confidence to do so.

The screenshot shows that Q4_K, Q5_K and Q6_K kernels are broken, hence I have commented them out because we did this in the last PR which got merged:

llama.cpp/ggml/src/ggml-sycl/mmq.cpp

Lines 816 to 817 in 64ae065

    
           constexpr int blocks_per_tile_x_row = QI4_K > WARP_SIZE ? 1 : WARP_SIZE / QI4_K; // == 1 if QK_K == 256 
        
           const int kbxd = k % blocks_per_tile_x_row;          // == 0 if QK_K == 256

Also, no support for IQ_XX data types because these kernels were never updated.

Edit: According to this, Xe-HPG architecture has a register size of 32 KB. RDNA architecture, whose tiling settings I found in SYCL's mmq kernels has 128KB of register size. I am suspecting that register overflows may have been causing accuracy issues.

llama.cpp/ggml/src/ggml-sycl/mmq.cpp

Lines 1785 to 1803 in 9f35e44

    
           if (compute_capability >= VER_GEN13) { 
        
               mmq_x  =  MMQ_X_Q4_0_RDNA2; 
        
               mmq_y  =  MMQ_Y_Q4_0_RDNA2; 
        
               nwarps = NWARPS_Q4_0_RDNA2; 
        
           } else if (compute_capability >= VER_GEN12) { 
        
               mmq_x  =  MMQ_X_Q4_0_RDNA1; 
        
               mmq_y  =  MMQ_Y_Q4_0_RDNA1; 
        
               nwarps = NWARPS_Q4_0_RDNA1; 
        
           } else if (compute_capability >= VER_GEN9) { 
        
               mmq_x  =  MMQ_X_Q4_0_AMPERE; 
        
               mmq_y  =  MMQ_Y_Q4_0_AMPERE; 
        
               nwarps = NWARPS_Q4_0_AMPERE; 
        
           } else if (compute_capability >= VER_4VEC) { 
        
               mmq_x  =  MMQ_X_Q4_0_PASCAL; 
        
               mmq_y  =  MMQ_Y_Q4_0_PASCAL; 
        
               nwarps = NWARPS_Q4_0_PASCAL; 
        
           } else { 
        
               GGML_ABORT("fatal error"); 
        
           }

Mar2ck · 2024-12-17T15:36:30Z

Mar2ck
Dec 17, 2024

I'd really like to see SYCL support added to the RPC server.

11 replies

qnixsynapse Dec 20, 2024

In Pr #10840 I removed tensor->extra from init tensor and kept it only on split buffers to mirror the implementation in CUDA backend.

NeoZhangJianyu Dec 20, 2024
Collaborator

@rgerganov OK, I think SYCL backend could follow CUDA style.

NeoZhangJianyu Dec 20, 2024
Collaborator

@qnixsynapse
I test the PR #10840 on 2 GPUs. It's passed.
It's merged now.

So, looks like the RPC server could try to add SYCL backend.

qnixsynapse Dec 21, 2024

@NeoZhangJianyu Thank you!

I believe that now the SYCL backend can be enabled in RPC. If it still doesn't work even with the latest changes, please inform me.
I will take the task to fix it after the holidays. :)

rgerganov Dec 21, 2024
Collaborator

I have submitted #10934 for adding SYCL in the rpc-server

NeoZhangJianyu · 2024-12-19T09:10:36Z

NeoZhangJianyu
Dec 19, 2024
Collaborator

I think the internal refactor should support the outside features to user.
I suggest to move the focus to optimize the kernel performance on Intel GPU.

0 replies

[SYCL][Intel GPU] Long Term Features & Issues Tracking #5277

airMeng Feb 2, 2024 Collaborator

Replies: 24 comments · 95 replies

NeoZhangJianyu Feb 2, 2024 Collaborator

abhilash1910 Feb 2, 2024 Collaborator

NeoZhangJianyu Feb 12, 2024 Collaborator

airMeng Feb 13, 2024 Collaborator Author

airMeng Mar 10, 2024 Collaborator Author

airMeng Mar 11, 2024 Collaborator Author

airMeng Apr 7, 2024 Collaborator Author

slaren Mar 21, 2024 Collaborator

NeoZhangJianyu Mar 22, 2024 Collaborator

airMeng Mar 22, 2024 Collaborator Author

slaren Mar 22, 2024 Collaborator

airMeng Mar 24, 2024 Collaborator Author

slaren Mar 24, 2024 Collaborator

airMeng Apr 22, 2024 Collaborator Author

airMeng Jun 12, 2024 Collaborator Author

airMeng Jun 17, 2024 Collaborator Author

airMeng Jun 24, 2024 Collaborator Author

airMeng Jun 26, 2024 Collaborator Author

slaren Jul 6, 2024 Collaborator

JohannesGaessler Jul 6, 2024 Collaborator

NeoZhangJianyu Jul 4, 2024 Collaborator

NeoZhangJianyu Jul 4, 2024 Collaborator

NeoZhangJianyu Jul 4, 2024 Collaborator

NeoZhangJianyu Jul 4, 2024 Collaborator

airMeng
Feb 2, 2024
Collaborator

Replies: 24 comments 95 replies

NeoZhangJianyu
Feb 2, 2024
Collaborator

abhilash1910
Feb 2, 2024
Collaborator

NeoZhangJianyu Feb 12, 2024
Collaborator

airMeng Feb 13, 2024
Collaborator Author

airMeng
Mar 10, 2024
Collaborator Author

airMeng Mar 11, 2024
Collaborator Author

airMeng Apr 7, 2024
Collaborator Author

slaren
Mar 21, 2024
Collaborator

NeoZhangJianyu Mar 22, 2024
Collaborator

airMeng Mar 22, 2024
Collaborator Author

slaren Mar 22, 2024
Collaborator

airMeng Mar 24, 2024
Collaborator Author

slaren Mar 24, 2024
Collaborator

airMeng Apr 22, 2024
Collaborator Author

airMeng
Jun 12, 2024
Collaborator Author

airMeng Jun 17, 2024
Collaborator Author

airMeng Jun 24, 2024
Collaborator Author

airMeng Jun 26, 2024
Collaborator Author

slaren Jul 6, 2024
Collaborator

JohannesGaessler Jul 6, 2024
Collaborator

NeoZhangJianyu Jul 4, 2024
Collaborator

NeoZhangJianyu Jul 4, 2024
Collaborator

NeoZhangJianyu Jul 4, 2024
Collaborator

NeoZhangJianyu Jul 4, 2024
Collaborator