ggml-qnn: add Qualcomm QNN(Qualcomm Neural Network,aka Qualcomm AI Engine Direct) backend #6869

zhouwg · 2024-04-24T08:29:34Z

Self Reported Review Complexity

Review Complexity : Low
Review Complexity : Medium
Review Complexity : High
I have read the contributing guidelines

Purpose

Android maintained its position as the leading mobile operating system worldwide in the fourth quarter of 2023 with a market share of 70.1 percent .

Qualcomm is No.1 mobile SoC semiconductor company in our planet currently(MediaTek's market share is No.1 in Q1 2024 but I personally think Qualcomm is the real No.1 mobile SoC vendor in our planet). Hexagon NPU in Qualcomm Snapdragon 8 Gen 3 was designed for generative AI and delivering 98% faster performance and 40% improved performance-per-watt for sustained AI inferencing, it make the Hexagon NPU the leading processor for on-device AI inferencing.

QNN(Qualcomm Neural Network, aka Qualcomm AI Engine Direct) SDK is verified to work with the following versions of the ML frameworks:

TensorFlow: tf-1.15.0, or tf-2.10.1
TFLite: tflite-2.3.0
PyTorch: torch-1.13.1
ONNX: onnx-1.11.0

As a very compact/highly well-designed/highly optimization/highly performance C/C++ machine learning framework/library, this PR aims to add Qualcomm's QNN backend for ggml and focus on this accordingly:how to utilize the Hexagon NPU maximally with the highly well-designed/highly compact ggml machine learning framework.

Status

Data path works fine as expected with whisper.cpp and llama.cpp using QNN backend and verified on both low-end and high-end Android phones based on Qualcomm mobile SoC.

4x performance gains for GGML_OP_MUL_MAT using QNN CPU backend with 1 thread on a Qualcomm mobile SoC equipped high-end Android phone(a flagship Qualcomm Snapdragon 8 Gen 3 mobile SoC which released on Oct 2023). The performance of GGML_OP_MUL_MAT might/should/would be improved much more using QNN NPU(aka Hexagon Tensor Processor) backend after we know the secrets(QNN RPC, multithreading in NPU backend......) of Qualcomm's NPU(aka Hexagon Tensor Processor).

A dedicated Android command line program (for purpose of UT) works fine as expected on Qualcomm SM8650-AB Snapdragon 8 Gen 3 equipped high-end Android phone and other Qualcomm's low-end mobile SoC equipped low-end Android phone(QNN NPU backend not works on Qualcomm low-end Android phone).

/data/local/tmp//libQnnCpu.so
QNN libs already exist on Android phone
ggml-qnn-test: 1 file pushed. 16.3 MB/s (4567168 bytes in 0.267s)
[main, 344]: enter qnn_ggml_op

[main, 345]: ggml op:2(ADD)
[main, 359]: Allocating Memory of size 33554432 bytes, 32 MB

[ggml_backend_qnn_init, 3955]: device 0
[ggml_backend_qnn_init, 3956]: qnn_lib_path /data/local/tmp/
[qnn_init, 2172]: enter qni_init

[load_system, 2033]: system_lib_path:/data/local/tmp/libQnnSystem.so

[load_system, 2082]: find a valid qnn system interface

[load_system, 2092]: initialize qnn system successfully

[qnn_init, 2180]: load QNN system lib successfully

[load_backend, 1911]: lib_path:/data/local/tmp/libQnnCpu.so

[load_backend, 1935]: num_providers=1

[load_backend, 1960]: find a valid qnn interface

[load_backend, 2005]: saver_initialize is null

[qnn_init, 2213]: initialize qnn log successfully

[qnn_init, 2224]: initialize qnn backend successfully

[qnn_init, 2230]: device property is not supported

[qnn_init, 2241]: create device successfully

[qnn_init, 2245]: profiling turned on; level = 2
[qnn_init, 2256]: detailed profiling requested. Creating Qnn Profile object

[qnn_init, 2262]: initialize qnn profile successfully

[qnn_init, 2272]: load rpcmem lib successfully

[qnn_init, 2299]: initialize qnn context successfully

[qnn_init, 2302]: leave qni_init

[ggml_backend_qnn_init, 4011]: qnn device name QNN-CPU
[init_qnn_graph, 2406]: succeed to create graph QNN-CPU, 0xd4a54a2a43bcdc2f

[main, 395]: creating new tensors

[main, 396]: ggml_blck_size(f32) 1
[main, 397]: ggml_type_size(f32) 4
[main, 436]: creating backend buffer

[main, 448]: creating compute graph

[ggml_qnn_can_handle_op, 2458]: op name:ADD, tensor type:f32
[ggml_qnn_can_handle_op, 2460]: src0 type:f32
[ggml_qnn_can_handle_op, 2463]: src1 type:f32
[ggml_qnn_add, 2574]: call ggml_qnn_add

[ggml_qnn_add, 2578]:        tensor_0: type = 0 (  f32) ne =     4 x     4 x     1, nb = (    4,    16,    64)

[ggml_qnn_add, 2582]:        tensor_1: type = 0 (  f32) ne =     4 x     4 x     1, nb = (    4,    16,    64)

[ggml_qnn_add, 2586]:        tensor_2: type = 0 (  f32) ne =     4 x     4 x     1, nb = (    4,    16,    64)

[ggml_qnn_add, 2587]: 4, 4, 1, 1
[ggml_qnn_add, 2588]: tensor0 name tensor_0
[ggml_qnn_add, 2589]: tensor1 name tensor_1
[ggml_qnn_add, 2590]: tensor2 name tensor_2
[ggml_qnn_add, 2617]: graph name ggml_op_qnn_add_1tensor_0_tensor_1
[ggml_qnn_logcallback, 2165]:     11.5ms [ DEBUG ] getNode OpPackage-Name : qti.aisw Node-Type : ElementWiseAdd 
[ggml_qnn_logcallback, 2165]:     11.5ms [VERBOSE] validate	Node-Type : ElementWiseAdd	Node-Name : ggml_op_add 
[ggml_qnn_logcallback, 2165]:     11.7ms [  INFO ] CpuGraph::finalize 
[ggml_qnn_logcallback, 2165]:     11.7ms [ DEBUG ] Setting data pointer for tensor ID: 1 
[ggml_qnn_logcallback, 2165]:     11.7ms [ DEBUG ] Setting data pointer for tensor ID: 2 
[ggml_qnn_logcallback, 2165]:     11.7ms [ DEBUG ] Setting data pointer for tensor ID: 3 
[ggml_qnn_logcallback, 2165]:     11.7ms [  INFO ] CpuGraph::execute 
[get_tensor_rank, 210]: tensor->rank 4

[get_tensor_rank, 211]: get_tensor_rank 2

[get_tensor_data_size, 223]: get_tensor_data_size 64
[get_tensor_data_size, 224]: ggml_nbytes(tensor) 64
[main, 464]: dump:

[tensor_dump, 191]: dump ggml tensor src0(tensor_0)
[tensor_dump, 195]:            src0: type = 0 (  f32) ne =     4 x     4 x     1, nb = (    4,    16,    64)
[tensor_sum_elements, 151]:    -0.84     0.23    -0.07    -0.25 
[tensor_sum_elements, 155]: 

[tensor_sum_elements, 151]:     0.10    -0.32    -0.96     0.28 
[tensor_sum_elements, 155]: 

[tensor_sum_elements, 151]:    -0.63    -0.59     0.29    -1.00 
[tensor_sum_elements, 155]: 

[tensor_sum_elements, 151]:    -0.01     0.10     0.92     0.54 
[tensor_sum_elements, 155]: 

[tensor_sum_elements, 185]: 

[tensor_dump, 198]: 

[tensor_dump, 191]: dump ggml tensor src1(tensor_1)
[tensor_dump, 195]:            src1: type = 0 (  f32) ne =     4 x     4 x     1, nb = (    4,    16,    64)
[tensor_sum_elements, 151]:     0.99    -0.43    -0.41    -0.44 
[tensor_sum_elements, 155]: 

[tensor_sum_elements, 151]:    -0.06     0.64    -0.61    -0.98 
[tensor_sum_elements, 155]: 

[tensor_sum_elements, 151]:    -0.86    -0.11     0.41     0.27 
[tensor_sum_elements, 155]: 

[tensor_sum_elements, 151]:     0.54    -0.70    -0.90    -0.13 
[tensor_sum_elements, 155]: 

[tensor_sum_elements, 185]: 

[tensor_dump, 198]: 

[tensor_dump, 191]: dump ggml tensor dst(tensor_2)
[tensor_dump, 195]:             dst: type = 0 (  f32) ne =     4 x     4 x     1, nb = (    4,    16,    64)
[tensor_sum_elements, 151]:     0.15    -0.19    -0.48    -0.69 
[tensor_sum_elements, 155]: 

[tensor_sum_elements, 151]:     0.04     0.32    -1.57    -0.70 
[tensor_sum_elements, 155]: 

[tensor_sum_elements, 151]:    -1.49    -0.70     0.70    -0.73 
[tensor_sum_elements, 155]: 

[tensor_sum_elements, 151]:     0.53    -0.60     0.02     0.42 
[tensor_sum_elements, 155]: 

[tensor_sum_elements, 185]: 

[tensor_dump, 198]: 

[ggml_backend_qnn_free, 3753]: enter ggml_backend_qnn_free
[ggml_backend_qnn_free, 3755]: idx 0, name:qnn-cpu
[ggml_backend_qnn_free, 3764]: graph type:ADD
[qnn_finalize, 2318]: succeed to close rpcmem lib

[ggml_backend_qnn_free, 3786]: leave ggml_backend_qnn_free

[ggml_backend_qnn_init, 3955]: device 0
[ggml_backend_qnn_init, 3956]: qnn_lib_path /data/local/tmp/
[qnn_init, 2172]: enter qni_init

[load_system, 2033]: system_lib_path:/data/local/tmp/libQnnSystem.so

[load_system, 2082]: find a valid qnn system interface

[load_system, 2092]: initialize qnn system successfully

[qnn_init, 2180]: load QNN system lib successfully

[load_backend, 1911]: lib_path:/data/local/tmp/libQnnCpu.so

[load_backend, 1935]: num_providers=1

[load_backend, 1960]: find a valid qnn interface

[load_backend, 2005]: saver_initialize is null

[qnn_init, 2213]: initialize qnn log successfully

[qnn_init, 2224]: initialize qnn backend successfully

[qnn_init, 2230]: device property is not supported

[qnn_init, 2241]: create device successfully

[qnn_init, 2245]: profiling turned on; level = 2
[qnn_init, 2256]: detailed profiling requested. Creating Qnn Profile object

[qnn_init, 2262]: initialize qnn profile successfully

[qnn_init, 2272]: load rpcmem lib successfully

[qnn_init, 2299]: initialize qnn context successfully

[qnn_init, 2302]: leave qni_init

[ggml_backend_qnn_init, 4011]: qnn device name QNN-CPU
[init_qnn_graph, 2406]: succeed to create graph QNN-CPU, 0xd4a54a5b40bcdc2f

[main, 395]: creating new tensors

[main, 396]: ggml_blck_size(f32) 1
[main, 397]: ggml_type_size(f32) 4
[main, 436]: creating backend buffer

[main, 448]: creating compute graph

[ggml_qnn_can_handle_op, 2458]: op name:MUL, tensor type:f32
[ggml_qnn_can_handle_op, 2460]: src0 type:f32
[ggml_qnn_can_handle_op, 2463]: src1 type:f32
[ggml_qnn_hanlde_op, 2993]: call ggml_qnn_hanlde_op

[ggml_qnn_hanlde_op, 2997]:        tensor_0: type = 0 (  f32)  ne =     4 x     4 x     1, nb = (    4,    16,    64)

[ggml_qnn_hanlde_op, 3001]:        tensor_1: type = 0 (  f32) ne =     4 x     4 x     1, nb = (    4,    16,    64)

[ggml_qnn_hanlde_op, 3005]:        tensor_2: type = 0 (  f32) ne =     4 x     4 x     1, nb = (    4,    16,    64)

[ggml_qnn_hanlde_op, 3006]: 4, 4, 1, 1
[ggml_qnn_hanlde_op, 3007]: tensor0 name tensor_0
[ggml_qnn_hanlde_op, 3008]: tensor1 name tensor_1
[ggml_qnn_hanlde_op, 3009]: tensor2 name tensor_2
[ggml_qnn_hanlde_op, 3033]: qnn graph name ggml_qnn_graph_MUL1tensor_0_tensor_1
[ggml_qnn_hanlde_op, 3034]: qnn op_config name ggml_qnn_op_config_MUL1tensor_0_tensor_1
[ggml_qnn_logcallback, 2165]:     17.7ms [ DEBUG ] getNode OpPackage-Name : qti.aisw Node-Type : ElementWiseMultiply 
[ggml_qnn_logcallback, 2165]:     17.8ms [VERBOSE] validate	Node-Type : ElementWiseMultiply	Node-Name : ggml_qnn_op_config_MUL1tensor_0_tensor_1 
[ggml_qnn_logcallback, 2165]:     18.0ms [  INFO ] CpuGraph::finalize 
[ggml_qnn_logcallback, 2165]:     18.1ms [ DEBUG ] Setting data pointer for tensor ID: 1 
[ggml_qnn_logcallback, 2165]:     18.1ms [ DEBUG ] Setting data pointer for tensor ID: 2 
[ggml_qnn_logcallback, 2165]:     18.1ms [ DEBUG ] Setting data pointer for tensor ID: 3 
[ggml_qnn_logcallback, 2165]:     18.1ms [  INFO ] CpuGraph::execute 
[ggml_qnn_hanlde_op, 3134]: duration of ggml_qnn_MUL : 0 milliseconds

[ggml_qnn_hanlde_op, 3135]: call ggml_qnn_hanlde_op done

[get_tensor_rank, 210]: tensor->rank 4

[get_tensor_rank, 211]: get_tensor_rank 2

[get_tensor_data_size, 223]: get_tensor_data_size 64
[get_tensor_data_size, 224]: ggml_nbytes(tensor) 64
[main, 464]: dump:

[tensor_dump, 191]: dump ggml tensor src0(tensor_0)
[tensor_dump, 195]:            src0: type = 0 (  f32) ne =     4 x     4 x     1, nb = (    4,    16,    64)
[tensor_sum_elements, 151]:    -0.62     0.59    -0.34     0.40 
[tensor_sum_elements, 155]: 

[tensor_sum_elements, 151]:    -0.81     0.33     0.52     0.01 
[tensor_sum_elements, 155]: 

[tensor_sum_elements, 151]:    -0.37     0.43     0.97     0.06 
[tensor_sum_elements, 155]: 

[tensor_sum_elements, 151]:     0.28     0.09    -0.57    -0.02 
[tensor_sum_elements, 155]: 

[tensor_sum_elements, 185]: 

[tensor_dump, 198]: 

[tensor_dump, 191]: dump ggml tensor src1(tensor_1)
[tensor_dump, 195]:            src1: type = 0 (  f32) ne =     4 x     4 x     1, nb = (    4,    16,    64)
[tensor_sum_elements, 151]:     0.24    -0.57    -0.17     0.36 
[tensor_sum_elements, 155]: 

[tensor_sum_elements, 151]:    -0.83    -0.64     0.23    -0.87 
[tensor_sum_elements, 155]: 

[tensor_sum_elements, 151]:    -0.25    -0.31     0.55     0.64 
[tensor_sum_elements, 155]: 

[tensor_sum_elements, 151]:    -0.42     0.42     0.96     0.88 
[tensor_sum_elements, 155]: 

[tensor_sum_elements, 185]: 

[tensor_dump, 198]: 

[tensor_dump, 191]: dump ggml tensor dst(tensor_2)
[tensor_dump, 195]:             dst: type = 0 (  f32) ne =     4 x     4 x     1, nb = (    4,    16,    64)
[tensor_sum_elements, 151]:    -0.15    -0.34     0.06     0.14 
[tensor_sum_elements, 155]: 

[tensor_sum_elements, 151]:     0.67    -0.21     0.12    -0.01 
[tensor_sum_elements, 155]: 

[tensor_sum_elements, 151]:     0.09    -0.13     0.53     0.04 
[tensor_sum_elements, 155]: 

[tensor_sum_elements, 151]:    -0.12     0.04    -0.55    -0.01 
[tensor_sum_elements, 155]: 

[tensor_sum_elements, 185]: 

[tensor_dump, 198]: 

[ggml_backend_qnn_free, 3753]: enter ggml_backend_qnn_free
[ggml_backend_qnn_free, 3755]: idx 0, name:qnn-cpu
[ggml_backend_qnn_free, 3764]: graph type:MUL
[qnn_finalize, 2318]: succeed to close rpcmem lib

[ggml_backend_qnn_free, 3786]: leave ggml_backend_qnn_free

/data/local/tmp//libQnnCpu.so
QNN libs already exist on Android phone
ggml-qnn-test: 1 file pushed. 20.3 MB/s (4567168 bytes in 0.215s)
[main, 344]: enter qnn_ggml_op

[main, 345]: ggml op:23(MUL_MAT)
[main, 359]: Allocating Memory of size 33554432 bytes, 32 MB

[ggml_backend_qnn_init, 3955]: device 0
[ggml_backend_qnn_init, 3956]: qnn_lib_path /data/local/tmp/
[qnn_init, 2172]: enter qni_init

[load_system, 2033]: system_lib_path:/data/local/tmp/libQnnSystem.so

[load_system, 2082]: find a valid qnn system interface

[load_system, 2092]: initialize qnn system successfully

[qnn_init, 2180]: load QNN system lib successfully

[load_backend, 1911]: lib_path:/data/local/tmp/libQnnCpu.so

[load_backend, 1935]: num_providers=1

[load_backend, 1960]: find a valid qnn interface

[load_backend, 2005]: saver_initialize is null

[qnn_init, 2213]: initialize qnn log successfully

[qnn_init, 2224]: initialize qnn backend successfully

[qnn_init, 2230]: device property is not supported

[qnn_init, 2241]: create device successfully

[qnn_init, 2245]: profiling turned on; level = 2
[qnn_init, 2256]: detailed profiling requested. Creating Qnn Profile object

[qnn_init, 2262]: initialize qnn profile successfully

[qnn_init, 2272]: load rpcmem lib successfully

[qnn_init, 2299]: initialize qnn context successfully

[qnn_init, 2302]: leave qni_init

[ggml_backend_qnn_init, 4011]: qnn device name QNN-CPU
[init_qnn_graph, 2406]: succeed to create graph QNN-CPU, 0xd4a50a2049bcdc2f

[main, 395]: creating new tensors

[main, 396]: ggml_blck_size(f32) 1
[main, 397]: ggml_type_size(f32) 4
[main, 436]: creating backend buffer

[main, 448]: creating compute graph

[ggml_qnn_can_handle_op, 2458]: op name:MUL_MAT, tensor type:f32
[ggml_qnn_can_handle_op, 2460]: src0 type:f32
[ggml_qnn_can_handle_op, 2463]: src1 type:f32
[ggml_qnn_can_handle_op, 2467]: GGML_OP_MUL_MAT
[ggml_qnn_can_handle_op, 2472]: src0        tensor_0: type = 0 (  f32) ne =     4 x     4 x     1, nb = (    4,    16,    64)

[ggml_qnn_can_handle_op, 2477]: src1        tensor_1: type = 0 (  f32) ne =     4 x     4 x     1, nb = (    4,    16,    64)

[ggml_qnn_can_handle_op, 2483]:             tensor_2: type = 0 (  f32) ne =     4 x     4 x     1, nb = (    4,    16,    64)

[ggml_qnn_mul_mat, 2785]: call ggml_qnn_mul_mat

[ggml_qnn_mul_mat, 2789]:        tensor_0: type = 0 (  f32) ne =     4 x     4 x     1, nb = (    4,    16,    64)

[ggml_qnn_mul_mat, 2793]:        tensor_1: type = 0 (  f32) ne =     4 x     4 x     1, nb = (    4,    16,    64)

[ggml_qnn_mul_mat, 2797]:        tensor_2: type = 0 (  f32) ne =     4 x     4 x     1, nb = (    4,    16,    64)

[ggml_qnn_mul_mat, 2798]: 4, 4, 1, 1
[ggml_qnn_mul_mat, 2799]: tensor0 name tensor_0
[ggml_qnn_mul_mat, 2800]: tensor1 name tensor_1
[ggml_qnn_mul_mat, 2801]: tensor2 name tensor_2
[ggml_qnn_mul_mat, 2828]: graph name ggml_op_qnn_mul_mat_1tensor_0_tensor_1
[ggml_qnn_logcallback, 2165]:     16.9ms [ DEBUG ] getNode OpPackage-Name : qti.aisw Node-Type : MatMul 
[ggml_qnn_logcallback, 2165]:     17.0ms [VERBOSE] validate	Node-Type : MatMul	Node-Name : ggml_op_mul_mat 
[ggml_qnn_logcallback, 2165]:     17.1ms [  INFO ] CpuGraph::finalize 
[ggml_qnn_logcallback, 2165]:     17.2ms [ DEBUG ] Setting data pointer for tensor ID: 1 
[ggml_qnn_logcallback, 2165]:     17.2ms [ DEBUG ] Setting data pointer for tensor ID: 2 
[ggml_qnn_logcallback, 2165]:     17.2ms [ DEBUG ] Setting data pointer for tensor ID: 3 
[ggml_qnn_logcallback, 2165]:     17.2ms [  INFO ] CpuGraph::execute 
[ggml_qnn_mul_mat, 2927]: duration of ggml_qnn_mul_mat : 10 milliseconds

[ggml_qnn_mul_mat, 2928]: call ggml_qnn_mul_mat done

[get_tensor_rank, 210]: tensor->rank 4

[get_tensor_rank, 211]: get_tensor_rank 2

[get_tensor_data_size, 223]: get_tensor_data_size 64
[get_tensor_data_size, 224]: ggml_nbytes(tensor) 64
[main, 464]: dump:

[tensor_dump, 191]: dump ggml tensor src0(tensor_0)
[tensor_dump, 195]:            src0: type = 0 (  f32) ne =     4 x     4 x     1, nb = (    4,    16,    64)
[tensor_sum_elements, 151]:     0.05     0.68    -0.27    -0.28 
[tensor_sum_elements, 155]: 

[tensor_sum_elements, 151]:    -0.47     0.77     0.41     0.14 
[tensor_sum_elements, 155]: 

[tensor_sum_elements, 151]:    -0.69    -0.71    -0.81    -0.23 
[tensor_sum_elements, 155]: 

[tensor_sum_elements, 151]:     0.37     0.36    -0.26     0.61 
[tensor_sum_elements, 155]: 

[tensor_sum_elements, 185]: 

[tensor_dump, 198]: 

[tensor_dump, 191]: dump ggml tensor src1(tensor_1)
[tensor_dump, 195]:            src1: type = 0 (  f32) ne =     4 x     4 x     1, nb = (    4,    16,    64)
[tensor_sum_elements, 151]:    -0.48    -0.81    -0.61     0.53 
[tensor_sum_elements, 155]: 

[tensor_sum_elements, 151]:    -0.04     0.87     0.64     0.17 
[tensor_sum_elements, 155]: 

[tensor_sum_elements, 151]:    -0.22     0.94    -0.38    -0.78 
[tensor_sum_elements, 155]: 

[tensor_sum_elements, 151]:    -0.97    -0.94    -0.35     0.94 
[tensor_sum_elements, 155]: 

[tensor_sum_elements, 185]: 

[tensor_dump, 198]: 

[tensor_dump, 191]: dump ggml tensor dst(tensor_2)
[tensor_dump, 195]:             dst: type = 0 (  f32) ne =     4 x     4 x     1, nb = (    4,    16,    64)
[tensor_sum_elements, 151]:     0.97    -0.79    -0.47     0.98 
[tensor_sum_elements, 155]: 

[tensor_sum_elements, 151]:    -0.33     0.24     0.56    -0.80 
[tensor_sum_elements, 155]: 

[tensor_sum_elements, 151]:     0.16    -0.20     0.95    -0.08 
[tensor_sum_elements, 155]: 

[tensor_sum_elements, 151]:     0.48     0.09    -0.20     0.80 
[tensor_sum_elements, 155]: 

[tensor_sum_elements, 185]: 

[tensor_dump, 198]: 

[ggml_backend_qnn_free, 3753]: enter ggml_backend_qnn_free
[ggml_backend_qnn_free, 3755]: idx 0, name:qnn-cpu
[ggml_backend_qnn_free, 3764]: graph type:ADD
[qnn_finalize, 2318]: succeed to close rpcmem lib

[ggml_backend_qnn_free, 3786]: leave ggml_backend_qnn_free

QNN's RPC feature(which useful for QNN NPU(aka HTP/DSP) backend) was used in this PR and it works fine as expected.there are 2+GBytes ion memory could be used for offload ggml tensors in cgraph to NPU on Qualcomm Snapdragon 8 Gen 3 equipped Android phone.

This PR is a Minimum Viable PR style and functional PR in ggml community. it'll be great helpful for other community programmer/developer/AI expert to contribute codes/ideas to GGML QNN backend if this PR can be approved and merged to master branch. Together we might/should/could reach the final target: utilize the Hexagon NPU maximally with the highly well-designed/highly compact ggml machine learning framework. this is might be the exact GGML way in GGML community.

Todo

Qualcomm's QNN backend for GGML has some todo tasks to make this backend can be used in real commercial application:

Lack of implementation of other GGML-OPs using QNN API. I provide a GENERAL approach try to fix this problem in a standalone PR of refine ggml backend subsystem for mixed inference between CPU&GPU / CPU&NPU easily for ANY ggml backends(which the backend's ggml_backend_xxx_buffer_is_host return true) . this approach works as expected with whisper inference and llama inference in my personal ggml learning&study project.
Add more quantize data type supportive(AI expert should be here)
Peformance fine-tunning: the performance of the existing ggml qnn backend is weaker/poor then the original ggml because there are some sophisticated Qualcomm's dedicated technologies not used in this PR and the power of state-of-the-art Qualcomm's NPU(Hexagon Tensor Processor) was not utilized currently in this PR(I know the direction but limited by my knowledge of real/hardcore AI tech). The performance fine-tunning in ggml gnn-npu backend is a long-term task. the following is an example:

[qnn_op_ut, 2037]: dump tensors:
[tensor_dump, 1404]: dump ggml tensor src0(tensor_0): type = 0 (  f32) ne =     4 x     4 x     1, nb = (    4,    16,    64)
[tensor_dump, 1466]: (4x4 in 4x4)
    0.16     0.85    -0.80    -0.25 
   -0.28     0.66     0.98     0.67 
   -0.15     0.78    -0.45    -0.50 
    0.92     0.31    -0.72    -0.46 

[tensor_dump, 1404]: dump ggml tensor src1(tensor_1): type = 0 (  f32) ne =     4 x     4 x     1, nb = (    4,    16,    64)
[tensor_dump, 1466]: (4x4 in 4x4)
    0.53     0.86    -0.91    -0.27 
    0.62     0.35    -0.27     0.43 
    0.73     0.42    -0.81    -0.24 
    0.49     0.81    -0.88     0.64 

[tensor_dump, 1404]: dump ggml tensor dst(tensor_2): type = 0 (  f32) ne =     4 x     4 x     1, nb = (    4,    16,    64)
[tensor_dump, 1466]: (4x4 in 4x4)
    0.69     1.70    -1.70    -0.52 
    0.34     1.02     0.71     1.10 
    0.58     1.19    -1.26    -0.74 
    1.41     1.12    -1.60     0.18 

[ggml_backend_qnn_free, 3286]: enter ggml_backend_qnn_free
[ggml_backend_qnn_free, 3288]: idx 2, name:qnn-npu
[ggml_backend_qnn_free, 3300]: graph type:ADD
[qnn_finalize, 1258]: succeed to close rpcmem lib

[ggml_backend_qnn_free, 3313]: leave ggml_backend_qnn_free
[qnn_op_ut, 2067]: duration of ut GGML_OP_ADD using QNN backend QNN-NPU: 532 milliseconds
[test-qnn-npu.cpp, qnn_op_ut, 2068]: leave qnn_op_test

[tensor_dump, 1404]: dump ggml tensor src0(tensor_0): type = 0 (  f32) ne =     4 x     4 x     1, nb = (    4,    16,    64)
[tensor_dump, 1466]: (4x4 in 4x4)
   -0.96     0.64     0.75     0.27 
   -0.10     0.59    -0.70     0.20 
    0.78     0.98    -0.46     0.33 
   -0.01     0.72     0.78     0.79 

[tensor_dump, 1404]: dump ggml tensor src1(tensor_1): type = 0 (  f32) ne =     4 x     4 x     1, nb = (    4,    16,    64)
[tensor_dump, 1466]: (4x4 in 4x4)
   -0.87     0.89     0.76     0.94 
    0.22    -0.88    -0.63     0.80 
   -0.32     0.16     0.53     0.53 
   -0.78     0.13    -0.04    -0.34 

[test-qnn-npu.cpp, qnn_test_qnnnpu_2, 6330]: error = 0

[test-qnn-npu.cpp, qnn_test_qnnnpu_2, 6333]: output matrix:
[tensor_dump, 1404]: dump ggml tensor dst(tensor_2): type = 0 (  f32) ne =     4 x     4 x     1, nb = (    4,    16,    64)
[tensor_dump, 1466]: (4x4 in 4x4)
   -1.83     1.53     1.52     1.20 
    0.12    -0.29    -1.33     1.00 
    0.45     1.14     0.07     0.86 
   -0.80     0.85     0.75     0.45 

[test-qnn-npu.cpp, qnn_finalize, 4886]: succeed to close rpcmem lib

[info, 161]: duration of qnn_nputest_2_ADD : 233 milliseconds
[test-qnn-npu.cpp, qnn_test_qnnnpu_2, 6357]: leave qnn_rpc_test

[qnn_op_ut, 2037]: dump tensors:
[tensor_dump, 1404]: dump ggml tensor src0(tensor_0): type = 0 (  f32) ne =     4 x     4 x     1, nb = (    4,    16,    64)
[tensor_dump, 1466]: (4x4 in 4x4)
   59.00    59.00    59.00    59.00 
   59.00    59.00    59.00    59.00 
   59.00    59.00    59.00    59.00 
   59.00    59.00    59.00    59.00 

[tensor_dump, 1404]: dump ggml tensor src1(tensor_1): type = 0 (  f32) ne =     4 x     4 x     1, nb = (    4,    16,    64)
[tensor_dump, 1466]: (4x4 in 4x4)
   94.00    94.00    94.00    94.00 
   94.00    94.00    94.00    94.00 
   94.00    94.00    94.00    94.00 
   94.00    94.00    94.00    94.00 

[tensor_dump, 1404]: dump ggml tensor dst(tensor_2): type = 0 (  f32) ne =     4 x     4 x     1, nb = (    4,    16,    64)
[tensor_dump, 1466]: (4x4 in 4x4)
  153.00   153.00   153.00   153.00 
  153.00   153.00   153.00   153.00 
  153.00   153.00   153.00   153.00 
  153.00   153.00   153.00   153.00 

[qnn_op_ut, 2067]: duration of ut GGML_OP_ADD using QNN backend ggml: 3 milliseconds
[test-qnn-npu.cpp, qnn_op_ut, 2068]: leave qnn_op_test

How to verify QNN backend or participate in development activity of GGML QNN backend

I provide a dedicated Android command line program and scripts in this PR for purpose of UT on Android device.

The QNN SDK should be downloaded and installed accordingly from https://qpm.qualcomm.com/#/main/tools/details/qualcomm_ai_engine_direct . pls attention the latest QNN SDK provided/released by Qualcomm should be used.
use the dedicated program and scripts in directory tests/ggml-qnn to verify/validate QNN backend in command line mode on Qualcomm SoC equipped Android phone. This Android command line UT program has NO unnecessary external dependencies and easily to use, it is the recommended method to verify GGML QNN backend or further development activity of GGML QNN backend in project llama.cpp/ggml community:


 cd tests/ggml-qnn/
./ggml-qnn-ut-build-run.sh  -h              (show usage)
./ggml-qnn-ut-build-run.sh  help            (show usage)
./ggml-qnn-ut-build-run.sh  build           (build Android command line UT program)
./ggml-qnn-ut-build-run.sh  updateqnnlibs   (upload the latest QNN libs to Android phone)
./ggml-qnn-ut-build-run.sh  GGML_OP_ADD  0  (run UT program and verfiy QNN CPU backend on Android phone)
./ggml-qnn-ut-build-run.sh  GGML_OP_ADD  1  (run UT program and verfiy QNN GPU backend on Android phone)
./ggml-qnn-ut-build-run.sh  GGML_OP_ADD  2  (run UT program and verfiy QNN NPU backend on Android phone)
./ggml-qnn-ut-build-run.sh  GGML_OP_ADD  3  (compare performance between QNN backend and original ggml on Android phone)

A suitable/qualified reviewer should/might be familiar with source code of ggml and Qualcomm QNN(Qualcomm Neural Network, aka Qualcomm AI Engine Direct) SDK or other Qualcomm's AI software stack; skillsets including real/hardcore AI tech is more better(add more quantize data type and implement more GGML OPs(or kernels) require the AI skillset) but is not an essential skillset in this PR. some notes for potential qualified reviewer:

Programming language detail is not the key-point in this PR also language detail is really important and I will handle it properly as much as possible(this PR follow the coding style in upstream llama.cpp strictly/as much as possible), pls do NOT spent too much time on these language details : such as code format, code align, variable name, function name, unused variable, unused function, compiler warning, C++ grammar/syntax in so-called modern C++11/14/17/20...).
PR should/could be submitted in upstream llama.cpp if the PR is for fix issues/bugs in upstream llama.cpp(this is the reason why familiar with source code of ggml is an essential prerequisite for a suitable reviewer).
Don't bring too much complex new features in this PR, a MVP(Minimum Viable PR) style PR might be accepted by the maintainers of ggml community.
Pls focus on the real keypoint in this PR:how to utilize the Hexagon NPU maximally with the highly well-designed/highly compact ggml machine learning framework.

Any GGML community programmer/developer/AI expert who interesting with the topic of GGML QNN backend can use/extend the dedicated Android command line program to verify GGML QNN backend, review are greatly welcomed and appreciated.

… Direct) backend

github-actions · 2024-04-24T10:58:13Z

📈 llama.cpp server for bench-server-baseline on Standard_NC4as_T4_v3 for phi-2-q4_0: 540 iterations 🚀

Expand details for performance related PR only

Concurrent users: 8, duration: 10m
HTTP request : avg=8677.33ms p(95)=20035.75ms fails=, finish reason: stop=492 truncated=48
Prompt processing (pp): avg=95.63tk/s p(95)=443.17tk/s
Token generation (tg): avg=47.46tk/s p(95)=47.64tk/s
ggml-org/models/phi-2/ggml-model-q4_0.gguf parallel=8 ctx-size=16384 ngl=33 batch-size=2048 ubatch-size=256 pp=1024 pp+tg=2048 branch=qualcomm_qnn_backend_for_ggml commit=a98a4e999000105b81b472c7b36ff80131d68ef1

More

---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 540 iterations"
    y-axis "llamacpp:prompt_tokens_seconds"
    x-axis "llamacpp:prompt_tokens_seconds" 1718010789 --> 1718011423
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 593.29, 593.29, 593.29, 593.29, 593.29, 747.27, 747.27, 747.27, 747.27, 747.27, 756.49, 756.49, 756.49, 756.49, 756.49, 776.61, 776.61, 776.61, 776.61, 776.61, 836.27, 836.27, 836.27, 836.27, 836.27, 841.05, 841.05, 841.05, 841.05, 841.05, 838.87, 838.87, 838.87, 838.87, 838.87, 859.25, 859.25, 859.25, 859.25, 859.25, 865.08, 865.08, 865.08, 865.08, 865.08, 858.89, 858.89, 858.89, 858.89, 858.89, 883.85, 883.85, 883.85, 883.85, 883.85, 891.89, 891.89, 891.89, 891.89, 891.89, 873.3, 873.3, 873.3, 873.3, 873.3, 893.13, 893.13, 893.13, 893.13, 893.13, 909.48, 909.48, 909.48, 909.48, 909.48, 910.98, 910.98, 910.98, 910.98, 910.98, 911.31, 911.31, 911.31, 911.31, 911.31, 910.69, 910.69, 910.69, 910.69, 910.69, 914.6, 914.6, 914.6, 914.6, 914.6, 928.83, 928.83, 928.83, 928.83, 928.83, 927.37, 927.37, 927.37, 927.37, 927.37, 921.49, 921.49, 921.49, 921.49, 921.49, 925.25, 925.25, 925.25, 925.25, 925.25, 928.15, 928.15, 928.15, 928.15, 928.15, 942.74, 942.74, 942.74, 942.74, 942.74, 924.43, 924.43, 924.43, 924.43, 924.43, 923.95, 923.95, 923.95, 923.95, 923.95, 915.03, 915.03, 915.03, 915.03, 915.03, 911.66, 911.66, 911.66, 911.66, 911.66, 909.5, 909.5, 909.5, 909.5, 909.5, 914.04, 914.04, 914.04, 914.04, 914.04, 911.98, 911.98, 911.98, 911.98, 911.98, 910.75, 910.75, 910.75, 910.75, 910.75, 916.72, 916.72, 916.72, 916.72, 916.72, 926.62, 926.62, 926.62, 926.62, 926.62, 924.55, 924.55, 924.55, 924.55, 924.55, 927.08, 927.08, 927.08, 927.08, 927.08, 921.68, 921.68, 921.68, 921.68, 921.68, 920.82, 920.82, 920.82, 920.82, 920.82, 921.7, 921.7, 921.7, 921.7, 921.7, 922.98, 922.98, 922.98, 922.98, 922.98, 930.8, 930.8, 930.8, 930.8, 930.8, 921.59, 921.59, 921.59, 921.59, 921.59, 897.51, 897.51, 897.51, 897.51, 897.51, 894.98, 894.98, 894.98, 894.98, 894.98, 893.03, 893.03, 893.03, 893.03, 893.03, 895.37, 895.37, 895.37, 895.37, 895.37, 897.77, 897.77, 897.77, 897.77, 897.77, 896.81, 896.81, 896.81, 896.81, 896.81, 899.61, 899.61, 899.61, 899.61, 899.61, 898.83, 898.83, 898.83, 898.83, 898.83, 901.17, 901.17, 901.17, 901.17, 901.17, 890.73, 890.73, 890.73, 890.73, 890.73, 888.87, 888.87, 888.87, 888.87, 888.87, 889.05, 889.05, 889.05, 889.05, 889.05, 889.17, 889.17, 889.17, 889.17, 889.17, 888.29, 888.29, 888.29, 888.29, 888.29, 887.41, 887.41, 887.41, 887.41, 887.41, 888.05, 888.05, 888.05, 888.05, 888.05, 888.97, 888.97, 888.97, 888.97, 888.97, 889.62, 889.62, 889.62]

More

---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 540 iterations"
    y-axis "llamacpp:predicted_tokens_seconds"
    x-axis "llamacpp:predicted_tokens_seconds" 1718010789 --> 1718011423
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 41.93, 41.93, 41.93, 41.93, 41.93, 35.06, 35.06, 35.06, 35.06, 35.06, 27.87, 27.87, 27.87, 27.87, 27.87, 30.27, 30.27, 30.27, 30.27, 30.27, 31.24, 31.24, 31.24, 31.24, 31.24, 31.49, 31.49, 31.49, 31.49, 31.49, 32.61, 32.61, 32.61, 32.61, 32.61, 33.52, 33.52, 33.52, 33.52, 33.52, 33.92, 33.92, 33.92, 33.92, 33.92, 34.15, 34.15, 34.15, 34.15, 34.15, 34.26, 34.26, 34.26, 34.26, 34.26, 33.88, 33.88, 33.88, 33.88, 33.88, 33.24, 33.24, 33.24, 33.24, 33.24, 33.26, 33.26, 33.26, 33.26, 33.26, 31.54, 31.54, 31.54, 31.54, 31.54, 31.03, 31.03, 31.03, 31.03, 31.03, 29.95, 29.95, 29.95, 29.95, 29.95, 29.72, 29.72, 29.72, 29.72, 29.72, 29.96, 29.96, 29.96, 29.96, 29.96, 29.84, 29.84, 29.84, 29.84, 29.84, 29.65, 29.65, 29.65, 29.65, 29.65, 29.74, 29.74, 29.74, 29.74, 29.74, 29.88, 29.88, 29.88, 29.88, 29.88, 30.09, 30.09, 30.09, 30.09, 30.09, 30.2, 30.2, 30.2, 30.2, 30.2, 30.22, 30.22, 30.22, 30.22, 30.22, 30.51, 30.51, 30.51, 30.51, 30.51, 30.47, 30.47, 30.47, 30.47, 30.47, 30.42, 30.42, 30.42, 30.42, 30.42, 30.68, 30.68, 30.68, 30.68, 30.68, 30.77, 30.77, 30.77, 30.77, 30.77, 30.87, 30.87, 30.87, 30.87, 30.87, 31.02, 31.02, 31.02, 31.02, 31.02, 31.2, 31.2, 31.2, 31.2, 31.2, 31.05, 31.05, 31.05, 31.05, 31.05, 31.03, 31.03, 31.03, 31.03, 31.03, 30.8, 30.8, 30.8, 30.8, 30.8, 30.35, 30.35, 30.35, 30.35, 30.35, 30.3, 30.3, 30.3, 30.3, 30.3, 30.55, 30.55, 30.55, 30.55, 30.55, 30.68, 30.68, 30.68, 30.68, 30.68, 30.79, 30.79, 30.79, 30.79, 30.79, 30.73, 30.73, 30.73, 30.73, 30.73, 30.24, 30.24, 30.24, 30.24, 30.24, 29.97, 29.97, 29.97, 29.97, 29.97, 29.37, 29.37, 29.37, 29.37, 29.37, 29.03, 29.03, 29.03, 29.03, 29.03, 29.04, 29.04, 29.04, 29.04, 29.04, 29.1, 29.1, 29.1, 29.1, 29.1, 29.14, 29.14, 29.14, 29.14, 29.14, 29.24, 29.24, 29.24, 29.24, 29.24, 29.27, 29.27, 29.27, 29.27, 29.27, 29.28, 29.28, 29.28, 29.28, 29.28, 29.11, 29.11, 29.11, 29.11, 29.11, 29.13, 29.13, 29.13, 29.13, 29.13, 29.15, 29.15, 29.15, 29.15, 29.15, 29.24, 29.24, 29.24, 29.24, 29.24, 29.36, 29.36, 29.36, 29.36, 29.36, 29.45, 29.45, 29.45, 29.45, 29.45, 29.56, 29.56, 29.56, 29.56, 29.56, 29.62, 29.62, 29.62]

Details

More

---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 540 iterations"
    y-axis "llamacpp:kv_cache_usage_ratio"
    x-axis "llamacpp:kv_cache_usage_ratio" 1718010789 --> 1718011423
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.19, 0.19, 0.19, 0.19, 0.19, 0.39, 0.39, 0.39, 0.39, 0.39, 0.22, 0.22, 0.22, 0.22, 0.22, 0.18, 0.18, 0.18, 0.18, 0.18, 0.24, 0.24, 0.24, 0.24, 0.24, 0.17, 0.17, 0.17, 0.17, 0.17, 0.1, 0.1, 0.1, 0.1, 0.1, 0.12, 0.12, 0.12, 0.12, 0.12, 0.09, 0.09, 0.09, 0.09, 0.09, 0.16, 0.16, 0.16, 0.16, 0.16, 0.22, 0.22, 0.22, 0.22, 0.22, 0.21, 0.21, 0.21, 0.21, 0.21, 0.26, 0.26, 0.26, 0.26, 0.26, 0.28, 0.28, 0.28, 0.28, 0.28, 0.36, 0.36, 0.36, 0.36, 0.36, 0.28, 0.28, 0.28, 0.28, 0.28, 0.29, 0.29, 0.29, 0.29, 0.29, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.21, 0.21, 0.21, 0.21, 0.21, 0.22, 0.22, 0.22, 0.22, 0.22, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.38, 0.38, 0.38, 0.38, 0.38, 0.14, 0.14, 0.14, 0.14, 0.14, 0.13, 0.13, 0.13, 0.13, 0.13, 0.28, 0.28, 0.28, 0.28, 0.28, 0.12, 0.12, 0.12, 0.12, 0.12, 0.07, 0.07, 0.07, 0.07, 0.07, 0.17, 0.17, 0.17, 0.17, 0.17, 0.21, 0.21, 0.21, 0.21, 0.21, 0.16, 0.16, 0.16, 0.16, 0.16, 0.17, 0.17, 0.17, 0.17, 0.17, 0.31, 0.31, 0.31, 0.31, 0.31, 0.28, 0.28, 0.28, 0.28, 0.28, 0.4, 0.4, 0.4, 0.4, 0.4, 0.22, 0.22, 0.22, 0.22, 0.22, 0.12, 0.12, 0.12, 0.12, 0.12, 0.12, 0.12, 0.12, 0.12, 0.12, 0.1, 0.1, 0.1, 0.1, 0.1, 0.32, 0.32, 0.32, 0.32, 0.32, 0.51, 0.51, 0.51, 0.51, 0.51, 0.61, 0.61, 0.61, 0.61, 0.61, 0.48, 0.48, 0.48, 0.48, 0.48, 0.27, 0.27, 0.27, 0.27, 0.27, 0.23, 0.23, 0.23, 0.23, 0.23, 0.17, 0.17, 0.17, 0.17, 0.17, 0.22, 0.22, 0.22, 0.22, 0.22, 0.18, 0.18, 0.18, 0.18, 0.18, 0.14, 0.14, 0.14, 0.14, 0.14, 0.21, 0.21, 0.21, 0.21, 0.21, 0.27, 0.27, 0.27, 0.27, 0.27, 0.08, 0.08, 0.08, 0.08, 0.08, 0.28, 0.28, 0.28, 0.28, 0.28, 0.15, 0.15, 0.15, 0.15, 0.15, 0.19, 0.19, 0.19, 0.19, 0.19, 0.09, 0.09, 0.09, 0.09, 0.09, 0.13, 0.13, 0.13, 0.13, 0.13, 0.17, 0.17, 0.17, 0.17, 0.17, 0.15, 0.15, 0.15]

More

---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 540 iterations"
    y-axis "llamacpp:requests_processing"
    x-axis "llamacpp:requests_processing" 1718010789 --> 1718011423
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 4.0, 4.0, 4.0, 4.0, 4.0, 7.0, 7.0, 7.0, 7.0, 7.0, 4.0, 4.0, 4.0, 4.0, 4.0, 2.0, 2.0, 2.0, 2.0, 2.0, 4.0, 4.0, 4.0, 4.0, 4.0, 5.0, 5.0, 5.0, 5.0, 5.0, 7.0, 7.0, 7.0, 7.0, 7.0, 2.0, 2.0, 2.0, 2.0, 2.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 7.0, 7.0, 7.0, 7.0, 7.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 8.0, 8.0, 8.0, 8.0, 8.0, 6.0, 6.0, 6.0, 6.0, 6.0, 8.0, 8.0, 8.0, 8.0, 8.0, 4.0, 4.0, 4.0, 4.0, 4.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 6.0, 6.0, 6.0, 6.0, 6.0, 5.0, 5.0, 5.0, 5.0, 5.0, 6.0, 6.0, 6.0, 6.0, 6.0, 3.0, 3.0, 3.0, 3.0, 3.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 3.0, 3.0, 3.0, 3.0, 3.0, 6.0, 6.0, 6.0, 6.0, 6.0, 4.0, 4.0, 4.0, 4.0, 4.0, 6.0, 6.0, 6.0, 6.0, 6.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 3.0, 3.0, 3.0, 3.0, 3.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 8.0, 8.0, 8.0, 8.0, 8.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 3.0, 3.0, 3.0, 3.0, 3.0, 7.0, 7.0, 7.0, 7.0, 7.0, 5.0, 5.0, 5.0, 5.0, 5.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 6.0, 6.0, 6.0, 6.0, 6.0, 7.0, 7.0, 7.0, 7.0, 7.0, 3.0, 3.0, 3.0, 3.0, 3.0, 6.0, 6.0, 6.0, 6.0, 6.0, 5.0, 5.0, 5.0, 5.0, 5.0, 6.0, 6.0, 6.0, 6.0, 6.0, 2.0, 2.0, 2.0, 2.0, 2.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 4.0, 4.0, 4.0, 4.0, 4.0, 6.0, 6.0, 6.0, 6.0, 6.0, 4.0, 4.0, 4.0, 4.0, 4.0, 6.0, 6.0, 6.0, 6.0, 6.0, 3.0, 3.0, 3.0, 3.0, 3.0, 2.0, 2.0, 2.0]

Dampfinchen · 2024-04-24T12:05:59Z

Nice. With competent LLMs getting smaller and more efficient as well as Snapdragon laptops coming soon, it's important to make full use of the AI acceleration these SoCs provide through the Hexagon NPU Cluster.

This will make llama.cpp a robust backend for the future and will lead to power efficient LLMs on the go. Personally, I really can't wait!

zhouwg · 2024-04-24T12:56:53Z

📈 llama.cpp server for bench-server-baseline on Standard_NC4as_T4_v3 for phi-2-q4_0: 198 iterations 🚀

Expand details for performance related PR only

Nice. With competent LLMs getting smaller and more efficient as well as Snapdragon laptops coming soon, it's important to make full use of the AI acceleration these SoCs provide through the Hexagon NPU Cluster.

This will make llama.cpp a robust backend for the future and will lead to power efficient LLMs on the go. Personally, I really can't wait!

thanks for your comment. this PR is a very initial implementation and could/might/should be a good starting point of Qualcomm's QNN backend for GGML. it's better some domain technical experts from Qualcomm involved in this effort after it's accepted by community. I personally think this PR is also an example of GGML way: try crazy ideas, build wild demos, and push the edge of what’s possible.

another thing, a small and standalone Android example(or re-use the existing Android example in llama.cpp) is needed for purpose of facilitate community developers to participate in develop/verify QNN backend.

ggerganov · 2024-04-25T11:47:22Z

another thing, a small and standalone Android example(or re-use the existing Android example in llama.cpp) is needed for purpose of facilitate community developers to participate in develop/verify QNN backend.

Yes, it would be useful to have an example or instructions how to run this. In the meantime, simply setting up the test-backend-ops to run with ggml-qnn would be a good start for people who want to implement the missing operators

zhouwg · 2024-04-25T11:55:08Z

another thing, a small and standalone Android example(or re-use the existing Android example in llama.cpp) is needed for purpose of facilitate community developers to participate in develop/verify QNN backend.

Yes, it would be useful to have an example or instructions how to run this. In the meantime, simply setting up the test-backend-ops to run with ggml-qnn would be a good start for people who want to implement the missing operators

thanks for your guidance. I'll study how to use test-backend-ops.cpp to validate QNN backend.

slaren · 2024-04-25T12:24:46Z

You would need to modify ggml_backend_registry_init to register the backend, then it should be automatically used by test-backend-ops.

llama.cpp/ggml-backend.c

Line 411 in 5477041

GGML_CALL static void ggml_backend_registry_init(void) {

zhouwg · 2024-04-25T13:16:20Z

You would need to modify ggml_backend_registry_init to register the backend, then it should be automatically used by test-backend-ops.

llama.cpp/ggml-backend.c

Line 411 in 5477041

GGML_CALL static void ggml_backend_registry_init(void) {

thanks for your help, it's really helpful. I'm working on adapt to test-backend-ops.cpp with QNN backend on Android.

zhouwg · 2024-04-25T15:29:39Z

@ggerganov, @slaren, I'm sorry to interrupt you. adapt to test-backend-ops.cpp using QNN backend already done and it works fine as expected on Xiaomi14(Qualcomm SM8650-AB Snapdragon 8 Gen 3).

Could you take a moment to look at it? thanks.

BTW, the design and implementation of test-backend-ops.cpp is really excellent. I never noticed this file/feature before.

BTW, should the README-qnn.md be removed?

tests/test-backend-ops.cpp

zhouwg

this review comment is very useful and I had been modified codes accordingly.
thanks too much.

tests/test-backend-ops.cpp

chraac · 2024-06-17T04:21:31Z

ggml-qnn.cpp

+    qnn_instance *    instance          = nullptr;
+    std::string       graph_name        = "ggml_op_qnn_add";
+    Qnn_GraphHandle_t graph_handle      = nullptr;
+    Qnn_Tensor_t *    tensor_0          = nullptr;


Created a PR on your fork, to simpilify the binding from Qnn_Tensor_t to ggml_tensor, please have look if have time: zhouwg#2

chraac · 2024-06-17T04:40:59Z

ggml-qnn.cpp

+ * mul_mat_f16_f32: src0 is F16 and src1 is F32.
+ * mul_mat_q_f32:   src0 is quantized (Q4_0, Q4_1, ...), and src1 is F32.
+ */
+static void ggml_qnn_mul_mat(ggml_backend_qnn_context * ctx,


Also find a maybe bug on this branch when trying to do mulmat with gpu backend on my 8 Gen2 phone, commandline:
ggml-qnn-ut -t GGML_OP_MUL_MAT -b 1

As you can see it generate a wrong dst matrix.

When running with cpu backend, the result is correct:

looks the graphExecute failed with error 6004. maybe we can use it to find the root cause here

to reproduce, you could use my patch to constant initialize the test tensor:

llama.cpp-5e18cdc-init the test array with const values.patch

just change the tensor init in the unit test so that we can reproduce it more easily

myan-o · 2024-06-18T06:18:43Z

problem 1

i'm tred build in termux.
Can't you change the path of /data/local/tmp?
The Skel.so path cannot be changed in NPU and loading fails.

problem 2

qnnsdk cannot be obtained without an account.
In other words, it cannot be built using termux alone.

slaren · 2024-06-20T22:59:17Z

ggml-qnn.cpp

+GGML_CALL static bool ggml_backend_qnn_offload_op(ggml_backend_t backend,const ggml_tensor * tensor) {
+    ggml_backend_qnn_context * ctx = (ggml_backend_qnn_context *) backend->context;
+
+    return ggml_qnn_compute_forward(ctx, nullptr, (ggml_tensor *) tensor);
+}


This function only needs to return true or false, but it must not execute the operation. The purpose of this function is to determine if an operation should be executed in this backend, even if it would require copying weights to the backend memory. As it is, this will either prevent the backend from working entirely, or it will cause many operations to be run twice.

Yeah, actually I've tried to make some improvement regarding those comments on my branch, also created a PR in this fork and ask for review several days before, but looks there's no responding on the original author so far.
Will spend some time on my fork next few weeks, gonna add more operators then.

ao-zz · 2024-06-28T10:18:59Z

Sad to see such a great PR being blocked by endless complaints.

@slaren @chraac Can't we just focus on the correctness of this groundbreaking PR? If this PR could make a correct result, then there is NO reason to block it. All other problems should be discussed in new PR or issues.

I hold this is the very first step we need to take. Then users around the world could have a chance to engage it and improve for the efficiency and other things you guys worried.

chraac · 2024-06-28T10:56:59Z

@slaren @chraac Can't we just focus on the correctness of this groundbreaking PR? If this PR could make a correct result, then there is NO reason to block it. All other problems should be discussed in new PR or issues.

Hi @ao-zz, thank you for your comment, for the PR not getting approval, as i said before:

MVP (Minimum Viable Pull Request) is good. however, thought we first need to establish a consensus with the community on the criteria that define a PR as viable.

And as you can see in my pervious post, there're some work we should do before merge, from my point of view:

Add more tensor ops to support to load a model (currently only have add and matmul implemented).
Some bug on gpu backend should be resolved (can check out my comment above).

Also have created a small refactoring PR on this fork, you can have a look.

hans00 · 2024-07-02T08:43:03Z

qnnsdk cannot be obtained without an account.

currently is public for everyone.
https://softwarecenter.qualcomm.com/api/download/software/qualcomm_neural_processing_sdk/v2.22.6.240515.zip?query=aiesdk (copy from official page

Ther-nullptr · 2024-07-07T13:21:22Z

This work has finished the SnapDragon CPU/GPU/NPU overhead:
https://arxiv.org/abs/2406.06282

ao-zz · 2024-07-08T01:15:06Z

This work has finished the SnapDragon CPU/GPU/NPU overhead: https://arxiv.org/abs/2406.06282

code not yet released, as you asked last week

chraac · 2024-07-15T16:00:40Z

Hi, it looks like this PR has been inactive for a while. I've made some changes on my local fork based on this PR, including:

Splited ggml-qnn.cpp into several separate files for better modularity.
Supported more operations, include: GGML_OP_SUB, GGML_OP_MUL, GGML_OP_DIV, GGML_OP_SQRT and GGML_OP_LOG.
Running the pure CPU backend in unit tests to cross-validate the results returned by the QNN backend.
Fixed the GPU backend error mentioned above (6869#discussion_r1642152991)

For anyone interested in this PR, please take a look at my fork. Comments and feedback are appreciated!
Also, I will spend more time continuing iteration on my fork, aiming to support more operators and run on more platforms other than Android.

slaren · 2024-07-15T16:15:53Z

Running the pure CPU backend in unit tests to cross-validate the results returned by the QNN backend.

FYI this is exactly what test-backend-ops does. To make it work with the QNN backend, you would have to modify ggml_backend_registry_init in ggml-backend.c to register the backend, and the supports_op function of the backend must be accurate. I am not really sure why this code is being duplicated, in fact, the QNN test has code copied from test-backend-ops.

myan-o · 2024-07-16T02:21:34Z

please support in termux.

chraac · 2024-07-16T04:45:58Z

Running the pure CPU backend in unit tests to cross-validate the results returned by the QNN backend.

FYI this is exactly what test-backend-ops does. To make it work with the QNN backend, you would have to modify ggml_backend_registry_init in ggml-backend.c to register the backend, and the supports_op function of the backend must be accurate. I am not really sure why this code is being duplicated, in fact, the QNN test has code copied from test-backend-ops.

After a brief review of the test-backend-ops, thought remove the QNN unit tests and reuse the test-backend-ops instead. I'll work on implementing this change in the next iteration.

chraac · 2024-07-16T04:50:02Z

problem 1

i'm tred build in termux. Can't you change the path of /data/local/tmp? The Skel.so path cannot be changed in NPU and loading fails.

problem 2

qnnsdk cannot be obtained without an account. In other words, it cannot be built using termux alone.

Hi @myan-o ,

For problem 1, I propose implementing a CMake parameter that allows users to customize the default path for dependent libraries.

For 2, maybe you can refer to this comment:

qnnsdk cannot be obtained without an account.

currently is public for everyone. https://softwarecenter.qualcomm.com/api/download/software/qualcomm_neural_processing_sdk/v2.22.6.240515.zip?query=aiesdk (copy from official page

yeonseok-zeticai · 2024-07-16T08:10:00Z

Great work @zhouwg! I worked for Qualcomm until early this year and I am quite used to using Qualcomm AI SDK, and I want to help you to get these things done. I think I can help you implement the to-do items that you have for the QNN backend. Let me catch up on your work in a few weeks about this PR and update more later.

chraac · 2024-07-16T08:21:30Z

Great work @zhouwg! I worked for Qualcomm until early this year and I am quite used to using Qualcomm AI SDK, and I want to help you to get these things done. I think I can help you implement the to-do items that you have for the QNN backend. Let me catch up on your work in a few weeks about this PR and update more later.

Hi @yeonseok-zeticai, this branch has been inactive for some time. In recent weeks, I've undertaken some refactoring on my own branch. If you're interested, please have a look. My branch is also based on this PR.:
https://github.com/chraac/llama.cpp/tree/dev-refactoring

yeonseok-zeticai · 2024-07-16T08:59:48Z

@chraac, is there any special reason for being inactive?. I can see the works you've made for 2 weeks on your branch. I'll catch up on your work as well.

chraac · 2024-07-16T09:17:30Z

@chraac, is there any special reason for being inactive?. I can see the works you've made for 2 weeks on your branch. I'll catch up on your work as well.

Sorry for misleading, when I mentioned inactive branch, I was referring to the branch of this PR: zhouwg:qualcomm_qnn_backend_for_ggml. I attempted to merge my changes into that branch, but received no response. As a result, I've kept my modifications in my own branch.

My branch: https://github.com/chraac/llama.cpp/tree/dev-refactoring

zhouwg · 2024-07-18T08:17:32Z

Great work @zhouwg! I worked for Qualcomm until early this year and I am quite used to using Qualcomm AI SDK, and I want to help you to get these things done. I think I can help you implement the to-do items that you have for the QNN backend. Let me catch up on your work in a few weeks about this PR and update more later.

Thanks so much and thanks to a real QNN expert is coming.

We can see that ggml-rpc is came from rgerganov, ggml-sycl is came from regular employee of Intel, ggml-cann is came from regular employee of Huawei.
As I said in the very beginning of this PoC(PoC: Add Qualcomm mobile SoC native backend for GGML zhouwg/kantv#121), this work(add Qualcomm QNN backend for ggml) should/could/might be initiated/done by permanent/regular employee from Qualcomm but not be initiated/done by a standalone/independent programmer.
I'm a streaming media expert and good at Linux/Android system software development but an AI beginner/know nothing about real/hard-core AI tech.

This PR can not be accepted by the maintainer/author of ggml backend subsystem although I begged for PR approval again and again before 06/15/2024. I understand this decision according to above reasons. but, I'm not sure whether there is double-standard in PR approval consideration although I has sincerely thanks for the help from the maintainer/author of ggml backend subsystem and has fully/100% positive opinion for this great/compact/awesome/excellent/high-performance device-side AI inference framework:

#7844:

One more thing, I feel a little disappointment on 06/15/2024 that the maintainers of this great opensource AI project can't understand what GFW brings to the programmers/developers in mainland China and has some misunderstandings about what I think about it although I really/100% love my country:

zhouwg · 2024-07-18T08:50:45Z

@chraac, is there any special reason for being inactive?. I can see the works you've made for 2 weeks on your branch. I'll catch up on your work as well.

Sorry for misleading, when I mentioned inactive branch, I was referring to the branch of this PR: zhouwg:qualcomm_qnn_backend_for_ggml. I attempted to merge my changes into that branch, but received no response. As a result, I've kept my modifications in my own branch.

My branch: https://github.com/chraac/llama.cpp/tree/dev-refactoring

Thanks for your help and continued efforts of this PR(BTW, I had read source codes in your personal/forked llama.cpp although I think put everything relative things in one single source file might be more better idea).

Your PR in my personal/forked llama.cpp is not make sense:

That's the reason why your PR in my personal/forked llama.cpp was not merged to my personal/forked llama.cpp.thanks for your understanding.

zhouwg · 2024-07-18T09:03:35Z

Running the pure CPU backend in unit tests to cross-validate the results returned by the QNN backend.

FYI this is exactly what test-backend-ops does. To make it work with the QNN backend, you would have to modify ggml_backend_registry_init in ggml-backend.c to register the backend, and the supports_op function of the backend must be accurate. I am not really sure why this code is being duplicated, in fact, the QNN test has code copied from test-backend-ops.

Your test-backend-ops.cpp is good and highly-designed but not good/robust/easy-understanding enough and there is unknown issue for the ggml-qnn.cpp. that's the reason why I provide a standalone/easy-understanding UT(some codes borrows from your test-backend-ops.cpp) for ggml-qnn.cpp.thanks for your understanding.

chraac · 2024-07-18T09:10:14Z

@chraac, is there any special reason for being inactive?. I can see the works you've made for 2 weeks on your branch. I'll catch up on your work as well.

Sorry for misleading, when I mentioned inactive branch, I was referring to the branch of this PR: zhouwg:qualcomm_qnn_backend_for_ggml. I attempted to merge my changes into that branch, but received no response. As a result, I've kept my modifications in my own branch.
My branch: https://github.com/chraac/llama.cpp/tree/dev-refactoring

Thanks for your help and continued efforts of this PR(I had read source codes in your personal/forked llama.cpp).

Your PR in my personal/forked llama.cpp is not make sense:

That's the reason why your PR in my personal/forked llama.cpp was not merged to my personal/forked llama.cpp.thanks for your understanding.

As i said before,

4000+ lines of code in single file is unreviewable and unmaintainable, so I've split them into separated files.
see so many duplicated source code in your original branch, the object bounding of qnn tensor and ggml tensor is vague, that may leads to more error in feature development, that's why i split them into files and also added some object to better manage them (My PR).
ran your test show no error on result as i mention above, so I thought its able to replace your old implementation
what do you mean by 'doesn't make sense', if there're some misunderstanding, you can leave comment on my PR, then we can have further discussion.

also you can have a look on my new refactoring branch, next will utilize the existing test-backend-ops and remove your unit test.

zhouwg · 2024-07-18T09:25:45Z

@chraac, is there any special reason for being inactive?. I can see the works you've made for 2 weeks on your branch. I'll catch up on your work as well.

Sorry for misleading, when I mentioned inactive branch, I was referring to the branch of this PR: zhouwg:qualcomm_qnn_backend_for_ggml. I attempted to merge my changes into that branch, but received no response. As a result, I've kept my modifications in my own branch.
My branch: https://github.com/chraac/llama.cpp/tree/dev-refactoring

Thanks for your help and continued efforts of this PR(I had read source codes in your personal/forked llama.cpp).
Your PR in my personal/forked llama.cpp is not make sense:
That's the reason why your PR in my personal/forked llama.cpp was not merged to my personal/forked llama.cpp.thanks for your understanding.

As i said before,

4000+ lines of code in single file is unreviewable and unmaintainable, so I've split them into separated files.

I do not want to argue this opinion with you again and pls see my opinion in this PR although I feel a little surprise of/thanks for your continued efforts(which I personally think it's exactly same to this PR but with more/advanced C++ language grammars) of this PR.

see so many duplicated source code in your original branch, the object bounding of qnn tensor and ggml tensor is vague, that may leads to more error in feature development, that's why i split them into files and also added some object to better manage them (My PR).

ran your test show no error on result as i mention above, so I thought its able to replace your old implementation

what do you mean by 'doesn't make sense', if there're some misunderstanding, you can leave comment on my PR, then we can have further discussion.

I'm sorry for that because I feel great disappointment and has no positive attention for this PR after 06/15/2024.

also you can have a look on my new refactoring branch, next will utilize the existing test-backend-ops and remove your unit test.

chraac · 2024-07-18T09:44:19Z

I'm sorry for that because I feel great disappointment and has no positive attention for this PR after 06/15/2024.

No worries, your effort on adding the QNN backend won't be wasted. You've done excellent work. I'll continue iterating on my branch, and as this backend garners more attention, we're hopeful it can be integrated into the upstream project in the future.

ggerganov · 2024-07-19T08:22:51Z

I'm not sure whether there is double-standard in PR approval consideration

@zhouwg Such comments are completely inappropriate. As I already mentioned in #6210 (comment), this will not be tolerated. Therefore I’ve decided to block you from the projects.

ggml: add Qualcomm QNN(Qualcomm Neural Network,aka Qualcomm AI Engine…

b0c3013

… Direct) backend

zhouwg force-pushed the qualcomm_qnn_backend_for_ggml branch 3 times, most recently from 59e42f8 to b0c3013 Compare April 24, 2024 10:26

zhouwg force-pushed the qualcomm_qnn_backend_for_ggml branch 5 times, most recently from 5abb2e4 to 7a420e1 Compare April 25, 2024 08:11

zhouwg changed the title ~~ggml: add Qualcomm QNN(Qualcomm Neural Network,aka Qualcomm AI Engine Direct) backend~~ ggml-qnn: add Qualcomm QNN(Qualcomm Neural Network,aka Qualcomm AI Engine Direct) backend Apr 25, 2024

zhouwg force-pushed the qualcomm_qnn_backend_for_ggml branch 3 times, most recently from 95a980a to b0c3013 Compare April 25, 2024 09:03

zhouwg force-pushed the qualcomm_qnn_backend_for_ggml branch 2 times, most recently from eff9669 to 180ab5f Compare April 25, 2024 15:47

slaren reviewed Apr 26, 2024

View reviewed changes

tests/test-backend-ops.cpp Outdated Show resolved Hide resolved

zhouwg force-pushed the qualcomm_qnn_backend_for_ggml branch 4 times, most recently from 992cf05 to 67beeb6 Compare April 26, 2024 02:12

zhouwg commented Apr 26, 2024

View reviewed changes

tests/test-backend-ops.cpp Outdated Show resolved Hide resolved

zhouwg force-pushed the qualcomm_qnn_backend_for_ggml branch from 8240376 to f20e281 Compare April 26, 2024 03:19

review: make a MVP(Minimum Viable PR) style PR in upstream

5598fbd

chraac mentioned this pull request Jun 17, 2024

Refactoring: add helper class to bind qnn tensor -> ggml tensor zhouwg/llama.cpp#2

Open

4 tasks

chraac reviewed Jun 17, 2024

View reviewed changes

slaren reviewed Jun 20, 2024

View reviewed changes

AndreasKunar mentioned this pull request Jul 17, 2024

Improvements for running on Windows with Snapdragon X #8531

Merged

3 tasks

ggerganov closed this Jul 19, 2024

Repository owner locked and limited conversation to collaborators Jul 19, 2024

ggml-qnn: add Qualcomm QNN(Qualcomm Neural Network,aka Qualcomm AI Engine Direct) backend #6869

ggml-qnn: add Qualcomm QNN(Qualcomm Neural Network,aka Qualcomm AI Engine Direct) backend #6869

Conversation

zhouwg commented Apr 24, 2024 • edited Loading

Purpose

Status

Todo

How to verify QNN backend or participate in development activity of GGML QNN backend

github-actions bot commented Apr 24, 2024 • edited Loading

Dampfinchen commented Apr 24, 2024

zhouwg commented Apr 24, 2024 • edited Loading

ggerganov commented Apr 25, 2024

zhouwg commented Apr 25, 2024 • edited Loading

slaren commented Apr 25, 2024

zhouwg commented Apr 25, 2024 • edited Loading

zhouwg commented Apr 25, 2024 • edited Loading

zhouwg left a comment

Choose a reason for hiding this comment

chraac Jun 17, 2024

Choose a reason for hiding this comment

chraac Jun 17, 2024 • edited Loading

Choose a reason for hiding this comment

chraac Jun 17, 2024 • edited Loading

Choose a reason for hiding this comment

chraac Jun 17, 2024 • edited Loading

Choose a reason for hiding this comment

myan-o commented Jun 18, 2024 • edited Loading

problem 1

problem 2

slaren Jun 20, 2024

Choose a reason for hiding this comment

chraac Jun 24, 2024 • edited Loading

Choose a reason for hiding this comment

ao-zz commented Jun 28, 2024

chraac commented Jun 28, 2024 • edited Loading

hans00 commented Jul 2, 2024

Ther-nullptr commented Jul 7, 2024

ao-zz commented Jul 8, 2024

chraac commented Jul 15, 2024 • edited Loading

slaren commented Jul 15, 2024

myan-o commented Jul 16, 2024

chraac commented Jul 16, 2024

chraac commented Jul 16, 2024 • edited Loading

problem 1

problem 2

yeonseok-zeticai commented Jul 16, 2024

chraac commented Jul 16, 2024

yeonseok-zeticai commented Jul 16, 2024

chraac commented Jul 16, 2024 • edited Loading

zhouwg commented Jul 18, 2024 • edited Loading

zhouwg commented Jul 18, 2024 • edited Loading

zhouwg commented Jul 18, 2024 • edited Loading

chraac commented Jul 18, 2024 • edited Loading

zhouwg commented Jul 18, 2024 • edited Loading

chraac commented Jul 18, 2024 • edited Loading

ggerganov commented Jul 19, 2024

zhouwg commented Apr 24, 2024 •

edited

Loading

github-actions bot commented Apr 24, 2024 •

edited

Loading

zhouwg commented Apr 24, 2024 •

edited

Loading

zhouwg commented Apr 25, 2024 •

edited

Loading

zhouwg commented Apr 25, 2024 •

edited

Loading

zhouwg commented Apr 25, 2024 •

edited

Loading

chraac Jun 17, 2024 •

edited

Loading

chraac Jun 17, 2024 •

edited

Loading

chraac Jun 17, 2024 •

edited

Loading

myan-o commented Jun 18, 2024 •

edited

Loading

chraac Jun 24, 2024 •

edited

Loading

chraac commented Jun 28, 2024 •

edited

Loading

chraac commented Jul 15, 2024 •

edited

Loading

chraac commented Jul 16, 2024 •

edited

Loading

chraac commented Jul 16, 2024 •

edited

Loading

zhouwg commented Jul 18, 2024 •

edited

Loading

zhouwg commented Jul 18, 2024 •

edited

Loading

zhouwg commented Jul 18, 2024 •

edited

Loading

chraac commented Jul 18, 2024 •

edited

Loading

zhouwg commented Jul 18, 2024 •

edited

Loading

chraac commented Jul 18, 2024 •

edited

Loading