Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adreno gpu run crash #4973

Closed
java63940 opened this issue Jan 16, 2024 · 9 comments
Closed

Adreno gpu run crash #4973

java63940 opened this issue Jan 16, 2024 · 9 comments

Comments

@java63940
Copy link

hello, every one
I follow this page to compile llama.cpp on termux: #2169
when I run a qwen1.8B model on a Snapdragon 8 Gen 3 device and specified the ngl, program went crash.
full log is:
~/.../llama.cpp/build-gpu $ GGML_OPENCL_PLATFORM=0 GGML_OPENCL_DEVICE=0 ./bin/main -m ../../../1.8b-ggml-model-q4_0.gguf -p 'I am a boy' -ngl 1
Log start
main: build = 1882 (a0b3ac8)
main: built with clang version 17.0.6 for aarch64-unknown-linux-android24
main: seed = 1705405161
ggml_opencl: selecting platform: 'QUALCOMM Snapdragon(TM)'
ggml_opencl: selecting device: 'QUALCOMM Adreno(TM) 750'
ggml_opencl: device FP16 support: true
llama_model_loader: loaded meta data with 19 key-value pairs and 195 tensors from ../../../1.8b-ggml-model-q4_0.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = qwen
llama_model_loader: - kv 1: general.name str = Qwen
llama_model_loader: - kv 2: qwen.context_length u32 = 8192
llama_model_loader: - kv 3: qwen.block_count u32 = 24
llama_model_loader: - kv 4: qwen.embedding_length u32 = 2048
llama_model_loader: - kv 5: qwen.feed_forward_length u32 = 11008
llama_model_loader: - kv 6: qwen.rope.freq_base f32 = 10000.000000
llama_model_loader: - kv 7: qwen.rope.dimension_count u32 = 128
llama_model_loader: - kv 8: qwen.attention.head_count u32 = 16
llama_model_loader: - kv 9: qwen.attention.layer_norm_rms_epsilon f32 = 0.000001
llama_model_loader: - kv 10: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 11: tokenizer.ggml.tokens arr[str,151936] = ["!", """, "#", "$", "%", "&", "'", ...
llama_model_loader: - kv 12: tokenizer.ggml.token_type arr[i32,151936] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv 13: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv 14: tokenizer.ggml.bos_token_id u32 = 151643
llama_model_loader: - kv 15: tokenizer.ggml.eos_token_id u32 = 151643
llama_model_loader: - kv 16: tokenizer.ggml.unknown_token_id u32 = 151643
llama_model_loader: - kv 17: general.quantization_version u32 = 2
llama_model_loader: - kv 18: general.file_type u32 = 2
llama_model_loader: - type f32: 73 tensors
llama_model_loader: - type q4_0: 121 tensors
llama_model_loader: - type q6_K: 1 tensors
llm_load_vocab: special tokens definition check successful ( 293/151936 ).
llm_load_print_meta: format = GGUF V3 (latest)
llm_load_print_meta: arch = qwen
llm_load_print_meta: vocab type = BPE
llm_load_print_meta: n_vocab = 151936
llm_load_print_meta: n_merges = 151387
llm_load_print_meta: n_ctx_train = 8192
llm_load_print_meta: n_embd = 2048
llm_load_print_meta: n_head = 16
llm_load_print_meta: n_head_kv = 16
llm_load_print_meta: n_layer = 24
llm_load_print_meta: n_rot = 128
llm_load_print_meta: n_embd_head_k = 128
llm_load_print_meta: n_embd_head_v = 128
llm_load_print_meta: n_gqa = 1
llm_load_print_meta: n_embd_k_gqa = 2048
llm_load_print_meta: n_embd_v_gqa = 2048
llm_load_print_meta: f_norm_eps = 0.0e+00
llm_load_print_meta: f_norm_rms_eps = 1.0e-06
llm_load_print_meta: f_clamp_kqv = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: n_ff = 11008
llm_load_print_meta: n_expert = 0
llm_load_print_meta: n_expert_used = 0
llm_load_print_meta: rope scaling = linear
llm_load_print_meta: freq_base_train = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx = 8192
llm_load_print_meta: rope_finetuned = unknown
llm_load_print_meta: model type = ?B
llm_load_print_meta: model ftype = Q4_0
llm_load_print_meta: model params = 1.84 B
llm_load_print_meta: model size = 1.04 GiB (4.85 BPW)
llm_load_print_meta: general.name = Qwen
llm_load_print_meta: BOS token = 151643 '[PAD151643]'
llm_load_print_meta: EOS token = 151643 '[PAD151643]'
llm_load_print_meta: UNK token = 151643 '[PAD151643]'
llm_load_print_meta: LF token = 148848 'ÄĬ'
llm_load_tensors: ggml ctx size = 0.15 MiB
llm_load_tensors: offloading 1 repeating layers to GPU
llm_load_tensors: offloaded 1/25 layers to GPU
llm_load_tensors: CPU buffer size = 1062.67 MiB
llm_load_tensors: OpenCL buffer size = 27.18 MiB
...............................................................
llama_new_context_with_model: n_ctx = 512
llama_new_context_with_model: freq_base = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init: CPU KV buffer size = 96.00 MiB
llama_new_context_with_model: KV self size = 96.00 MiB, K (f16): 48.00 MiB, V (f16): 48.00 MiB
llama_new_context_with_model: graph splits (measure): 1
llama_new_context_with_model: CPU compute buffer size = 300.75 MiB
Segmentation fault

compile

@ggerganov
Copy link
Owner

Can you provide a stack trace?

@java63940
Copy link
Author

Can you provide a stack trace?

Thread 6 "main" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x1557 (LWP 5463)]
0x0000005555831f20 in ggml_vec_add_f32 (n=6144, z=, x=, y=) at /data/data/com.termux/files/home/code/llama.cpp/ggml.c:1162
1162 inline static void ggml_vec_add_f32 (const int n, float * z, const float * x, const float * y) { for (int i = 0; i < n; ++i) z[i] = x[i] + y[i]; }
(gdb) bt
#0 0x0000005555831f20 in ggml_vec_add_f32 (n=6144, z=, x=, y=) at /data/data/com.termux/files/home/code/llama.cpp/ggml.c:1162
#1 ggml_compute_forward_add_f32 (params=, src0=, src1=, dst=) at /data/data/com.termux/files/home/code/llama.cpp/ggml.c:7206
#2 ggml_compute_forward_add (params=, src0=, src1=, dst=) at /data/data/com.termux/files/home/code/llama.cpp/ggml.c:7452
#3 0x0000005555824214 in ggml_graph_compute_thread (data=0x7fffffca38) at /data/data/com.termux/files/home/code/llama.cpp/ggml.c:16654
#4 0x0000007ff457b138 in __pthread_start(void*) () from /apex/com.android.runtime/lib64/bionic/libc.so
#5 0x0000007ff4514ae8 in __start_thread () from /apex/com.android.runtime/lib64/bionic/libc.so

@java63940
Copy link
Author

Can you provide a stack trace?
Thank you for your reply.
I change a device which chip name is Snapdragon 778G plus,also crash. can you give some advise? thanks

@nonetrix
Copy link

nonetrix commented Jan 19, 2024

I having this but on Google Tensor instead which is Mali based, I don't think this is specific to Adreno perhaps
Screenshot_20240118-210746

Log start                                                  main: build = 1917 (57e2a7a)                               main: built with clang version 17.0.6 for aarch64-unknown-linux-android24                                             main: seed  = 1705633757                                   ggml_opencl: selecting platform: 'ARM Platform'            ggml_opencl: selecting device: 'Mali-G78 r0p1'             ggml_opencl: device FP16 support: true                     llama_model_loader: loaded meta data with 22 key-value pairs and 325 tensors from ./models/dolphin-2_6-phi-2.Q8_0.gguf (version GGUF V3 (latest))                                llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.                     llama_model_loader: - kv   0:                       general.architecture str              = phi2                      llama_model_loader: - kv   1:                               general.name str              = Phi2                      llama_model_loader: - kv   2:                        phi2.context_length u32              = 2048                      llama_model_loader: - kv   3:                      phi2.embedding_length u32              = 2560                      llama_model_loader: - kv   4:                   phi2.feed_forward_length u32              = 10240                     llama_model_loader: - kv   5:                           phi2.block_count u32              = 32                        llama_model_loader: - kv   6:                  phi2.attention.head_count u32              = 32                        llama_model_loader: - kv   7:               phi2.attention.head_count_kv u32              = 32
llama_model_loader: - kv   8:          phi2.attention.layer_norm_epsilon f32              = 0.000010
llama_model_loader: - kv   9:                  phi2.rope.dimension_count u32              = 32
llama_model_loader: - kv  10:                          general.file_type u32              = 7
llama_model_loader: - kv  11:               tokenizer.ggml.add_bos_token bool             = false
llama_model_loader: - kv  12:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  13:                      tokenizer.ggml.tokens arr[str,51200]   = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  14:                  tokenizer.ggml.token_type arr[i32,51200]   = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  15:                      tokenizer.ggml.merges arr[str,50000]   = ["Ġ t", "Ġ a", "h e", "i n", "r e",...
llama_model_loader: - kv  16:                tokenizer.ggml.bos_token_id u32              = 50256
llama_model_loader: - kv  17:                tokenizer.ggml.eos_token_id u32              = 50295
llama_model_loader: - kv  18:            tokenizer.ggml.unknown_token_id u32              = 50256
llama_model_loader: - kv  19:            tokenizer.ggml.padding_token_id u32              = 50256
llama_model_loader: - kv  20:                    tokenizer.chat_template str              = {{ bos_token }}{%- set ns = namespace...
llama_model_loader: - kv  21:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:  195 tensors
llama_model_loader: - type q8_0:  130 tensors
llm_load_vocab: mismatch in special tokens definition ( 910/51200 vs 944/51200 ).
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = phi2
llm_load_print_meta: vocab type       = BPE
llm_load_print_meta: n_vocab          = 51200
llm_load_print_meta: n_merges         = 50000
llm_load_print_meta: n_ctx_train      = 2048
llm_load_print_meta: n_embd           = 2560
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 32
llm_load_print_meta: n_layer          = 32
llm_load_print_meta: n_rot            = 32
llm_load_print_meta: n_embd_head_k    = 80
llm_load_print_meta: n_embd_head_v    = 80
llm_load_print_meta: n_gqa            = 1
llm_load_print_meta: n_embd_k_gqa     = 2560
llm_load_print_meta: n_embd_v_gqa     = 2560
llm_load_print_meta: f_norm_eps       = 1.0e-05
llm_load_print_meta: f_norm_rms_eps   = 0.0e+00
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: n_ff             = 10240
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx  = 2048
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: model type       = 3B
llm_load_print_meta: model ftype      = Q8_0
llm_load_print_meta: model params     = 2.78 B
llm_load_print_meta: model size       = 2.75 GiB (8.51 BPW)
llm_load_print_meta: general.name     = Phi2
llm_load_print_meta: BOS token        = 50256 '<|endoftext|>'
llm_load_print_meta: EOS token        = 50295 '<|im_end|>'
llm_load_print_meta: UNK token        = 50256 '<|endoftext|>'
llm_load_print_meta: PAD token        = 50256 '<|endoftext|>'
llm_load_print_meta: LF token         = 128 'Ä'
llm_load_tensors: ggml ctx size =    0.25 MiB
llm_load_tensors: offloading 32 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 33/33 layers to GPU
llm_load_tensors:        CPU buffer size =   132.81 MiB
llm_load_tensors:     OpenCL buffer size =  2686.46 MiB
.............................................................................................
llama_new_context_with_model: n_ctx      = 512
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:        CPU KV buffer size =   160.00 MiB
llama_new_context_with_model: KV self size  =  160.00 MiB, K (f16):   80.00 MiB, V (f16):   80.00 MiB
llama_new_context_with_model: graph splits (measure): 1
llama_new_context_with_model:        CPU compute buffer size =   105.00 MiB
Segmentation fault

@ggerganov
Copy link
Owner

Yes, I suspect it's related to OpenCL

@slaren
Copy link
Collaborator

slaren commented Jan 19, 2024

I believe that what is happening is that the bias weights are being offloaded to OpenCL, but then the OpenCL backend is not able to use these weights because it does not implement the ggml_add operation. So the operation is run on the CPU backend, but these weights are in GPU memory, so it crashes when it tries to access them.

@nonetrix
Copy link

nonetrix commented Feb 2, 2024

When can this be fixed then?

Copy link
Contributor

This issue is stale because it has been open for 30 days with no activity.

@github-actions github-actions bot added the stale label Mar 18, 2024
Copy link
Contributor

github-actions bot commented Apr 3, 2024

This issue was closed because it has been inactive for 14 days since being marked as stale.

@github-actions github-actions bot closed this as completed Apr 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants