`server` stops processing requests after the empty prompt #5246

z80maniac · 2024-01-31T17:32:06Z

Current Behavior

After passing an empty prompt to the server it stops processing any further requests.

Environment and Context

Commit: d3bac7d

OS: Kubuntu 23.10

❯ lscpu | grep -P 'Model name|Flags'
Model name: AMD Ryzen 9 7900 12-Core Processor
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good amd_lbr_v2 nopl nonstop_tsc cpuid extd_apicid aperfmperf rapl pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3 cdp_l3 hw_pstate ssbd mba perfmon_v2 ibrs ibpb stibp ibrs_enhanced vmmcall fsgsbase bmi1 avx2 smep bmi2 erms invpcid cqm rdt_a avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local avx512_bf16 clzero irperf xsaveerptr rdpru wbnoinvd cppc arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif x2avic v_spec_ctrl vnmi avx512vbmi umip pku ospke avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg avx512_vpopcntdq rdpid overflow_recov succor smca fsrm flush_l1d

❯ uname -a
Linux comp 6.5.0-15-generic #15-Ubuntu SMP PREEMPT_DYNAMIC Tue Jan  9 17:03:36 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux

❯ make --version | head -1
GNU Make 4.3

❯ g++ --version | head -1
g++ (Ubuntu 13.2.0-4ubuntu3) 13.2.0

Steps to Reproduce

I used this model:
https://huggingface.co/TheBloke/Llama-2-13B-chat-GGUF/blob/main/llama-2-13b-chat.Q4_K_M.gguf
The server is built with just make, no other params.
Start the server:

./server -m /opt/models/text/llama-2-13b-chat.Q4_K_M.gguf

startup log

{"timestamp":1706720633,"level":"INFO","function":"main","line":2427,"message":"build info","build":2036,"commit":"d3bac7d5"}
{"timestamp":1706720633,"level":"INFO","function":"main","line":2430,"message":"system info","n_threads":12,"n_threads_batch":-1,"total_threads":24,"system_info":"AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | "}

llama server listening at http://127.0.0.1:8080

{"timestamp":1706720633,"level":"INFO","function":"main","line":2534,"message":"HTTP server listening","port":"8080","hostname":"127.0.0.1"}
llama_model_loader: loaded meta data with 19 key-value pairs and 363 tensors from /opt/models/text/llama-2-13b-chat.Q4_K_M.gguf (version GGUF V2)
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = LLaMA v2
llama_model_loader: - kv   2:                       llama.context_length u32              = 4096
llama_model_loader: - kv   3:                     llama.embedding_length u32              = 5120
llama_model_loader: - kv   4:                          llama.block_count u32              = 40
llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 13824
llama_model_loader: - kv   6:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv   7:                 llama.attention.head_count u32              = 40
llama_model_loader: - kv   8:              llama.attention.head_count_kv u32              = 40
llama_model_loader: - kv   9:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  10:                          general.file_type u32              = 15
llama_model_loader: - kv  11:                       tokenizer.ggml.model str              = llama
llama_model_loader: - kv  12:                      tokenizer.ggml.tokens arr[str,32000]   = ["<unk>", "<s>", "</s>", "<0x00>", "<...
llama_model_loader: - kv  13:                      tokenizer.ggml.scores arr[f32,32000]   = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv  14:                  tokenizer.ggml.token_type arr[i32,32000]   = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
llama_model_loader: - kv  15:                tokenizer.ggml.bos_token_id u32              = 1
llama_model_loader: - kv  16:                tokenizer.ggml.eos_token_id u32              = 2
llama_model_loader: - kv  17:            tokenizer.ggml.unknown_token_id u32              = 0
llama_model_loader: - kv  18:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   81 tensors
llama_model_loader: - type q4_K:  241 tensors
llama_model_loader: - type q6_K:   41 tensors
llm_load_vocab: special tokens definition check successful ( 259/32000 ).
llm_load_print_meta: format           = GGUF V2
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 32000
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: n_ctx_train      = 4096
llm_load_print_meta: n_embd           = 5120
llm_load_print_meta: n_head           = 40
llm_load_print_meta: n_head_kv        = 40
llm_load_print_meta: n_layer          = 40
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 1
llm_load_print_meta: n_embd_k_gqa     = 5120
llm_load_print_meta: n_embd_v_gqa     = 5120
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: n_ff             = 13824
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx  = 4096
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: model type       = 13B
llm_load_print_meta: model ftype      = Q4_K - Medium
llm_load_print_meta: model params     = 13.02 B
llm_load_print_meta: model size       = 7.33 GiB (4.83 BPW)
llm_load_print_meta: general.name     = LLaMA v2
llm_load_print_meta: BOS token        = 1 '<s>'
llm_load_print_meta: EOS token        = 2 '</s>'
llm_load_print_meta: UNK token        = 0 '<unk>'
llm_load_print_meta: LF token         = 13 '<0x0A>'
llm_load_tensors: ggml ctx size =    0.14 MiB
llm_load_tensors: offloading 0 repeating layers to GPU
llm_load_tensors: offloaded 0/41 layers to GPU
llm_load_tensors:        CPU buffer size =  7500.85 MiB
....................................................................................................
llama_new_context_with_model: n_ctx      = 512
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:        CPU KV buffer size =   400.00 MiB
llama_new_context_with_model: KV self size  =  400.00 MiB, K (f16):  200.00 MiB, V (f16):  200.00 MiB
llama_new_context_with_model:        CPU input buffer size   =    11.01 MiB
llama_new_context_with_model:        CPU compute buffer size =    81.40 MiB
llama_new_context_with_model: graph splits (measure): 1
Available slots:
 -> Slot 0 - max context: 512
{"timestamp":1706720634,"level":"INFO","function":"main","line":2555,"message":"model loaded"}
all slots are idle and system prompt is empty, clear the KV cache

Call the API without specifying the prompt:

curl --data '{"n_predict": 0}' http://127.0.0.1:8080/completion

It completes OK. The server output:

slot 0 is processing [task id: 0]

print_timings: prompt eval time =       0.00 ms /     0 tokens (    -nan ms per token,     -nan tokens per second)
print_timings:        eval time =       0.00 ms /     0 runs   (    -nan ms per token,     -nan tokens per second)
print_timings:       total time =       0.00 ms
{"timestamp":1706720709,"level":"INFO","function":"log_server_request","line":2368,"message":"request","remote_addr":"127.0.0.1","remote_port":37752,"status":200,"method":"POST","path":"/completion","params":{"{\"n_predict\": 0}":""}}

Call the API again, with the prompt (or without - it doesn't matter):

curl --data '{"n_predict": 8, "prompt": "This is"}' http://127.0.0.1:8080/completion

The server does not respond and no logs are produced.

Additional info

git bisect showed that the offending commit is 48c857a.
I used the {"n_predict": 0} trick to get the current context size from the server without clearing up the current cache. Ideally, there should be an API endpoint to return this info, though (/props maybe).
The docs don't say anything about an empty prompt, but I guess with n_predict: 0 it should be allowed (and the server does it correctly for the first request). At least it shouldn't block the entire server forever.

The text was updated successfully, but these errors were encountered:

github-actions · 2024-03-18T01:32:29Z

This issue is stale because it has been open for 30 days with no activity.

z80maniac · 2024-03-19T07:39:17Z

No longer reproducible.

z80maniac added the bug-unconfirmed label Jan 31, 2024

z80maniac mentioned this issue Feb 3, 2024

server: allow to get default generation settings for completion #5307

Merged

z80maniac mentioned this issue Feb 26, 2024

Server gets stuck after invalid request #5724

Closed

ngxson mentioned this issue Feb 26, 2024

Server: fix server hangs on empty prompt #5733

Merged

github-actions bot added the stale label Mar 18, 2024

z80maniac closed this as completed Mar 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`server` stops processing requests after the empty prompt #5246

`server` stops processing requests after the empty prompt #5246

z80maniac commented Jan 31, 2024

github-actions bot commented Mar 18, 2024

z80maniac commented Mar 19, 2024

server stops processing requests after the empty prompt #5246

server stops processing requests after the empty prompt #5246

Comments

z80maniac commented Jan 31, 2024

Current Behavior

Environment and Context

Steps to Reproduce

Additional info

github-actions bot commented Mar 18, 2024

z80maniac commented Mar 19, 2024

`server` stops processing requests after the empty prompt #5246

`server` stops processing requests after the empty prompt #5246