Skip to content

Bug: llama-box crashes when setting --ctx-size to 0 #21

@n00b001

Description

@n00b001

Hello

Summary of issue:
I've mad a python program that has quite a lot of complexity - and from time to time I see llama-box crash. I'm trying to narrow down the reason as to why this might happen. So this may be it.

Expectation:
When I set "-c" to "0", I expect the context window of the model to be used.
When a prompt is sent that is larger than the context window, I expect it to be truncated to the size of the context window, where 'older' tokens are discarded (unless "--no-context-shift" is used)

What actually happens:
I set "-c" to "0", and send a large prompt, llama-box crashes

System specs:

  • 64 GB RAM
  • RTX 4090
  • 7950x3D
  • llama-box.exe --version: v0.0.103 (568736f)
  • llama-box.exe --version: vendor : llama.cpp b56f079e (4418), stable-diffusion.cpp 01fec2a (197)

When this happens, CPU/RAM/VRAM are all OK - so it doesn't look like an out of memory (OOM) error

Here is the minimum needed to reproduce the issue:

I am running llama-box with this command:
llama-box.exe --port 8082 -c 0 -np 2 --host 0.0.0.0 -m "models/Qwen2-VL-7B-Instruct-abliterated-Q6_K_L.gguf" --mmproj "models/mmproj-Qwen2-VL-7B-Instruct-abliterated-f16.gguf"

model: https://huggingface.co/bartowski/Qwen2-VL-7B-Instruct-abliterated-GGUF/blob/main/Qwen2-VL-7B-Instruct-abliterated-Q6_K_L.gguf
mmproj: https://huggingface.co/bartowski/Qwen2-VL-7B-Instruct-abliterated-GGUF/blob/main/mmproj-Qwen2-VL-7B-Instruct-abliterated-f16.gguf

I am sending the server this command:
curl http://localhost:8082/v1/chat/completions -H "Content-Type: application/json" -d "@lots_of_ones.txt"

And the file 'lots_of_ones.txt' contains 1,638,400 occurences of the character: '1' (along with a little JSON):
{"model": "hermes2", "messages": [{"role":"user", "content": "1[...]1"}]}

Output from llama-box when it crashes:

0.00.024.794 I ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
0.00.024.799 I ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
0.00.024.799 I ggml_cuda_init: found 1 CUDA devices:
0.00.024.808 I   Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes
0.00.025.977 I
0.00.025.991 I arguments  : .\llama-box.exe --port 8082 -c 0 -np 2 --host 0.0.0.0 -m models/Qwen2-VL-7B-Instruct-abliterated-Q6_K_L.gguf --mmproj models/mmproj-Qwen2-VL-7B-Instruct-abliterated-f16.gguf --temp 0
0.00.025.992 I version    : v0.0.103 (568736f)
0.00.025.992 I compiler   : unknown
0.00.025.992 I target     : unknown
0.00.025.993 I vendor     : llama.cpp b56f079e (4418), stable-diffusion.cpp 01fec2a (197)
0.00.026.017 I system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CUDA : ARCHS = 600,610,700,750,800,860,890,900 | F16 = 1 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | LLAMAFILE = 1 |
0.00.026.018 I
0.00.026.815 I srv                      main: listening, hostname = 0.0.0.0, port = 8082, n_threads = 4 + 2
0.00.038.864 I srv                      main: loading model
0.00.038.872 I srv                load_model: loading model 'models/Qwen2-VL-7B-Instruct-abliterated-Q6_K_L.gguf'
0.00.038.887 W srv                load_model: n_ctx is too small for multimodal projection, setting to 2048
0.00.039.576 I clip_model_load: loaded meta data with 20 key-value pairs and 521 tensors from models/mmproj-Qwen2-VL-7B-Instruct-abliterated-f16.gguf
0.00.039.582 I clip_model_load: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
0.00.039.586 I clip_model_load: - kv   0:                       general.architecture str              = clip
0.00.039.587 I clip_model_load: - kv   1:                        general.description str              = image encoder for Qwen2VL
0.00.039.590 I clip_model_load: - kv   2:                          general.file_type u32              = 1
0.00.039.590 I clip_model_load: - kv   3:                      clip.has_text_encoder bool             = false
0.00.039.591 I clip_model_load: - kv   4:                    clip.has_vision_encoder bool             = true
0.00.039.591 I clip_model_load: - kv   5:                    clip.has_qwen2vl_merger bool             = true
0.00.039.592 I clip_model_load: - kv   6:                        clip.projector_type str              = qwen2vl_merger
0.00.039.592 I clip_model_load: - kv   7:                              clip.use_silu bool             = false
0.00.039.592 I clip_model_load: - kv   8:                              clip.use_gelu bool             = false
0.00.039.593 I clip_model_load: - kv   9:                     clip.vision.patch_size u32              = 14
0.00.039.593 I clip_model_load: - kv  10:                     clip.vision.image_size u32              = 560
0.00.039.593 I clip_model_load: - kv  11:               clip.vision.embedding_length u32              = 1280
0.00.039.594 I clip_model_load: - kv  12:                 clip.vision.projection_dim u32              = 3584
0.00.039.594 I clip_model_load: - kv  13:           clip.vision.attention.head_count u32              = 16
0.00.039.605 I clip_model_load: - kv  14:   clip.vision.attention.layer_norm_epsilon f32              = 0.000001
0.00.039.605 I clip_model_load: - kv  15:                    clip.vision.block_count u32              = 32
0.00.039.606 I clip_model_load: - kv  16:            clip.vision.feed_forward_length u32              = 0
0.00.039.607 I clip_model_load: - kv  17:                               general.name str              = Qwen2-VL-7B-Instruct-abliterated
0.00.039.617 I clip_model_load: - kv  18:                     clip.vision.image_mean arr[f32,3]       = [0.481455, 0.457828, 0.408211]
0.00.039.620 I clip_model_load: - kv  19:                      clip.vision.image_std arr[f32,3]       = [0.268630, 0.261303, 0.275777]
0.00.039.620 I clip_model_load: - type  f32:  325 tensors
0.00.039.620 I clip_model_load: - type  f16:  196 tensors
0.00.040.113 I clip_model_load: CLIP using CUDA backend
0.00.040.114 W clip_model_load: Main model doesn't offload, fallback to CPU backend
0.00.040.116 I clip_model_load: params backend buffer size =  1289.95 MB (521 tensors)
0.00.814.700 E key clip.vision.image_grid_pinpoints not found in file
0.00.814.751 E key clip.vision.mm_patch_merge_type not found in file
0.00.814.757 E key clip.vision.image_crop_resolution not found in file
0.00.815.734 I clip_model_load: compute allocated memory: 198.93 MiB
0.00.881.376 I llama_load_model_from_file: using device CUDA0 (NVIDIA GeForce RTX 4090) - 22994 MiB free
0.00.902.013 I llama_model_loader: loaded meta data with 38 key-value pairs and 339 tensors from models/Qwen2-VL-7B-Instruct-abliterated-Q6_K_L.gguf (version GGUF V3 (latest))
0.00.902.023 I llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
0.00.902.025 I llama_model_loader: - kv   0:                       general.architecture str              = qwen2vl
0.00.902.025 I llama_model_loader: - kv   1:                               general.type str              = model
0.00.902.026 I llama_model_loader: - kv   2:                               general.name str              = Qwen2 VL 7B Instruct Abliterated
0.00.902.027 I llama_model_loader: - kv   3:                           general.finetune str              = Instruct-abliterated
0.00.902.027 I llama_model_loader: - kv   4:                           general.basename str              = Qwen2-VL
0.00.902.028 I llama_model_loader: - kv   5:                         general.size_label str              = 7B
0.00.902.028 I llama_model_loader: - kv   6:                            general.license str              = apache-2.0
0.00.902.029 I llama_model_loader: - kv   7:                       general.license.link str              = https://huggingface.co/huihui-ai/Qwen...
0.00.902.030 I llama_model_loader: - kv   8:                   general.base_model.count u32              = 1
0.00.902.030 I llama_model_loader: - kv   9:                  general.base_model.0.name str              = Qwen2 VL 7B Instruct
0.00.902.031 I llama_model_loader: - kv  10:          general.base_model.0.organization str              = Qwen
0.00.902.032 I llama_model_loader: - kv  11:              general.base_model.0.repo_url str              = https://huggingface.co/Qwen/Qwen2-VL-...
0.00.902.041 I llama_model_loader: - kv  12:                               general.tags arr[str,4]       = ["chat", "abliterated", "uncensored",...
0.00.902.042 I llama_model_loader: - kv  13:                          general.languages arr[str,1]       = ["en"]
0.00.902.043 I llama_model_loader: - kv  14:                        qwen2vl.block_count u32              = 28
0.00.902.043 I llama_model_loader: - kv  15:                     qwen2vl.context_length u32              = 32768
0.00.902.044 I llama_model_loader: - kv  16:                   qwen2vl.embedding_length u32              = 3584
0.00.902.044 I llama_model_loader: - kv  17:                qwen2vl.feed_forward_length u32              = 18944
0.00.902.044 I llama_model_loader: - kv  18:               qwen2vl.attention.head_count u32              = 28
0.00.902.045 I llama_model_loader: - kv  19:            qwen2vl.attention.head_count_kv u32              = 4
0.00.902.048 I llama_model_loader: - kv  20:                     qwen2vl.rope.freq_base f32              = 1000000.000000
0.00.902.050 I llama_model_loader: - kv  21:   qwen2vl.attention.layer_norm_rms_epsilon f32              = 0.000001
0.00.902.050 I llama_model_loader: - kv  22:                          general.file_type u32              = 18
0.00.902.051 I llama_model_loader: - kv  23:            qwen2vl.rope.dimension_sections arr[i32,4]       = [16, 24, 24, 0]
0.00.902.052 I llama_model_loader: - kv  24:                       tokenizer.ggml.model str              = gpt2
0.00.902.052 I llama_model_loader: - kv  25:                         tokenizer.ggml.pre str              = qwen2
0.00.923.788 I llama_model_loader: - kv  26:                      tokenizer.ggml.tokens arr[str,152064]  = ["!", "\"", "#", "$", "%", "&", "'", ...
0.00.932.652 I llama_model_loader: - kv  27:                  tokenizer.ggml.token_type arr[i32,152064]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
0.00.954.468 I llama_model_loader: - kv  28:                      tokenizer.ggml.merges arr[str,151387]  = ["─á ─á", "─á─á ─á─á", "i n", "─á t",...
0.00.954.471 I llama_model_loader: - kv  29:                tokenizer.ggml.eos_token_id u32              = 151645
0.00.954.471 I llama_model_loader: - kv  30:            tokenizer.ggml.padding_token_id u32              = 151645
0.00.954.472 I llama_model_loader: - kv  31:                tokenizer.ggml.bos_token_id u32              = 151643
0.00.954.474 I llama_model_loader: - kv  32:                    tokenizer.chat_template str              = {% set image_count = namespace(value=...
0.00.954.475 I llama_model_loader: - kv  33:               general.quantization_version u32              = 2
0.00.954.476 I llama_model_loader: - kv  34:                      quantize.imatrix.file str              = /models_out/Qwen2-VL-7B-Instruct-abli...
0.00.954.477 I llama_model_loader: - kv  35:                   quantize.imatrix.dataset str              = /training_dir/calibration_datav3.txt
0.00.954.477 I llama_model_loader: - kv  36:             quantize.imatrix.entries_count i32              = 196
0.00.954.478 I llama_model_loader: - kv  37:              quantize.imatrix.chunks_count i32              = 128
0.00.954.478 I llama_model_loader: - type  f32:  141 tensors
0.00.954.479 I llama_model_loader: - type q8_0:    2 tensors
0.00.954.479 I llama_model_loader: - type q6_K:  196 tensors
0.01.041.321 I llm_load_vocab: special tokens cache size = 14
0.01.060.136 I llm_load_vocab: token to piece cache size = 0.9309 MB
0.01.060.148 I llm_load_print_meta: format           = GGUF V3 (latest)
0.01.060.148 I llm_load_print_meta: arch             = qwen2vl
0.01.060.148 I llm_load_print_meta: vocab type       = BPE
0.01.060.149 I llm_load_print_meta: n_vocab          = 152064
0.01.060.149 I llm_load_print_meta: n_merges         = 151387
0.01.060.149 I llm_load_print_meta: vocab_only       = 0
0.01.060.150 I llm_load_print_meta: n_ctx_train      = 32768
0.01.060.150 I llm_load_print_meta: n_embd           = 3584
0.01.060.150 I llm_load_print_meta: n_layer          = 28
0.01.060.159 I llm_load_print_meta: n_head           = 28
0.01.060.160 I llm_load_print_meta: n_head_kv        = 4
0.01.060.160 I llm_load_print_meta: n_rot            = 128
0.01.060.161 I llm_load_print_meta: n_swa            = 0
0.01.060.161 I llm_load_print_meta: n_embd_head_k    = 128
0.01.060.161 I llm_load_print_meta: n_embd_head_v    = 128
0.01.060.162 I llm_load_print_meta: n_gqa            = 7
0.01.060.163 I llm_load_print_meta: n_embd_k_gqa     = 512
0.01.060.164 I llm_load_print_meta: n_embd_v_gqa     = 512
0.01.060.165 I llm_load_print_meta: f_norm_eps       = 0.0e+00
0.01.060.166 I llm_load_print_meta: f_norm_rms_eps   = 1.0e-06
0.01.060.167 I llm_load_print_meta: f_clamp_kqv      = 0.0e+00
0.01.060.167 I llm_load_print_meta: f_max_alibi_bias = 0.0e+00
0.01.060.167 I llm_load_print_meta: f_logit_scale    = 0.0e+00
0.01.060.168 I llm_load_print_meta: n_ff             = 18944
0.01.060.169 I llm_load_print_meta: n_expert         = 0
0.01.060.169 I llm_load_print_meta: n_expert_used    = 0
0.01.060.169 I llm_load_print_meta: causal attn      = 1
0.01.060.169 I llm_load_print_meta: pooling type     = 0
0.01.060.170 I llm_load_print_meta: rope type        = 8
0.01.060.170 I llm_load_print_meta: rope scaling     = linear
0.01.060.171 I llm_load_print_meta: freq_base_train  = 1000000.0
0.01.060.172 I llm_load_print_meta: freq_scale_train = 1
0.01.060.172 I llm_load_print_meta: n_ctx_orig_yarn  = 32768
0.01.060.172 I llm_load_print_meta: rope_finetuned   = unknown
0.01.060.172 I llm_load_print_meta: ssm_d_conv       = 0
0.01.060.173 I llm_load_print_meta: ssm_d_inner      = 0
0.01.060.173 I llm_load_print_meta: ssm_d_state      = 0
0.01.060.173 I llm_load_print_meta: ssm_dt_rank      = 0
0.01.060.173 I llm_load_print_meta: ssm_dt_b_c_rms   = 0
0.01.060.174 I llm_load_print_meta: model type       = 7B
0.01.060.174 I llm_load_print_meta: model ftype      = Q6_K
0.01.060.175 I llm_load_print_meta: model params     = 7.62 B
0.01.060.176 I llm_load_print_meta: model size       = 6.06 GiB (6.84 BPW)
0.01.060.176 I llm_load_print_meta: general.name     = Qwen2 VL 7B Instruct Abliterated
0.01.060.177 I llm_load_print_meta: BOS token        = 151643 '<|endoftext|>'
0.01.060.177 I llm_load_print_meta: EOS token        = 151645 '<|im_end|>'
0.01.060.177 I llm_load_print_meta: EOT token        = 151645 '<|im_end|>'
0.01.060.178 I llm_load_print_meta: PAD token        = 151645 '<|im_end|>'
0.01.060.178 I llm_load_print_meta: LF token         = 148848 'ÄĬ'
0.01.060.178 I llm_load_print_meta: EOG token        = 151643 '<|endoftext|>'
0.01.060.178 I llm_load_print_meta: EOG token        = 151645 '<|im_end|>'
0.01.060.179 I llm_load_print_meta: max token length = 256
0.01.393.720 I llm_load_tensors: offloading 0 repeating layers to GPU
0.01.393.724 I llm_load_tensors: offloaded 0/29 layers to GPU
0.01.393.731 I llm_load_tensors:   CPU_Mapped model buffer size =  6210.54 MiB
.....................................................................................
0.01.400.120 I common_init_from_params: model requires M-RoPE, increasing batch size by 4x
0.01.400.128 I llama_new_context_with_model: n_seq_max     = 2
0.01.400.129 I llama_new_context_with_model: n_ctx         = 2048
0.01.400.129 I llama_new_context_with_model: n_ctx_per_seq = 1024
0.01.400.129 I llama_new_context_with_model: n_batch       = 2048
0.01.400.130 I llama_new_context_with_model: n_ubatch      = 512
0.01.400.130 I llama_new_context_with_model: flash_attn    = 0
0.01.400.131 I llama_new_context_with_model: freq_base     = 1000000.0
0.01.400.132 I llama_new_context_with_model: freq_scale    = 1
0.01.400.134 W llama_new_context_with_model: n_ctx_per_seq (1024) < n_ctx_train (32768) -- the full capacity of the model will not be utilized
0.01.400.142 I llama_kv_cache_init: kv_size = 2048, offload = 1, type_k = 'f16', type_v = 'f16', n_layer = 28, can_shift = 1
0.01.409.585 I llama_kv_cache_init:        CPU KV buffer size =   112.00 MiB
0.01.409.590 I llama_new_context_with_model: KV self size  =  112.00 MiB, K (f16):   56.00 MiB, V (f16):   56.00 MiB
0.01.409.759 I llama_new_context_with_model:        CPU  output buffer size =     1.16 MiB
0.01.414.110 I llama_new_context_with_model:      CUDA0 compute buffer size =   856.23 MiB
0.01.414.114 I llama_new_context_with_model:  CUDA_Host compute buffer size =    11.01 MiB
0.01.414.115 I llama_new_context_with_model: graph nodes  = 986
0.01.414.115 I llama_new_context_with_model: graph splits = 396 (with bs=512), 1 (with bs=1)
0.01.414.117 I common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
0.01.414.117 W common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
0.02.406.722 I srv                load_model: prompt caching disabled
0.02.407.284 I srv                load_model: chat template, built_in: true, alias: chatml, tool call: supported, example:
<|im_start|>system
You are a helpful assistant.

## Tools

You CAN call functions to assist with the user query. Do not make assumptions about what values to plug into functions.

You are provided with following function tools:

### get_weather

get_weather:  Parameters: {"type":"object","properties":{"location":{"type":"string"}}}Format the arguments as a JSON object.

### get_temperature

get_temperature: Return the temperature according to the location. Parameters: {"type":"object","properties":{"location":{"type":"string"}}}Format the arguments as a JSON object.

When you can reply with your internal knowledge, reply directly without any function calls. Otherwise, for each function call, return a JSON object with function name and arguments within <tool_call></tool_call> XML tags:
<tool_call>
{"name": The name of the function to use, "arguments": The input of the function, must be a JSON object in compact format}
</tool_call>
<tool_result>
The function results.
</tool_result>
Reply based on the function results.<|im_end|>
<|im_start|>user
Hello.<|im_end|>
Hi there.<|im_end|>
<|im_start|>user
What's the weather like in Paris today?<|im_end|>
<|im_start|>assistant

0.02.407.287 I srv                      main: initializing server
0.02.407.289 I srv                      init: initializing slots, n_slots = 2
0.02.407.414 I srv                      main: starting server
0.35.185.155 I srv        log_server_request: rid 34369410 | POST /v1/chat/completions 127.0.0.1:60400
0.35.205.825 I srv oaicompat_completions_req: rid 34369410 | {"messages":"[...]","model":"hermes2"}
0.36.236.689 W slt              update_slots: rid 34369410 | id 00 | task 0 | input truncated, n_ctx = 1024, n_keep = 0, n_left = 1024, n_prompt_tokens = 520
1.23.432.412 W slt              update_slots: rid 34369410 | id 00 | task 0 | slot context shift, n_keep = 0, n_left = 1023, n_discard = 511
D:\a\llama-box\llama-box\llama.cpp\ggml\src\ggml-cpu\ggml-cpu.c:9441: GGML_ASSERT(sections[0] > 0 || sections[1] > 0 || sections[2] > 0) failed
D:\a\llama-box\llama-box\llama.cpp\ggml\src\ggml-cpu\ggml-cpu.c:9441: GGML_ASSERT(sections[0] > 0 || sections[1] > 0 || sections[2] > 0) failed```

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions