-
Notifications
You must be signed in to change notification settings - Fork 13.4k
Closed
Labels
bug-unconfirmedmedium severityUsed to report medium severity bugs in llama.cpp (e.g. Malfunctioning Features but still useable)Used to report medium severity bugs in llama.cpp (e.g. Malfunctioning Features but still useable)
Description
What happened?
The latest llama.cpp produces bad outputs for CodeShell, which previously performed well when merged into llama.cpp.
After updating convert-hf-to-gguf.py
and convert-hf-to-gguf-update.py
, I have converted the CodeShell-7b, a ckpt working well with an old version(5d55b0c) to gguf. But running inference with it on the latest version produces poor outputs.
Tested command:
./llama-simple -m codeshell-7b.gguf -p "def merge_sort(array, start, end):" -n 100
Name and Version
version: 3281 (023b880)
built with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu
What operating system are you seeing the problem on?
Linux
Relevant log output
# ./llama-simple -m cd7b.gguf -p "def merge_sort(array, start, end):" -n 100
llama_model_loader: loaded meta data with 23 key-value pairs and 508 tensors from cd7b.gguf (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = codeshell llama_model_loader: - kv 1: general.name str = CodeShell llama_model_loader: - kv 2: codeshell.context_length u32 = 8192 llama_model_loader: - kv 3: codeshell.embedding_length u32 = 4096 llama_model_loader: - kv 4: codeshell.feed_forward_length u32 = 16384 llama_model_loader: - kv 5: codeshell.block_count u32 = 42 llama_model_loader: - kv 6: codeshell.attention.head_count u32 = 32 llama_model_loader: - kv 7: codeshell.attention.head_count_kv u32 = 8 llama_model_loader: - kv 8: codeshell.attention.layer_norm_epsilon f32 = 0.000010 llama_model_loader: - kv 9: general.file_type u32 = 1 llama_model_loader: - kv 10: codeshell.rope.freq_base f32 = 10000.000000 llama_model_loader: - kv 11: codeshell.rope.scaling.type str = linear llama_model_loader: - kv 12: codeshell.rope.scaling.factor f32 = 1.000000 llama_model_loader: - kv 13: tokenizer.ggml.model str = gpt2 llama_model_loader: - kv 14: tokenizer.ggml.pre str = codeshell llama_model_loader: - kv 15: tokenizer.ggml.tokens arr[str,70144] = ["æ½»", "æ¶ģ", "ïĴĻ", "amily... llama_model_loader: - kv 16: tokenizer.ggml.token_type arr[i32,70144] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... llama_model_loader: - kv 17: tokenizer.ggml.merges arr[str,72075] = ["Ġ Ġ", "ĠĠ ĠĠ", "ĠĠĠĠ ĠĠ... llama_model_loader: - kv 18: tokenizer.ggml.bos_token_id u32 = 70000 llama_model_loader: - kv 19: tokenizer.ggml.eos_token_id u32 = 70000 llama_model_loader: - kv 20: tokenizer.ggml.unknown_token_id u32 = 70000 llama_model_loader: - kv 21: tokenizer.ggml.padding_token_id u32 = 70000 llama_model_loader: - kv 22: general.quantization_version u32 = 2
llama_model_loader: - type f32: 338 tensors
llama_model_loader: - type f16: 170 tensors
llm_load_vocab: special tokens cache size = 144
llm_load_vocab: token to piece cache size = 0.2985 MB
llm_load_print_meta: format = GGUF V3 (latest)
llm_load_print_meta: arch = codeshell
llm_load_print_meta: vocab type = BPE
llm_load_print_meta: n_vocab = 70144
llm_load_print_meta: n_merges = 72075
llm_load_print_meta: n_ctx_train = 8192
llm_load_print_meta: n_embd = 4096
llm_load_print_meta: n_head = 32
llm_load_print_meta: n_head_kv = 8
llm_load_print_meta: n_layer = 42
llm_load_print_meta: n_rot = 128
llm_load_print_meta: n_swa = 0
llm_load_print_meta: n_embd_head_k = 128
llm_load_print_meta: n_embd_head_v = 128
llm_load_print_meta: n_gqa = 4
llm_load_print_meta: n_embd_k_gqa = 1024
llm_load_print_meta: n_embd_v_gqa = 1024
llm_load_print_meta: f_norm_eps = 1.0e-05
llm_load_print_meta: f_norm_rms_eps = 0.0e+00
llm_load_print_meta: f_clamp_kqv = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale = 0.0e+00
llm_load_print_meta: n_ff = 16384
llm_load_print_meta: n_expert = 0
llm_load_print_meta: n_expert_used = 0
llm_load_print_meta: causal attn = 1
llm_load_print_meta: pooling type = 0
llm_load_print_meta: rope type = 0
llm_load_print_meta: rope scaling = linear
llm_load_print_meta: freq_base_train = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn = 8192
llm_load_print_meta: rope_finetuned = unknown
llm_load_print_meta: ssm_d_conv = 0
llm_load_print_meta: ssm_d_inner = 0
llm_load_print_meta: ssm_d_state = 0
llm_load_print_meta: ssm_dt_rank = 0
llm_load_print_meta: model type = 0.1B
llm_load_print_meta: model ftype = F16
llm_load_print_meta: model params = 7.98 B
llm_load_print_meta: model size = 14.86 GiB (16.00 BPW)
llm_load_print_meta: general.name = CodeShell
llm_load_print_meta: BOS token = 70000 '<|endoftext|>'
llm_load_print_meta: EOS token = 70000 '<|endoftext|>'
llm_load_print_meta: UNK token = 70000 '<|endoftext|>'
llm_load_print_meta: PAD token = 70000 '<|endoftext|>'
llm_load_print_meta: LF token = 28544 'ÄĬ'
llm_load_print_meta: EOT token = 70000 '<|endoftext|>'
llm_load_print_meta: max token length = 256
llm_load_tensors: ggml ctx size = 0.22 MiB
llm_load_tensors: CPU buffer size = 15215.58 MiB
...............................................................................................
llama_new_context_with_model: n_ctx = 8192
llama_new_context_with_model: n_batch = 2048
llama_new_context_with_model: n_ubatch = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init: CPU KV buffer size = 1344.00 MiB
llama_new_context_with_model: KV self size = 1344.00 MiB, K (f16): 672.00 MiB, V (f16): 672.00 MiB
llama_new_context_with_model: CPU output buffer size = 0.27 MiB
llama_new_context_with_model: CPU compute buffer size = 564.01 MiB
llama_new_context_with_model: graph nodes = 1687
llama_new_context_with_model: graph splits = 1
main: n_predict = 100, n_ctx = 8192, n_kv_req = 100
def merge_sort(array, start, end):
if start < end:
mid = (start + end) // 2
merge_sort(array, start, mid)
merge_sort(array, mid + 1, end)
merge(array, start, mid, end)
def merge(array, start, mid, end):
left = array[start:mid + 1]
right = mid + 1
if array[
main: decoded 89 tokens in 14.22 s, speed: 6.26 t/s
llama_print_timings: load time = 1487.96 ms
llama_print_timings: sample time = 9.78 ms / 90 runs ( 0.11 ms per token, 9206.22 tokens per second)
llama_print_timings: prompt eval time = 260.92 ms / 11 tokens ( 23.72 ms per token, 42.16 tokens per second)
llama_print_timings: eval time = 14146.74 ms / 89 runs ( 158.95 ms per token, 6.29 tokens per second)
llama_print_timings: total time = 15708.60 ms / 100 tokens
Metadata
Metadata
Assignees
Labels
bug-unconfirmedmedium severityUsed to report medium severity bugs in llama.cpp (e.g. Malfunctioning Features but still useable)Used to report medium severity bugs in llama.cpp (e.g. Malfunctioning Features but still useable)