-
Notifications
You must be signed in to change notification settings - Fork 13.3k
Closed as not planned
Closed as not planned
Copy link
Labels
wontfixThis will not be worked onThis will not be worked on
Description
Name and Version
build/bin/llama-cli --version
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes
version: 4945 (9b169a4)
built with cc (GCC) 8.5.0 20210514 (Red Hat 8.5.0-23) for x86_64-redhat-linux
Operating systems
Linux
GGML backends
CUDA
Hardware
RTX 4090
Models
GPT2LMHeadModel
Problem description & steps to reproduce
when I convert the basic GPT2LMHeadModel using the convert_hf_to_gguf.py script. it work perfectly but then when I load it using:
from llama_cpp import Llama
llama = Llama("model.path.gguf")
I got this error:
llama_model_load: error loading model: missing tensor 'output.weight'
obviously the output layer has not been convert but why? Does anyone have an idea
First Bad Commit
No response
Relevant log output
nt_info: f_clamp_kqv = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale = 0.0e+00
print_info: f_attn_scale = 0.0e+00
print_info: n_ff = 3072
print_info: n_expert = 0
print_info: n_expert_used = 0
print_info: causal attn = 1
print_info: pooling type = 0
print_info: rope type = -1
print_info: rope scaling = linear
print_info: freq_base_train = 10000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn = 1024
print_info: rope_finetuned = unknown
print_info: ssm_d_conv = 0
print_info: ssm_d_inner = 0
print_info: ssm_d_state = 0
print_info: ssm_dt_rank = 0
print_info: ssm_dt_b_c_rms = 0
print_info: model type = 0.1B
print_info: model params = 124.44 M
print_info: general.name = Gpt2
print_info: vocab type = BPE
print_info: n_vocab = 50257
print_info: n_merges = 50000
print_info: BOS token = 50256 '<|endoftext|>'
print_info: EOS token = 50256 '<|endoftext|>'
print_info: EOT token = 50256 '<|endoftext|>'
print_info: LF token = 198 'Ċ'
print_info: EOG token = 50256 '<|endoftext|>'
print_info: max token length = 256
load_tensors: loading model tensors, this can take a while... (mmap = true)
load_tensors: layer 0 assigned to device CUDA0
load_tensors: layer 1 assigned to device CUDA0
load_tensors: layer 2 assigned to device CUDA0
load_tensors: layer 3 assigned to device CUDA0
load_tensors: layer 4 assigned to device CUDA0
load_tensors: layer 5 assigned to device CUDA0
load_tensors: layer 6 assigned to device CUDA0
load_tensors: layer 7 assigned to device CUDA0
load_tensors: layer 8 assigned to device CUDA0
load_tensors: layer 9 assigned to device CUDA0
load_tensors: layer 10 assigned to device CUDA0
load_tensors: layer 11 assigned to device CUDA0
load_tensors: layer 12 assigned to device CUDA0
llama_model_load: error loading model: missing tensor 'output.weight'
llama_model_load_from_file_impl: failed to load model
Metadata
Metadata
Assignees
Labels
wontfixThis will not be worked onThis will not be worked on