Llama2 #488

abetlen · 2023-07-18T17:48:56Z

antonkulaga · 2023-07-20T21:59:10Z

I can confirm that Llama 13b model works. However, I am having bizarre errors when I try to use it together with langchain to make embeddings, it gets inf tokens per second and goes for ages. @abetlen I suggest that you should also test the embeddings creations for 4K context works smoothly as I often get errors of too many tokens even with 2K tokens splitting.

abetlen · 2023-07-20T22:36:14Z

@antonkulaga will investigate, does this issue only come up with the embeddings? It could be an upstream issue in llama.cpp

bretello · 2023-07-24T11:26:49Z

Testing out 70b (quantized) on an M1 max with 64GB of RAM:

ins] In [2]: from llama_cpp import Llama

[ins] In [3]: MODEL_PATH = "./llama2/70b-v2-q4_0.bin"

[ins] In [4]: model = Llama(model_path=MODEL_PATH)
llama.cpp: loading model from ./llama2/70b-v2-q4_0.bin
llama_model_load_internal: format     = ggjt v3 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 8192
llama_model_load_internal: n_mult     = 4096
llama_model_load_internal: n_head     = 64
llama_model_load_internal: n_layer    = 80
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: freq_base  = 10000.0
llama_model_load_internal: freq_scale = 1
llama_model_load_internal: ftype      = 2 (mostly Q4_0)
llama_model_load_internal: n_ff       = 24576
llama_model_load_internal: model size = 65B
llama_model_load_internal: ggml ctx size =    0.19 MB
error loading model: llama.cpp: tensor 'layers.0.attention.wk.weight' has wrong shape; expected  8192 x  8192, got  8192 x  1024
llama_load_model_from_file: failed to load model
---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
Cell In[4], line 1
----> 1 model = Llama(model_path=MODEL_PATH)

File .venv/lib/python3.9/site-packages/llama_cpp/llama.py:305, in Llama.__init__(self, model_path, n_ctx, n_parts, n_gpu_layers, seed, f16_kv, logits_all, vocab_only, use_mmap, use_mlock, embedding, n_threads, n_batch, last_n_tokens_size, lora_base, lora_path, low_vram, tensor_split, rope_freq_base, rope_freq_scale, verbose)
    300     raise ValueError(f"Model path does not exist: {model_path}")
    302 self.model = llama_cpp.llama_load_model_from_file(
    303     self.model_path.encode("utf-8"), self.params
    304 )
--> 305 assert self.model is not None
    307 self.ctx = llama_cpp.llama_new_context_with_model(self.model, self.params)
    309 assert self.ctx is not None

AssertionError:

Seems to be expecting the wrong shape:

error loading model: llama.cpp: tensor 'layers.0.attention.wk.weight' has wrong shape; expected  8192 x  8192, got  8192 x  1024

Note that this works just fine with llama.cpp. Below details on how the model was converted and quantized:

# convert model to ggml
 python convert.py --outfile models/70B-v2/ggml-model-f16.bin --outtype f16 ../llama2/llama/llama-2-70b/
 # quantize it to q4_0
./quantize ./models/70B-v2/ggml-model-f16.bin ./models/70B-v2/ggml-model-q4_0.bin q4_0
# inference runs fine
 ./main -m ./models/70B-v2/ggml-model-q4_0.bin -p "I believe the meaning of life is" --no-mmap --ignore-eos -n 64 -t 8 -gqa 8

Extra info

Installed with CMAKE_ARGS="-DLLAMA_METAL=on" FORCE_CMAKE=1 pip install llama-cpp-python

rlleshi · 2023-08-15T14:30:31Z

Currently not getting good results with llama2-13b. They are actually quite off and frequently become ramblings. See #596

The model exposed through Fast API has better responses. Though still not as good as the responses from llama.cpp itself or when integrating it with LocalAI.

I'm not exactly sure which hyperparameters would need to be tuned.

JohanAR · 2023-08-23T13:23:43Z

llama2 13b models are working for me, but 22b e.g. (llama2-22b-gplatty.ggmlv3.q5_K_M.bin and others) segfaults with "ggml_new_object: not enough space in the context's memory pool (needed 13798672, available 12747472)" when I send more than just a handful of tokens. No issues with older 30b (n_ctx 2048) models, only l2 22b. Any debug info that could be useful? I'm using llama-cpp-python through text-generation-webui.

abetlen mentioned this issue Jul 18, 2023

Roadmap for v0.2 #487

Open

9 tasks

This was referenced Jul 24, 2023

raise exception when llama_load_model_from_file fails #521

Merged

Llama2 70b support #522

Merged

gjmulder added the enhancement New feature or request label Jul 30, 2023

abetlen closed this as completed Sep 12, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Llama2 #488

Llama2 #488

abetlen commented Jul 18, 2023 •

edited

Loading

antonkulaga commented Jul 20, 2023 •

edited

Loading

abetlen commented Jul 20, 2023

bretello commented Jul 24, 2023 •

edited

Loading

rlleshi commented Aug 15, 2023 •

edited

Loading

JohanAR commented Aug 23, 2023

Llama2 #488

Llama2 #488

Comments

abetlen commented Jul 18, 2023 • edited Loading

antonkulaga commented Jul 20, 2023 • edited Loading

abetlen commented Jul 20, 2023

bretello commented Jul 24, 2023 • edited Loading

Extra info

rlleshi commented Aug 15, 2023 • edited Loading

JohanAR commented Aug 23, 2023

abetlen commented Jul 18, 2023 •

edited

Loading

antonkulaga commented Jul 20, 2023 •

edited

Loading

bretello commented Jul 24, 2023 •

edited

Loading

rlleshi commented Aug 15, 2023 •

edited

Loading