Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Llama2 #488

Closed
4 of 5 tasks
Tracked by #487
abetlen opened this issue Jul 18, 2023 · 5 comments
Closed
4 of 5 tasks
Tracked by #487

Llama2 #488

abetlen opened this issue Jul 18, 2023 · 5 comments
Labels
enhancement New feature or request

Comments

@abetlen
Copy link
Owner

abetlen commented Jul 18, 2023

  • Test Llama2 models
    • 7b
    • 13b
    • 70b (WIP)
  • Add download links to README
@abetlen abetlen mentioned this issue Jul 18, 2023
9 tasks
@antonkulaga
Copy link

antonkulaga commented Jul 20, 2023

I can confirm that Llama 13b model works. However, I am having bizarre errors when I try to use it together with langchain to make embeddings, it gets inf tokens per second and goes for ages. @abetlen I suggest that you should also test the embeddings creations for 4K context works smoothly as I often get errors of too many tokens even with 2K tokens splitting.

@abetlen
Copy link
Owner Author

abetlen commented Jul 20, 2023

@antonkulaga will investigate, does this issue only come up with the embeddings? It could be an upstream issue in llama.cpp

@bretello
Copy link
Contributor

bretello commented Jul 24, 2023

Testing out 70b (quantized) on an M1 max with 64GB of RAM:

ins] In [2]: from llama_cpp import Llama

[ins] In [3]: MODEL_PATH = "./llama2/70b-v2-q4_0.bin"

[ins] In [4]: model = Llama(model_path=MODEL_PATH)
llama.cpp: loading model from ./llama2/70b-v2-q4_0.bin
llama_model_load_internal: format     = ggjt v3 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 8192
llama_model_load_internal: n_mult     = 4096
llama_model_load_internal: n_head     = 64
llama_model_load_internal: n_layer    = 80
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: freq_base  = 10000.0
llama_model_load_internal: freq_scale = 1
llama_model_load_internal: ftype      = 2 (mostly Q4_0)
llama_model_load_internal: n_ff       = 24576
llama_model_load_internal: model size = 65B
llama_model_load_internal: ggml ctx size =    0.19 MB
error loading model: llama.cpp: tensor 'layers.0.attention.wk.weight' has wrong shape; expected  8192 x  8192, got  8192 x  1024
llama_load_model_from_file: failed to load model
---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
Cell In[4], line 1
----> 1 model = Llama(model_path=MODEL_PATH)

File .venv/lib/python3.9/site-packages/llama_cpp/llama.py:305, in Llama.__init__(self, model_path, n_ctx, n_parts, n_gpu_layers, seed, f16_kv, logits_all, vocab_only, use_mmap, use_mlock, embedding, n_threads, n_batch, last_n_tokens_size, lora_base, lora_path, low_vram, tensor_split, rope_freq_base, rope_freq_scale, verbose)
    300     raise ValueError(f"Model path does not exist: {model_path}")
    302 self.model = llama_cpp.llama_load_model_from_file(
    303     self.model_path.encode("utf-8"), self.params
    304 )
--> 305 assert self.model is not None
    307 self.ctx = llama_cpp.llama_new_context_with_model(self.model, self.params)
    309 assert self.ctx is not None

AssertionError:

Seems to be expecting the wrong shape:

error loading model: llama.cpp: tensor 'layers.0.attention.wk.weight' has wrong shape; expected  8192 x  8192, got  8192 x  1024

Note that this works just fine with llama.cpp. Below details on how the model was converted and quantized:

# convert model to ggml
 python convert.py --outfile models/70B-v2/ggml-model-f16.bin --outtype f16 ../llama2/llama/llama-2-70b/
 # quantize it to q4_0
./quantize ./models/70B-v2/ggml-model-f16.bin ./models/70B-v2/ggml-model-q4_0.bin q4_0
# inference runs fine
 ./main -m ./models/70B-v2/ggml-model-q4_0.bin -p "I believe the meaning of life is" --no-mmap --ignore-eos -n 64 -t 8 -gqa 8

Extra info

Installed with CMAKE_ARGS="-DLLAMA_METAL=on" FORCE_CMAKE=1 pip install llama-cpp-python

@gjmulder gjmulder added the enhancement New feature or request label Jul 30, 2023
@rlleshi
Copy link

rlleshi commented Aug 15, 2023

Currently not getting good results with llama2-13b. They are actually quite off and frequently become ramblings. See #596

The model exposed through Fast API has better responses. Though still not as good as the responses from llama.cpp itself or when integrating it with LocalAI.

I'm not exactly sure which hyperparameters would need to be tuned.

@JohanAR
Copy link

JohanAR commented Aug 23, 2023

llama2 13b models are working for me, but 22b e.g. (llama2-22b-gplatty.ggmlv3.q5_K_M.bin and others) segfaults with "ggml_new_object: not enough space in the context's memory pool (needed 13798672, available 12747472)" when I send more than just a handful of tokens. No issues with older 30b (n_ctx 2048) models, only l2 22b. Any debug info that could be useful? I'm using llama-cpp-python through text-generation-webui.

@abetlen abetlen closed this as completed Sep 12, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

6 participants