You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi! I'm trying to run the Q4_K_M quantization of Meta-Llama-3-8B-Instruct on my Mac (M2 Pro, 16GB VRAM) using llama-cpp-python, with the following test code:
from llama_cpp import Llama
llm4 = Llama(model_path = "/path/to/model/Q4_K_M.gguf", chat_format = "chatml")
response = llm4.create_chat_completion(
messages = [
{
"role": "system",
"content": "You are a helpful dietologist.",
},
{
"role": "user",
"content": "Can I eat oranges after 7 pm?"
},
],
response_format = {
"type": "json_object",
},
temperature = 0.7,
)
print(response)
Everything works fine when using llama-cli through the terminal, and I've reinstalled llama-cpp-python and rebuilt llama-cpp as per the instructions, but it didn't help. This is also the case for the Q8 and F16 quantizations (F16 gives an insufficient memory error when running through llama-cli, but empty output when running through llama-cpp-python). Is there anything obvious I may be missing here?
The text was updated successfully, but these errors were encountered:
I encountered a similar issue. After fine-tuning a LLM and quantizing it using llama.cpp, the model works perfectly when accessed via the terminal using llama-cli. However, when I attempt to use the high-level API from the llama-cpp-python library, I receive no errors, but the assistant's content in the response is always empty.
Has anyone experienced this issue or found a solution?
@JHH11 If you're running code similar to what I posted above, try deleting the response_format block entirely - that fixed it for me, though I still have no idea why
Hi! I'm trying to run the Q4_K_M quantization of Meta-Llama-3-8B-Instruct on my Mac (M2 Pro, 16GB VRAM) using llama-cpp-python, with the following test code:
However, the output is consistently empty:
{'id': 'chatcmpl-d6b4c8ae-0f0a-4112-bb32-3c567f383d13', 'object': 'chat.completion', 'created': 1724142021, 'model': ‘path/to/model/Q4_K_M.gguf', 'choices': [{'index': 0, 'message': {'role': 'assistant', 'content': '{} '}, 'logprobs': None, 'finish_reason': 'stop'}], 'usage': {'prompt_tokens': 51, 'completion_tokens': 2, 'total_tokens': 53}}
Everything works fine when using llama-cli through the terminal, and I've reinstalled llama-cpp-python and rebuilt llama-cpp as per the instructions, but it didn't help. This is also the case for the Q8 and F16 quantizations (F16 gives an insufficient memory error when running through llama-cli, but empty output when running through llama-cpp-python). Is there anything obvious I may be missing here?
The text was updated successfully, but these errors were encountered: