Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Empty output when running Q4_K_M quantization of Llama-3-8B-Instruct with llama-cpp-python #1696

Open
smolraccoon opened this issue Aug 22, 2024 · 3 comments

Comments

@smolraccoon
Copy link

Hi! I'm trying to run the Q4_K_M quantization of Meta-Llama-3-8B-Instruct on my Mac (M2 Pro, 16GB VRAM) using llama-cpp-python, with the following test code:

from llama_cpp import Llama
llm4 = Llama(model_path = "/path/to/model/Q4_K_M.gguf", chat_format = "chatml")

response = llm4.create_chat_completion(
         messages = [
             {
              "role": "system",
              "content": "You are a helpful dietologist.",
             },
             {
                "role": "user",
                "content": "Can I eat oranges after 7 pm?"
                 },
         ],
         response_format = {
              "type": "json_object",
         },
         temperature = 0.7,
)

print(response)

However, the output is consistently empty:

{'id': 'chatcmpl-d6b4c8ae-0f0a-4112-bb32-3c567f383d13', 'object': 'chat.completion', 'created': 1724142021, 'model': ‘path/to/model/Q4_K_M.gguf', 'choices': [{'index': 0, 'message': {'role': 'assistant', 'content': '{} '}, 'logprobs': None, 'finish_reason': 'stop'}], 'usage': {'prompt_tokens': 51, 'completion_tokens': 2, 'total_tokens': 53}}

Everything works fine when using llama-cli through the terminal, and I've reinstalled llama-cpp-python and rebuilt llama-cpp as per the instructions, but it didn't help. This is also the case for the Q8 and F16 quantizations (F16 gives an insufficient memory error when running through llama-cli, but empty output when running through llama-cpp-python). Is there anything obvious I may be missing here?

@JHH11
Copy link

JHH11 commented Sep 27, 2024

I encountered a similar issue. After fine-tuning a LLM and quantizing it using llama.cpp, the model works perfectly when accessed via the terminal using llama-cli. However, when I attempt to use the high-level API from the llama-cpp-python library, I receive no errors, but the assistant's content in the response is always empty.

Has anyone experienced this issue or found a solution?

@smolraccoon
Copy link
Author

@JHH11 If you're running code similar to what I posted above, try deleting the response_format block entirely - that fixed it for me, though I still have no idea why

@JHH11
Copy link

JHH11 commented Oct 4, 2024

@smolraccoon Thanks for sharing, but the method didn't work for me. By the way, should the chat_format be set to llama-3?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants