Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OpenAI-Compatible Chat Completions API Endpoint Responses include EOS / stop tokens #6859

Closed
K-Mistele opened this issue Apr 24, 2024 · 11 comments

Comments

@K-Mistele
Copy link
Contributor

K-Mistele commented Apr 24, 2024

Commit: 4e96a81 (origin/master)

Expected Behavior: Chat completions from /v1/chat/completions should not include the stop token in the text returned to the client

Actual Behavior: Stop token is included when using Mistral 7B instruct v0.2 and either no chat template, or the llama2 chat template.

Example of Broken Behavior

When I run inference with the server and mistral-7b-instruct-v0.2, I use the following command:

./server -m ~/Documents/AI/models/mistral-7b-instruct-v0.2.Q8_0.gguf -c 32768 -cb -np 1 -ngl -1 --host 0.0.0.0

The result of using the /v1/chat/completions OpenAI endpoint with The Bloke's Quant of the model, includes the EOS </s> string in the output:

Screenshot 2024-04-23 at 9 17 25 PM

This happens when I omit the --chat-template option, and when I use --chat-template llama2 as indicated in this repository's wiki

In the past, when I have used chatml fine-tunes of mistral, I did not see a stop token at the end of the generated text.

However now, using the chatml-tuned Hermes 2 Pro Mistral 7B:

./server -m ~/Documents/AI/models/optimal/Hermes-2-Pro-Mistral-7B.Q8_0.gguf -cb -np 1 -c 8096 --host 0.0.0.0, I see the <|im_end|> stop token:

Screenshot 2024-04-23 at 9 24 01 PM

I am confident that I had never seen stop tokens included in chat completion response from the OpenAI compatible completions endpoint before with older versions of llama.cpp

@K-Mistele K-Mistele changed the title Server output includes EOS token for llama2 chat template Sever OpenAI Chat Completions Endpoint Responses include EOS / stop tokens Apr 24, 2024
@K-Mistele K-Mistele changed the title Sever OpenAI Chat Completions Endpoint Responses include EOS / stop tokens OpenAI-Compatible Chat Completions API Endpoint Responses include EOS / stop tokens Apr 24, 2024
@QueryType
Copy link

Reported the same here too: #6847
And here also: #6837

@K-Mistele
Copy link
Contributor Author

Reported the same here too: #6847 And here also: #6837

Hmm - this is unrelated to #6837 since that pertains to the default Server UI and the /completions API endpoint. This issue is for the/v1/chat/completions endpoint, which did not used to include stop / EOS tokens in responses.

In fact, I'm running a ChatML instruct-tuned LLM (Nous Hermes 2 Solar 10.7B) in production on an older version of llama.cpp that's working fine and isn't including the stop token in responses, but running it with a more recent version of llama.cpp does include the stop token in the response. So, the behavior for the /v1/chat/completions endpoint has changed (broken).

I did see #6847 but I opted to open this one anyways. They may or may not be the same issue (#6847 uses different models & does not use a chat template), and your title also indicate that it's for "old models" which isn't the case for the models that I'm using - all are recent models. I think it's also inaccurate to frame it as "gibberish" since it's not gibberish tokens, which is usually a separate problem. It's an issue with prompt template-related tokens, which means it's the chat template not being applied or parsed properly in the chat completions endpoint.

Given this, I think it's best to either merge that issue into this, or to leave them both open.

@QueryType
Copy link

Reported the same here too: #6847 And here also: #6837

Hmm - this is unrelated to #6837 since that pertains to the default Server UI and the /completions API endpoint. This issue is for the/v1/chat/completions endpoint, which did not used to include stop / EOS tokens in responses.

In fact, I'm running a ChatML instruct-tuned LLM (Nous Hermes 2 Solar 10.7B) in production on an older version of llama.cpp that's working fine and isn't including the stop token in responses, but running it with a more recent version of llama.cpp does include the stop token in the response. So, the behavior for the /v1/chat/completions endpoint has changed (broken).

I did see #6847 but I opted to open this one anyways. They may or may not be the same issue (#6847 uses different models & does not use a chat template), and your title also indicate that it's for "old models" which isn't the case for the models that I'm using - all are recent models. I think it's also inaccurate to frame it as "gibberish" since it's not gibberish tokens, which is usually a separate problem. It's an issue with prompt template-related tokens, which means it's the chat template not being applied or parsed properly in the chat completions endpoint.

Given this, I think it's best to either merge that issue into this, or to leave them both open.

I agree to keep them separate. Almost all models old or new are impacted by the change. Reverting to older versions seems to work.

@K-Mistele
Copy link
Contributor Author

@QueryType in your issue you said

I was able to run mythomax-13B and other models perfectly before these changes, infact I can run even now on older releases

Do you know what the most recent release that you are able to use the server without this issue is? If so we can work backwards to figure out the source of the issue, and I can try to create a PR to fix

@QueryType
Copy link

@QueryType in your issue you said

I was able to run mythomax-13B and other models perfectly before these changes, infact I can run even now on older releases

Do you know what the most recent release that you are able to use the server without this issue is? If so we can work backwards to figure out the source of the issue, and I can try to create a PR to fix

Yes, good idea, will do that. My hunch is (b2707 or b2702) introduced it. So b2700 should be fine. But honestly, I need to check. Will do in the evening today once I have access to the machine.

@K-Mistele
Copy link
Contributor Author

K-Mistele commented Apr 24, 2024

Hmm it looks like it could've been either of these two merged PRs:

https://github.com/ggerganov/llama.cpp/pull/6745/files - This PR deals with parsing EOG tokens so it could be this one.

The other one, which is my guess, is this:
https://github.com/ggerganov/llama.cpp/pull/6807/files

It adds a parameter bool special to llama_token_to_piece that is set to true in some places and false in others. The PR is explicitly to "add an option to render special/control tokens", and if it's true, special/control tokens (including EOS/EOG tokens?) render.

Seems like this is the most likely culprit, cc @ggerganov who was the author. Checking if changing this value stops the issue now.

@K-Mistele
Copy link
Contributor Author

Yep that fixed it. cc @ggerganov, merging #6807 broke this behavior such that server now renders EOS/stop tokens.

Changing true to false in common.hpp on lines 2331 and 2335 fixes the issue. Happy to create a PR to fix this, but since it seems like the option to show the control tokens was intentional and maybe it just wasn't the intention to make doing so the default behavior, I understand that there may be a different way that you would prefer to resolve this?

@slavonnet
Copy link

also #6872

@slavonnet
Copy link

also #6873

@slavonnet
Copy link

Not only the server but also main returns the ending

@ggerganov
Copy link
Owner

Fixed in #6860

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants