OpenAI-Compatible Chat Completions API Endpoint Responses include EOS / stop tokens #6859

K-Mistele · 2024-04-24T02:19:41Z

Commit: 4e96a81 (origin/master)

Expected Behavior: Chat completions from /v1/chat/completions should not include the stop token in the text returned to the client

Actual Behavior: Stop token is included when using Mistral 7B instruct v0.2 and either no chat template, or the llama2 chat template.

Example of Broken Behavior

When I run inference with the server and mistral-7b-instruct-v0.2, I use the following command:

./server -m ~/Documents/AI/models/mistral-7b-instruct-v0.2.Q8_0.gguf -c 32768 -cb -np 1 -ngl -1 --host 0.0.0.0

The result of using the /v1/chat/completions OpenAI endpoint with The Bloke's Quant of the model, includes the EOS </s> string in the output:

This happens when I omit the --chat-template option, and when I use --chat-template llama2 as indicated in this repository's wiki

In the past, when I have used chatml fine-tunes of mistral, I did not see a stop token at the end of the generated text.

However now, using the chatml-tuned Hermes 2 Pro Mistral 7B:

./server -m ~/Documents/AI/models/optimal/Hermes-2-Pro-Mistral-7B.Q8_0.gguf -cb -np 1 -c 8096 --host 0.0.0.0, I see the <|im_end|> stop token:

I am confident that I had never seen stop tokens included in chat completion response from the OpenAI compatible completions endpoint before with older versions of llama.cpp

The text was updated successfully, but these errors were encountered:

QueryType · 2024-04-24T02:29:23Z

Reported the same here too: #6847
And here also: #6837

K-Mistele · 2024-04-24T02:42:03Z

Reported the same here too: #6847 And here also: #6837

Hmm - this is unrelated to #6837 since that pertains to the default Server UI and the /completions API endpoint. This issue is for the/v1/chat/completions endpoint, which did not used to include stop / EOS tokens in responses.

In fact, I'm running a ChatML instruct-tuned LLM (Nous Hermes 2 Solar 10.7B) in production on an older version of llama.cpp that's working fine and isn't including the stop token in responses, but running it with a more recent version of llama.cpp does include the stop token in the response. So, the behavior for the /v1/chat/completions endpoint has changed (broken).

I did see #6847 but I opted to open this one anyways. They may or may not be the same issue (#6847 uses different models & does not use a chat template), and your title also indicate that it's for "old models" which isn't the case for the models that I'm using - all are recent models. I think it's also inaccurate to frame it as "gibberish" since it's not gibberish tokens, which is usually a separate problem. It's an issue with prompt template-related tokens, which means it's the chat template not being applied or parsed properly in the chat completions endpoint.

Given this, I think it's best to either merge that issue into this, or to leave them both open.

QueryType · 2024-04-24T02:53:29Z

Reported the same here too: #6847 And here also: #6837

Hmm - this is unrelated to #6837 since that pertains to the default Server UI and the /completions API endpoint. This issue is for the/v1/chat/completions endpoint, which did not used to include stop / EOS tokens in responses.

In fact, I'm running a ChatML instruct-tuned LLM (Nous Hermes 2 Solar 10.7B) in production on an older version of llama.cpp that's working fine and isn't including the stop token in responses, but running it with a more recent version of llama.cpp does include the stop token in the response. So, the behavior for the /v1/chat/completions endpoint has changed (broken).

I did see #6847 but I opted to open this one anyways. They may or may not be the same issue (#6847 uses different models & does not use a chat template), and your title also indicate that it's for "old models" which isn't the case for the models that I'm using - all are recent models. I think it's also inaccurate to frame it as "gibberish" since it's not gibberish tokens, which is usually a separate problem. It's an issue with prompt template-related tokens, which means it's the chat template not being applied or parsed properly in the chat completions endpoint.

Given this, I think it's best to either merge that issue into this, or to leave them both open.

I agree to keep them separate. Almost all models old or new are impacted by the change. Reverting to older versions seems to work.

K-Mistele · 2024-04-24T02:59:00Z

@QueryType in your issue you said

I was able to run mythomax-13B and other models perfectly before these changes, infact I can run even now on older releases

Do you know what the most recent release that you are able to use the server without this issue is? If so we can work backwards to figure out the source of the issue, and I can try to create a PR to fix

QueryType · 2024-04-24T03:07:16Z

@QueryType in your issue you said

I was able to run mythomax-13B and other models perfectly before these changes, infact I can run even now on older releases

Do you know what the most recent release that you are able to use the server without this issue is? If so we can work backwards to figure out the source of the issue, and I can try to create a PR to fix

Yes, good idea, will do that. My hunch is (b2707 or b2702) introduced it. So b2700 should be fine. But honestly, I need to check. Will do in the evening today once I have access to the machine.

K-Mistele · 2024-04-24T03:14:24Z

Hmm it looks like it could've been either of these two merged PRs:

https://github.com/ggerganov/llama.cpp/pull/6745/files - This PR deals with parsing EOG tokens so it could be this one.

The other one, which is my guess, is this:
https://github.com/ggerganov/llama.cpp/pull/6807/files

It adds a parameter bool special to llama_token_to_piece that is set to true in some places and false in others. The PR is explicitly to "add an option to render special/control tokens", and if it's true, special/control tokens (including EOS/EOG tokens?) render.

Seems like this is the most likely culprit, cc @ggerganov who was the author. Checking if changing this value stops the issue now.

K-Mistele · 2024-04-24T03:19:17Z

Yep that fixed it. cc @ggerganov, merging #6807 broke this behavior such that server now renders EOS/stop tokens.

Changing true to false in common.hpp on lines 2331 and 2335 fixes the issue. Happy to create a PR to fix this, but since it seems like the option to show the control tokens was intentional and maybe it just wasn't the intention to make doing so the default behavior, I understand that there may be a different way that you would prefer to resolve this?

slavonnet · 2024-04-24T09:41:19Z

also #6872

slavonnet · 2024-04-24T09:54:28Z

also #6873

slavonnet · 2024-04-24T10:12:11Z

Not only the server but also main returns the ending

ggerganov · 2024-04-24T10:16:54Z

Fixed in #6860

K-Mistele added the bug-unconfirmed label Apr 24, 2024

K-Mistele changed the title ~~Server output includes EOS token for llama2 chat template~~ Sever OpenAI Chat Completions Endpoint Responses include EOS / stop tokens Apr 24, 2024

K-Mistele changed the title ~~Sever OpenAI Chat Completions Endpoint Responses include EOS / stop tokens~~ OpenAI-Compatible Chat Completions API Endpoint Responses include EOS / stop tokens Apr 24, 2024

K-Mistele mentioned this issue Apr 24, 2024

Fix: Revert showing control tokens by default for server OpenAI Chat completions #6860

Merged

ggerganov closed this as completed Apr 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OpenAI-Compatible Chat Completions API Endpoint Responses include EOS / stop tokens #6859

OpenAI-Compatible Chat Completions API Endpoint Responses include EOS / stop tokens #6859

K-Mistele commented Apr 24, 2024 •

edited

Loading

QueryType commented Apr 24, 2024

K-Mistele commented Apr 24, 2024

QueryType commented Apr 24, 2024

K-Mistele commented Apr 24, 2024

QueryType commented Apr 24, 2024

K-Mistele commented Apr 24, 2024 •

edited

Loading

K-Mistele commented Apr 24, 2024

slavonnet commented Apr 24, 2024

slavonnet commented Apr 24, 2024

slavonnet commented Apr 24, 2024

ggerganov commented Apr 24, 2024

OpenAI-Compatible Chat Completions API Endpoint Responses include EOS / stop tokens #6859

OpenAI-Compatible Chat Completions API Endpoint Responses include EOS / stop tokens #6859

Comments

K-Mistele commented Apr 24, 2024 • edited Loading

Example of Broken Behavior

QueryType commented Apr 24, 2024

K-Mistele commented Apr 24, 2024

QueryType commented Apr 24, 2024

K-Mistele commented Apr 24, 2024

QueryType commented Apr 24, 2024

K-Mistele commented Apr 24, 2024 • edited Loading

K-Mistele commented Apr 24, 2024

slavonnet commented Apr 24, 2024

slavonnet commented Apr 24, 2024

slavonnet commented Apr 24, 2024

ggerganov commented Apr 24, 2024

K-Mistele commented Apr 24, 2024 •

edited

Loading

K-Mistele commented Apr 24, 2024 •

edited

Loading