Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix: Revert showing control tokens by default for server OpenAI Chat completions #6860

Conversation

K-Mistele
Copy link
Contributor

In #6807 @ggerganov added the ability to toggle showing control tokens (e.g. EOS tokens). In common.cpp this was set to true by default in two places, which broke the /v1/chat/completions endpoint as described in #6859 - in short, the OpenAI chat completions endpoint response now includes the EOS / stop token, which is different from past behavior / expected behavior.

I have confirmed that reverting the booleans to be false in the two places in common.cpp fixes this behavior.

While this PR fixes the breaking change, it may affect behavior that is dependent on #6807's new default of true in other places. This may need to be investigated further, but I propose reverting the change for now to fix the broken /v1/chat/completions behavior.

s/o @QueryType for opening #6847 as well which was caused by the same underlying issue.

API Response before the change (ChatML model):

image

API Response before the change (Mistral model / llama2 template):

image

Correct API response after this change:

(note the absence of control tokens)

Screenshot 2024-04-23 at 10 34 04 PM

…vide overridden declaration to receive "bool special" param to toggle showing control tokens
…mon/common.cpp to specify "false" so that control tokens are not shown in chat completion responses"
@K-Mistele
Copy link
Contributor Author

I provided an alternative solution that reverts the change to common.cpp to prevent breaking things that depended on the change in #6807

I added an overridden declaration of llama_token_to_piece in common.cpp and common.h that allows passing in a boolean to actually toggle if the control characters should be shown:

common.h overridden declaration

std::string llama_token_to_piece(const struct llama_context * ctx,  llama_token  token, bool  special);

common.cpp definition for the declaration

// duplicate with ability to specify whether to use special token
std::string llama_token_to_piece(const struct llama_context * ctx, llama_token token, bool special) {
    std::vector<char> result(8, 0);
    const int n_tokens = llama_token_to_piece(llama_get_model(ctx), token, result.data(), result.size(), special);
    if (n_tokens < 0) {
        result.resize(-n_tokens);
        int check = llama_token_to_piece(llama_get_model(ctx), token, result.data(), result.size(), special);
        GGML_ASSERT(check == -n_tokens);
    } else {
        result.resize(n_tokens);
    }

    return std::string(result.data(), result.size());
}

Then, I edited server.cpp to use the overridden version of the function, and to pass in false instead of the true default from common.cpp.

The net result of this is that it still fixes the broken behavior in the /v1/chat/completions endpoint on the server, but anything downstream that depends on the changes in #6807 elsewhere in the project should still be fine.

You'll have to forgive me because my C++ is a little rusty (no pun intended), but as best as I can remember from my C++ class in college, this is the correct way to override a function.

Bonus: everything compiles correctly, and I confirmed that this is still a working fix for the problem with the server -- no stop tokens in the server's chat completions response:

Screenshot 2024-04-23 at 10 48 32 PM

(sense of humor not included)

@K-Mistele
Copy link
Contributor Author

I am open to comments, concerns and/or complaints regarding if this is the correct way to fix this problem

@K-Mistele K-Mistele changed the title Fix: Revert showing control tokens by default Fix: Revert showing control tokens by default for server OpenAI Chat completions Apr 24, 2024
@K-Mistele
Copy link
Contributor Author

K-Mistele commented May 6, 2024

Hmm i am not sure what happened but on the most up-to-date llama.cpp master branch, I am observing this issue again even though the default appears to remain unchanged, and server.cpp still explicitly passes false to llama_token_to_piece. Possibly only related to llama 3 though - it works fine on mistral, but on llama 3 I get a at the end of every response with --chat-template llama3

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants