Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Server: OpenAI-compatible POST /v1/chat/completions API endpoint #4160

Conversation

kir-gadjello
Copy link
Contributor

@kir-gadjello kir-gadjello commented Nov 22, 2023

This PR allows one to use OpenAI ChatGPT API-compatible software right away after building just the server binary, without any wrappers like "LocalAI" or "api_like_OAI.py". I think merging this will remove friction around one of the most popular llama.cpp usecases, making wrappers like LocalAI less necessary. This might drive adoption up.

It works: I use this code with several projects, including https://github.com/talkingwallace/ChatGPT-Paper-Reader and third-party ChatGPT UIs. Both streaming and synchronous API modes are supported and tested by yours truly in real applications.

For now, only ChatML-based models are supported, but it's a matter of adding different prompt formatters, and the ChatML is becoming the de facto standard anyway.

No breaking change is made to the existing codebase: it's just a completely different, optional API endpoint and a few state variables, only server.cpp was modified.

After merging this we could work on replicating other OAI APIs such as vision.

Copy link
Owner

@ggerganov ggerganov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice addition!

  • Use 4 space indentation
  • Column width is 120 characters

@ggerganov ggerganov requested a review from tobi November 22, 2023 10:28
Comment on lines 2229 to 2246

std::string format_chatml(std::vector<json> messages) {

std::ostringstream chatml_msgs;

// iterate the array
for (auto it = messages.begin(); it != messages.end(); ++it) {
chatml_msgs << "<|im_start|>"
<< json_value(*it, "role", std::string("user")) << '\n';
chatml_msgs << json_value(*it, "content", std::string(""))
<< "<|im_end|>\n";
}

chatml_msgs << "<|im_start|>assistant" << '\n';

return chatml_msgs.str();
}

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we should enable tokenization of special tokens by default in sever example:

std::vector<llama_token> tokenize(const json & json_prompt, bool add_bos) const
{
// If `add_bos` is true, we only add BOS, when json_prompt is a string,
// or the first element of the json_prompt array is a string.
std::vector<llama_token> prompt_tokens;
if (json_prompt.is_array())
{
bool first = true;
for (const auto& p : json_prompt)
{
if (p.is_string())
{
auto s = p.template get<std::string>();
std::vector<llama_token> p;
if (first)
{
p = ::llama_tokenize(ctx, s, add_bos);
first = false;
}
else
{
p = ::llama_tokenize(ctx, s, false);
}
prompt_tokens.insert(prompt_tokens.end(), p.begin(), p.end());
}
else
{
if (first)
{
first = false;
}
prompt_tokens.push_back(p.template get<llama_token>());
}
}
}
else
{
auto s = json_prompt.template get<std::string>();
prompt_tokens = ::llama_tokenize(ctx, s, add_bos);
}
return prompt_tokens;
}

Here, the calls to llama_tokenize currently default to special == false because there are some edge cases where the user could be using html tags such as <s> and they would be tokenized incorrectly as special tokens.

Technically, the correct solution is the one we have in main:

const auto line_pfx = ::llama_tokenize(ctx, params.input_prefix, false, true);
const auto line_inp = ::llama_tokenize(ctx, buffer, false, false);
const auto line_sfx = ::llama_tokenize(ctx, params.input_suffix, false, true);
LOG("input tokens: %s\n", LOG_TOKENS_TOSTR_PRETTY(ctx, line_inp).c_str());

But this seems a bit tricky to implement in server.

I think there is no point in sacrificing ChatML compat just for that extreme edge case, so maybe we should use special == true in server.

@staviq and others, any thoughts?

Copy link
Contributor Author

@kir-gadjello kir-gadjello Nov 23, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we enable it only in the context of /v1/chat/completions (and maybe other future) OpenAI compatible APIs? Note that I added oaicompat field to the slot, so it should be easy to do a tokenize(..., special = slot.oaicompat)
Can we update tokenize to be able to give it a whitelist of allowed special tokens? ChatML tokens are special due to "|"-character and won't clash with HTML tags, so allowing them and them only should not do harm.

Preferably, I would leave this decision & implementation to core contributors.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are many options and we can do anything we like, though it's a bit difficult for me to make the decision. There are so many variations of these templates and tokenization is incredibly over-complicated in general.

Can we enable it only in the context of /v1/chat/completions (and maybe other future) OpenAI compatible APIs?

Yes, let's do that.

Can we update tokenize to be able to give it a whitelist of allowed special tokens?

Not sure if this is needed. What issue would it fix?
The model already knows internally which are the special tokens.

@kir-gadjello
Copy link
Contributor Author

I fixed code style, not sure about some places though. I hope eventually clang-format will solve it, right now it produces a global patch, so some core contributor will have to do it instead of me.
If there are no other problems, you can merge it.

examples/server/server.cpp Show resolved Hide resolved
examples/server/server.cpp Show resolved Hide resolved
examples/server/server.cpp Show resolved Hide resolved
@shibe2
Copy link
Contributor

shibe2 commented Nov 23, 2023

A disadvantage of processing special tokens in this way is that then the model cannot distinguish between, say, special token im_start and literal text "<|im_start|>". Even with ChatML, special tokens should not be recognized inside message body. A proper way to format prompts is to obtain identifiers of needed special tokens in a separate step and put only text through the tokenizer.

for (auto it = messages.begin(); it != messages.end(); ++it) {
chatml_msgs << "<|im_start|>"
<< json_value(*it, "role", std::string("user")) << '\n';
chatml_msgs << json_value(*it, "content", std::string(""))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that content should not be mixed with message prefixes and suffixes. If content happens to contain strings "<|im_start|>" or "<|im_end|>", they should not be converted to special tokens. A better way to format to ChatML is to create an array as described in the documentation of /completion endpoint.

@tobi
Copy link
Collaborator

tobi commented Nov 23, 2023

Thank you. Yes, this will make a huge difference.

One idea for the chat format is that we could see if we could bake in the chat templates into the gguf file metadata. Huggingface Transformers now has the chat format as meta data in the tokenizer_config.json file according to https://huggingface.co/docs/transformers/chat_templating.

Sadly it's some god awful Jinja syntax, but since we have a lot of EBNF experts here we may have a go at parsing a subset of it. Definitely would make things future safe. There are a lot of goofy template styles floating around. Although chatml is definitely the right standard.

@Azeirah
Copy link
Contributor

Azeirah commented Nov 23, 2023

Hey, this is a really cool addition!

I tried using it with a client in ObsidianMD, but I'm getting CORS issues. I know the api_like_OAI.py configures CORS with flask.

httplib doesn't offer explicit CORS support, but it is possible to configure an OPTIONS handler: yhirose/cpp-httplib#62 (comment)

@tobi
Copy link
Collaborator

tobi commented Nov 24, 2023

I thew a few different local software pieces and libraries at it and it works really great.

Two things I think need doing:

  • for sanity's sake let's mount the endpoints with v1/ and without to reduce the manual error potential.
  • let's implement v1/models endpoint. A bunch of tools hit it to check if the endpoint is working.

Thanks again!

@kir-gadjello
Copy link
Contributor Author

@tobi @shibe2 Thank you for trying this out and proposing improvements. Thing is, I'm very busy with my main project rn and so I won't be able to return to this PR to implement a proper ChatML templater engine and /models endpoint in a week. If we need to mainline this sooner, I'd be happy to accept commits into this branch 🙏

@ggerganov
Copy link
Owner

I continued this into a local branch: #4198

Propose that we enable special tokens by default for now, because I think handling ChatML is more important than the few edge cases where the input text would contain the string "<|im_start|>". We will fix this later as discussed above.

Will create a new issue to list the rest of the proposed improvements and look for contributors to help out with the implementation.

@cebtenzzre
Copy link
Collaborator

Closing in favor of #4198

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants