-
Notifications
You must be signed in to change notification settings - Fork 9.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Server: OpenAI-compatible POST /v1/chat/completions API endpoint #4160
Server: OpenAI-compatible POST /v1/chat/completions API endpoint #4160
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice addition!
- Use 4 space indentation
- Column width is 120 characters
examples/server/server.cpp
Outdated
|
||
std::string format_chatml(std::vector<json> messages) { | ||
|
||
std::ostringstream chatml_msgs; | ||
|
||
// iterate the array | ||
for (auto it = messages.begin(); it != messages.end(); ++it) { | ||
chatml_msgs << "<|im_start|>" | ||
<< json_value(*it, "role", std::string("user")) << '\n'; | ||
chatml_msgs << json_value(*it, "content", std::string("")) | ||
<< "<|im_end|>\n"; | ||
} | ||
|
||
chatml_msgs << "<|im_start|>assistant" << '\n'; | ||
|
||
return chatml_msgs.str(); | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if we should enable tokenization of special tokens by default in sever
example:
llama.cpp/examples/server/server.cpp
Lines 610 to 653 in 8e672ef
std::vector<llama_token> tokenize(const json & json_prompt, bool add_bos) const | |
{ | |
// If `add_bos` is true, we only add BOS, when json_prompt is a string, | |
// or the first element of the json_prompt array is a string. | |
std::vector<llama_token> prompt_tokens; | |
if (json_prompt.is_array()) | |
{ | |
bool first = true; | |
for (const auto& p : json_prompt) | |
{ | |
if (p.is_string()) | |
{ | |
auto s = p.template get<std::string>(); | |
std::vector<llama_token> p; | |
if (first) | |
{ | |
p = ::llama_tokenize(ctx, s, add_bos); | |
first = false; | |
} | |
else | |
{ | |
p = ::llama_tokenize(ctx, s, false); | |
} | |
prompt_tokens.insert(prompt_tokens.end(), p.begin(), p.end()); | |
} | |
else | |
{ | |
if (first) | |
{ | |
first = false; | |
} | |
prompt_tokens.push_back(p.template get<llama_token>()); | |
} | |
} | |
} | |
else | |
{ | |
auto s = json_prompt.template get<std::string>(); | |
prompt_tokens = ::llama_tokenize(ctx, s, add_bos); | |
} | |
return prompt_tokens; | |
} |
Here, the calls to llama_tokenize
currently default to special == false
because there are some edge cases where the user could be using html tags such as <s>
and they would be tokenized incorrectly as special tokens.
Technically, the correct solution is the one we have in main
:
llama.cpp/examples/main/main.cpp
Lines 787 to 792 in 8e672ef
const auto line_pfx = ::llama_tokenize(ctx, params.input_prefix, false, true); | |
const auto line_inp = ::llama_tokenize(ctx, buffer, false, false); | |
const auto line_sfx = ::llama_tokenize(ctx, params.input_suffix, false, true); | |
LOG("input tokens: %s\n", LOG_TOKENS_TOSTR_PRETTY(ctx, line_inp).c_str()); | |
But this seems a bit tricky to implement in server
.
I think there is no point in sacrificing ChatML compat just for that extreme edge case, so maybe we should use special == true
in server
.
@staviq and others, any thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we enable it only in the context of /v1/chat/completions (and maybe other future) OpenAI compatible APIs? Note that I added oaicompat
field to the slot, so it should be easy to do a tokenize(..., special = slot.oaicompat)
Can we update tokenize to be able to give it a whitelist of allowed special tokens? ChatML tokens are special due to "|"-character and won't clash with HTML tags, so allowing them and them only should not do harm.
Preferably, I would leave this decision & implementation to core contributors.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are many options and we can do anything we like, though it's a bit difficult for me to make the decision. There are so many variations of these templates and tokenization is incredibly over-complicated in general.
Can we enable it only in the context of /v1/chat/completions (and maybe other future) OpenAI compatible APIs?
Yes, let's do that.
Can we update tokenize to be able to give it a whitelist of allowed special tokens?
Not sure if this is needed. What issue would it fix?
The model already knows internally which are the special tokens.
I fixed code style, not sure about some places though. I hope eventually clang-format will solve it, right now it produces a global patch, so some core contributor will have to do it instead of me. |
A disadvantage of processing special tokens in this way is that then the model cannot distinguish between, say, special token |
for (auto it = messages.begin(); it != messages.end(); ++it) { | ||
chatml_msgs << "<|im_start|>" | ||
<< json_value(*it, "role", std::string("user")) << '\n'; | ||
chatml_msgs << json_value(*it, "content", std::string("")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that content should not be mixed with message prefixes and suffixes. If content happens to contain strings "<|im_start|>" or "<|im_end|>", they should not be converted to special tokens. A better way to format to ChatML is to create an array as described in the documentation of /completion
endpoint.
Thank you. Yes, this will make a huge difference. One idea for the chat format is that we could see if we could bake in the chat templates into the gguf file metadata. Huggingface Transformers now has the chat format as meta data in the tokenizer_config.json file according to https://huggingface.co/docs/transformers/chat_templating. Sadly it's some god awful Jinja syntax, but since we have a lot of EBNF experts here we may have a go at parsing a subset of it. Definitely would make things future safe. There are a lot of goofy template styles floating around. Although chatml is definitely the right standard. |
Hey, this is a really cool addition! I tried using it with a client in ObsidianMD, but I'm getting CORS issues. I know the api_like_OAI.py configures CORS with flask. httplib doesn't offer explicit CORS support, but it is possible to configure an OPTIONS handler: yhirose/cpp-httplib#62 (comment) |
I thew a few different local software pieces and libraries at it and it works really great. Two things I think need doing:
Thanks again! |
@tobi @shibe2 Thank you for trying this out and proposing improvements. Thing is, I'm very busy with my main project rn and so I won't be able to return to this PR to implement a proper ChatML templater engine and /models endpoint in a week. If we need to mainline this sooner, I'd be happy to accept commits into this branch 🙏 |
I continued this into a local branch: #4198 Propose that we enable special tokens by default for now, because I think handling ChatML is more important than the few edge cases where the input text would contain the string "<|im_start|>". We will fix this later as discussed above. Will create a new issue to list the rest of the proposed improvements and look for contributors to help out with the implementation. |
Closing in favor of #4198 |
This PR allows one to use OpenAI ChatGPT API-compatible software right away after building just the server binary, without any wrappers like "LocalAI" or "api_like_OAI.py". I think merging this will remove friction around one of the most popular llama.cpp usecases, making wrappers like LocalAI less necessary. This might drive adoption up.
It works: I use this code with several projects, including https://github.com/talkingwallace/ChatGPT-Paper-Reader and third-party ChatGPT UIs. Both streaming and synchronous API modes are supported and tested by yours truly in real applications.
For now, only ChatML-based models are supported, but it's a matter of adding different prompt formatters, and the ChatML is becoming the de facto standard anyway.
No breaking change is made to the existing codebase: it's just a completely different, optional API endpoint and a few state variables, only server.cpp was modified.
After merging this we could work on replicating other OAI APIs such as vision.