-
Notifications
You must be signed in to change notification settings - Fork 1.2k
feat: OpenAI-Compatible models, completions, chat/completions #1894
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 14 commits
a193c9f
92fdf6d
5bc5fed
1dbdff1
00c4493
15d37fd
de01b14
24cfa1e
a6cf8fa
fcdeb3d
a1e9cff
52b4766
ef684ff
ac5dc8f
8d10556
8f5cd49
a5827f7
ffae192
31181c0
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Large diffs are not rendered by default.
Large diffs are not rendered by default.
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -442,6 +442,217 @@ class EmbeddingsResponse(BaseModel): | |
| embeddings: List[List[float]] | ||
|
|
||
|
|
||
| @json_schema_type | ||
| class OpenAIUserMessageParam(BaseModel): | ||
| """A message from the user in an OpenAI-compatible chat completion request. | ||
|
|
||
| :param role: Must be "user" to identify this as a user message | ||
| :param content: The content of the message, which can include text and other media | ||
| :param name: (Optional) The name of the user message participant. | ||
| """ | ||
|
|
||
| role: Literal["user"] = "user" | ||
| content: InterleavedContent | ||
| name: Optional[str] = None | ||
|
|
||
|
|
||
| @json_schema_type | ||
| class OpenAISystemMessageParam(BaseModel): | ||
| """A system message providing instructions or context to the model. | ||
|
|
||
| :param role: Must be "system" to identify this as a system message | ||
| :param content: The content of the "system prompt". If multiple system messages are provided, they are concatenated. The underlying Llama Stack code may also add other system messages (for example, for formatting tool definitions). | ||
| :param name: (Optional) The name of the system message participant. | ||
| """ | ||
|
|
||
| role: Literal["system"] = "system" | ||
| content: InterleavedContent | ||
| name: Optional[str] = None | ||
|
|
||
|
|
||
| @json_schema_type | ||
| class OpenAIAssistantMessageParam(BaseModel): | ||
| """A message containing the model's (assistant) response in an OpenAI-compatible chat completion request. | ||
|
|
||
| :param role: Must be "assistant" to identify this as the model's response | ||
| :param content: The content of the model's response | ||
| :param name: (Optional) The name of the assistant message participant. | ||
| :param tool_calls: List of tool calls. Each tool call is a ToolCall object. | ||
| """ | ||
|
|
||
| role: Literal["assistant"] = "assistant" | ||
| content: InterleavedContent | ||
| name: Optional[str] = None | ||
| tool_calls: Optional[List[ToolCall]] = Field(default_factory=list) | ||
|
|
||
|
|
||
| @json_schema_type | ||
| class OpenAIToolMessageParam(BaseModel): | ||
| """A message representing the result of a tool invocation in an OpenAI-compatible chat completion request. | ||
|
|
||
| :param role: Must be "tool" to identify this as a tool response | ||
| :param tool_call_id: Unique identifier for the tool call this response is for | ||
| :param content: The response content from the tool | ||
| """ | ||
|
|
||
| role: Literal["tool"] = "tool" | ||
| tool_call_id: str | ||
| content: InterleavedContent | ||
|
|
||
|
|
||
| @json_schema_type | ||
| class OpenAIDeveloperMessageParam(BaseModel): | ||
| """A message from the developer in an OpenAI-compatible chat completion request. | ||
|
|
||
| :param role: Must be "developer" to identify this as a developer message | ||
| :param content: The content of the developer message | ||
| :param name: (Optional) The name of the developer message participant. | ||
| """ | ||
|
|
||
| role: Literal["developer"] = "developer" | ||
| content: InterleavedContent | ||
| name: Optional[str] = None | ||
|
|
||
|
|
||
| OpenAIMessageParam = Annotated[ | ||
| Union[ | ||
| OpenAIUserMessageParam, | ||
| OpenAISystemMessageParam, | ||
| OpenAIAssistantMessageParam, | ||
| OpenAIToolMessageParam, | ||
| OpenAIDeveloperMessageParam, | ||
| ], | ||
| Field(discriminator="role"), | ||
| ] | ||
| register_schema(OpenAIMessageParam, name="OpenAIMessageParam") | ||
|
|
||
|
|
||
| @json_schema_type | ||
| class OpenAITopLogProb(BaseModel): | ||
| """The top log probability for a token from an OpenAI-compatible chat completion response. | ||
|
|
||
| :token: The token | ||
| :bytes: (Optional) The bytes for the token | ||
| :logprob: The log probability of the token | ||
| """ | ||
|
|
||
| token: str | ||
| bytes: Optional[List[int]] = None | ||
| logprob: float | ||
|
|
||
|
|
||
| @json_schema_type | ||
| class OpenAITokenLogProb(BaseModel): | ||
| """The log probability for a token from an OpenAI-compatible chat completion response. | ||
|
|
||
| :token: The token | ||
| :bytes: (Optional) The bytes for the token | ||
| :logprob: The log probability of the token | ||
| :top_logprobs: The top log probabilities for the token | ||
| """ | ||
|
|
||
| token: str | ||
| bytes: Optional[List[int]] = None | ||
| logprob: float | ||
| top_logprobs: List[OpenAITopLogProb] | ||
|
|
||
|
|
||
| @json_schema_type | ||
| class OpenAIChoiceLogprobs(BaseModel): | ||
| """The log probabilities for the tokens in the message from an OpenAI-compatible chat completion response. | ||
|
|
||
| :content: (Optional) The log probabilities for the tokens in the message | ||
| :refusal: (Optional) The log probabilities for the tokens in the message | ||
| """ | ||
|
|
||
| content: Optional[List[OpenAITokenLogProb]] = None | ||
| refusal: Optional[List[OpenAITokenLogProb]] = None | ||
|
|
||
|
|
||
| @json_schema_type | ||
| class OpenAIChoice(BaseModel): | ||
| """A choice from an OpenAI-compatible chat completion response. | ||
|
|
||
| :param message: The message from the model | ||
| :param finish_reason: The reason the model stopped generating | ||
| :index: The index of the choice | ||
| :logprobs: (Optional) The log probabilities for the tokens in the message | ||
| """ | ||
|
|
||
| message: OpenAIMessageParam | ||
| finish_reason: str | ||
| index: int | ||
| logprobs: Optional[OpenAIChoiceLogprobs] = None | ||
|
|
||
|
|
||
| @json_schema_type | ||
| class OpenAIChatCompletion(BaseModel): | ||
| """Response from an OpenAI-compatible chat completion request. | ||
|
|
||
| :param id: The ID of the chat completion | ||
| :param choices: List of choices | ||
| :param object: The object type, which will be "chat.completion" | ||
| :param created: The Unix timestamp in seconds when the chat completion was created | ||
| :param model: The model that was used to generate the chat completion | ||
| """ | ||
|
|
||
| id: str | ||
| choices: List[OpenAIChoice] | ||
| object: Literal["chat.completion"] = "chat.completion" | ||
| created: int | ||
| model: str | ||
|
|
||
|
|
||
| @json_schema_type | ||
| class OpenAICompletionLogprobs(BaseModel): | ||
| """The log probabilities for the tokens in the message from an OpenAI-compatible completion response. | ||
|
|
||
| :text_offset: (Optional) The offset of the token in the text | ||
| :token_logprobs: (Optional) The log probabilities for the tokens | ||
| :tokens: (Optional) The tokens | ||
| :top_logprobs: (Optional) The top log probabilities for the tokens | ||
| """ | ||
|
|
||
| text_offset: Optional[List[int]] = None | ||
| token_logprobs: Optional[List[float]] = None | ||
| tokens: Optional[List[str]] = None | ||
| top_logprobs: Optional[List[Dict[str, float]]] = None | ||
|
|
||
|
|
||
| @json_schema_type | ||
| class OpenAICompletionChoice(BaseModel): | ||
| """A choice from an OpenAI-compatible completion response. | ||
|
|
||
| :finish_reason: The reason the model stopped generating | ||
| :text: The text of the choice | ||
| :index: The index of the choice | ||
| :logprobs: (Optional) The log probabilities for the tokens in the choice | ||
| """ | ||
|
|
||
| finish_reason: str | ||
| text: str | ||
| index: int | ||
| logprobs: Optional[OpenAIChoiceLogprobs] = None | ||
|
|
||
|
|
||
| @json_schema_type | ||
| class OpenAICompletion(BaseModel): | ||
| """Response from an OpenAI-compatible completion request. | ||
|
|
||
| :id: The ID of the completion | ||
| :choices: List of choices | ||
| :created: The Unix timestamp in seconds when the completion was created | ||
| :model: The model that was used to generate the completion | ||
| :object: The object type, which will be "text_completion" | ||
| """ | ||
|
|
||
| id: str | ||
| choices: List[OpenAICompletionChoice] | ||
| created: int | ||
| model: str | ||
| object: Literal["text_completion"] = "text_completion" | ||
|
|
||
|
|
||
| class ModelStore(Protocol): | ||
| async def get_model(self, identifier: str) -> Model: ... | ||
|
|
||
|
|
@@ -564,3 +775,105 @@ async def embeddings( | |
| :returns: An array of embeddings, one for each content. Each embedding is a list of floats. The dimensionality of the embedding is model-specific; you can check model metadata using /models/{model_id} | ||
| """ | ||
| ... | ||
|
|
||
| @webmethod(route="/openai/v1/completions", method="POST") | ||
| async def openai_completion( | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I wonder if we should have this under apis/openai/ so that OpenAI related things are in one place.
Collaborator
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. That's reasonable, and I went back-and-forth a bit here myself. I put the OpenAI models API endpoint under our models.py file and the OpenAI inference endpoints under our inference.py file simply because they mapped nicely to existing constructs. But, I don't have a strong preference there. |
||
| self, | ||
| # Standard OpenAI completion parameters | ||
| model: str, | ||
| prompt: Union[str, List[str], List[int], List[List[int]]], | ||
| best_of: Optional[int] = None, | ||
| echo: Optional[bool] = None, | ||
| frequency_penalty: Optional[float] = None, | ||
| logit_bias: Optional[Dict[str, float]] = None, | ||
| logprobs: Optional[bool] = None, | ||
| max_tokens: Optional[int] = None, | ||
| n: Optional[int] = None, | ||
| presence_penalty: Optional[float] = None, | ||
| seed: Optional[int] = None, | ||
| stop: Optional[Union[str, List[str]]] = None, | ||
| stream: Optional[bool] = None, | ||
| stream_options: Optional[Dict[str, Any]] = None, | ||
| temperature: Optional[float] = None, | ||
| top_p: Optional[float] = None, | ||
| user: Optional[str] = None, | ||
| # vLLM-specific parameters | ||
| guided_choice: Optional[List[str]] = None, | ||
| prompt_logprobs: Optional[int] = None, | ||
| ) -> OpenAICompletion: | ||
| """Generate an OpenAI-compatible completion for the given prompt using the specified model. | ||
|
|
||
| :param model: The identifier of the model to use. The model must be registered with Llama Stack and available via the /models endpoint. | ||
| :param prompt: The prompt to generate a completion for | ||
| :param best_of: (Optional) The number of completions to generate | ||
| :param echo: (Optional) Whether to echo the prompt | ||
| :param frequency_penalty: (Optional) The penalty for repeated tokens | ||
| :param logit_bias: (Optional) The logit bias to use | ||
| :param logprobs: (Optional) The log probabilities to use | ||
| :param max_tokens: (Optional) The maximum number of tokens to generate | ||
| :param n: (Optional) The number of completions to generate | ||
| :param presence_penalty: (Optional) The penalty for repeated tokens | ||
| :param seed: (Optional) The seed to use | ||
| :param stop: (Optional) The stop tokens to use | ||
| :param stream: (Optional) Whether to stream the response | ||
| :param stream_options: (Optional) The stream options to use | ||
| :param temperature: (Optional) The temperature to use | ||
| :param top_p: (Optional) The top p to use | ||
| :param user: (Optional) The user to use | ||
| """ | ||
| ... | ||
|
|
||
| @webmethod(route="/openai/v1/chat/completions", method="POST") | ||
| async def openai_chat_completion( | ||
| self, | ||
| model: str, | ||
| messages: List[OpenAIMessageParam], | ||
| frequency_penalty: Optional[float] = None, | ||
| function_call: Optional[Union[str, Dict[str, Any]]] = None, | ||
| functions: Optional[List[Dict[str, Any]]] = None, | ||
| logit_bias: Optional[Dict[str, float]] = None, | ||
| logprobs: Optional[bool] = None, | ||
| max_completion_tokens: Optional[int] = None, | ||
| max_tokens: Optional[int] = None, | ||
| n: Optional[int] = None, | ||
| parallel_tool_calls: Optional[bool] = None, | ||
| presence_penalty: Optional[float] = None, | ||
| response_format: Optional[Dict[str, str]] = None, | ||
| seed: Optional[int] = None, | ||
| stop: Optional[Union[str, List[str]]] = None, | ||
| stream: Optional[bool] = None, | ||
| stream_options: Optional[Dict[str, Any]] = None, | ||
| temperature: Optional[float] = None, | ||
| tool_choice: Optional[Union[str, Dict[str, Any]]] = None, | ||
| tools: Optional[List[Dict[str, Any]]] = None, | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. should we have define a tool type for this? |
||
| top_logprobs: Optional[int] = None, | ||
| top_p: Optional[float] = None, | ||
| user: Optional[str] = None, | ||
| ) -> OpenAIChatCompletion: | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is this correctly typed for streaming?
Collaborator
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. No, it's not. The type doesn't cover the streaming case at all, so even though streaming works in practice with the API as-is when used from OpenAI clients the typing for streaming isn't handled yet. |
||
| """Generate an OpenAI-compatible chat completion for the given messages using the specified model. | ||
|
|
||
| :param model: The identifier of the model to use. The model must be registered with Llama Stack and available via the /models endpoint. | ||
| :param messages: List of messages in the conversation | ||
| :param frequency_penalty: (Optional) The penalty for repeated tokens | ||
| :param function_call: (Optional) The function call to use | ||
| :param functions: (Optional) List of functions to use | ||
| :param logit_bias: (Optional) The logit bias to use | ||
| :param logprobs: (Optional) The log probabilities to use | ||
| :param max_completion_tokens: (Optional) The maximum number of tokens to generate | ||
| :param max_tokens: (Optional) The maximum number of tokens to generate | ||
| :param n: (Optional) The number of completions to generate | ||
| :param parallel_tool_calls: (Optional) Whether to parallelize tool calls | ||
| :param presence_penalty: (Optional) The penalty for repeated tokens | ||
| :param response_format: (Optional) The response format to use | ||
| :param seed: (Optional) The seed to use | ||
| :param stop: (Optional) The stop tokens to use | ||
| :param stream: (Optional) Whether to stream the response | ||
| :param stream_options: (Optional) The stream options to use | ||
| :param temperature: (Optional) The temperature to use | ||
| :param tool_choice: (Optional) The tool choice to use | ||
| :param tools: (Optional) The tools to use | ||
| :param top_logprobs: (Optional) The top log probabilities to use | ||
| :param top_p: (Optional) The top p to use | ||
| :param user: (Optional) The user to use | ||
| """ | ||
| ... | ||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we just import from openai.types.chat as we did in openai_compat.py?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I actually started with that. However, the API codegen wasn't able to successfully run with those types. I don't recall the exact errors now, but I can try an example out in a bit just to document what the actual issue was there. A secondary concern would be whether we want direct control over the public-facing API of Llama Stack or whether we want to let new versions of the OpenAI python client impact our API surface.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here's an example of the kinds of errors the API spec codegen throws when using any of the OpenAI python client's types in our API:
It's probably solvable, but something about how the OpenAI types use
ClassVarisn't liked by the strong_typing code in Llama Stack.