-
Notifications
You must be signed in to change notification settings - Fork 1.2k
feat: OpenAI-Compatible models, completions, chat/completions #1894
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
This is just a draft that I'll keep pushing on a bit to see how hard it would be to get OpenAI compatible completion and chat completion endpoints across all existing providers. Most of the work will come in the translation layer for the providers that don't use OpenAI clients internally. This also paves the way to implementing OpenAI Responses API compatibility, but starting with Models, Completions, and Chat Completions first for the sake of maximum compatibility with existing clients in the wild. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the head start! Please keep me posted on how this goes
86927c7 to
39350db
Compare
|
I'll start on some basic tests here, but note that I can't really write the typical inference integration tests until there's a release of the llama stack client python library with these API changes, since our integration tests use that to hit the inference endpoint. |
|
Nevermind my comment about not being able to write integration tests yet. It dawned on me that I can of course just use an OpenAI client to test these new APIs, so I've started adding those. |
|
Is it possible to support extra_body so that we can pass parameters that are not part of OpenAI API but others like vLLM specifically. eg., vLLM has a sampling parameter called prompt_logprobs to return logprobs for prompt tokens ('logprobs' parameter returns only for generated tokens). OpenAI has no 'prompt_logprobs' parameter, so I'm using This would require changes to response class to support prompt_logprobs. vLLM returns these prompt logprobs with 'prompt_logprobs' key in the choices list. Below is a sample response for the input prompt - |
This stubs in some OpenAI server-side compatibility with three new endpoints: /v1/openai/v1/models /v1/openai/v1/completions /v1/openai/v1/chat/completions This gives common inference apps using OpenAI clients the ability to talk to Llama Stack using an endpoint like http://localhost:8321/v1/openai/v1 . The two "v1" instances in there isn't awesome, but the thinking is that Llama Stack's API is v1 and then our OpenAI compatibility layer is compatible with OpenAI V1. And, some OpenAI clients implicitly assume the URL ends with "v1", so this gives maximum compatibility. The openai models endpoint is implemented in the routing layer, and just returns all the models Llama Stack knows about. The chat endpoints are only actually implemented for the remote-vllm provider right now, and it just proxies the completion and chat completion requests to the backend vLLM. The goal to support this for every inference provider - proxying directly to the provider's OpenAI endpoint for OpenAI-compatible providers. For providers that don't have an OpenAI-compatible API, we'll add a mixin to translate incoming OpenAI requests to Llama Stack inference requests and translate the Llama Stack inference responses to OpenAI responses.
Importing the models from the OpenAI client library required a top-level dependency on the openai python package, and also was incompatible with our API generation code due to some quirks in how the OpenAI pydantic models are defined. So, this creates our own stubs of those pydantic models so that we're in more direct control of our API surface for this OpenAI-compatible API, so that it works with our code generation, and so that the openai python client isn't a hard requirement of Llama Stack's API.
This adds OpenAI-compatible completions and chat completions support for the native Together provider as well as all providers implemented with litellm.
Signed-off-by: Ben Browning <[email protected]>
Signed-off-by: Ben Browning <[email protected]>
The OpenAI completion prompt field can be a string or an array, so update things to use and pass that properly. This also stubs in a basic conversion of OpenAI non-streaming completion requests to Llama Stack completion calls, for those providers that don't actually have an OpenAI backend to allow them to still accept requests via the OpenAI APIs. Signed-off-by: Ben Browning <[email protected]>
The OpenAI completion API supports strings, array of strings, array of tokens, or array of token arrays. So, expand our type hinting to support all of these types. Signed-off-by: Ben Browning <[email protected]>
Signed-off-by: Ben Browning <[email protected]>
This starts to stub in some integration tests for the OpenAI-compatible server APIs using an OpenAI client. Signed-off-by: Ben Browning <[email protected]>
When called via the OpenAI API, ollama is responding with more brief responses than when called via its native API. This adjusts the prompting for its OpenAI calls to ask it to be more verbose.
This adds the vLLM-specific extra_body parameters of prompt_logprobs and guided_choice to our openai_completion inference endpoint. The plan here would be to expand this to support all common optional parameters of any of the OpenAI providers, allowing each provider to use or ignore these parameters based on whether their server supports them. Signed-off-by: Ben Browning <[email protected]>
0684bbf to
ac5dc8f
Compare
|
@saichandrapandraju I've wired in initial support for extra_body params for Just an implementation note, on the server-side these are top-level parameters I have to add to the |
Signed-off-by: Ben Browning <[email protected]>
This adjusts the vllm openai_completion endpoint to also pass a value of 0 for prompt_logprobs, instead of only passing values greater than zero to the backend. The existing test_openai_completion_prompt_logprobs was parameterized to test this case as well. Signed-off-by: Ben Browning <[email protected]>
|
This is fantastic. Let me / @ehhuang know how we can help here. I will take a look at this in detail tonight or tomorrow. |
|
I updated the PR description with the list of providers that may be working, marking which ones I've tested (remote::vllm and remote::ollama) and which ones I haven't. I also updated the commands I use to run the currently basic test suite against those providers, so that others can see how they'd run the tests against other providers. Initial conversion of non-stream completion API calls to Llama Stack infererence calls is done as a fallback option for providers that don't natively support OpenAI APIs. I'll add in the same non-streaming fallback logic for chat completions soon. Feedback on whether we want to do this translation and extend it to also support streaming or whether we'd just prefer to raise an error stating that OpenAI APIs are unsupported for specific providers would be useful. For now, I'm erroring on the side of trying to make things Just Work for all inference providers, but that may not be trivial. Also, I haven't yet had a chance to wire up all the providers that do support native OpenAI APIs, and will work on that. This includes databricks, nvidia, runpod, and sambanova specifically. Those are already using OpenAI clients so it should be quite trivial to wire them up just like I did for vllm, ollama, and together.ai. |
This wires up the openai_completion and openai_chat_completion API methods for the remote Nvidia inference provider, and adds it to the chat completions part of the OpenAI test suite. The hosted Nvidia service doesn't actually host any Llama models with functioning completions and chat completions endpoints, so for now the test suite only activates the nvidia provider for chat completions. Signed-off-by: Ben Browning <[email protected]>
After actually running the test_openai_completion.py tests against together.ai, turns out there were a couple of bugs in the initial implementation. This fixes those. Signed-off-by: Ben Browning <[email protected]>
This wires up the openai_completion and openai_chat_completion API methods for the remote Fireworks inference provider. Signed-off-by: Ben Browning <[email protected]>
ehhuang
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great start!
| stream_options: Optional[Dict[str, Any]] = None, | ||
| temperature: Optional[float] = None, | ||
| tool_choice: Optional[Union[str, Dict[str, Any]]] = None, | ||
| tools: Optional[List[Dict[str, Any]]] = None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we have define a tool type for this?
| @@ -0,0 +1,216 @@ | |||
| # Copyright (c) Meta Platforms, Inc. and affiliates. | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I recently added a test suite to test OpenAI compat endpoints: https://github.com/meta-llama/llama-stack/blob/main/tests/verifications/README.md
We should move new test from here over there to consolidate if possible.
| top_logprobs: Optional[int] = None, | ||
| top_p: Optional[float] = None, | ||
| user: Optional[str] = None, | ||
| ) -> OpenAIChatCompletion: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this correctly typed for streaming?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, it's not. The type doesn't cover the streaming case at all, so even though streaming works in practice with the API as-is when used from OpenAI clients the typing for streaming isn't handled yet.
| ... | ||
|
|
||
| @webmethod(route="/openai/v1/completions", method="POST") | ||
| async def openai_completion( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if we should have this under apis/openai/ so that OpenAI related things are in one place.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's reasonable, and I went back-and-forth a bit here myself. I put the OpenAI models API endpoint under our models.py file and the OpenAI inference endpoints under our inference.py file simply because they mapped nicely to existing constructs. But, I don't have a strong preference there.
|
|
||
|
|
||
| @json_schema_type | ||
| class OpenAICompletion(BaseModel): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we just import from openai.types.chat as we did in openai_compat.py?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I actually started with that. However, the API codegen wasn't able to successfully run with those types. I don't recall the exact errors now, but I can try an example out in a bit just to document what the actual issue was there. A secondary concern would be whether we want direct control over the public-facing API of Llama Stack or whether we want to let new versions of the OpenAI python client impact our API surface.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here's an example of the kinds of errors the API spec codegen throws when using any of the OpenAI python client's types in our API:
Traceback (most recent call last):
File "/Users/bbrowning/.pyenv/versions/3.10.16/lib/python3.10/runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/Users/bbrowning/.pyenv/versions/3.10.16/lib/python3.10/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "/Volumes/SourceCode/llama-stack/docs/openapi_generator/generate.py", line 91, in <module>
fire.Fire(main)
File "/Users/bbrowning/.cache/uv/archive-v0/fsDRrIpMBoxSdg6tsSQLY/lib/python3.10/site-packages/fire/core.py", line 135, in Fire
component_trace = _Fire(component, args, parsed_flag_args, context, name)
File "/Users/bbrowning/.cache/uv/archive-v0/fsDRrIpMBoxSdg6tsSQLY/lib/python3.10/site-packages/fire/core.py", line 468, in _Fire
component, remaining_args = _CallAndUpdateTrace(
File "/Users/bbrowning/.cache/uv/archive-v0/fsDRrIpMBoxSdg6tsSQLY/lib/python3.10/site-packages/fire/core.py", line 684, in _CallAndUpdateTrace
component = fn(*varargs, **kwargs)
File "/Volumes/SourceCode/llama-stack/docs/openapi_generator/generate.py", line 55, in main
spec = Specification(
File "/Volumes/SourceCode/llama-stack/docs/openapi_generator/pyopenapi/utility.py", line 29, in __init__
self.document = generator.generate()
File "/Volumes/SourceCode/llama-stack/docs/openapi_generator/pyopenapi/generator.py", line 781, in generate
operation = self._build_operation(op)
File "/Volumes/SourceCode/llama-stack/docs/openapi_generator/pyopenapi/generator.py", line 691, in _build_operation
responses = response_builder.build_response(response_options)
File "/Volumes/SourceCode/llama-stack/docs/openapi_generator/pyopenapi/generator.py", line 374, in build_response
responses[status_code] = self._build_response(
File "/Volumes/SourceCode/llama-stack/docs/openapi_generator/pyopenapi/generator.py", line 393, in _build_response
content=self.content_builder.build_content(response_type, examples),
File "/Volumes/SourceCode/llama-stack/docs/openapi_generator/pyopenapi/generator.py", line 216, in build_content
return {media_type: self.build_media_type(item_type, examples)}
File "/Volumes/SourceCode/llama-stack/docs/openapi_generator/pyopenapi/generator.py", line 221, in build_media_type
schema = self.schema_builder.classdef_to_ref(item_type)
File "/Volumes/SourceCode/llama-stack/docs/openapi_generator/pyopenapi/generator.py", line 135, in classdef_to_ref
type_schema = self.classdef_to_schema(typ)
File "/Volumes/SourceCode/llama-stack/docs/openapi_generator/pyopenapi/generator.py", line 116, in classdef_to_schema
type_schema, type_definitions = self.schema_generator.classdef_to_schema(typ)
File "/Volumes/SourceCode/llama-stack/llama_stack/strong_typing/schema.py", line 612, in classdef_to_schema
types_defined[sub_name] = self._type_to_schema_with_lookup(sub_type)
File "/Volumes/SourceCode/llama-stack/llama_stack/strong_typing/schema.py", line 569, in _type_to_schema_with_lookup
type_schema = self.type_to_schema(data_type, force_expand=True)
File "/Volumes/SourceCode/llama-stack/llama_stack/strong_typing/schema.py", line 321, in type_to_schema
return self._type_to_schema(data_type, force_expand, json_schema_extra) | common_info
File "/Volumes/SourceCode/llama-stack/llama_stack/strong_typing/schema.py", line 518, in _type_to_schema
property_def = self.type_to_schema(property_type, json_schema_extra=json_schema_extra)
File "/Volumes/SourceCode/llama-stack/llama_stack/strong_typing/schema.py", line 321, in type_to_schema
return self._type_to_schema(data_type, force_expand, json_schema_extra) | common_info
File "/Volumes/SourceCode/llama-stack/llama_stack/strong_typing/schema.py", line 495, in _type_to_schema
for property_name, property_type in get_class_properties(typ):
File "/Volumes/SourceCode/llama-stack/llama_stack/strong_typing/inspection.py", line 571, in get_class_properties
resolved_hints = get_resolved_hints(typ)
File "/Volumes/SourceCode/llama-stack/llama_stack/strong_typing/inspection.py", line 557, in get_resolved_hints
return typing.get_type_hints(typ, include_extras=True)
File "/Users/bbrowning/.pyenv/versions/3.10.16/lib/python3.10/typing.py", line 1833, in get_type_hints
value = _eval_type(value, base_globals, base_locals)
File "/Users/bbrowning/.pyenv/versions/3.10.16/lib/python3.10/typing.py", line 327, in _eval_type
return t._evaluate(globalns, localns, recursive_guard)
File "/Users/bbrowning/.pyenv/versions/3.10.16/lib/python3.10/typing.py", line 694, in _evaluate
eval(self.__forward_code__, globalns, localns),
File "<string>", line 1, in <module>
NameError: name 'ClassVar' is not defined
It's probably solvable, but something about how the OpenAI types use ClassVar isn't liked by the strong_typing code in Llama Stack.
|
FYI I'm building on top of this to enable OAI compat for meta-reference inference |
|
+1. We have tested the changes internally and this would enable our upcoming demos. |
|
Chatted with Eric and Raghu. We feel comfortable merging this in, and iterating on it in origin/main both to indicate to folks that this is imminent and also avoid unnecessary pain for @bbrowning. Others can also potentially jump in. Unless someone objects in the next hour or so, I will do so. We will try to get a release going later (or over the weekend) potentially as well. |
|
Awesome, thanks for the quick turnaround! Now we should be able to get a lot more real-world testing of this to work out the edge cases. |
…tack#1894) # What does this PR do? This stubs in some OpenAI server-side compatibility with three new endpoints: /v1/openai/v1/models /v1/openai/v1/completions /v1/openai/v1/chat/completions This gives common inference apps using OpenAI clients the ability to talk to Llama Stack using an endpoint like http://localhost:8321/v1/openai/v1 . The two "v1" instances in there isn't awesome, but the thinking is that Llama Stack's API is v1 and then our OpenAI compatibility layer is compatible with OpenAI V1. And, some OpenAI clients implicitly assume the URL ends with "v1", so this gives maximum compatibility. The openai models endpoint is implemented in the routing layer, and just returns all the models Llama Stack knows about. The following providers should be working with the new OpenAI completions and chat/completions API: * remote::anthropic (untested) * remote::cerebras-openai-compat (untested) * remote::fireworks (tested) * remote::fireworks-openai-compat (untested) * remote::gemini (untested) * remote::groq-openai-compat (untested) * remote::nvidia (tested) * remote::ollama (tested) * remote::openai (untested) * remote::passthrough (untested) * remote::sambanova-openai-compat (untested) * remote::together (tested) * remote::together-openai-compat (untested) * remote::vllm (tested) The goal to support this for every inference provider - proxying directly to the provider's OpenAI endpoint for OpenAI-compatible providers. For providers that don't have an OpenAI-compatible API, we'll add a mixin to translate incoming OpenAI requests to Llama Stack inference requests and translate the Llama Stack inference responses to OpenAI responses. This is related to llamastack#1817 but is a bit larger in scope than just chat completions, as I have real use-cases that need the older completions API as well. ## Test Plan ### vLLM ``` VLLM_URL="http://localhost:8000/v1" INFERENCE_MODEL="meta-llama/Llama-3.2-3B-Instruct" llama stack build --template remote-vllm --image-type venv --run LLAMA_STACK_CONFIG=http://localhost:8321 INFERENCE_MODEL="meta-llama/Llama-3.2-3B-Instruct" python -m pytest -v tests/integration/inference/test_openai_completion.py --text-model "meta-llama/Llama-3.2-3B-Instruct" ``` ### ollama ``` INFERENCE_MODEL="llama3.2:3b-instruct-q8_0" llama stack build --template ollama --image-type venv --run LLAMA_STACK_CONFIG=http://localhost:8321 INFERENCE_MODEL="llama3.2:3b-instruct-q8_0" python -m pytest -v tests/integration/inference/test_openai_completion.py --text-model "llama3.2:3b-instruct-q8_0" ``` ## Documentation Run a Llama Stack distribution that uses one of the providers mentioned in the list above. Then, use your favorite OpenAI client to send completion or chat completion requests with the base_url set to http://localhost:8321/v1/openai/v1 . Replace "localhost:8321" with the host and port of your Llama Stack server, if different. --------- Signed-off-by: Ben Browning <[email protected]>
…LMAttribution and VLLMProvider (#1544) Summary: This PR introduces support for applying Captum's perturbation-based attribution algorithms to remotely hosted large language models (LLMs). It enables users to perform interpretability analyses on models served via APIs, such as those using [vLLM](https://github.com/vllm-project/vllm), without requiring access to model internals. ## Motivation: Captum’s current LLM attribution framework requires access to local models, limiting its usability in production and hosted environments. With the rise of scalable remote inference backends and OpenAI-compatible APIs, this PR allows Captum to be used for black-box interpretability with hosted models, as long as they return token-level log probabilities. This integration also aligns with ongoing efforts like [llama-stack](https://github.com/meta-llama/llama-stack), which aims to provide a unified API layer for inference (and also for RAG, Agents, Tools, Safety, Evals, and Telemetry) across multiple backends—further expanding Captum’s reach for model explainability. ## Key Additions: - `RemoteLLMProvider` Interface: A generic interface for fetching log probabilities from remote LLMs, making it easy to plug in various inference backends. - `VLLMProvider` Implementation: A concrete subclass of `RemoteLLMProvider` tailored for models served using vLLM, handling the specifics of communicating with vLLM endpoints to retrieve necessary data for attribution. - `RemoteLLMAttribution` class: A subclass of `LLMAttribution` that overrides internal methods to work with remote providers. It enables all perturbation-based algorithms (e.g., Feature Ablation, Shapley Values, KernelSHAP) using only the output logprobs from a remote LLM. - OpenAI-Compatible API Support: Used openai client under the hood for querying remote models, as many LLM serving solutions now support the OpenAI-compatible API format (e.g., [vLLM OpenAI server](https://docs.vllm.ai/en/latest/getting_started/quickstart.html#openai-compatible-server) and projects like `llama-stack`(see [here](llamastack/llama-stack#1894) for ongoing work related to this). ## Issue(s) related to this: - #1529 Pull Request resolved: #1544 Reviewed By: aobo-y Differential Revision: D75043583 Pulled By: craymichael fbshipit-source-id: aa2e263ddd51777168db7de2a7f91637eb8279de
What does this PR do?
This stubs in some OpenAI server-side compatibility with three new endpoints:
/v1/openai/v1/models
/v1/openai/v1/completions
/v1/openai/v1/chat/completions
This gives common inference apps using OpenAI clients the ability to talk to Llama Stack using an endpoint like
http://localhost:8321/v1/openai/v1 .
The two "v1" instances in there isn't awesome, but the thinking is that Llama Stack's API is v1 and then our OpenAI compatibility layer is compatible with OpenAI V1. And, some OpenAI clients implicitly assume the URL ends with "v1", so this gives maximum compatibility.
The openai models endpoint is implemented in the routing layer, and just returns all the models Llama Stack knows about.
The following providers should be working with the new OpenAI completions and chat/completions API:
The goal to support this for every inference provider - proxying directly to the provider's OpenAI endpoint for OpenAI-compatible providers. For providers that don't have an OpenAI-compatible API, we'll add a mixin to translate incoming OpenAI requests to Llama Stack inference requests and translate the Llama Stack inference responses to OpenAI responses.
This is related to #1817 but is a bit larger in scope than just chat completions, as I have real use-cases that need the older completions API as well.
Test Plan
vLLM
ollama
Documentation
Run a Llama Stack distribution that uses one of the providers mentioned in the list above. Then, use your favorite OpenAI client to send completion or chat completion requests with the base_url set to http://localhost:8321/v1/openai/v1 . Replace "localhost:8321" with the host and port of your Llama Stack server, if different.