feat: OpenAI-Compatible models, completions, chat/completions #1894

bbrowning · 2025-04-08T01:37:33Z

What does this PR do?

This stubs in some OpenAI server-side compatibility with three new endpoints:

/v1/openai/v1/models
/v1/openai/v1/completions
/v1/openai/v1/chat/completions

This gives common inference apps using OpenAI clients the ability to talk to Llama Stack using an endpoint like
http://localhost:8321/v1/openai/v1 .

The two "v1" instances in there isn't awesome, but the thinking is that Llama Stack's API is v1 and then our OpenAI compatibility layer is compatible with OpenAI V1. And, some OpenAI clients implicitly assume the URL ends with "v1", so this gives maximum compatibility.

The openai models endpoint is implemented in the routing layer, and just returns all the models Llama Stack knows about.

The following providers should be working with the new OpenAI completions and chat/completions API:

remote::anthropic (untested)
remote::cerebras-openai-compat (untested)
remote::fireworks (tested)
remote::fireworks-openai-compat (untested)
remote::gemini (untested)
remote::groq-openai-compat (untested)
remote::nvidia (tested)
remote::ollama (tested)
remote::openai (untested)
remote::passthrough (untested)
remote::sambanova-openai-compat (untested)
remote::together (tested)
remote::together-openai-compat (untested)
remote::vllm (tested)

The goal to support this for every inference provider - proxying directly to the provider's OpenAI endpoint for OpenAI-compatible providers. For providers that don't have an OpenAI-compatible API, we'll add a mixin to translate incoming OpenAI requests to Llama Stack inference requests and translate the Llama Stack inference responses to OpenAI responses.

This is related to #1817 but is a bit larger in scope than just chat completions, as I have real use-cases that need the older completions API as well.

Test Plan

vLLM

VLLM_URL="http://localhost:8000/v1" INFERENCE_MODEL="meta-llama/Llama-3.2-3B-Instruct" llama stack build --template remote-vllm --image-type venv --run

LLAMA_STACK_CONFIG=http://localhost:8321 INFERENCE_MODEL="meta-llama/Llama-3.2-3B-Instruct" python -m pytest -v tests/integration/inference/test_openai_completion.py --text-model "meta-llama/Llama-3.2-3B-Instruct"

ollama

INFERENCE_MODEL="llama3.2:3b-instruct-q8_0" llama stack build --template ollama --image-type venv --run

LLAMA_STACK_CONFIG=http://localhost:8321 INFERENCE_MODEL="llama3.2:3b-instruct-q8_0" python -m pytest -v tests/integration/inference/test_openai_completion.py --text-model "llama3.2:3b-instruct-q8_0"

Documentation

Run a Llama Stack distribution that uses one of the providers mentioned in the list above. Then, use your favorite OpenAI client to send completion or chat completion requests with the base_url set to http://localhost:8321/v1/openai/v1 . Replace "localhost:8321" with the host and port of your Llama Stack server, if different.

bbrowning · 2025-04-08T01:48:05Z

This is just a draft that I'll keep pushing on a bit to see how hard it would be to get OpenAI compatible completion and chat completion endpoints across all existing providers. Most of the work will come in the translation layer for the providers that don't use OpenAI clients internally.

This also paves the way to implementing OpenAI Responses API compatibility, but starting with Models, Completions, and Chat Completions first for the sake of maximum compatibility with existing clients in the wild.

terrytangyuan

Thanks for the head start! Please keep me posted on how this goes

llama_stack/apis/inference/inference.py

bbrowning · 2025-04-09T14:30:05Z

I'll start on some basic tests here, but note that I can't really write the typical inference integration tests until there's a release of the llama stack client python library with these API changes, since our integration tests use that to hit the inference endpoint.

bbrowning · 2025-04-09T17:56:45Z

Nevermind my comment about not being able to write integration tests yet. It dawned on me that I can of course just use an OpenAI client to test these new APIs, so I've started adding those.

saichandrapandraju · 2025-04-09T18:27:35Z

Is it possible to support extra_body so that we can pass parameters that are not part of OpenAI API but others like vLLM specifically.

eg., vLLM has a sampling parameter called prompt_logprobs to return logprobs for prompt tokens ('logprobs' parameter returns only for generated tokens). OpenAI has no 'prompt_logprobs' parameter, so I'm using extra_body={'prompt_logprobs': 1} to send this to vLLM server.

This would require changes to response class to support prompt_logprobs. vLLM returns these prompt logprobs with 'prompt_logprobs' key in the choices list. Below is a sample response for the input prompt - Dave lives in Palm Coast, FL and is a lawyer. His personal interests include and vLLM sampling params { "temperature":0, "prompt_logprobs": 1 }

{'id': 'cmpl-5b44f8444e20400f837777b6b5c6f132',
 'object': 'text_completion',
 'created': 1743818276,
 'model': 'opt',
 'choices': [{'index': 0,
   'text': ' the law, politics, and business. He is a member of the Palm Coast',
   'logprobs': None,
   'finish_reason': 'length',
   'stop_reason': None,
   'prompt_logprobs': [None,
    {'33857': {'logprob': -13.873104095458984,
      'rank': 6057,
      'decoded_token': 'Dave'},
     '100': {'logprob': -1.4258383512496948, 'rank': 1, 'decoded_token': 'I'}},
    {'1074': {'logprob': -10.792854309082031,
      'rank': 2880,
      'decoded_token': ' lives'},
     '219': {'logprob': -2.0438311100006104, 'rank': 1, 'decoded_token': 'y'}},
    {'11': {'logprob': -0.4757663607597351,
      'rank': 1,
      'decoded_token': ' in'}},
    {'7929': {'logprob': -7.719001770019531,
      'rank': 295,
      'decoded_token': ' Palm'},
     '5': {'logprob': -2.083259344100952, 'rank': 1, 'decoded_token': ' the'}},
    {'2565': {'logprob': -4.589813232421875,
      'rank': 4,
      'decoded_token': ' Coast'},
     '2467': {'logprob': -0.6835634708404541,
      'rank': 1,
      'decoded_token': ' Beach'}},
    {'6': {'logprob': -0.5932450890541077, 'rank': 1, 'decoded_token': ','}},
    {'8854': {'logprob': -1.4891626834869385,
      'rank': 1,
      'decoded_token': ' FL'}},
    {'8': {'logprob': -1.737820029258728, 'rank': 3, 'decoded_token': ' and'},
     '4': {'logprob': -0.886257529258728, 'rank': 1, 'decoded_token': '.'}},
    {'16': {'logprob': -1.8477180004119873,
      'rank': 1,
      'decoded_token': ' is'}},
    {'10': {'logprob': -1.1920616626739502, 'rank': 1, 'decoded_token': ' a'}},
    {'2470': {'logprob': -6.246567726135254,
      'rank': 79,
      'decoded_token': ' lawyer'},
     '3562': {'logprob': -2.969223737716675,
      'rank': 1,
      'decoded_token': ' retired'}},
    {'4': {'logprob': -1.4258630275726318, 'rank': 1, 'decoded_token': '.'}},
    {'832': {'logprob': -3.3757128715515137,
      'rank': 4,
      'decoded_token': ' His'},
     '91': {'logprob': -0.8600877523422241,
      'rank': 1,
      'decoded_token': ' He'}},
    {'1081': {'logprob': -5.170414924621582,
      'rank': 32,
      'decoded_token': ' personal'},
     '173': {'logprob': -2.475102424621582,
      'rank': 1,
      'decoded_token': ' work'}},
    {'3168': {'logprob': -3.6241774559020996,
      'rank': 6,
      'decoded_token': ' interests'},
     '301': {'logprob': -2.0382399559020996,
      'rank': 1,
      'decoded_token': ' life'}},
    {'680': {'logprob': -0.6599060893058777,
      'rank': 1,
      'decoded_token': ' include'}}]}],
 'usage': {'prompt_tokens': 17,
  'total_tokens': 33,
  'completion_tokens': 16,
  'prompt_tokens_details': None}}

This stubs in some OpenAI server-side compatibility with three new endpoints: /v1/openai/v1/models /v1/openai/v1/completions /v1/openai/v1/chat/completions This gives common inference apps using OpenAI clients the ability to talk to Llama Stack using an endpoint like http://localhost:8321/v1/openai/v1 . The two "v1" instances in there isn't awesome, but the thinking is that Llama Stack's API is v1 and then our OpenAI compatibility layer is compatible with OpenAI V1. And, some OpenAI clients implicitly assume the URL ends with "v1", so this gives maximum compatibility. The openai models endpoint is implemented in the routing layer, and just returns all the models Llama Stack knows about. The chat endpoints are only actually implemented for the remote-vllm provider right now, and it just proxies the completion and chat completion requests to the backend vLLM. The goal to support this for every inference provider - proxying directly to the provider's OpenAI endpoint for OpenAI-compatible providers. For providers that don't have an OpenAI-compatible API, we'll add a mixin to translate incoming OpenAI requests to Llama Stack inference requests and translate the Llama Stack inference responses to OpenAI responses.

Importing the models from the OpenAI client library required a top-level dependency on the openai python package, and also was incompatible with our API generation code due to some quirks in how the OpenAI pydantic models are defined. So, this creates our own stubs of those pydantic models so that we're in more direct control of our API surface for this OpenAI-compatible API, so that it works with our code generation, and so that the openai python client isn't a hard requirement of Llama Stack's API.

This adds OpenAI-compatible completions and chat completions support for the native Together provider as well as all providers implemented with litellm.

Signed-off-by: Ben Browning <[email protected]>

The OpenAI completion prompt field can be a string or an array, so update things to use and pass that properly. This also stubs in a basic conversion of OpenAI non-streaming completion requests to Llama Stack completion calls, for those providers that don't actually have an OpenAI backend to allow them to still accept requests via the OpenAI APIs. Signed-off-by: Ben Browning <[email protected]>

The OpenAI completion API supports strings, array of strings, array of tokens, or array of token arrays. So, expand our type hinting to support all of these types. Signed-off-by: Ben Browning <[email protected]>

Signed-off-by: Ben Browning <[email protected]>

This starts to stub in some integration tests for the OpenAI-compatible server APIs using an OpenAI client. Signed-off-by: Ben Browning <[email protected]>

When called via the OpenAI API, ollama is responding with more brief responses than when called via its native API. This adjusts the prompting for its OpenAI calls to ask it to be more verbose.

This adds the vLLM-specific extra_body parameters of prompt_logprobs and guided_choice to our openai_completion inference endpoint. The plan here would be to expand this to support all common optional parameters of any of the OpenAI providers, allowing each provider to use or ignore these parameters based on whether their server supports them. Signed-off-by: Ben Browning <[email protected]>

bbrowning · 2025-04-09T19:50:39Z

@saichandrapandraju I've wired in initial support for extra_body params for prompt_logprobs and guided_choice (because I needed this one for my own use-case) into this, along with a basic integration test for each that passes for me in vllm.

Just an implementation note, on the server-side these are top-level parameters I have to add to the openai_completion method because those extra_body parameters on the client-side come in as top-level body parameters when the actual request hits the server. That's not relevant to clients, but for implementing any additional extra_body parameters for all the backend providers we just have to hoist those up to optional parameters at the top level of the Inference API's openai_completion method.

Signed-off-by: Ben Browning <[email protected]>

llama_stack/providers/remote/inference/vllm/vllm.py

This adjusts the vllm openai_completion endpoint to also pass a value of 0 for prompt_logprobs, instead of only passing values greater than zero to the backend. The existing test_openai_completion_prompt_logprobs was parameterized to test this case as well. Signed-off-by: Ben Browning <[email protected]>

ashwinb · 2025-04-10T00:09:01Z

This is fantastic. Let me / @ehhuang know how we can help here. I will take a look at this in detail tonight or tomorrow.

bbrowning · 2025-04-10T13:25:43Z

I updated the PR description with the list of providers that may be working, marking which ones I've tested (remote::vllm and remote::ollama) and which ones I haven't. I also updated the commands I use to run the currently basic test suite against those providers, so that others can see how they'd run the tests against other providers.

Initial conversion of non-stream completion API calls to Llama Stack infererence calls is done as a fallback option for providers that don't natively support OpenAI APIs. I'll add in the same non-streaming fallback logic for chat completions soon. Feedback on whether we want to do this translation and extend it to also support streaming or whether we'd just prefer to raise an error stating that OpenAI APIs are unsupported for specific providers would be useful. For now, I'm erroring on the side of trying to make things Just Work for all inference providers, but that may not be trivial.

Also, I haven't yet had a chance to wire up all the providers that do support native OpenAI APIs, and will work on that. This includes databricks, nvidia, runpod, and sambanova specifically. Those are already using OpenAI clients so it should be quite trivial to wire them up just like I did for vllm, ollama, and together.ai.

This wires up the openai_completion and openai_chat_completion API methods for the remote Nvidia inference provider, and adds it to the chat completions part of the OpenAI test suite. The hosted Nvidia service doesn't actually host any Llama models with functioning completions and chat completions endpoints, so for now the test suite only activates the nvidia provider for chat completions. Signed-off-by: Ben Browning <[email protected]>

After actually running the test_openai_completion.py tests against together.ai, turns out there were a couple of bugs in the initial implementation. This fixes those. Signed-off-by: Ben Browning <[email protected]>

This wires up the openai_completion and openai_chat_completion API methods for the remote Fireworks inference provider. Signed-off-by: Ben Browning <[email protected]>

ehhuang

Great start!

ehhuang · 2025-04-10T21:18:48Z

llama_stack/apis/inference/inference.py

+        stream_options: Optional[Dict[str, Any]] = None,
+        temperature: Optional[float] = None,
+        tool_choice: Optional[Union[str, Dict[str, Any]]] = None,
+        tools: Optional[List[Dict[str, Any]]] = None,


should we have define a tool type for this?

ehhuang · 2025-04-10T21:22:06Z

tests/integration/inference/test_openai_completion.py

@@ -0,0 +1,216 @@
+# Copyright (c) Meta Platforms, Inc. and affiliates.


I recently added a test suite to test OpenAI compat endpoints: https://github.com/meta-llama/llama-stack/blob/main/tests/verifications/README.md
We should move new test from here over there to consolidate if possible.

ehhuang · 2025-04-10T21:23:07Z

llama_stack/apis/inference/inference.py

+        top_logprobs: Optional[int] = None,
+        top_p: Optional[float] = None,
+        user: Optional[str] = None,
+    ) -> OpenAIChatCompletion:


Is this correctly typed for streaming?

No, it's not. The type doesn't cover the streaming case at all, so even though streaming works in practice with the API as-is when used from OpenAI clients the typing for streaming isn't handled yet.

ehhuang · 2025-04-10T21:24:33Z

llama_stack/apis/inference/inference.py

        ...
+
+    @webmethod(route="/openai/v1/completions", method="POST")
+    async def openai_completion(


I wonder if we should have this under apis/openai/ so that OpenAI related things are in one place.

That's reasonable, and I went back-and-forth a bit here myself. I put the OpenAI models API endpoint under our models.py file and the OpenAI inference endpoints under our inference.py file simply because they mapped nicely to existing constructs. But, I don't have a strong preference there.

ehhuang · 2025-04-10T21:59:01Z

llama_stack/apis/inference/inference.py

+
+
+@json_schema_type
+class OpenAICompletion(BaseModel):


Should we just import from openai.types.chat as we did in openai_compat.py?

I actually started with that. However, the API codegen wasn't able to successfully run with those types. I don't recall the exact errors now, but I can try an example out in a bit just to document what the actual issue was there. A secondary concern would be whether we want direct control over the public-facing API of Llama Stack or whether we want to let new versions of the OpenAI python client impact our API surface.

Here's an example of the kinds of errors the API spec codegen throws when using any of the OpenAI python client's types in our API:

Traceback (most recent call last): File "/Users/bbrowning/.pyenv/versions/3.10.16/lib/python3.10/runpy.py", line 196, in _run_module_as_main return _run_code(code, main_globals, None, File "/Users/bbrowning/.pyenv/versions/3.10.16/lib/python3.10/runpy.py", line 86, in _run_code exec(code, run_globals) File "/Volumes/SourceCode/llama-stack/docs/openapi_generator/generate.py", line 91, in <module> fire.Fire(main) File "/Users/bbrowning/.cache/uv/archive-v0/fsDRrIpMBoxSdg6tsSQLY/lib/python3.10/site-packages/fire/core.py", line 135, in Fire component_trace = _Fire(component, args, parsed_flag_args, context, name) File "/Users/bbrowning/.cache/uv/archive-v0/fsDRrIpMBoxSdg6tsSQLY/lib/python3.10/site-packages/fire/core.py", line 468, in _Fire component, remaining_args = _CallAndUpdateTrace( File "/Users/bbrowning/.cache/uv/archive-v0/fsDRrIpMBoxSdg6tsSQLY/lib/python3.10/site-packages/fire/core.py", line 684, in _CallAndUpdateTrace component = fn(*varargs, **kwargs) File "/Volumes/SourceCode/llama-stack/docs/openapi_generator/generate.py", line 55, in main spec = Specification( File "/Volumes/SourceCode/llama-stack/docs/openapi_generator/pyopenapi/utility.py", line 29, in __init__ self.document = generator.generate() File "/Volumes/SourceCode/llama-stack/docs/openapi_generator/pyopenapi/generator.py", line 781, in generate operation = self._build_operation(op) File "/Volumes/SourceCode/llama-stack/docs/openapi_generator/pyopenapi/generator.py", line 691, in _build_operation responses = response_builder.build_response(response_options) File "/Volumes/SourceCode/llama-stack/docs/openapi_generator/pyopenapi/generator.py", line 374, in build_response responses[status_code] = self._build_response( File "/Volumes/SourceCode/llama-stack/docs/openapi_generator/pyopenapi/generator.py", line 393, in _build_response content=self.content_builder.build_content(response_type, examples), File "/Volumes/SourceCode/llama-stack/docs/openapi_generator/pyopenapi/generator.py", line 216, in build_content return {media_type: self.build_media_type(item_type, examples)} File "/Volumes/SourceCode/llama-stack/docs/openapi_generator/pyopenapi/generator.py", line 221, in build_media_type schema = self.schema_builder.classdef_to_ref(item_type) File "/Volumes/SourceCode/llama-stack/docs/openapi_generator/pyopenapi/generator.py", line 135, in classdef_to_ref type_schema = self.classdef_to_schema(typ) File "/Volumes/SourceCode/llama-stack/docs/openapi_generator/pyopenapi/generator.py", line 116, in classdef_to_schema type_schema, type_definitions = self.schema_generator.classdef_to_schema(typ) File "/Volumes/SourceCode/llama-stack/llama_stack/strong_typing/schema.py", line 612, in classdef_to_schema types_defined[sub_name] = self._type_to_schema_with_lookup(sub_type) File "/Volumes/SourceCode/llama-stack/llama_stack/strong_typing/schema.py", line 569, in _type_to_schema_with_lookup type_schema = self.type_to_schema(data_type, force_expand=True) File "/Volumes/SourceCode/llama-stack/llama_stack/strong_typing/schema.py", line 321, in type_to_schema return self._type_to_schema(data_type, force_expand, json_schema_extra) | common_info File "/Volumes/SourceCode/llama-stack/llama_stack/strong_typing/schema.py", line 518, in _type_to_schema property_def = self.type_to_schema(property_type, json_schema_extra=json_schema_extra) File "/Volumes/SourceCode/llama-stack/llama_stack/strong_typing/schema.py", line 321, in type_to_schema return self._type_to_schema(data_type, force_expand, json_schema_extra) | common_info File "/Volumes/SourceCode/llama-stack/llama_stack/strong_typing/schema.py", line 495, in _type_to_schema for property_name, property_type in get_class_properties(typ): File "/Volumes/SourceCode/llama-stack/llama_stack/strong_typing/inspection.py", line 571, in get_class_properties resolved_hints = get_resolved_hints(typ) File "/Volumes/SourceCode/llama-stack/llama_stack/strong_typing/inspection.py", line 557, in get_resolved_hints return typing.get_type_hints(typ, include_extras=True) File "/Users/bbrowning/.pyenv/versions/3.10.16/lib/python3.10/typing.py", line 1833, in get_type_hints value = _eval_type(value, base_globals, base_locals) File "/Users/bbrowning/.pyenv/versions/3.10.16/lib/python3.10/typing.py", line 327, in _eval_type return t._evaluate(globalns, localns, recursive_guard) File "/Users/bbrowning/.pyenv/versions/3.10.16/lib/python3.10/typing.py", line 694, in _evaluate eval(self.__forward_code__, globalns, localns), File "<string>", line 1, in <module> NameError: name 'ClassVar' is not defined

It's probably solvable, but something about how the OpenAI types use ClassVar isn't liked by the strong_typing code in Llama Stack.

ehhuang · 2025-04-11T02:48:51Z

FYI I'm building on top of this to enable OAI compat for meta-reference inference

terrytangyuan · 2025-04-11T15:34:20Z

+1. We have tested the changes internally and this would enable our upcoming demos.

ashwinb · 2025-04-11T17:39:18Z

Chatted with Eric and Raghu. We feel comfortable merging this in, and iterating on it in origin/main both to indicate to folks that this is imminent and also avoid unnecessary pain for @bbrowning. Others can also potentially jump in.

Unless someone objects in the next hour or so, I will do so. We will try to get a release going later (or over the weekend) potentially as well.

bbrowning · 2025-04-11T20:53:52Z

Awesome, thanks for the quick turnaround! Now we should be able to get a lot more real-world testing of this to work out the edge cases.

…tack#1894) # What does this PR do? This stubs in some OpenAI server-side compatibility with three new endpoints: /v1/openai/v1/models /v1/openai/v1/completions /v1/openai/v1/chat/completions This gives common inference apps using OpenAI clients the ability to talk to Llama Stack using an endpoint like http://localhost:8321/v1/openai/v1 . The two "v1" instances in there isn't awesome, but the thinking is that Llama Stack's API is v1 and then our OpenAI compatibility layer is compatible with OpenAI V1. And, some OpenAI clients implicitly assume the URL ends with "v1", so this gives maximum compatibility. The openai models endpoint is implemented in the routing layer, and just returns all the models Llama Stack knows about. The following providers should be working with the new OpenAI completions and chat/completions API: * remote::anthropic (untested) * remote::cerebras-openai-compat (untested) * remote::fireworks (tested) * remote::fireworks-openai-compat (untested) * remote::gemini (untested) * remote::groq-openai-compat (untested) * remote::nvidia (tested) * remote::ollama (tested) * remote::openai (untested) * remote::passthrough (untested) * remote::sambanova-openai-compat (untested) * remote::together (tested) * remote::together-openai-compat (untested) * remote::vllm (tested) The goal to support this for every inference provider - proxying directly to the provider's OpenAI endpoint for OpenAI-compatible providers. For providers that don't have an OpenAI-compatible API, we'll add a mixin to translate incoming OpenAI requests to Llama Stack inference requests and translate the Llama Stack inference responses to OpenAI responses. This is related to llamastack#1817 but is a bit larger in scope than just chat completions, as I have real use-cases that need the older completions API as well. ## Test Plan ### vLLM ``` VLLM_URL="http://localhost:8000/v1" INFERENCE_MODEL="meta-llama/Llama-3.2-3B-Instruct" llama stack build --template remote-vllm --image-type venv --run LLAMA_STACK_CONFIG=http://localhost:8321 INFERENCE_MODEL="meta-llama/Llama-3.2-3B-Instruct" python -m pytest -v tests/integration/inference/test_openai_completion.py --text-model "meta-llama/Llama-3.2-3B-Instruct" ``` ### ollama ``` INFERENCE_MODEL="llama3.2:3b-instruct-q8_0" llama stack build --template ollama --image-type venv --run LLAMA_STACK_CONFIG=http://localhost:8321 INFERENCE_MODEL="llama3.2:3b-instruct-q8_0" python -m pytest -v tests/integration/inference/test_openai_completion.py --text-model "llama3.2:3b-instruct-q8_0" ``` ## Documentation Run a Llama Stack distribution that uses one of the providers mentioned in the list above. Then, use your favorite OpenAI client to send completion or chat completion requests with the base_url set to http://localhost:8321/v1/openai/v1 . Replace "localhost:8321" with the host and port of your Llama Stack server, if different. --------- Signed-off-by: Ben Browning <[email protected]>

…LMAttribution and VLLMProvider (#1544) Summary: This PR introduces support for applying Captum's perturbation-based attribution algorithms to remotely hosted large language models (LLMs). It enables users to perform interpretability analyses on models served via APIs, such as those using [vLLM](https://github.com/vllm-project/vllm), without requiring access to model internals. ## Motivation: Captum’s current LLM attribution framework requires access to local models, limiting its usability in production and hosted environments. With the rise of scalable remote inference backends and OpenAI-compatible APIs, this PR allows Captum to be used for black-box interpretability with hosted models, as long as they return token-level log probabilities. This integration also aligns with ongoing efforts like [llama-stack](https://github.com/meta-llama/llama-stack), which aims to provide a unified API layer for inference (and also for RAG, Agents, Tools, Safety, Evals, and Telemetry) across multiple backends—further expanding Captum’s reach for model explainability. ## Key Additions: - `RemoteLLMProvider` Interface: A generic interface for fetching log probabilities from remote LLMs, making it easy to plug in various inference backends. - `VLLMProvider` Implementation: A concrete subclass of `RemoteLLMProvider` tailored for models served using vLLM, handling the specifics of communicating with vLLM endpoints to retrieve necessary data for attribution. - `RemoteLLMAttribution` class: A subclass of `LLMAttribution` that overrides internal methods to work with remote providers. It enables all perturbation-based algorithms (e.g., Feature Ablation, Shapley Values, KernelSHAP) using only the output logprobs from a remote LLM. - OpenAI-Compatible API Support: Used openai client under the hood for querying remote models, as many LLM serving solutions now support the OpenAI-compatible API format (e.g., [vLLM OpenAI server](https://docs.vllm.ai/en/latest/getting_started/quickstart.html#openai-compatible-server) and projects like `llama-stack`(see [here](llamastack/llama-stack#1894) for ongoing work related to this). ## Issue(s) related to this: - #1529 Pull Request resolved: #1544 Reviewed By: aobo-y Differential Revision: D75043583 Pulled By: craymichael fbshipit-source-id: aa2e263ddd51777168db7de2a7f91637eb8279de

facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label Apr 8, 2025

bbrowning changed the title ~~Add OpenAI-Compatible models, completions, chat/completions endpoints~~ feat: OpenAI-Compatible models, completions, chat/completions Apr 8, 2025

bbrowning mentioned this pull request Apr 8, 2025

Implement an OpenAI Chat Completion compatibility API #1817

Closed

terrytangyuan reviewed Apr 8, 2025

View reviewed changes

bbrowning force-pushed the openai_server_compat branch from 86927c7 to 39350db Compare April 8, 2025 16:40

RobGeada reviewed Apr 9, 2025

View reviewed changes

llama_stack/apis/inference/inference.py Outdated Show resolved Hide resolved

bbrowning added 14 commits April 9, 2025 15:47

Clean up some more usage of direct OpenAI types

5bc5fed

ollama OpenAI-compatible completions and chat completions

1dbdff1

OpenAI-compatible completions and chats for litellm and together

00c4493

This adds OpenAI-compatible completions and chat completions support for the native Together provider as well as all providers implemented with litellm.

Add unsupported OpenAI mixin to all remaining inference providers

15d37fd

Passthrough inference support for OpenAI-compatible APIs

de01b14

Signed-off-by: Ben Browning <[email protected]>

Mark inline vllm as OpenAI unsupported inference

24cfa1e

Signed-off-by: Ben Browning <[email protected]>

OpenAI completion prompt can also include tokens

fcdeb3d

The OpenAI completion API supports strings, array of strings, array of tokens, or array of token arrays. So, expand our type hinting to support all of these types. Signed-off-by: Ben Browning <[email protected]>

Update spec with latest changes as well

a1e9cff

Signed-off-by: Ben Browning <[email protected]>

Start some integration tests with an OpenAI client

52b4766

This starts to stub in some integration tests for the OpenAI-compatible server APIs using an OpenAI client. Signed-off-by: Ben Browning <[email protected]>

Fix openai_completion tests for ollama

ef684ff

When called via the OpenAI API, ollama is responding with more brief responses than when called via its native API. This adjusts the prompting for its OpenAI calls to ask it to be more verbose.

bbrowning force-pushed the openai_server_compat branch from 0684bbf to ac5dc8f Compare April 9, 2025 19:47

Add basic tests for OpenAI Chat Completions API

8d10556

Signed-off-by: Ben Browning <[email protected]>

saichandrapandraju reviewed Apr 9, 2025

View reviewed changes

llama_stack/providers/remote/inference/vllm/vllm.py Outdated Show resolved Hide resolved

bbrowning added 3 commits April 10, 2025 13:43

Bug fixes for together.ai OpenAI endpoints

ffae192

After actually running the test_openai_completion.py tests against together.ai, turns out there were a couple of bugs in the initial implementation. This fixes those. Signed-off-by: Ben Browning <[email protected]>

Fireworks provider support for OpenAI API endpoints

31181c0

This wires up the openai_completion and openai_chat_completion API methods for the remote Fireworks inference provider. Signed-off-by: Ben Browning <[email protected]>

ehhuang reviewed Apr 10, 2025

View reviewed changes

ashwinb marked this pull request as ready for review April 11, 2025 17:29

ashwinb requested review from SLR722, ashwinb, dineshyv, dltn, hardikjshah, leseb, raghotham, sixianyi0721, vladimirivic and yanxi0830 as code owners April 11, 2025 17:29

ashwinb approved these changes Apr 11, 2025

View reviewed changes

ashwinb merged commit 2b2db5f into llamastack:main Apr 11, 2025
24 checks passed

bbrowning deleted the openai_server_compat branch April 11, 2025 20:53

saichandrapandraju mentioned this pull request Apr 15, 2025

Add Remote LLM Support for Perturbation-Based Attribution via RemoteLLMAttribution and VLLMProvider meta-pytorch/captum#1544

Closed

reluctantfuturist mentioned this pull request Jun 6, 2025

Support custom OpenAI-compatible HTTP endpoints in LlamaStack config templates #2390

Closed

		@@ -0,0 +1,216 @@
		# Copyright (c) Meta Platforms, Inc. and affiliates.

feat: OpenAI-Compatible models, completions, chat/completions #1894

feat: OpenAI-Compatible models, completions, chat/completions #1894

Uh oh!

Conversation

bbrowning commented Apr 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Test Plan

vLLM

ollama

Documentation

Uh oh!

bbrowning commented Apr 8, 2025

Uh oh!

terrytangyuan left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

bbrowning commented Apr 9, 2025

Uh oh!

bbrowning commented Apr 9, 2025

Uh oh!

saichandrapandraju commented Apr 9, 2025

Uh oh!

bbrowning commented Apr 9, 2025

Uh oh!

Uh oh!

ashwinb commented Apr 10, 2025

Uh oh!

bbrowning commented Apr 10, 2025

Uh oh!

ehhuang left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ehhuang commented Apr 11, 2025

Uh oh!

terrytangyuan commented Apr 11, 2025

Uh oh!

ashwinb commented Apr 11, 2025

Uh oh!

Uh oh!

bbrowning commented Apr 11, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

bbrowning commented Apr 8, 2025 •

edited

Loading

terrytangyuan left a comment •

edited

Loading