Skip to content

Conversation

@bbrowning
Copy link
Collaborator

@bbrowning bbrowning commented Apr 8, 2025

What does this PR do?

This stubs in some OpenAI server-side compatibility with three new endpoints:

/v1/openai/v1/models
/v1/openai/v1/completions
/v1/openai/v1/chat/completions

This gives common inference apps using OpenAI clients the ability to talk to Llama Stack using an endpoint like
http://localhost:8321/v1/openai/v1 .

The two "v1" instances in there isn't awesome, but the thinking is that Llama Stack's API is v1 and then our OpenAI compatibility layer is compatible with OpenAI V1. And, some OpenAI clients implicitly assume the URL ends with "v1", so this gives maximum compatibility.

The openai models endpoint is implemented in the routing layer, and just returns all the models Llama Stack knows about.

The following providers should be working with the new OpenAI completions and chat/completions API:

  • remote::anthropic (untested)
  • remote::cerebras-openai-compat (untested)
  • remote::fireworks (tested)
  • remote::fireworks-openai-compat (untested)
  • remote::gemini (untested)
  • remote::groq-openai-compat (untested)
  • remote::nvidia (tested)
  • remote::ollama (tested)
  • remote::openai (untested)
  • remote::passthrough (untested)
  • remote::sambanova-openai-compat (untested)
  • remote::together (tested)
  • remote::together-openai-compat (untested)
  • remote::vllm (tested)

The goal to support this for every inference provider - proxying directly to the provider's OpenAI endpoint for OpenAI-compatible providers. For providers that don't have an OpenAI-compatible API, we'll add a mixin to translate incoming OpenAI requests to Llama Stack inference requests and translate the Llama Stack inference responses to OpenAI responses.

This is related to #1817 but is a bit larger in scope than just chat completions, as I have real use-cases that need the older completions API as well.

Test Plan

vLLM

VLLM_URL="http://localhost:8000/v1" INFERENCE_MODEL="meta-llama/Llama-3.2-3B-Instruct" llama stack build --template remote-vllm --image-type venv --run

LLAMA_STACK_CONFIG=http://localhost:8321 INFERENCE_MODEL="meta-llama/Llama-3.2-3B-Instruct" python -m pytest -v tests/integration/inference/test_openai_completion.py --text-model "meta-llama/Llama-3.2-3B-Instruct"

ollama

INFERENCE_MODEL="llama3.2:3b-instruct-q8_0" llama stack build --template ollama --image-type venv --run

LLAMA_STACK_CONFIG=http://localhost:8321 INFERENCE_MODEL="llama3.2:3b-instruct-q8_0" python -m pytest -v tests/integration/inference/test_openai_completion.py --text-model "llama3.2:3b-instruct-q8_0"

Documentation

Run a Llama Stack distribution that uses one of the providers mentioned in the list above. Then, use your favorite OpenAI client to send completion or chat completion requests with the base_url set to http://localhost:8321/v1/openai/v1 . Replace "localhost:8321" with the host and port of your Llama Stack server, if different.

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label Apr 8, 2025
@bbrowning bbrowning changed the title Add OpenAI-Compatible models, completions, chat/completions endpoints feat: OpenAI-Compatible models, completions, chat/completions Apr 8, 2025
@bbrowning
Copy link
Collaborator Author

This is just a draft that I'll keep pushing on a bit to see how hard it would be to get OpenAI compatible completion and chat completion endpoints across all existing providers. Most of the work will come in the translation layer for the providers that don't use OpenAI clients internally.

This also paves the way to implementing OpenAI Responses API compatibility, but starting with Models, Completions, and Chat Completions first for the sake of maximum compatibility with existing clients in the wild.

Copy link
Contributor

@terrytangyuan terrytangyuan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the head start! Please keep me posted on how this goes

@bbrowning bbrowning force-pushed the openai_server_compat branch from 86927c7 to 39350db Compare April 8, 2025 16:40
@bbrowning
Copy link
Collaborator Author

I'll start on some basic tests here, but note that I can't really write the typical inference integration tests until there's a release of the llama stack client python library with these API changes, since our integration tests use that to hit the inference endpoint.

@bbrowning
Copy link
Collaborator Author

Nevermind my comment about not being able to write integration tests yet. It dawned on me that I can of course just use an OpenAI client to test these new APIs, so I've started adding those.

@saichandrapandraju
Copy link

Is it possible to support extra_body so that we can pass parameters that are not part of OpenAI API but others like vLLM specifically.

eg., vLLM has a sampling parameter called prompt_logprobs to return logprobs for prompt tokens ('logprobs' parameter returns only for generated tokens). OpenAI has no 'prompt_logprobs' parameter, so I'm using extra_body={'prompt_logprobs': 1} to send this to vLLM server.

This would require changes to response class to support prompt_logprobs. vLLM returns these prompt logprobs with 'prompt_logprobs' key in the choices list. Below is a sample response for the input prompt - Dave lives in Palm Coast, FL and is a lawyer. His personal interests include and vLLM sampling params { "temperature":0, "prompt_logprobs": 1 }

{'id': 'cmpl-5b44f8444e20400f837777b6b5c6f132',
 'object': 'text_completion',
 'created': 1743818276,
 'model': 'opt',
 'choices': [{'index': 0,
   'text': ' the law, politics, and business. He is a member of the Palm Coast',
   'logprobs': None,
   'finish_reason': 'length',
   'stop_reason': None,
   'prompt_logprobs': [None,
    {'33857': {'logprob': -13.873104095458984,
      'rank': 6057,
      'decoded_token': 'Dave'},
     '100': {'logprob': -1.4258383512496948, 'rank': 1, 'decoded_token': 'I'}},
    {'1074': {'logprob': -10.792854309082031,
      'rank': 2880,
      'decoded_token': ' lives'},
     '219': {'logprob': -2.0438311100006104, 'rank': 1, 'decoded_token': 'y'}},
    {'11': {'logprob': -0.4757663607597351,
      'rank': 1,
      'decoded_token': ' in'}},
    {'7929': {'logprob': -7.719001770019531,
      'rank': 295,
      'decoded_token': ' Palm'},
     '5': {'logprob': -2.083259344100952, 'rank': 1, 'decoded_token': ' the'}},
    {'2565': {'logprob': -4.589813232421875,
      'rank': 4,
      'decoded_token': ' Coast'},
     '2467': {'logprob': -0.6835634708404541,
      'rank': 1,
      'decoded_token': ' Beach'}},
    {'6': {'logprob': -0.5932450890541077, 'rank': 1, 'decoded_token': ','}},
    {'8854': {'logprob': -1.4891626834869385,
      'rank': 1,
      'decoded_token': ' FL'}},
    {'8': {'logprob': -1.737820029258728, 'rank': 3, 'decoded_token': ' and'},
     '4': {'logprob': -0.886257529258728, 'rank': 1, 'decoded_token': '.'}},
    {'16': {'logprob': -1.8477180004119873,
      'rank': 1,
      'decoded_token': ' is'}},
    {'10': {'logprob': -1.1920616626739502, 'rank': 1, 'decoded_token': ' a'}},
    {'2470': {'logprob': -6.246567726135254,
      'rank': 79,
      'decoded_token': ' lawyer'},
     '3562': {'logprob': -2.969223737716675,
      'rank': 1,
      'decoded_token': ' retired'}},
    {'4': {'logprob': -1.4258630275726318, 'rank': 1, 'decoded_token': '.'}},
    {'832': {'logprob': -3.3757128715515137,
      'rank': 4,
      'decoded_token': ' His'},
     '91': {'logprob': -0.8600877523422241,
      'rank': 1,
      'decoded_token': ' He'}},
    {'1081': {'logprob': -5.170414924621582,
      'rank': 32,
      'decoded_token': ' personal'},
     '173': {'logprob': -2.475102424621582,
      'rank': 1,
      'decoded_token': ' work'}},
    {'3168': {'logprob': -3.6241774559020996,
      'rank': 6,
      'decoded_token': ' interests'},
     '301': {'logprob': -2.0382399559020996,
      'rank': 1,
      'decoded_token': ' life'}},
    {'680': {'logprob': -0.6599060893058777,
      'rank': 1,
      'decoded_token': ' include'}}]}],
 'usage': {'prompt_tokens': 17,
  'total_tokens': 33,
  'completion_tokens': 16,
  'prompt_tokens_details': None}}

bbrowning added 14 commits April 9, 2025 15:47
This stubs in some OpenAI server-side compatibility with three new
endpoints:

/v1/openai/v1/models
/v1/openai/v1/completions
/v1/openai/v1/chat/completions

This gives common inference apps using OpenAI clients the ability to
talk to Llama Stack using an endpoint like
http://localhost:8321/v1/openai/v1 .

The two "v1" instances in there isn't awesome, but the thinking is
that Llama Stack's API is v1 and then our OpenAI compatibility layer
is compatible with OpenAI V1. And, some OpenAI clients implicitly
assume the URL ends with "v1", so this gives maximum compatibility.

The openai models endpoint is implemented in the routing layer, and
just returns all the models Llama Stack knows about.

The chat endpoints are only actually implemented for the remote-vllm
provider right now, and it just proxies the completion and chat
completion requests to the backend vLLM.

The goal to support this for every inference provider - proxying
directly to the provider's OpenAI endpoint for OpenAI-compatible
providers. For providers that don't have an OpenAI-compatible API,
we'll add a mixin to translate incoming OpenAI requests to Llama Stack
inference requests and translate the Llama Stack inference responses
to OpenAI responses.
Importing the models from the OpenAI client library required a
top-level dependency on the openai python package, and also was
incompatible with our API generation code due to some quirks in how
the OpenAI pydantic models are defined.

So, this creates our own stubs of those pydantic models so that we're
in more direct control of our API surface for this OpenAI-compatible
API, so that it works with our code generation, and so that the openai
python client isn't a hard requirement of Llama Stack's API.
This adds OpenAI-compatible completions and chat completions support
for the native Together provider as well as all providers implemented
with litellm.
The OpenAI completion prompt field can be a string or an array, so
update things to use and pass that properly.

This also stubs in a basic conversion of OpenAI non-streaming
completion requests to Llama Stack completion calls, for those
providers that don't actually have an OpenAI backend to allow them to
still accept requests via the OpenAI APIs.

Signed-off-by: Ben Browning <[email protected]>
The OpenAI completion API supports strings, array of strings, array of
tokens, or array of token arrays. So, expand our type hinting to
support all of these types.

Signed-off-by: Ben Browning <[email protected]>
This starts to stub in some integration tests for the
OpenAI-compatible server APIs using an OpenAI client.

Signed-off-by: Ben Browning <[email protected]>
When called via the OpenAI API, ollama is responding with more brief
responses than when called via its native API. This adjusts the
prompting for its OpenAI calls to ask it to be more verbose.
This adds the vLLM-specific extra_body parameters of prompt_logprobs
and guided_choice to our openai_completion inference endpoint. The
plan here would be to expand this to support all common optional
parameters of any of the OpenAI providers, allowing each provider to
use or ignore these parameters based on whether their server supports them.

Signed-off-by: Ben Browning <[email protected]>
@bbrowning bbrowning force-pushed the openai_server_compat branch from 0684bbf to ac5dc8f Compare April 9, 2025 19:47
@bbrowning
Copy link
Collaborator Author

@saichandrapandraju I've wired in initial support for extra_body params for prompt_logprobs and guided_choice (because I needed this one for my own use-case) into this, along with a basic integration test for each that passes for me in vllm.

Just an implementation note, on the server-side these are top-level parameters I have to add to the openai_completion method because those extra_body parameters on the client-side come in as top-level body parameters when the actual request hits the server. That's not relevant to clients, but for implementing any additional extra_body parameters for all the backend providers we just have to hoist those up to optional parameters at the top level of the Inference API's openai_completion method.

This adjusts the vllm openai_completion endpoint to also pass a
value of 0 for prompt_logprobs, instead of only passing values greater
than zero to the backend.

The existing test_openai_completion_prompt_logprobs was parameterized
to test this case as well.

Signed-off-by: Ben Browning <[email protected]>
@ashwinb
Copy link
Contributor

ashwinb commented Apr 10, 2025

This is fantastic. Let me / @ehhuang know how we can help here. I will take a look at this in detail tonight or tomorrow.

@bbrowning
Copy link
Collaborator Author

I updated the PR description with the list of providers that may be working, marking which ones I've tested (remote::vllm and remote::ollama) and which ones I haven't. I also updated the commands I use to run the currently basic test suite against those providers, so that others can see how they'd run the tests against other providers.

Initial conversion of non-stream completion API calls to Llama Stack infererence calls is done as a fallback option for providers that don't natively support OpenAI APIs. I'll add in the same non-streaming fallback logic for chat completions soon. Feedback on whether we want to do this translation and extend it to also support streaming or whether we'd just prefer to raise an error stating that OpenAI APIs are unsupported for specific providers would be useful. For now, I'm erroring on the side of trying to make things Just Work for all inference providers, but that may not be trivial.

Also, I haven't yet had a chance to wire up all the providers that do support native OpenAI APIs, and will work on that. This includes databricks, nvidia, runpod, and sambanova specifically. Those are already using OpenAI clients so it should be quite trivial to wire them up just like I did for vllm, ollama, and together.ai.

This wires up the openai_completion and openai_chat_completion API
methods for the remote Nvidia inference provider, and adds it to the
chat completions part of the OpenAI test suite.

The hosted Nvidia service doesn't actually host any Llama models with
functioning completions and chat completions endpoints, so for now the
test suite only activates the nvidia provider for chat completions.

Signed-off-by: Ben Browning <[email protected]>
After actually running the test_openai_completion.py tests against
together.ai, turns out there were a couple of bugs in the initial
implementation. This fixes those.

Signed-off-by: Ben Browning <[email protected]>
This wires up the openai_completion and openai_chat_completion API
methods for the remote Fireworks inference provider.

Signed-off-by: Ben Browning <[email protected]>
Copy link
Contributor

@ehhuang ehhuang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great start!

stream_options: Optional[Dict[str, Any]] = None,
temperature: Optional[float] = None,
tool_choice: Optional[Union[str, Dict[str, Any]]] = None,
tools: Optional[List[Dict[str, Any]]] = None,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we have define a tool type for this?

@@ -0,0 +1,216 @@
# Copyright (c) Meta Platforms, Inc. and affiliates.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I recently added a test suite to test OpenAI compat endpoints: https://github.com/meta-llama/llama-stack/blob/main/tests/verifications/README.md
We should move new test from here over there to consolidate if possible.

top_logprobs: Optional[int] = None,
top_p: Optional[float] = None,
user: Optional[str] = None,
) -> OpenAIChatCompletion:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this correctly typed for streaming?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, it's not. The type doesn't cover the streaming case at all, so even though streaming works in practice with the API as-is when used from OpenAI clients the typing for streaming isn't handled yet.

...

@webmethod(route="/openai/v1/completions", method="POST")
async def openai_completion(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we should have this under apis/openai/ so that OpenAI related things are in one place.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's reasonable, and I went back-and-forth a bit here myself. I put the OpenAI models API endpoint under our models.py file and the OpenAI inference endpoints under our inference.py file simply because they mapped nicely to existing constructs. But, I don't have a strong preference there.



@json_schema_type
class OpenAICompletion(BaseModel):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we just import from openai.types.chat as we did in openai_compat.py?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually started with that. However, the API codegen wasn't able to successfully run with those types. I don't recall the exact errors now, but I can try an example out in a bit just to document what the actual issue was there. A secondary concern would be whether we want direct control over the public-facing API of Llama Stack or whether we want to let new versions of the OpenAI python client impact our API surface.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's an example of the kinds of errors the API spec codegen throws when using any of the OpenAI python client's types in our API:

Traceback (most recent call last):
  File "/Users/bbrowning/.pyenv/versions/3.10.16/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/Users/bbrowning/.pyenv/versions/3.10.16/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/Volumes/SourceCode/llama-stack/docs/openapi_generator/generate.py", line 91, in <module>
    fire.Fire(main)
  File "/Users/bbrowning/.cache/uv/archive-v0/fsDRrIpMBoxSdg6tsSQLY/lib/python3.10/site-packages/fire/core.py", line 135, in Fire
    component_trace = _Fire(component, args, parsed_flag_args, context, name)
  File "/Users/bbrowning/.cache/uv/archive-v0/fsDRrIpMBoxSdg6tsSQLY/lib/python3.10/site-packages/fire/core.py", line 468, in _Fire
    component, remaining_args = _CallAndUpdateTrace(
  File "/Users/bbrowning/.cache/uv/archive-v0/fsDRrIpMBoxSdg6tsSQLY/lib/python3.10/site-packages/fire/core.py", line 684, in _CallAndUpdateTrace
    component = fn(*varargs, **kwargs)
  File "/Volumes/SourceCode/llama-stack/docs/openapi_generator/generate.py", line 55, in main
    spec = Specification(
  File "/Volumes/SourceCode/llama-stack/docs/openapi_generator/pyopenapi/utility.py", line 29, in __init__
    self.document = generator.generate()
  File "/Volumes/SourceCode/llama-stack/docs/openapi_generator/pyopenapi/generator.py", line 781, in generate
    operation = self._build_operation(op)
  File "/Volumes/SourceCode/llama-stack/docs/openapi_generator/pyopenapi/generator.py", line 691, in _build_operation
    responses = response_builder.build_response(response_options)
  File "/Volumes/SourceCode/llama-stack/docs/openapi_generator/pyopenapi/generator.py", line 374, in build_response
    responses[status_code] = self._build_response(
  File "/Volumes/SourceCode/llama-stack/docs/openapi_generator/pyopenapi/generator.py", line 393, in _build_response
    content=self.content_builder.build_content(response_type, examples),
  File "/Volumes/SourceCode/llama-stack/docs/openapi_generator/pyopenapi/generator.py", line 216, in build_content
    return {media_type: self.build_media_type(item_type, examples)}
  File "/Volumes/SourceCode/llama-stack/docs/openapi_generator/pyopenapi/generator.py", line 221, in build_media_type
    schema = self.schema_builder.classdef_to_ref(item_type)
  File "/Volumes/SourceCode/llama-stack/docs/openapi_generator/pyopenapi/generator.py", line 135, in classdef_to_ref
    type_schema = self.classdef_to_schema(typ)
  File "/Volumes/SourceCode/llama-stack/docs/openapi_generator/pyopenapi/generator.py", line 116, in classdef_to_schema
    type_schema, type_definitions = self.schema_generator.classdef_to_schema(typ)
  File "/Volumes/SourceCode/llama-stack/llama_stack/strong_typing/schema.py", line 612, in classdef_to_schema
    types_defined[sub_name] = self._type_to_schema_with_lookup(sub_type)
  File "/Volumes/SourceCode/llama-stack/llama_stack/strong_typing/schema.py", line 569, in _type_to_schema_with_lookup
    type_schema = self.type_to_schema(data_type, force_expand=True)
  File "/Volumes/SourceCode/llama-stack/llama_stack/strong_typing/schema.py", line 321, in type_to_schema
    return self._type_to_schema(data_type, force_expand, json_schema_extra) | common_info
  File "/Volumes/SourceCode/llama-stack/llama_stack/strong_typing/schema.py", line 518, in _type_to_schema
    property_def = self.type_to_schema(property_type, json_schema_extra=json_schema_extra)
  File "/Volumes/SourceCode/llama-stack/llama_stack/strong_typing/schema.py", line 321, in type_to_schema
    return self._type_to_schema(data_type, force_expand, json_schema_extra) | common_info
  File "/Volumes/SourceCode/llama-stack/llama_stack/strong_typing/schema.py", line 495, in _type_to_schema
    for property_name, property_type in get_class_properties(typ):
  File "/Volumes/SourceCode/llama-stack/llama_stack/strong_typing/inspection.py", line 571, in get_class_properties
    resolved_hints = get_resolved_hints(typ)
  File "/Volumes/SourceCode/llama-stack/llama_stack/strong_typing/inspection.py", line 557, in get_resolved_hints
    return typing.get_type_hints(typ, include_extras=True)
  File "/Users/bbrowning/.pyenv/versions/3.10.16/lib/python3.10/typing.py", line 1833, in get_type_hints
    value = _eval_type(value, base_globals, base_locals)
  File "/Users/bbrowning/.pyenv/versions/3.10.16/lib/python3.10/typing.py", line 327, in _eval_type
    return t._evaluate(globalns, localns, recursive_guard)
  File "/Users/bbrowning/.pyenv/versions/3.10.16/lib/python3.10/typing.py", line 694, in _evaluate
    eval(self.__forward_code__, globalns, localns),
  File "<string>", line 1, in <module>
NameError: name 'ClassVar' is not defined

It's probably solvable, but something about how the OpenAI types use ClassVar isn't liked by the strong_typing code in Llama Stack.

@ehhuang
Copy link
Contributor

ehhuang commented Apr 11, 2025

FYI I'm building on top of this to enable OAI compat for meta-reference inference

@terrytangyuan
Copy link
Contributor

+1. We have tested the changes internally and this would enable our upcoming demos.

@ashwinb
Copy link
Contributor

ashwinb commented Apr 11, 2025

Chatted with Eric and Raghu. We feel comfortable merging this in, and iterating on it in origin/main both to indicate to folks that this is imminent and also avoid unnecessary pain for @bbrowning. Others can also potentially jump in.

Unless someone objects in the next hour or so, I will do so. We will try to get a release going later (or over the weekend) potentially as well.

@ashwinb ashwinb merged commit 2b2db5f into llamastack:main Apr 11, 2025
24 checks passed
@bbrowning
Copy link
Collaborator Author

Awesome, thanks for the quick turnaround! Now we should be able to get a lot more real-world testing of this to work out the edge cases.

@bbrowning bbrowning deleted the openai_server_compat branch April 11, 2025 20:53
MichaelClifford pushed a commit to MichaelClifford/llama-stack that referenced this pull request Apr 14, 2025
…tack#1894)

# What does this PR do?

This stubs in some OpenAI server-side compatibility with three new
endpoints:

/v1/openai/v1/models
/v1/openai/v1/completions
/v1/openai/v1/chat/completions

This gives common inference apps using OpenAI clients the ability to
talk to Llama Stack using an endpoint like
http://localhost:8321/v1/openai/v1 .

The two "v1" instances in there isn't awesome, but the thinking is that
Llama Stack's API is v1 and then our OpenAI compatibility layer is
compatible with OpenAI V1. And, some OpenAI clients implicitly assume
the URL ends with "v1", so this gives maximum compatibility.

The openai models endpoint is implemented in the routing layer, and just
returns all the models Llama Stack knows about.

The following providers should be working with the new OpenAI
completions and chat/completions API:
* remote::anthropic (untested)
* remote::cerebras-openai-compat (untested)
* remote::fireworks (tested)
* remote::fireworks-openai-compat (untested)
* remote::gemini (untested)
* remote::groq-openai-compat (untested)
* remote::nvidia (tested)
* remote::ollama (tested)
* remote::openai (untested)
* remote::passthrough (untested)
* remote::sambanova-openai-compat (untested)
* remote::together (tested)
* remote::together-openai-compat (untested)
* remote::vllm (tested)

The goal to support this for every inference provider - proxying
directly to the provider's OpenAI endpoint for OpenAI-compatible
providers. For providers that don't have an OpenAI-compatible API, we'll
add a mixin to translate incoming OpenAI requests to Llama Stack
inference requests and translate the Llama Stack inference responses to
OpenAI responses.

This is related to llamastack#1817 but is a bit larger in scope than just chat
completions, as I have real use-cases that need the older completions
API as well.

## Test Plan

### vLLM

```
VLLM_URL="http://localhost:8000/v1" INFERENCE_MODEL="meta-llama/Llama-3.2-3B-Instruct" llama stack build --template remote-vllm --image-type venv --run

LLAMA_STACK_CONFIG=http://localhost:8321 INFERENCE_MODEL="meta-llama/Llama-3.2-3B-Instruct" python -m pytest -v tests/integration/inference/test_openai_completion.py --text-model "meta-llama/Llama-3.2-3B-Instruct"
```

### ollama
```
INFERENCE_MODEL="llama3.2:3b-instruct-q8_0" llama stack build --template ollama --image-type venv --run

LLAMA_STACK_CONFIG=http://localhost:8321 INFERENCE_MODEL="llama3.2:3b-instruct-q8_0" python -m pytest -v tests/integration/inference/test_openai_completion.py --text-model "llama3.2:3b-instruct-q8_0"
```



## Documentation

Run a Llama Stack distribution that uses one of the providers mentioned
in the list above. Then, use your favorite OpenAI client to send
completion or chat completion requests with the base_url set to
http://localhost:8321/v1/openai/v1 . Replace "localhost:8321" with the
host and port of your Llama Stack server, if different.

---------

Signed-off-by: Ben Browning <[email protected]>
facebook-github-bot pushed a commit to meta-pytorch/captum that referenced this pull request May 21, 2025
…LMAttribution and VLLMProvider (#1544)

Summary:
This PR introduces support for applying Captum's perturbation-based attribution algorithms to remotely hosted large language models (LLMs). It enables users to perform interpretability analyses on models served via APIs, such as those using [vLLM](https://github.com/vllm-project/vllm), without requiring access to model internals.

## Motivation:
Captum’s current LLM attribution framework requires access to local models, limiting its usability in production and hosted environments. With the rise of scalable remote inference backends and OpenAI-compatible APIs, this PR allows Captum to be used for black-box interpretability with hosted models, as long as they return token-level log probabilities.

This integration also aligns with ongoing efforts like [llama-stack](https://github.com/meta-llama/llama-stack), which aims to provide a unified API layer for inference (and also for RAG, Agents, Tools, Safety, Evals, and Telemetry) across multiple backends—further expanding Captum’s reach for model explainability.

## Key Additions:

- `RemoteLLMProvider` Interface:
A generic interface for fetching log probabilities from remote LLMs, making it easy to plug in various inference backends.
- `VLLMProvider` Implementation:
A concrete subclass of `RemoteLLMProvider` tailored for models served using vLLM, handling the specifics of communicating with vLLM endpoints to retrieve necessary data for attribution.
- `RemoteLLMAttribution` class:
A subclass of `LLMAttribution` that overrides internal methods to work with remote providers. It enables all perturbation-based algorithms (e.g., Feature Ablation, Shapley Values, KernelSHAP) using only the output logprobs from a remote LLM.
- OpenAI-Compatible API Support:
Used openai client under the hood for querying remote models, as many LLM serving solutions now support the OpenAI-compatible API format (e.g., [vLLM OpenAI server](https://docs.vllm.ai/en/latest/getting_started/quickstart.html#openai-compatible-server) and projects like `llama-stack`(see [here](llamastack/llama-stack#1894) for ongoing work related to this).

## Issue(s) related to this:
- #1529

Pull Request resolved: #1544

Reviewed By: aobo-y

Differential Revision: D75043583

Pulled By: craymichael

fbshipit-source-id: aa2e263ddd51777168db7de2a7f91637eb8279de
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants