-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Description
Hi LlamaStack team,
I'd like to request native support in the stack configuration for using custom OpenAI-compatible HTTP endpoints for both inference and embeddings.
Use Case
I'm integrating with internal APIs that expose OpenAI-style /v1/embeddings and /v1/completions endpoints (e.g., IBM Granite, MiniLM via APIcast). These endpoints require custom model IDs, auth tokens, and base URLs.
I'd like to declare this setup in YAML, something like:
runtime:
template: remote-httpx
params:
model_url: ${LLM_MODEL_URL}
model_id: ${LLM_MODEL_ID}
api_key: ${LLM_ACCESS_KEY}
api_key_header: Authorization
api_key_prefix: Bearer
embedding:
template: remote-httpx
params:
model_url: ${MINILLM_MODEL_URL}
model_id: ${MINILLM_MODEL_ID}
api_key: ${MINILLM_ACCESS_KEY}
api_key_header: Authorization
api_key_prefix: BearerWhy it matters
This would allow declarative use of enterprise models behind auth layers, without needing to override logic in Python. It would also unlock easier use of hosted or private models (Granite, DeepSeek, Ollama, etc.).
Question
Is there a recommended workaround today for supporting these remote providers via config? Or is this currently only possible by overriding via code?