Skip to content

Commit 1b72b94

Browse files
committed
LCORE-169: Provide initial set of opinionated & tested llama-stack configurations
1 parent 45eb299 commit 1b72b94

File tree

8 files changed

+757
-1
lines changed

8 files changed

+757
-1
lines changed

README.md

Lines changed: 15 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -35,6 +35,7 @@ The service includes comprehensive user data collection capabilities for various
3535
* [K8s based authentication](#k8s-based-authentication)
3636
* [JSON Web Keyset based authentication](#json-web-keyset-based-authentication)
3737
* [No-op authentication](#no-op-authentication)
38+
* [RAG Configuration](#rag-configuration)
3839
* [Usage](#usage)
3940
* [Make targets](#make-targets)
4041
* [Running Linux container image](#running-linux-container-image)
@@ -451,7 +452,21 @@ service:
451452
Credentials are not allowed with wildcard origins per CORS/Fetch spec.
452453
See https://fastapi.tiangolo.com/tutorial/cors/
453454

455+
# RAG Configuration
454456

457+
The [guide to RAG setup](docs/rag-guide.md) provides guidance on setting up RAG and includes tested examples for both inference and vector store integration.
458+
459+
460+
## Example configurations for inference
461+
462+
The following configurations are llama-stack config examples from production deployments:
463+
464+
- [Granite on vLLM example](examples/vllm-granite-run.yaml)
465+
- [Qwen3 on vLLM example](examples/vllm-qwen3-run.yaml)
466+
- [Gemini example](examples/gemini-run.yaml)
467+
468+
> [!NOTE]
469+
> RAG functionality is **not tested** for these configurations.
455470

456471
# Usage
457472

docs/rag_guide.md

Lines changed: 124 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -61,7 +61,7 @@ Update the `run.yaml` file used by Llama Stack to point to:
6161
* Your downloaded **embedding model**
6262
* Your generated **vector database**
6363

64-
Example:
64+
### FAISS Example
6565

6666
```yaml
6767
models:
@@ -100,10 +100,115 @@ Where:
100100
- `db_path` is the path to the vector index (.db file in this case)
101101
- `vector_db_id` is the index ID used to generate the db
102102

103+
See the full working [config example](examples/openai-faiss-run.yaml) for more details.
104+
105+
### pgvector Example
106+
107+
This example shows how to configure a remote PostgreSQL database with the [pgvector](https://github.com/pgvector/pgvector) extension for storing embeddings.
108+
109+
> You will need to install PostgreSQL, the matching version of pgvector, then log in with `psql` and enable the extension with:
110+
> ```sql
111+
> CREATE EXTENSION IF NOT EXISTS vector;
112+
> ```
113+
114+
Update the connection details (`host`, `port`, `db`, `user`, `password`) to match your PostgreSQL setup.
115+
116+
Each pgvector-backed table follows this schema:
117+
118+
- `id` (`text`): UUID identifier of the chunk
119+
- `document` (`jsonb`): json containing content and metadata associated with the embedding
120+
- `embedding` (`vector(n)`): the embedding vector, where `n` is the embedding dimension and must match the model's output size (e.g. 768 for `all-mpnet-base-v2`)
121+
122+
> [!NOTE]
123+
> The vector_db_id (e.g. rhdocs) is used to point to the table named vector_store_rhdocs in the specified database, which stores the vector embeddings.
124+
125+
126+
```yaml
127+
[...]
128+
providers:
129+
[...]
130+
vector_io:
131+
- provider_id: pgvector-example
132+
provider_type: remote::pgvector
133+
config:
134+
host: localhost
135+
port: 5432
136+
db: pgvector_example # PostgreSQL database (psql -d pgvector_example)
137+
user: lightspeed # PostgreSQL user
138+
password: password123
139+
kvstore:
140+
type: sqlite
141+
db_path: .llama/distributions/pgvector/pgvector_registry.db
142+
143+
vector_dbs:
144+
- embedding_dimension: 768
145+
embedding_model: sentence-transformers/all-mpnet-base-v2
146+
provider_id: pgvector-example
147+
# A unique ID that becomes the PostgreSQL table name, prefixed with 'vector_store_'.
148+
# e.g., 'rhdocs' will create the table 'vector_store_rhdocs'.
149+
vector_db_id: rhdocs
150+
```
151+
152+
See the full working [config example](examples/openai-pgvector-run.yaml) for more details.
153+
103154
---
104155

105156
## Add an Inference Model (LLM)
106157

158+
### vLLM on RHEL AI (Llama 3.1) Example
159+
160+
The [`vllm-openai`](https://hub.docker.com/r/vllm/vllm-openai) Docker image is used to serve the Llama-3.1-8B-Instruct model.
161+
The following example shows how to run it on **RHEL AI** with `podman`:
162+
163+
```bash
164+
podman run \
165+
--device "${CONTAINER_DEVICE}" \
166+
--gpus ${GPUS} \
167+
-v ~/.cache/huggingface:/root/.cache/huggingface \
168+
--env "HUGGING_FACE_HUB_TOKEN=${HF_TOKEN}" \
169+
-p ${EXPORTED_PORT}:8000 \
170+
--ipc=host \
171+
docker.io/vllm/vllm-openai:latest \
172+
--model meta-llama/Llama-3.1-8B-Instruct \
173+
--enable-auto-tool-choice \
174+
--tool-call-parser llama3_json --chat-template examples/tool_chat_template_llama3.1_json.jinja
175+
```
176+
177+
> The example command above enables tool calling for Llama 3.1 models.
178+
> For other supported models and configuration options, see the vLLM documentation:
179+
> [vLLM: Tool Calling](https://docs.vllm.ai/en/stable/features/tool_calling.html)
180+
181+
After starting the container, you can check which model is being served by running:
182+
183+
```bash
184+
curl http://localhost:8000/v1/models # Replace localhost with the url of the vLLM instance
185+
```
186+
187+
The response will include the `model_id`, which you can then use in your `run.yaml` configuration.
188+
189+
```yaml
190+
[...]
191+
models:
192+
[...]
193+
- model_id: meta-llama/Llama-3.1-8B-Instruct
194+
provider_id: vllm
195+
model_type: llm
196+
provider_model_id: null
197+
198+
providers:
199+
[...]
200+
inference:
201+
- provider_id: vllm
202+
provider_type: remote::vllm
203+
config:
204+
url: http://localhost:8000/v1/ # Replace localhost with the url of the vLLM instance
205+
api_token: <your-key-here>
206+
```
207+
208+
See the full working [config example](examples/vllm-llama-faiss-run.yaml) for more details.
209+
210+
### OpenAI Example
211+
107212
Add a provider for your language model (e.g., OpenAI):
108213

109214
```yaml
@@ -133,6 +238,24 @@ export OPENAI_API_KEY=<your-key-here>
133238
> When experimenting with different `models`, `providers` and `vector_dbs`, you might need to manually unregister the old ones with the Llama Stack client CLI (e.g. `llama-stack-client vector_dbs list`)
134239

135240

241+
See the full working [config example](examples/openai-faiss-run.yaml) for more details.
242+
243+
### Azure OpenAI
244+
245+
Not yet supported.
246+
247+
### Ollama
248+
249+
The `remote::ollama` provider can be used for inference. However, it does not support tool calling, including RAG.
250+
While Ollama also exposes an OpenAI compatible endpoint that supports tool calling, it cannot be used with `llama-stack` due to current limitations in the `remote::openai` provider.
251+
252+
There is an [ongoing discussion](https://github.com/meta-llama/llama-stack/discussions/3034) about enabling tool calling with Ollama.
253+
Currently, tool calling is not supported out of the box. Some experimental patches exist (including internal workarounds), but these are not officially released.
254+
255+
### vLLM Mistral
256+
257+
The RAG tool calls where not working properly when experimenting with `mistralai/Mistral-7B-Instruct-v0.3` on vLLM.
258+
136259
---
137260

138261
# Complete Configuration Reference

examples/gemini-run.yaml

Lines changed: 112 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,112 @@
1+
# Example llama-stack configuration for Google Gemini inference
2+
#
3+
# Contributed by @eranco74 (2025-08). See https://github.com/rh-ecosystem-edge/assisted-chat/blob/main/template.yaml#L282-L386
4+
# This file shows how to integrate Gemini with LCS.
5+
#
6+
# Notes:
7+
# - You will need valid Gemini API credentials to run this.
8+
# - You will need a postgres instance to run this config.
9+
#
10+
version: 2
11+
image_name: gemini-config
12+
apis:
13+
- agents
14+
- datasetio
15+
- eval
16+
- files
17+
- inference
18+
- safety
19+
- scoring
20+
- telemetry
21+
- tool_runtime
22+
- vector_io
23+
providers:
24+
inference:
25+
- provider_id: ${LLAMA_STACK_INFERENCE_PROVIDER}
26+
provider_type: remote::gemini
27+
config:
28+
api_key: ${env.GEMINI_API_KEY}
29+
vector_io: []
30+
files: []
31+
safety: []
32+
agents:
33+
- provider_id: meta-reference
34+
provider_type: inline::meta-reference
35+
config:
36+
persistence_store:
37+
type: postgres
38+
host: ${env.LLAMA_STACK_POSTGRES_HOST}
39+
port: ${env.LLAMA_STACK_POSTGRES_PORT}
40+
db: ${env.LLAMA_STACK_POSTGRES_NAME}
41+
user: ${env.LLAMA_STACK_POSTGRES_USER}
42+
password: ${env.LLAMA_STACK_POSTGRES_PASSWORD}
43+
responses_store:
44+
type: postgres
45+
host: ${env.LLAMA_STACK_POSTGRES_HOST}
46+
port: ${env.LLAMA_STACK_POSTGRES_PORT}
47+
db: ${env.LLAMA_STACK_POSTGRES_NAME}
48+
user: ${env.LLAMA_STACK_POSTGRES_USER}
49+
password: ${env.LLAMA_STACK_POSTGRES_PASSWORD}
50+
telemetry:
51+
- provider_id: meta-reference
52+
provider_type: inline::meta-reference
53+
config:
54+
service_name: "${LLAMA_STACK_OTEL_SERVICE_NAME}"
55+
sinks: ${LLAMA_STACK_TELEMETRY_SINKS}
56+
sqlite_db_path: ${STORAGE_MOUNT_PATH}/sqlite/trace_store.db
57+
eval: []
58+
datasetio: []
59+
scoring:
60+
- provider_id: basic
61+
provider_type: inline::basic
62+
config: {}
63+
- provider_id: llm-as-judge
64+
provider_type: inline::llm-as-judge
65+
config: {}
66+
tool_runtime:
67+
- provider_id: rag-runtime
68+
provider_type: inline::rag-runtime
69+
config: {}
70+
- provider_id: model-context-protocol
71+
provider_type: remote::model-context-protocol
72+
config: {}
73+
metadata_store:
74+
type: sqlite
75+
db_path: ${STORAGE_MOUNT_PATH}/sqlite/registry.db
76+
inference_store:
77+
type: postgres
78+
host: ${env.LLAMA_STACK_POSTGRES_HOST}
79+
port: ${env.LLAMA_STACK_POSTGRES_PORT}
80+
db: ${env.LLAMA_STACK_POSTGRES_NAME}
81+
user: ${env.LLAMA_STACK_POSTGRES_USER}
82+
password: ${env.LLAMA_STACK_POSTGRES_PASSWORD}
83+
models:
84+
- metadata: {}
85+
model_id: ${LLAMA_STACK_2_0_FLASH_MODEL}
86+
provider_id: ${LLAMA_STACK_INFERENCE_PROVIDER}
87+
provider_model_id: ${LLAMA_STACK_2_0_FLASH_MODEL}
88+
model_type: llm
89+
- metadata: {}
90+
model_id: ${LLAMA_STACK_2_5_PRO_MODEL}
91+
provider_id: ${LLAMA_STACK_INFERENCE_PROVIDER}
92+
provider_model_id: ${LLAMA_STACK_2_5_PRO_MODEL}
93+
model_type: llm
94+
- metadata: {}
95+
model_id: ${LLAMA_STACK_2_5_FLASH_MODEL}
96+
provider_id: ${LLAMA_STACK_INFERENCE_PROVIDER}
97+
provider_model_id: ${LLAMA_STACK_2_5_FLASH_MODEL}
98+
model_type: llm
99+
shields: []
100+
vector_dbs: []
101+
datasets: []
102+
scoring_fns: []
103+
benchmarks: []
104+
tool_groups:
105+
- toolgroup_id: builtin::rag
106+
provider_id: rag-runtime
107+
- toolgroup_id: mcp::assisted
108+
provider_id: model-context-protocol
109+
mcp_endpoint:
110+
uri: "${MCP_SERVER_URL}"
111+
server:
112+
port: ${LLAMA_STACK_SERVER_PORT}

examples/openai-faiss-run.yaml

Lines changed: 83 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,83 @@
1+
# Example llama-stack configuration for OpenAI inference + FAISS (RAG)
2+
#
3+
# Notes:
4+
# - You will need an OpenAI API key
5+
# - You can generate the vector index with the rag-content tool (https://github.com/lightspeed-core/rag-content)
6+
#
7+
version: 2
8+
image_name: openai-faiss-config
9+
10+
apis:
11+
- agents
12+
- inference
13+
- vector_io
14+
- tool_runtime
15+
- safety
16+
17+
models:
18+
- model_id: gpt-test
19+
provider_id: openai # This ID is a reference to 'providers.inference'
20+
model_type: llm
21+
provider_model_id: gpt-4o-mini
22+
23+
- model_id: sentence-transformers/all-mpnet-base-v2
24+
metadata:
25+
embedding_dimension: 768
26+
model_type: embedding
27+
provider_id: sentence-transformers # This ID is a reference to 'providers.inference'
28+
provider_model_id: /home/USER/lightspeed-stack/embedding_models/all-mpnet-base-v2
29+
30+
providers:
31+
inference:
32+
- provider_id: sentence-transformers
33+
provider_type: inline::sentence-transformers
34+
config: {}
35+
36+
- provider_id: openai
37+
provider_type: remote::openai
38+
config:
39+
api_key: ${env.OPENAI_API_KEY}
40+
41+
agents:
42+
- provider_id: meta-reference
43+
provider_type: inline::meta-reference
44+
config:
45+
persistence_store:
46+
type: sqlite
47+
db_path: .llama/distributions/ollama/agents_store.db
48+
responses_store:
49+
type: sqlite
50+
db_path: .llama/distributions/ollama/responses_store.db
51+
52+
safety:
53+
- provider_id: llama-guard
54+
provider_type: inline::llama-guard
55+
config:
56+
excluded_categories: []
57+
58+
vector_io:
59+
- provider_id: ocp-docs
60+
provider_type: inline::faiss
61+
config:
62+
kvstore:
63+
type: sqlite
64+
db_path: /home/USER/lightspeed-stack/vector_dbs/ocp_docs/faiss_store.db
65+
namespace: null
66+
67+
tool_runtime:
68+
- provider_id: rag-runtime
69+
provider_type: inline::rag-runtime
70+
config: {}
71+
72+
# Enable the RAG tool
73+
tool_groups:
74+
- provider_id: rag-runtime
75+
toolgroup_id: builtin::rag
76+
args: null
77+
mcp_endpoint: null
78+
79+
vector_dbs:
80+
- embedding_dimension: 768
81+
embedding_model: sentence-transformers/all-mpnet-base-v2
82+
provider_id: ocp-docs # This ID is a reference to 'providers.vector_io'
83+
vector_db_id: openshift-index # This ID was defined during index generation

0 commit comments

Comments
 (0)