Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
15 changes: 15 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -35,6 +35,7 @@ The service includes comprehensive user data collection capabilities for various
* [K8s based authentication](#k8s-based-authentication)
* [JSON Web Keyset based authentication](#json-web-keyset-based-authentication)
* [No-op authentication](#no-op-authentication)
* [RAG Configuration](#rag-configuration)
* [Usage](#usage)
* [Make targets](#make-targets)
* [Running Linux container image](#running-linux-container-image)
Expand Down Expand Up @@ -451,7 +452,21 @@ service:
Credentials are not allowed with wildcard origins per CORS/Fetch spec.
See https://fastapi.tiangolo.com/tutorial/cors/

# RAG Configuration

The [guide to RAG setup](docs/rag_guide.md) provides guidance on setting up RAG and includes tested examples for both inference and vector store integration.

## Example configurations for inference

The following configurations are llama-stack config examples from production deployments:

- [Granite on vLLM example](examples/vllm-granite-run.yaml)
- [Qwen3 on vLLM example](examples/vllm-qwen3-run.yaml)
- [Gemini example](examples/gemini-run.yaml)
- [VertexAI example](examples/vertexai-run.yaml)

> [!NOTE]
> RAG functionality is **not tested** for these configurations.

# Usage

Expand Down
123 changes: 122 additions & 1 deletion docs/rag_guide.md
Original file line number Diff line number Diff line change
Expand Up @@ -61,7 +61,7 @@ Update the `run.yaml` file used by Llama Stack to point to:
* Your downloaded **embedding model**
* Your generated **vector database**

Example:
### FAISS example

```yaml
models:
Expand Down Expand Up @@ -100,10 +100,113 @@ Where:
- `db_path` is the path to the vector index (.db file in this case)
- `vector_db_id` is the index ID used to generate the db

See the full working [config example](examples/openai-faiss-run.yaml) for more details.

Comment on lines +103 to +104
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Fix relative links to examples from docs/ to examples/.

From docs/, the correct relative path is ../examples/...

Update FAISS example reference:

-See the full working [config example](examples/openai-faiss-run.yaml) for more details.
+See the full working [config example](../examples/openai-faiss-run.yaml) for more details.

Update pgvector example reference:

-See the full working [config example](examples/openai-pgvector-run.yaml) for more details.
+See the full working [config example](../examples/openai-pgvector-run.yaml) for more details.

Update vLLM Llama example reference:

-See the full working [config example](examples/vllm-llama-faiss-run.yaml) for more details.
+See the full working [config example](../examples/vllm-llama-faiss-run.yaml) for more details.

Update OpenAI example reference:

-See the full working [config example](examples/openai-faiss-run.yaml) for more details.
+See the full working [config example](../examples/openai-faiss-run.yaml) for more details.

Also applies to: 152-153, 208-209, 241-242

🤖 Prompt for AI Agents
docs/rag_guide.md lines 103-104 (and also update similar occurrences at 152-153,
208-209, 241-242): the example links currently point to examples/... which are
incorrect when referenced from docs/; change each link to use the correct
relative path prefix ../examples/ (e.g., ../examples/openai-faiss-run.yaml) for
the FAISS, pgvector, vLLM Llama, and OpenAI example references so they resolve
correctly from the docs directory.

### pgvector example

This example shows how to configure a remote PostgreSQL database with the [pgvector](https://github.com/pgvector/pgvector) extension for storing embeddings.

> You will need to install PostgreSQL with a matching version to pgvector, then log in with `psql` and enable the extension with:
> ```sql
> CREATE EXTENSION IF NOT EXISTS vector;
> ```

Update the connection details (`host`, `port`, `db`, `user`, `password`) to match your PostgreSQL setup.

Each pgvector-backed table follows this schema:

- `id` (`text`): UUID identifier of the chunk
- `document` (`jsonb`): json containing content and metadata associated with the embedding
- `embedding` (`vector(n)`): the embedding vector, where `n` is the embedding dimension and will match the model's output size (e.g. 768 for `all-mpnet-base-v2`)

> [!NOTE]
> The `vector_db_id` (e.g. `rhdocs`) is used to point to the table named `vector_store_rhdocs` in the specified database, which stores the vector embeddings.


```yaml
[...]
providers:
[...]
vector_io:
- provider_id: pgvector-example
provider_type: remote::pgvector
config:
host: localhost
port: 5432
db: pgvector_example # PostgreSQL database (psql -d pgvector_example)
user: lightspeed # PostgreSQL user
password: password123
kvstore:
type: sqlite
db_path: .llama/distributions/pgvector/pgvector_registry.db

vector_dbs:
- embedding_dimension: 768
embedding_model: sentence-transformers/all-mpnet-base-v2
provider_id: pgvector-example
# A unique ID that becomes the PostgreSQL table name, prefixed with 'vector_store_'.
# e.g., 'rhdocs' will create the table 'vector_store_rhdocs'.
# If the table was already created, this value must match the ID used at creation.
vector_db_id: rhdocs
```

See the full working [config example](examples/openai-pgvector-run.yaml) for more details.

---

## Add an Inference Model (LLM)

### vLLM on RHEL AI (Llama 3.1) example

> [!NOTE]
> The following example assumes that podman's CDI has been properly configured to [enable GPU support](https://podman-desktop.io/docs/podman/gpu).

The [`vllm-openai`](https://hub.docker.com/r/vllm/vllm-openai) Docker image is used to serve the Llama-3.1-8B-Instruct model.
The following example shows how to run it on **RHEL AI** with `podman`:

```bash
podman run \
--device "${CONTAINER_DEVICE}" \
--gpus ${GPUS} \
-v ~/.cache/huggingface:/root/.cache/huggingface \
--env "HUGGING_FACE_HUB_TOKEN=${HF_TOKEN}" \
-p ${EXPORTED_PORT}:8000 \
--ipc=host \
docker.io/vllm/vllm-openai:latest \
--model meta-llama/Llama-3.1-8B-Instruct \
--enable-auto-tool-choice \
--tool-call-parser llama3_json --chat-template examples/tool_chat_template_llama3.1_json.jinja
```

> The example command above enables tool calling for Llama 3.1 models.
> For other supported models and configuration options, see the vLLM documentation:
> [vLLM: Tool Calling](https://docs.vllm.ai/en/stable/features/tool_calling.html)

After starting the container edit your `run.yaml` file, matching `model_id` with the model provided in the `podman run` command.

```yaml
[...]
models:
[...]
- model_id: meta-llama/Llama-3.1-8B-Instruct # Same as the model name in the 'podman run' command
provider_id: vllm
model_type: llm
provider_model_id: null

providers:
[...]
inference:
- provider_id: vllm
provider_type: remote::vllm
config:
url: http://localhost:${env.EXPORTED_PORT:=8000}/v1/ # Replace localhost with the url of the vLLM instance
api_token: <your-key-here> # if any
```

See the full working [config example](examples/vllm-llama-faiss-run.yaml) for more details.

### OpenAI example

Add a provider for your language model (e.g., OpenAI):

```yaml
Expand Down Expand Up @@ -133,6 +236,24 @@ export OPENAI_API_KEY=<your-key-here>
> When experimenting with different `models`, `providers` and `vector_dbs`, you might need to manually unregister the old ones with the Llama Stack client CLI (e.g. `llama-stack-client vector_dbs list`)


See the full working [config example](examples/openai-faiss-run.yaml) for more details.

### Azure OpenAI

Not yet supported.

### Ollama

The `remote::ollama` provider can be used for inference. However, it does not support tool calling, including RAG.
While Ollama also exposes an OpenAI compatible endpoint that supports tool calling, it cannot be used with `llama-stack` due to current limitations in the `remote::openai` provider.

There is an [ongoing discussion](https://github.com/meta-llama/llama-stack/discussions/3034) about enabling tool calling with Ollama.
Currently, tool calling is not supported out of the box. Some experimental patches exist (including internal workarounds), but these are not officially released.

### vLLM Mistral

The RAG tool calls where not working properly when experimenting with `mistralai/Mistral-7B-Instruct-v0.3` on vLLM.

---

# Complete Configuration Reference
Expand Down
112 changes: 112 additions & 0 deletions examples/gemini-run.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,112 @@
# Example llama-stack configuration for Google Gemini inference
#
# Contributed by @eranco74 (2025-08). See https://github.com/rh-ecosystem-edge/assisted-chat/blob/main/template.yaml#L282-L386
# This file shows how to integrate Gemini with LCS.
#
# Notes:
# - You will need valid Gemini API credentials to run this.
# - You will need a postgres instance to run this config.
#
version: 2
image_name: gemini-config
apis:
- agents
- datasetio
- eval
- files
- inference
- safety
- scoring
- telemetry
- tool_runtime
- vector_io
providers:
inference:
- provider_id: ${LLAMA_STACK_INFERENCE_PROVIDER}
provider_type: remote::gemini
config:
api_key: ${env.GEMINI_API_KEY}
vector_io: []
files: []
safety: []
agents:
- provider_id: meta-reference
provider_type: inline::meta-reference
config:
persistence_store:
type: postgres
host: ${env.LLAMA_STACK_POSTGRES_HOST}
port: ${env.LLAMA_STACK_POSTGRES_PORT}
db: ${env.LLAMA_STACK_POSTGRES_NAME}
user: ${env.LLAMA_STACK_POSTGRES_USER}
password: ${env.LLAMA_STACK_POSTGRES_PASSWORD}
responses_store:
type: postgres
host: ${env.LLAMA_STACK_POSTGRES_HOST}
port: ${env.LLAMA_STACK_POSTGRES_PORT}
db: ${env.LLAMA_STACK_POSTGRES_NAME}
user: ${env.LLAMA_STACK_POSTGRES_USER}
password: ${env.LLAMA_STACK_POSTGRES_PASSWORD}
telemetry:
- provider_id: meta-reference
provider_type: inline::meta-reference
config:
service_name: "${LLAMA_STACK_OTEL_SERVICE_NAME}"
sinks: ${LLAMA_STACK_TELEMETRY_SINKS}
sqlite_db_path: ${STORAGE_MOUNT_PATH}/sqlite/trace_store.db
eval: []
datasetio: []
scoring:
- provider_id: basic
provider_type: inline::basic
config: {}
- provider_id: llm-as-judge
provider_type: inline::llm-as-judge
config: {}
tool_runtime:
- provider_id: rag-runtime
provider_type: inline::rag-runtime
config: {}
- provider_id: model-context-protocol
provider_type: remote::model-context-protocol
config: {}
metadata_store:
type: sqlite
db_path: ${STORAGE_MOUNT_PATH}/sqlite/registry.db
inference_store:
type: postgres
host: ${env.LLAMA_STACK_POSTGRES_HOST}
port: ${env.LLAMA_STACK_POSTGRES_PORT}
db: ${env.LLAMA_STACK_POSTGRES_NAME}
user: ${env.LLAMA_STACK_POSTGRES_USER}
password: ${env.LLAMA_STACK_POSTGRES_PASSWORD}
models:
- metadata: {}
model_id: ${LLAMA_STACK_2_0_FLASH_MODEL}
provider_id: ${LLAMA_STACK_INFERENCE_PROVIDER}
provider_model_id: ${LLAMA_STACK_2_0_FLASH_MODEL}
model_type: llm
- metadata: {}
model_id: ${LLAMA_STACK_2_5_PRO_MODEL}
provider_id: ${LLAMA_STACK_INFERENCE_PROVIDER}
provider_model_id: ${LLAMA_STACK_2_5_PRO_MODEL}
model_type: llm
- metadata: {}
model_id: ${LLAMA_STACK_2_5_FLASH_MODEL}
provider_id: ${LLAMA_STACK_INFERENCE_PROVIDER}
provider_model_id: ${LLAMA_STACK_2_5_FLASH_MODEL}
model_type: llm
shields: []
vector_dbs: []
datasets: []
scoring_fns: []
benchmarks: []
tool_groups:
- toolgroup_id: builtin::rag
provider_id: rag-runtime
- toolgroup_id: mcp::assisted
provider_id: model-context-protocol
mcp_endpoint:
uri: "${MCP_SERVER_URL}"
server:
port: ${LLAMA_STACK_SERVER_PORT}
83 changes: 83 additions & 0 deletions examples/openai-faiss-run.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,83 @@
# Example llama-stack configuration for OpenAI inference + FAISS (RAG)
#
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Trim trailing spaces and add newline at EOF (yamllint/CI).

Multiple trailing spaces; missing newline at EOF. These are typical CI blockers.

Also applies to: 6-6, 18-18, 28-29, 32-32, 36-36, 59-59, 68-68, 81-81, 83-83

🧰 Tools
🪛 YAMLlint (1.37.1)

[error] 2-2: trailing spaces

(trailing-spaces)

🤖 Prompt for AI Agents
In examples/openai-faiss-run.yaml around lines 2 (and additionally lines 6, 18,
28-29, 32, 36, 59, 68, 81, 83), remove any trailing spaces at the ends of those
lines and ensure the file ends with a single newline character; update each
affected line to have no trailing whitespace and save the file with a final
newline (LF) so yamllint/CI passes.

# Notes:
# - You will need an OpenAI API key
# - You can generate the vector index with the rag-content tool (https://github.com/lightspeed-core/rag-content)
#
version: 2
image_name: openai-faiss-config

apis:
- agents
- inference
- vector_io
- tool_runtime
- safety

models:
- model_id: gpt-test
provider_id: openai # This ID is a reference to 'providers.inference'
model_type: llm
provider_model_id: gpt-4o-mini

- model_id: sentence-transformers/all-mpnet-base-v2
metadata:
embedding_dimension: 768
model_type: embedding
provider_id: sentence-transformers # This ID is a reference to 'providers.inference'
provider_model_id: /home/USER/lightspeed-stack/embedding_models/all-mpnet-base-v2

providers:
inference:
- provider_id: sentence-transformers
provider_type: inline::sentence-transformers
config: {}

- provider_id: openai
provider_type: remote::openai
config:
api_key: ${env.OPENAI_API_KEY}

agents:
Comment on lines +31 to +41
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Fix indentation to 4 spaces per level under providers and nested maps.

YAML lint expects 4-space indentation. Current blocks under providers/config/stores are under-indented by 2 spaces.

-providers:
-  inference:
-  - provider_id: sentence-transformers 
-    provider_type: inline::sentence-transformers
-    config: {}
-
-  - provider_id: openai 
-    provider_type: remote::openai
-    config:
-      api_key: ${env.OPENAI_API_KEY}
-
-  agents:
-  - provider_id: meta-reference
-    provider_type: inline::meta-reference
-    config:
-      persistence_store:
-        type: sqlite
-        db_path: .llama/distributions/ollama/agents_store.db
-      responses_store:
-        type: sqlite
-        db_path: .llama/distributions/ollama/responses_store.db
-
-  safety:
-  - provider_id: llama-guard
-    provider_type: inline::llama-guard
-    config:
-      excluded_categories: []
-
-  vector_io:
-  - provider_id: ocp-docs 
-    provider_type: inline::faiss
-    config:
-      kvstore:
-        type: sqlite
-        db_path: /home/USER/lightspeed-stack/vector_dbs/ocp_docs/faiss_store.db
-        namespace: null
-
-  tool_runtime:
-  - provider_id: rag-runtime 
-    provider_type: inline::rag-runtime
-    config: {}
+providers:
+    inference:
+        - provider_id: sentence-transformers
+          provider_type: inline::sentence-transformers
+          config: {}
+
+        - provider_id: openai
+          provider_type: remote::openai
+          config:
+              api_key: ${env.OPENAI_API_KEY}
+
+    agents:
+        - provider_id: meta-reference
+          provider_type: inline::meta-reference
+          config:
+              persistence_store:
+                  type: sqlite
+                  db_path: .llama/distributions/ollama/agents_store.db
+              responses_store:
+                  type: sqlite
+                  db_path: .llama/distributions/ollama/responses_store.db
+
+    safety:
+        - provider_id: llama-guard
+          provider_type: inline::llama-guard
+          config:
+              excluded_categories: []
+
+    vector_io:
+        - provider_id: ocp-docs
+          provider_type: inline::faiss
+          config:
+              kvstore:
+                  type: sqlite
+                  db_path: /home/USER/lightspeed-stack/vector_dbs/ocp_docs/faiss_store.db
+                  namespace: null
+
+    tool_runtime:
+        - provider_id: rag-runtime
+          provider_type: inline::rag-runtime
+          config: {}

Also applies to: 45-51, 56-66, 67-71

🧰 Tools
🪛 YAMLlint (1.37.1)

[warning] 31-31: wrong indentation: expected 4 but found 2

(indentation)


[error] 32-32: trailing spaces

(trailing-spaces)


[error] 36-36: trailing spaces

(trailing-spaces)


[warning] 39-39: wrong indentation: expected 8 but found 6

(indentation)

🤖 Prompt for AI Agents
In examples/openai-faiss-run.yaml around lines 31 to 41 (and also apply to
45-51, 56-66, 67-71), the YAML blocks under "inference", provider entries,
"config", and other nested maps use 2-space indentation; update these to use 4
spaces per indentation level consistently (each nested key/value should be
indented by 4 spaces from its parent) so that provider entries, their
provider_type and config keys and values, and any stores/nested maps align with
YAML lint expectations.

- provider_id: meta-reference
provider_type: inline::meta-reference
config:
persistence_store:
type: sqlite
db_path: .llama/distributions/ollama/agents_store.db
responses_store:
type: sqlite
db_path: .llama/distributions/ollama/responses_store.db

safety:
- provider_id: llama-guard
provider_type: inline::llama-guard
config:
excluded_categories: []

vector_io:
- provider_id: ocp-docs
provider_type: inline::faiss
config:
kvstore:
type: sqlite
db_path: /home/USER/lightspeed-stack/vector_dbs/ocp_docs/faiss_store.db
namespace: null

tool_runtime:
- provider_id: rag-runtime
provider_type: inline::rag-runtime
config: {}

# Enable the RAG tool
tool_groups:
- provider_id: rag-runtime
toolgroup_id: builtin::rag
args: null
mcp_endpoint: null

vector_dbs:
- embedding_dimension: 768
embedding_model: sentence-transformers/all-mpnet-base-v2
provider_id: ocp-docs # This ID is a reference to 'providers.vector_io'
vector_db_id: openshift-index # This ID was defined during index generation
Loading