lightspeed-core · tisnik · Aug 21, 2025 · Aug 20, 2025 · coderabbitai · Aug 20, 2025
diff --git a/README.md b/README.md
@@ -35,6 +35,7 @@ The service includes comprehensive user data collection capabilities for various
         * [K8s based authentication](#k8s-based-authentication)
         * [JSON Web Keyset based authentication](#json-web-keyset-based-authentication)
         * [No-op authentication](#no-op-authentication)
+* [RAG Configuration](#rag-configuration)
 * [Usage](#usage)
     * [Make targets](#make-targets)
     * [Running Linux container image](#running-linux-container-image)
@@ -451,7 +452,21 @@ service:
 Credentials are not allowed with wildcard origins per CORS/Fetch spec.
 See https://fastapi.tiangolo.com/tutorial/cors/
 
+# RAG Configuration
 
+The [guide to RAG setup](docs/rag_guide.md) provides guidance on setting up RAG and includes tested examples for both inference and vector store integration.
+
+## Example configurations for inference
+
+The following configurations are llama-stack config examples from production deployments:
+
+- [Granite on vLLM example](examples/vllm-granite-run.yaml)
+- [Qwen3 on vLLM example](examples/vllm-qwen3-run.yaml)
+- [Gemini example](examples/gemini-run.yaml)
+- [VertexAI example](examples/vertexai-run.yaml)
+
+> [!NOTE]
+> RAG functionality is **not tested** for these configurations.
 
 # Usage
 

diff --git a/docs/rag_guide.md b/docs/rag_guide.md
@@ -61,7 +61,7 @@ Update the `run.yaml` file used by Llama Stack to point to:
 * Your downloaded **embedding model**
 * Your generated **vector database**
 
-Example:
+### FAISS example
 
 ```yaml
 models:
@@ -100,10 +100,113 @@ Where:
 - `db_path` is the path to the vector index (.db file in this case)
 - `vector_db_id` is the index ID used to generate the db
 
+See the full working [config example](examples/openai-faiss-run.yaml) for more details.
+
+### pgvector example
+
+This example shows how to configure a remote PostgreSQL database with the [pgvector](https://github.com/pgvector/pgvector) extension for storing embeddings.
+
+> You will need to install PostgreSQL with a matching version to pgvector, then log in with `psql` and enable the extension with:
+> ```sql
+> CREATE EXTENSION IF NOT EXISTS vector;
+> ```
+
+Update the connection details (`host`, `port`, `db`, `user`, `password`) to match your PostgreSQL setup.
+
+Each pgvector-backed table follows this schema:
+
+- `id` (`text`): UUID identifier of the chunk
+- `document` (`jsonb`): json containing content and metadata associated with the embedding  
+- `embedding` (`vector(n)`): the embedding vector, where `n` is the embedding dimension and will match the model's output size (e.g. 768 for `all-mpnet-base-v2`) 
+
+> [!NOTE]
+> The `vector_db_id` (e.g. `rhdocs`) is used to point to the table named `vector_store_rhdocs` in the specified database, which stores the vector embeddings.
+
+
+```yaml
+[...]
+providers:
+  [...]
+  vector_io:
+  - provider_id: pgvector-example 
+    provider_type: remote::pgvector
+    config:
+      host: localhost
+      port: 5432
+      db: pgvector_example # PostgreSQL database (psql -d pgvector_example)
+      user: lightspeed # PostgreSQL user
+      password: password123
+      kvstore:
+        type: sqlite
+        db_path: .llama/distributions/pgvector/pgvector_registry.db
+
+vector_dbs:
+- embedding_dimension: 768
+  embedding_model: sentence-transformers/all-mpnet-base-v2
+  provider_id: pgvector-example 
+  # A unique ID that becomes the PostgreSQL table name, prefixed with 'vector_store_'.
+  # e.g., 'rhdocs' will create the table 'vector_store_rhdocs'.
+  # If the table was already created, this value must match the ID used at creation.
+  vector_db_id: rhdocs
+```
+
+See the full working [config example](examples/openai-pgvector-run.yaml) for more details.
+
 ---
 
 ## Add an Inference Model (LLM)
 
+### vLLM on RHEL AI (Llama 3.1) example
+
+> [!NOTE]
+> The following example assumes that podman's CDI has been properly configured to [enable GPU support](https://podman-desktop.io/docs/podman/gpu).
+
+The [`vllm-openai`](https://hub.docker.com/r/vllm/vllm-openai) Docker image is used to serve the Llama-3.1-8B-Instruct model.  
+The following example shows how to run it on **RHEL AI** with `podman`:  
+
+```bash
+podman run \
+  --device "${CONTAINER_DEVICE}" \
+  --gpus ${GPUS} \
+  -v ~/.cache/huggingface:/root/.cache/huggingface \
+  --env "HUGGING_FACE_HUB_TOKEN=${HF_TOKEN}" \
+  -p ${EXPORTED_PORT}:8000 \
+  --ipc=host \
+  docker.io/vllm/vllm-openai:latest \
+  --model meta-llama/Llama-3.1-8B-Instruct \
+  --enable-auto-tool-choice \
+  --tool-call-parser llama3_json --chat-template examples/tool_chat_template_llama3.1_json.jinja
+```
+
+> The example command above enables tool calling for Llama 3.1 models.
+> For other supported models and configuration options, see the vLLM documentation:
+> [vLLM: Tool Calling](https://docs.vllm.ai/en/stable/features/tool_calling.html)
+
+After starting the container edit your `run.yaml` file, matching `model_id` with the model provided in the `podman run` command.
+
+```yaml
+[...]
+models:
+[...]
+- model_id: meta-llama/Llama-3.1-8B-Instruct # Same as the model name in the 'podman run' command
+  provider_id: vllm
+  model_type: llm
+  provider_model_id: null
+
+providers:
+  [...]
+  inference:
+  - provider_id: vllm
+    provider_type: remote::vllm
+    config:
+      url: http://localhost:${env.EXPORTED_PORT:=8000}/v1/ # Replace localhost with the url of the vLLM instance
+      api_token: <your-key-here> # if any
+```
+
+See the full working [config example](examples/vllm-llama-faiss-run.yaml) for more details.
+
+### OpenAI example
+
 Add a provider for your language model (e.g., OpenAI):
 
 ```yaml
@@ -133,6 +236,24 @@ export OPENAI_API_KEY=<your-key-here>
 > When experimenting with different `models`, `providers` and `vector_dbs`, you might need to manually unregister the old ones with the Llama Stack client CLI (e.g. `llama-stack-client vector_dbs list`)
 
 
+See the full working [config example](examples/openai-faiss-run.yaml) for more details.
+
+### Azure OpenAI
+
+Not yet supported.
+
+### Ollama
+
+The `remote::ollama` provider can be used for inference. However, it does not support tool calling, including RAG.  
+While Ollama also exposes an OpenAI compatible endpoint that supports tool calling, it cannot be used with `llama-stack` due to current limitations in the `remote::openai` provider. 
+
+There is an [ongoing discussion](https://github.com/meta-llama/llama-stack/discussions/3034) about enabling tool calling with Ollama.  
+Currently, tool calling is not supported out of the box. Some experimental patches exist (including internal workarounds), but these are not officially released.  
+
+### vLLM Mistral
+
+The RAG tool calls where not working properly when experimenting with `mistralai/Mistral-7B-Instruct-v0.3` on vLLM.
+
 ---
 
 # Complete Configuration Reference

diff --git a/examples/gemini-run.yaml b/examples/gemini-run.yaml
@@ -0,0 +1,112 @@
+# Example llama-stack configuration for Google Gemini inference
+# 
+# Contributed by @eranco74 (2025-08). See https://github.com/rh-ecosystem-edge/assisted-chat/blob/main/template.yaml#L282-L386
+# This file shows how to integrate Gemini with LCS.
+# 
+# Notes:
+# - You will need valid Gemini API credentials to run this.
+# - You will need a postgres instance to run this config.
+#
+version: 2
+image_name: gemini-config
+apis:
+- agents
+- datasetio
+- eval
+- files
+- inference
+- safety
+- scoring
+- telemetry
+- tool_runtime
+- vector_io
+providers:
+  inference:
+  - provider_id: ${LLAMA_STACK_INFERENCE_PROVIDER}
+    provider_type: remote::gemini
+    config:
+      api_key: ${env.GEMINI_API_KEY}
+  vector_io: []
+  files: []
+  safety: []
+  agents:
+  - provider_id: meta-reference
+    provider_type: inline::meta-reference
+    config:
+      persistence_store:
+        type: postgres
+        host: ${env.LLAMA_STACK_POSTGRES_HOST}
+        port: ${env.LLAMA_STACK_POSTGRES_PORT}
+        db: ${env.LLAMA_STACK_POSTGRES_NAME}
+        user: ${env.LLAMA_STACK_POSTGRES_USER}
+        password: ${env.LLAMA_STACK_POSTGRES_PASSWORD}
+      responses_store:
+        type: postgres
+        host: ${env.LLAMA_STACK_POSTGRES_HOST}
+        port: ${env.LLAMA_STACK_POSTGRES_PORT}
+        db: ${env.LLAMA_STACK_POSTGRES_NAME}
+        user: ${env.LLAMA_STACK_POSTGRES_USER}
+        password: ${env.LLAMA_STACK_POSTGRES_PASSWORD}
+  telemetry:
+  - provider_id: meta-reference
+    provider_type: inline::meta-reference
+    config:
+      service_name: "${LLAMA_STACK_OTEL_SERVICE_NAME}"
+      sinks: ${LLAMA_STACK_TELEMETRY_SINKS}
+      sqlite_db_path: ${STORAGE_MOUNT_PATH}/sqlite/trace_store.db
+  eval: []
+  datasetio: []
+  scoring:
+  - provider_id: basic
+    provider_type: inline::basic
+    config: {}
+  - provider_id: llm-as-judge
+    provider_type: inline::llm-as-judge
+    config: {}
+  tool_runtime:
+  - provider_id: rag-runtime
+    provider_type: inline::rag-runtime
+    config: {}
+  - provider_id: model-context-protocol
+    provider_type: remote::model-context-protocol
+    config: {}
+metadata_store:
+  type: sqlite
+  db_path: ${STORAGE_MOUNT_PATH}/sqlite/registry.db
+inference_store:
+  type: postgres
+  host: ${env.LLAMA_STACK_POSTGRES_HOST}
+  port: ${env.LLAMA_STACK_POSTGRES_PORT}
+  db: ${env.LLAMA_STACK_POSTGRES_NAME}
+  user: ${env.LLAMA_STACK_POSTGRES_USER}
+  password: ${env.LLAMA_STACK_POSTGRES_PASSWORD}
+models:
+- metadata: {}
+  model_id: ${LLAMA_STACK_2_0_FLASH_MODEL}
+  provider_id: ${LLAMA_STACK_INFERENCE_PROVIDER}
+  provider_model_id: ${LLAMA_STACK_2_0_FLASH_MODEL}
+  model_type: llm
+- metadata: {}
+  model_id: ${LLAMA_STACK_2_5_PRO_MODEL}
+  provider_id: ${LLAMA_STACK_INFERENCE_PROVIDER}
+  provider_model_id: ${LLAMA_STACK_2_5_PRO_MODEL}
+  model_type: llm
+- metadata: {}
+  model_id: ${LLAMA_STACK_2_5_FLASH_MODEL}
+  provider_id: ${LLAMA_STACK_INFERENCE_PROVIDER}
+  provider_model_id: ${LLAMA_STACK_2_5_FLASH_MODEL}
+  model_type: llm
+shields: []
+vector_dbs: []
+datasets: []
+scoring_fns: []
+benchmarks: []
+tool_groups:
+- toolgroup_id: builtin::rag
+  provider_id: rag-runtime
+- toolgroup_id: mcp::assisted
+  provider_id: model-context-protocol
+  mcp_endpoint:
+    uri: "${MCP_SERVER_URL}"
+server:
+  port: ${LLAMA_STACK_SERVER_PORT}
diff --git a/examples/openai-faiss-run.yaml b/examples/openai-faiss-run.yaml
@@ -0,0 +1,83 @@
+# Example llama-stack configuration for OpenAI inference + FAISS (RAG)
+# 
+# Notes:
+# - You will need an OpenAI API key
+# - You can generate the vector index with the rag-content tool (https://github.com/lightspeed-core/rag-content)
+# 
+version: 2
+image_name: openai-faiss-config
+
+apis:
+- agents
+- inference
+- vector_io
+- tool_runtime
+- safety
+
+models:
+- model_id: gpt-test 
+  provider_id: openai # This ID is a reference to 'providers.inference'
+  model_type: llm
+  provider_model_id: gpt-4o-mini
+
+- model_id: sentence-transformers/all-mpnet-base-v2
+  metadata:
+      embedding_dimension: 768
+  model_type: embedding
+  provider_id: sentence-transformers # This ID is a reference to 'providers.inference'
+  provider_model_id: /home/USER/lightspeed-stack/embedding_models/all-mpnet-base-v2 
+
+providers:
+  inference:
+  - provider_id: sentence-transformers 
+    provider_type: inline::sentence-transformers
+    config: {}
+
+  - provider_id: openai 
+    provider_type: remote::openai
+    config:
+      api_key: ${env.OPENAI_API_KEY}
+
+  agents:
+  - provider_id: meta-reference
+    provider_type: inline::meta-reference
+    config:
+      persistence_store:
+        type: sqlite
+        db_path: .llama/distributions/ollama/agents_store.db
+      responses_store:
+        type: sqlite
+        db_path: .llama/distributions/ollama/responses_store.db
+
+  safety:
+  - provider_id: llama-guard
+    provider_type: inline::llama-guard
+    config:
+      excluded_categories: []
+
+  vector_io:
+  - provider_id: ocp-docs 
+    provider_type: inline::faiss
+    config:
+      kvstore:
+        type: sqlite
+        db_path: /home/USER/lightspeed-stack/vector_dbs/ocp_docs/faiss_store.db
+        namespace: null
+
+  tool_runtime:
+  - provider_id: rag-runtime 
+    provider_type: inline::rag-runtime
+    config: {}
+
+# Enable the RAG tool
+tool_groups:
+- provider_id: rag-runtime
+  toolgroup_id: builtin::rag
+  args: null
+  mcp_endpoint: null
+
+vector_dbs:
+- embedding_dimension: 768
+  embedding_model: sentence-transformers/all-mpnet-base-v2 
+  provider_id: ocp-docs # This ID is a reference to 'providers.vector_io'
+  vector_db_id: openshift-index  # This ID was defined during index generation