Add a blueprint for Haystack Deep Research Agent (#461)

oryx1729 · web-flow · commit a3c7296739f6 · 2025-09-25T12:43:27.000Z
Adds a blueprint demonstrating how to build a deep research agent using Haystack Framework that combines web search and Retrieval-Augmented Generation (RAG) using the NeMo-Agent-Toolkit. ## By Submitting this PR I confirm: - I am familiar with the [Contributing Guidelines](https://github.com/NVIDIA/NeMo-Agent-Toolkit/blob/develop/docs/source/resources/contributing.md). - We require that all contributors "sign-off" on their commits. This certifies that the contribution is your original work, or you have rights to submit it under the same license, or a compatible license. - Any contribution which contains commits that are not Signed-Off will not be accepted. - When the PR is ready for review, new or existing tests cover these changes. - When the PR is ready for review, the documentation is up to date with these changes. ## Summary by CodeRabbit - New Features - Adds a Haystack Deep Research Agent example: web search + RAG + orchestrated agent, OpenSearch integration, optional startup indexing, sample data, and configurable agent/LLM settings. - Documentation - Adds a comprehensive README with architecture, setup, configuration, usage examples, and troubleshooting. - Tests - Adds config validation and conditional end-to-end tests for the example workflow. - Chores - Adds packaging/metadata, component registration, sample data pointers, and a CI allowlist entry. Authors: - https://github.com/oryx1729 - Michele Pangrazzi (https://github.com/mpangrazzi) Approvers: - Will Killian (https://github.com/willkill07) URL: #461
diff --git a/ci/scripts/path_checks.py b/ci/scripts/path_checks.py
@@ -165,6 +165,10 @@
         r"^examples/notebooks/retail_sales_agent/.*configs/",
         r"^\./retail_sales_agent/data/",
     ),
+    (
+        r"^examples/basic/frameworks/haystack_deep_research_agent/README.md",
+        r"^examples/basic/frameworks/haystack_deep_research_agent/data/bedrock-ug.pdf",
+    ),
     # ignore generated files
     (
         r"^docs/",
diff --git a/examples/basic/frameworks/haystack_deep_research_agent/README.md b/examples/basic/frameworks/haystack_deep_research_agent/README.md
@@ -0,0 +1,215 @@
+<!--
+SPDX-FileCopyrightText: Copyright (c) 2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+SPDX-License-Identifier: Apache-2.0
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+-->
+
+# Haystack Deep Research Agent
+
+This example demonstrates how to build a deep research agent using Haystack framework  that combines web search and Retrieval Augmented Generation (RAG) capabilities using the NeMo-Agent-Toolkit.
+
+## Overview
+
+The Haystack Deep Research Agent is an intelligent research assistant that can:
+
+- **Web Search**: Search the internet for current information using SerperDev API
+- **Document Retrieval**: Query an internal document database using RAG with OpenSearch
+- **Comprehensive Research**: Combine both sources to provide thorough, well-cited research reports
+- **Intelligent Routing**: Automatically decide when to use web search vs. internal documents
+
+## Architecture
+
+The workflow consists of three main components:
+
+1. **Web Search Tool**: Uses Haystack's SerperDevWebSearch and LinkContentFetcher to search the web and extract content from web pages
+2. **RAG Tool**: Uses OpenSearchDocumentStore to index and query internal documents with semantic retrieval
+3. **Deep Research Agent** (`register.py`): Orchestrates the agent and imports modular pipelines from `src/nat_haystack_deep_research_agent/pipelines/`:
+   - `search.py`: builds the web search tool
+   - `rag.py`: builds the RAG pipeline and tool
+   - `indexing.py`: startup indexing (PDF/TXT/MD) into OpenSearch
+
+## Prerequisites
+
+Before using this workflow, ensure you have:
+
+1. **NVIDIA API Key**: Required for the chat generator and RAG functionality
+   - Get your key from [NVIDIA API Catalog](https://build.nvidia.com/)
+   - Set as environment variable: `export NVIDIA_API_KEY=your_key_here`
+
+2. **SerperDev API Key**: Required for web search functionality
+   - Get your key from [SerperDev](https://serper.dev)
+   - Set as environment variable: `export SERPERDEV_API_KEY=your_key_here`
+
+3. **OpenSearch Instance**: Required for RAG functionality
+   - You can run OpenSearch locally using `docker`
+
+## Installation and Usage
+
+Follow the instructions in the [Install Guide](../../../../docs/source/quick-start/installing.md#install-from-source) to create the development environment and install NVIDIA NeMo Agent Toolkit.
+
+### Step 1: Set Your API Keys
+
+```bash
+export NVIDIA_API_KEY=<YOUR_NVIDIA_API_KEY>
+export SERPERDEV_API_KEY=<YOUR_SERPERDEV_API_KEY>
+```
+
+### Step 2: Start OpenSearch (if not already running)
+
+```bash
+docker run -d --name opensearch -p 9200:9200 -p 9600:9600 \
+  -e "discovery.type=single-node" \
+  -e "plugins.security.disabled=true" \
+  opensearchproject/opensearch:2.11.1
+```
+
+### Step 3: Install the Workflow
+
+```bash
+uv pip install -e examples/basic/frameworks/haystack_deep_research_agent
+```
+
+### Step 4: Add Sample Documents (Optional)
+
+Place documents in the example `data/` directory to enable RAG (PDF, TXT, or MD). On startup, the workflow indexes files from:
+
+- `workflow.data_dir` (default: `/data`)
+- If empty/missing, it falls back to this example's bundled `data/` directory
+
+```bash
+# Example: Download a sample PDF
+wget "https://docs.aws.amazon.com/pdfs/bedrock/latest/userguide/bedrock-ug.pdf" \
+  -O examples/basic/frameworks/haystack_deep_research_agent/data/bedrock-ug.pdf
+```
+
+### Step 5: Run the Workflow
+
+```bash
+nat run --config_file=examples/basic/frameworks/haystack_deep_research_agent/configs/config.yml --input "What are the latest updates on the Artemis moon mission?"
+```
+
+## Example Queries
+
+Here are some example queries you can try:
+
+**Web Search Examples:**
+
+```bash
+# Current events
+nat run --config_file=examples/basic/frameworks/haystack_deep_research_agent/configs/config.yml --input "What are the latest developments in AI research for 2024?"
+
+# Technology news
+nat run --config_file=examples/basic/frameworks/haystack_deep_research_agent/configs/config.yml --input "What are the new features in the latest Python release?"
+```
+
+**RAG Examples (if you have documents indexed):**
+
+```bash
+# Document-specific queries
+nat run --config_file=examples/basic/frameworks/haystack_deep_research_agent/configs/config.yml --input "What are the key features of AWS Bedrock?"
+
+# Mixed queries (will use both web search and RAG)
+nat run --config_file=examples/basic/frameworks/haystack_deep_research_agent/configs/config.yml --input "How does AWS Bedrock compare to other AI platforms in 2024?"
+```
+
+**Web Search + RAG Examples:**
+
+```bash
+nat run --config_file=examples/basic/frameworks/haystack_deep_research_agent/configs/config.yml --input "Is panna (heavy cream) needed on carbonara? Check online the recipe and compare it with the one from our internal dataset."
+```
+
+## Testing
+
+### Quick smoke test (no external services)
+
+- Validates the workflow config without hitting LLMs or OpenSearch.
+
+```bash
+# In your virtual environment
+pytest -q examples/basic/frameworks/haystack_deep_research_agent/tests -k config_yaml_loads_and_has_keys
+```
+
+### End-to-end test (requires keys + OpenSearch)
+
+- Prerequisites:
+  - Set keys: `NVIDIA_API_KEY` and `SERPERDEV_API_KEY`
+  - OpenSearch running on `http://localhost:9200` (start with Docker):
+
+```bash
+docker run -d --name opensearch -p 9200:9200 -p 9600:9600 \
+  -e "discovery.type=single-node" \
+  -e "plugins.security.disabled=true" \
+  opensearchproject/opensearch:2.11.1
+```
+
+- Run the e2e test (ensure `pytest-asyncio` is installed in your virtual environment):
+
+```bash
+pip install pytest-asyncio  # if not already installed
+export NVIDIA_API_KEY=<YOUR_KEY>
+export SERPERDEV_API_KEY=<YOUR_KEY>
+pytest -q examples/basic/frameworks/haystack_deep_research_agent/tests -k full_workflow_e2e
+```
+
+## Configuration
+
+The workflow is configured via `config.yml`. Key configuration options include:
+
+- **Web Search Tool**:
+  - `top_k`: Number of search results to retrieve (default: 10)
+  - `timeout`: Timeout for fetching web content (default: 3 seconds)
+  - `retry_attempts`: Number of retry attempts for failed requests (default: 2)
+
+- **RAG Tool**:
+  - `opensearch_url`: OpenSearch host URL (default: `http://localhost:9200`)
+  - `index_name`: OpenSearch index name (fixed: `deep_research_docs`)
+  - `top_k`: Number of documents to retrieve (default: 15)
+  - `index_on_startup`: If true, run indexing pipeline on start
+  - `data_dir`: Directory to scan for documents; if empty/missing, falls back to example `data/`
+
+- **Agent**:
+  - `max_agent_steps`: Maximum number of agent steps (default: 20)
+  - `system_prompt`: Customizable system prompt for the agent
+
+## Customization
+
+You can customize the workflow by:
+
+1. **Modifying the system prompt** in `config.yml` to change the agent's behavior
+2. **Adding more document types** by extending the RAG tool to support other file formats
+3. **Changing the LLM model** by updating the top-level `llms` section in `config.yml`. This example defines `agent_llm` and `rag_llm` using the `nim` provider so they can leverage common parameters like `temperature`, `top_p`, and `max_tokens`. The workflow references them via the builder. See Haystack's NvidiaChatGenerator docs: [NvidiaChatGenerator](https://docs.haystack.deepset.ai/docs/nvidiachatgenerator)
+4. **Adjusting search parameters** to optimize for your use case
+
+## Troubleshooting
+
+**Common Issues:**
+
+1. **OpenSearch Connection Error**: Ensure OpenSearch is running and accessible at the configured host
+2. **Missing API Keys**: Verify that both NVIDIA_API_KEY and SERPERDEV_API_KEY are set
+3. **No Documents Found**: Check that PDF files are placed in the data directory and the path is correct
+4. **Web Search Fails**: Verify your SerperDev API key is valid and has remaining quota
+
+**Logs**: Check the NeMo-Agent-Toolkit logs for detailed error information and debugging.
+
+## Architecture Details
+
+The workflow demonstrates several key NeMo-Agent-Toolkit patterns:
+
+- **Workflow Registration**: The agent is exposed as a workflow function with a Pydantic config
+- **Builder LLM Integration**: LLMs are defined under top-level `llms:` and accessed via `builder.get_llm_config(...)`
+- **Component Integration**: Haystack components are composed into tools within the workflow
+- **Error Handling**: Robust error handling with fallback behaviors
+- **Async Operations**: All operations are asynchronous for better performance
+
+This example showcases how the Haystack AI framework can be seamlessly integrated into NeMo-Agent-Toolkit workflows while maintaining the flexibility and power of the underlying architecture.
diff --git a/examples/basic/frameworks/haystack_deep_research_agent/configs b/examples/basic/frameworks/haystack_deep_research_agent/configs
@@ -0,0 +1 @@
+src/nat_haystack_deep_research_agent/configs
diff --git a/examples/basic/frameworks/haystack_deep_research_agent/data b/examples/basic/frameworks/haystack_deep_research_agent/data
@@ -0,0 +1 @@
+src/nat_haystack_deep_research_agent/data
diff --git a/examples/basic/frameworks/haystack_deep_research_agent/pyproject.toml b/examples/basic/frameworks/haystack_deep_research_agent/pyproject.toml
@@ -0,0 +1,28 @@
+[build-system]
+build-backend = "setuptools.build_meta"
+requires = ["setuptools >= 64", "setuptools-scm>=8"]
+
+[tool.setuptools_scm]
+root = "../../../.."
+
+[project]
+name = "nat_haystack_deep_research_agent"
+dynamic = ["version"]
+dependencies = [
+  "nvidia-nat~=1.3",
+  "haystack-ai>=2.17.0,<2.19",
+  "opensearch-haystack~=4.2.0",
+  "nvidia-haystack~=0.3.0",
+  "trafilatura~=2.0.0",
+  "pypdf~=5.8.0",
+  "docstring-parser~=0.16",
+]
+requires-python = ">=3.11,<3.13"
+description = "Haystack Deep Research Agent workflow for NVIDIA NeMo Agent Toolkit"
+classifiers = ["Programming Language :: Python"]
+
+[tool.uv.sources]
+nvidia-nat = { path = "../../../..", editable = true }
+
+[project.entry-points.'nat.components']
+nat_haystack_deep_research_agent = "nat_haystack_deep_research_agent.register"
diff --git a/examples/basic/frameworks/haystack_deep_research_agent/src/nat_haystack_deep_research_agent/__init__.py b/examples/basic/frameworks/haystack_deep_research_agent/src/nat_haystack_deep_research_agent/__init__.py
@@ -0,0 +1,22 @@
+# SPDX-FileCopyrightText: Copyright (c) 2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+# Re-export pipelines helpers for convenience
+try:
+    from .pipelines.indexing import run_startup_indexing  # noqa: F401
+    from .pipelines.rag import create_rag_tool  # noqa: F401
+    from .pipelines.search import create_search_tool  # noqa: F401
+except Exception:  # pragma: no cover - optional during install time
+    pass
diff --git a/examples/basic/frameworks/haystack_deep_research_agent/src/nat_haystack_deep_research_agent/configs/config.yml b/examples/basic/frameworks/haystack_deep_research_agent/src/nat_haystack_deep_research_agent/configs/config.yml
@@ -0,0 +1,31 @@
+# SPDX-FileCopyrightText: Copyright (c) 2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+llms:
+  rag_llm:
+    _type: nim
+    model: nvidia/llama-3.3-nemotron-super-49b-v1
+  agent_llm:
+    _type: nim
+    model: nvidia/llama-3.3-nemotron-super-49b-v1
+
+workflow:
+  _type: haystack_deep_research_agent
+  max_agent_steps: 20
+  search_top_k: 10
+  rag_top_k: 15
+  opensearch_url: http://localhost:9200
+  index_on_startup: true
+  data_dir: /data
diff --git a/examples/basic/frameworks/haystack_deep_research_agent/src/nat_haystack_deep_research_agent/data/carbonara.md b/examples/basic/frameworks/haystack_deep_research_agent/src/nat_haystack_deep_research_agent/data/carbonara.md
@@ -0,0 +1,3 @@
+version https://git-lfs.github.com/spec/v1
+oid sha256:620c7c473f1fc8913e017026287617069de7cf596d9501481b0916001c9f4291
+size 2740
diff --git a/examples/basic/frameworks/haystack_deep_research_agent/src/nat_haystack_deep_research_agent/data/sample_document.txt b/examples/basic/frameworks/haystack_deep_research_agent/src/nat_haystack_deep_research_agent/data/sample_document.txt
@@ -0,0 +1,3 @@
+version https://git-lfs.github.com/spec/v1
+oid sha256:3f5c0860de3cf2ebe91c874b0023b5641706c611cf00032e860b2f78f60b8ad2
+size 2098
diff --git a/examples/basic/frameworks/haystack_deep_research_agent/src/nat_haystack_deep_research_agent/pipelines/__init__.py b/examples/basic/frameworks/haystack_deep_research_agent/src/nat_haystack_deep_research_agent/pipelines/__init__.py
@@ -0,0 +1,18 @@
+# SPDX-FileCopyrightText: Copyright (c) 2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from .indexing import run_startup_indexing  # noqa: F401
+from .rag import create_rag_tool  # noqa: F401
+from .search import create_search_tool  # noqa: F401
diff --git a/examples/basic/frameworks/haystack_deep_research_agent/src/nat_haystack_deep_research_agent/pipelines/indexing.py b/examples/basic/frameworks/haystack_deep_research_agent/src/nat_haystack_deep_research_agent/pipelines/indexing.py
diff --git a/examples/basic/frameworks/haystack_deep_research_agent/src/nat_haystack_deep_research_agent/pipelines/rag.py b/examples/basic/frameworks/haystack_deep_research_agent/src/nat_haystack_deep_research_agent/pipelines/rag.py
diff --git a/examples/basic/frameworks/haystack_deep_research_agent/src/nat_haystack_deep_research_agent/pipelines/search.py b/examples/basic/frameworks/haystack_deep_research_agent/src/nat_haystack_deep_research_agent/pipelines/search.py
diff --git a/examples/basic/frameworks/haystack_deep_research_agent/src/nat_haystack_deep_research_agent/register.py b/examples/basic/frameworks/haystack_deep_research_agent/src/nat_haystack_deep_research_agent/register.py
diff --git a/examples/basic/frameworks/haystack_deep_research_agent/tests/test_haystack_deep_research_agent.py b/examples/basic/frameworks/haystack_deep_research_agent/tests/test_haystack_deep_research_agent.py

Original file line number	Diff line number	Diff line change
`@@ -0,0 +1 @@`
	`1`	`+src/nat_haystack_deep_research_agent/configs`
Original file line number	Diff line number	Diff line change
`@@ -0,0 +1 @@`
	`1`	`+src/nat_haystack_deep_research_agent/data`
Original file line number	Diff line number	Diff line change
`@@ -0,0 +1,3 @@`
	`1`	`+version https://git-lfs.github.com/spec/v1`
	`2`	`+oid sha256:620c7c473f1fc8913e017026287617069de7cf596d9501481b0916001c9f4291`
	`3`	`+size 2740`
Original file line number	Diff line number	Diff line change
`@@ -0,0 +1,3 @@`
	`1`	`+version https://git-lfs.github.com/spec/v1`
	`2`	`+oid sha256:3f5c0860de3cf2ebe91c874b0023b5641706c611cf00032e860b2f78f60b8ad2`
	`3`	`+size 2098`