Skip to content

Commit a3c7296

Browse files
authored
Add a blueprint for Haystack Deep Research Agent (#461)
Adds a blueprint demonstrating how to build a deep research agent using Haystack Framework that combines web search and Retrieval-Augmented Generation (RAG) using the NeMo-Agent-Toolkit. ## By Submitting this PR I confirm: - I am familiar with the [Contributing Guidelines](https://github.com/NVIDIA/NeMo-Agent-Toolkit/blob/develop/docs/source/resources/contributing.md). - We require that all contributors "sign-off" on their commits. This certifies that the contribution is your original work, or you have rights to submit it under the same license, or a compatible license. - Any contribution which contains commits that are not Signed-Off will not be accepted. - When the PR is ready for review, new or existing tests cover these changes. - When the PR is ready for review, the documentation is up to date with these changes. ## Summary by CodeRabbit - New Features - Adds a Haystack Deep Research Agent example: web search + RAG + orchestrated agent, OpenSearch integration, optional startup indexing, sample data, and configurable agent/LLM settings. - Documentation - Adds a comprehensive README with architecture, setup, configuration, usage examples, and troubleshooting. - Tests - Adds config validation and conditional end-to-end tests for the example workflow. - Chores - Adds packaging/metadata, component registration, sample data pointers, and a CI allowlist entry. Authors: - https://github.com/oryx1729 - Michele Pangrazzi (https://github.com/mpangrazzi) Approvers: - Will Killian (https://github.com/willkill07) URL: #461
1 parent b8accdd commit a3c7296

File tree

15 files changed

+817
-0
lines changed

15 files changed

+817
-0
lines changed

ci/scripts/path_checks.py

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -165,6 +165,10 @@
165165
r"^examples/notebooks/retail_sales_agent/.*configs/",
166166
r"^\./retail_sales_agent/data/",
167167
),
168+
(
169+
r"^examples/basic/frameworks/haystack_deep_research_agent/README.md",
170+
r"^examples/basic/frameworks/haystack_deep_research_agent/data/bedrock-ug.pdf",
171+
),
168172
# ignore generated files
169173
(
170174
r"^docs/",
Lines changed: 215 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,215 @@
1+
<!--
2+
SPDX-FileCopyrightText: Copyright (c) 2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
3+
SPDX-License-Identifier: Apache-2.0
4+
5+
Licensed under the Apache License, Version 2.0 (the "License");
6+
you may not use this file except in compliance with the License.
7+
You may obtain a copy of the License at
8+
9+
http://www.apache.org/licenses/LICENSE-2.0
10+
11+
Unless required by applicable law or agreed to in writing, software
12+
distributed under the License is distributed on an "AS IS" BASIS,
13+
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
14+
See the License for the specific language governing permissions and
15+
limitations under the License.
16+
-->
17+
18+
# Haystack Deep Research Agent
19+
20+
This example demonstrates how to build a deep research agent using Haystack framework that combines web search and Retrieval Augmented Generation (RAG) capabilities using the NeMo-Agent-Toolkit.
21+
22+
## Overview
23+
24+
The Haystack Deep Research Agent is an intelligent research assistant that can:
25+
26+
- **Web Search**: Search the internet for current information using SerperDev API
27+
- **Document Retrieval**: Query an internal document database using RAG with OpenSearch
28+
- **Comprehensive Research**: Combine both sources to provide thorough, well-cited research reports
29+
- **Intelligent Routing**: Automatically decide when to use web search vs. internal documents
30+
31+
## Architecture
32+
33+
The workflow consists of three main components:
34+
35+
1. **Web Search Tool**: Uses Haystack's SerperDevWebSearch and LinkContentFetcher to search the web and extract content from web pages
36+
2. **RAG Tool**: Uses OpenSearchDocumentStore to index and query internal documents with semantic retrieval
37+
3. **Deep Research Agent** (`register.py`): Orchestrates the agent and imports modular pipelines from `src/nat_haystack_deep_research_agent/pipelines/`:
38+
- `search.py`: builds the web search tool
39+
- `rag.py`: builds the RAG pipeline and tool
40+
- `indexing.py`: startup indexing (PDF/TXT/MD) into OpenSearch
41+
42+
## Prerequisites
43+
44+
Before using this workflow, ensure you have:
45+
46+
1. **NVIDIA API Key**: Required for the chat generator and RAG functionality
47+
- Get your key from [NVIDIA API Catalog](https://build.nvidia.com/)
48+
- Set as environment variable: `export NVIDIA_API_KEY=your_key_here`
49+
50+
2. **SerperDev API Key**: Required for web search functionality
51+
- Get your key from [SerperDev](https://serper.dev)
52+
- Set as environment variable: `export SERPERDEV_API_KEY=your_key_here`
53+
54+
3. **OpenSearch Instance**: Required for RAG functionality
55+
- You can run OpenSearch locally using `docker`
56+
57+
## Installation and Usage
58+
59+
Follow the instructions in the [Install Guide](../../../../docs/source/quick-start/installing.md#install-from-source) to create the development environment and install NVIDIA NeMo Agent Toolkit.
60+
61+
### Step 1: Set Your API Keys
62+
63+
```bash
64+
export NVIDIA_API_KEY=<YOUR_NVIDIA_API_KEY>
65+
export SERPERDEV_API_KEY=<YOUR_SERPERDEV_API_KEY>
66+
```
67+
68+
### Step 2: Start OpenSearch (if not already running)
69+
70+
```bash
71+
docker run -d --name opensearch -p 9200:9200 -p 9600:9600 \
72+
-e "discovery.type=single-node" \
73+
-e "plugins.security.disabled=true" \
74+
opensearchproject/opensearch:2.11.1
75+
```
76+
77+
### Step 3: Install the Workflow
78+
79+
```bash
80+
uv pip install -e examples/basic/frameworks/haystack_deep_research_agent
81+
```
82+
83+
### Step 4: Add Sample Documents (Optional)
84+
85+
Place documents in the example `data/` directory to enable RAG (PDF, TXT, or MD). On startup, the workflow indexes files from:
86+
87+
- `workflow.data_dir` (default: `/data`)
88+
- If empty/missing, it falls back to this example's bundled `data/` directory
89+
90+
```bash
91+
# Example: Download a sample PDF
92+
wget "https://docs.aws.amazon.com/pdfs/bedrock/latest/userguide/bedrock-ug.pdf" \
93+
-O examples/basic/frameworks/haystack_deep_research_agent/data/bedrock-ug.pdf
94+
```
95+
96+
### Step 5: Run the Workflow
97+
98+
```bash
99+
nat run --config_file=examples/basic/frameworks/haystack_deep_research_agent/configs/config.yml --input "What are the latest updates on the Artemis moon mission?"
100+
```
101+
102+
## Example Queries
103+
104+
Here are some example queries you can try:
105+
106+
**Web Search Examples:**
107+
108+
```bash
109+
# Current events
110+
nat run --config_file=examples/basic/frameworks/haystack_deep_research_agent/configs/config.yml --input "What are the latest developments in AI research for 2024?"
111+
112+
# Technology news
113+
nat run --config_file=examples/basic/frameworks/haystack_deep_research_agent/configs/config.yml --input "What are the new features in the latest Python release?"
114+
```
115+
116+
**RAG Examples (if you have documents indexed):**
117+
118+
```bash
119+
# Document-specific queries
120+
nat run --config_file=examples/basic/frameworks/haystack_deep_research_agent/configs/config.yml --input "What are the key features of AWS Bedrock?"
121+
122+
# Mixed queries (will use both web search and RAG)
123+
nat run --config_file=examples/basic/frameworks/haystack_deep_research_agent/configs/config.yml --input "How does AWS Bedrock compare to other AI platforms in 2024?"
124+
```
125+
126+
**Web Search + RAG Examples:**
127+
128+
```bash
129+
nat run --config_file=examples/basic/frameworks/haystack_deep_research_agent/configs/config.yml --input "Is panna (heavy cream) needed on carbonara? Check online the recipe and compare it with the one from our internal dataset."
130+
```
131+
132+
## Testing
133+
134+
### Quick smoke test (no external services)
135+
136+
- Validates the workflow config without hitting LLMs or OpenSearch.
137+
138+
```bash
139+
# In your virtual environment
140+
pytest -q examples/basic/frameworks/haystack_deep_research_agent/tests -k config_yaml_loads_and_has_keys
141+
```
142+
143+
### End-to-end test (requires keys + OpenSearch)
144+
145+
- Prerequisites:
146+
- Set keys: `NVIDIA_API_KEY` and `SERPERDEV_API_KEY`
147+
- OpenSearch running on `http://localhost:9200` (start with Docker):
148+
149+
```bash
150+
docker run -d --name opensearch -p 9200:9200 -p 9600:9600 \
151+
-e "discovery.type=single-node" \
152+
-e "plugins.security.disabled=true" \
153+
opensearchproject/opensearch:2.11.1
154+
```
155+
156+
- Run the e2e test (ensure `pytest-asyncio` is installed in your virtual environment):
157+
158+
```bash
159+
pip install pytest-asyncio # if not already installed
160+
export NVIDIA_API_KEY=<YOUR_KEY>
161+
export SERPERDEV_API_KEY=<YOUR_KEY>
162+
pytest -q examples/basic/frameworks/haystack_deep_research_agent/tests -k full_workflow_e2e
163+
```
164+
165+
## Configuration
166+
167+
The workflow is configured via `config.yml`. Key configuration options include:
168+
169+
- **Web Search Tool**:
170+
- `top_k`: Number of search results to retrieve (default: 10)
171+
- `timeout`: Timeout for fetching web content (default: 3 seconds)
172+
- `retry_attempts`: Number of retry attempts for failed requests (default: 2)
173+
174+
- **RAG Tool**:
175+
- `opensearch_url`: OpenSearch host URL (default: `http://localhost:9200`)
176+
- `index_name`: OpenSearch index name (fixed: `deep_research_docs`)
177+
- `top_k`: Number of documents to retrieve (default: 15)
178+
- `index_on_startup`: If true, run indexing pipeline on start
179+
- `data_dir`: Directory to scan for documents; if empty/missing, falls back to example `data/`
180+
181+
- **Agent**:
182+
- `max_agent_steps`: Maximum number of agent steps (default: 20)
183+
- `system_prompt`: Customizable system prompt for the agent
184+
185+
## Customization
186+
187+
You can customize the workflow by:
188+
189+
1. **Modifying the system prompt** in `config.yml` to change the agent's behavior
190+
2. **Adding more document types** by extending the RAG tool to support other file formats
191+
3. **Changing the LLM model** by updating the top-level `llms` section in `config.yml`. This example defines `agent_llm` and `rag_llm` using the `nim` provider so they can leverage common parameters like `temperature`, `top_p`, and `max_tokens`. The workflow references them via the builder. See Haystack's NvidiaChatGenerator docs: [NvidiaChatGenerator](https://docs.haystack.deepset.ai/docs/nvidiachatgenerator)
192+
4. **Adjusting search parameters** to optimize for your use case
193+
194+
## Troubleshooting
195+
196+
**Common Issues:**
197+
198+
1. **OpenSearch Connection Error**: Ensure OpenSearch is running and accessible at the configured host
199+
2. **Missing API Keys**: Verify that both NVIDIA_API_KEY and SERPERDEV_API_KEY are set
200+
3. **No Documents Found**: Check that PDF files are placed in the data directory and the path is correct
201+
4. **Web Search Fails**: Verify your SerperDev API key is valid and has remaining quota
202+
203+
**Logs**: Check the NeMo-Agent-Toolkit logs for detailed error information and debugging.
204+
205+
## Architecture Details
206+
207+
The workflow demonstrates several key NeMo-Agent-Toolkit patterns:
208+
209+
- **Workflow Registration**: The agent is exposed as a workflow function with a Pydantic config
210+
- **Builder LLM Integration**: LLMs are defined under top-level `llms:` and accessed via `builder.get_llm_config(...)`
211+
- **Component Integration**: Haystack components are composed into tools within the workflow
212+
- **Error Handling**: Robust error handling with fallback behaviors
213+
- **Async Operations**: All operations are asynchronous for better performance
214+
215+
This example showcases how the Haystack AI framework can be seamlessly integrated into NeMo-Agent-Toolkit workflows while maintaining the flexibility and power of the underlying architecture.
Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
src/nat_haystack_deep_research_agent/configs
Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
src/nat_haystack_deep_research_agent/data
Lines changed: 28 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,28 @@
1+
[build-system]
2+
build-backend = "setuptools.build_meta"
3+
requires = ["setuptools >= 64", "setuptools-scm>=8"]
4+
5+
[tool.setuptools_scm]
6+
root = "../../../.."
7+
8+
[project]
9+
name = "nat_haystack_deep_research_agent"
10+
dynamic = ["version"]
11+
dependencies = [
12+
"nvidia-nat~=1.3",
13+
"haystack-ai>=2.17.0,<2.19",
14+
"opensearch-haystack~=4.2.0",
15+
"nvidia-haystack~=0.3.0",
16+
"trafilatura~=2.0.0",
17+
"pypdf~=5.8.0",
18+
"docstring-parser~=0.16",
19+
]
20+
requires-python = ">=3.11,<3.13"
21+
description = "Haystack Deep Research Agent workflow for NVIDIA NeMo Agent Toolkit"
22+
classifiers = ["Programming Language :: Python"]
23+
24+
[tool.uv.sources]
25+
nvidia-nat = { path = "../../../..", editable = true }
26+
27+
[project.entry-points.'nat.components']
28+
nat_haystack_deep_research_agent = "nat_haystack_deep_research_agent.register"
Lines changed: 22 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,22 @@
1+
# SPDX-FileCopyrightText: Copyright (c) 2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
2+
# SPDX-License-Identifier: Apache-2.0
3+
#
4+
# Licensed under the Apache License, Version 2.0 (the "License");
5+
# you may not use this file except in compliance with the License.
6+
# You may obtain a copy of the License at
7+
#
8+
# http://www.apache.org/licenses/LICENSE-2.0
9+
#
10+
# Unless required by applicable law or agreed to in writing, software
11+
# distributed under the License is distributed on an "AS IS" BASIS,
12+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13+
# See the License for the specific language governing permissions and
14+
# limitations under the License.
15+
16+
# Re-export pipelines helpers for convenience
17+
try:
18+
from .pipelines.indexing import run_startup_indexing # noqa: F401
19+
from .pipelines.rag import create_rag_tool # noqa: F401
20+
from .pipelines.search import create_search_tool # noqa: F401
21+
except Exception: # pragma: no cover - optional during install time
22+
pass
Lines changed: 31 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,31 @@
1+
# SPDX-FileCopyrightText: Copyright (c) 2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
2+
# SPDX-License-Identifier: Apache-2.0
3+
#
4+
# Licensed under the Apache License, Version 2.0 (the "License");
5+
# you may not use this file except in compliance with the License.
6+
# You may obtain a copy of the License at
7+
#
8+
# http://www.apache.org/licenses/LICENSE-2.0
9+
#
10+
# Unless required by applicable law or agreed to in writing, software
11+
# distributed under the License is distributed on an "AS IS" BASIS,
12+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13+
# See the License for the specific language governing permissions and
14+
# limitations under the License.
15+
16+
llms:
17+
rag_llm:
18+
_type: nim
19+
model: nvidia/llama-3.3-nemotron-super-49b-v1
20+
agent_llm:
21+
_type: nim
22+
model: nvidia/llama-3.3-nemotron-super-49b-v1
23+
24+
workflow:
25+
_type: haystack_deep_research_agent
26+
max_agent_steps: 20
27+
search_top_k: 10
28+
rag_top_k: 15
29+
opensearch_url: http://localhost:9200
30+
index_on_startup: true
31+
data_dir: /data
Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
version https://git-lfs.github.com/spec/v1
2+
oid sha256:620c7c473f1fc8913e017026287617069de7cf596d9501481b0916001c9f4291
3+
size 2740
Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
version https://git-lfs.github.com/spec/v1
2+
oid sha256:3f5c0860de3cf2ebe91c874b0023b5641706c611cf00032e860b2f78f60b8ad2
3+
size 2098
Lines changed: 18 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,18 @@
1+
# SPDX-FileCopyrightText: Copyright (c) 2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
2+
# SPDX-License-Identifier: Apache-2.0
3+
#
4+
# Licensed under the Apache License, Version 2.0 (the "License");
5+
# you may not use this file except in compliance with the License.
6+
# You may obtain a copy of the License at
7+
#
8+
# http://www.apache.org/licenses/LICENSE-2.0
9+
#
10+
# Unless required by applicable law or agreed to in writing, software
11+
# distributed under the License is distributed on an "AS IS" BASIS,
12+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13+
# See the License for the specific language governing permissions and
14+
# limitations under the License.
15+
16+
from .indexing import run_startup_indexing # noqa: F401
17+
from .rag import create_rag_tool # noqa: F401
18+
from .search import create_search_tool # noqa: F401

0 commit comments

Comments
 (0)