This repository contains a minimal Retrieval‑Augmented Generation (RAG) MCP server example implemented in Python. The server exposes a single MCP tool named retrieve
that looks up short, directly relevant snippets from a local corpus of research papers (the papers/
directory) using embeddings and a vector store.
rag.py
contains the core code for the server. It first loads the Markdown files under papers/
, splits them into chunks, embeds them using a HuggingFace embedding model (default all-MiniLM-L6-v2
), and stores the embeddings in a Chroma vector store with search functionality included. The server exposes a MCP tool retrieve(prompt: str)
which returns the top-k closest text chunks for a keyword/topic prompt.
The retrieved text chunks can then be processed by another LLM e.g. GPT-5 in VSCode Copilot. Another option would be to have an extra LLM that processes/summarises the chunks first, then passing this summary to the orchestrator agent (e.g. Copilot), but I'm concerned this would be too lossy. The above workflow gives the orchestrator direct access to the retrieved chunks.
For VSCode, create a file under .vscode/mcp.json
with the contents
{
"servers": {
"rag": {
"type": "stdio",
"command": "python",
"args": [
"rag.py"
]
}
}
}
We can then ask Copilot to call the tool e.g. in agent mode, "use the rag MCP tool to give an overview of simulation-based inference". If any issues with starting the server (it should start automatically, may need to restart VSCode), try https://code.visualstudio.com/docs/copilot/customization/mcp-servers.
I found it cumbersome to constantly ask Copilot to call the MCP tool. A more effective way is to set up a prompt file (https://code.visualstudio.com/docs/copilot/customization/prompt-files). I've put an example in .github/prompts/rag.prompt.md
, which tells the orchestrator (whichever LLM you've selected in Copilot) how to use the MCP tool e.g. by translating the user's query into a set of keywords before vector searching. You can then prompt Copilot to do things like /rag write an introduction to simulation-based inference in intro.txt, including references.
Cursor/Claude etc should have a very similar setup process to the above, main difference I think being the syntax of the JSON file.
Dependencies/customizations:
langchain
andlangchain_community
for document loading and text splitting.- Chunking:
CharacterTextSplitter
is configured withchunk_size=500
andchunk_overlap=100
.
- Chunking:
HuggingFaceEmbeddings
fromlangchain_huggingface.embeddings
(model:all-MiniLM-L6-v2
) for generating dense embeddings. Can replace this, can also useOllamaEmbeddings
.Chroma
vector store and search vialangchain_community.vectorstores
.- Currently the index is an in-memory Chroma instance, but should figure out how to set up a persistent index - imagine just need to change the Chroma configuration.
- Requests
k=10
top matches by default; changesearch_kwargs
to adjust the number of retrieved snippets.
mcp.server.fastmcp.FastMCP
to expose a simple MCP server interface.