Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
81 changes: 81 additions & 0 deletions .github/workflows/graph-rag.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,81 @@
name: Graph RAG CI

on:
push:
paths:
- graph-rag/**
- .github/workflows/graph-rag.yml
pull_request:
paths:
- graph-rag/**
- .github/workflows/graph-rag.yml

jobs:
test:
runs-on: ubuntu-latest
timeout-minutes: 15
permissions:
contents: read
strategy:
fail-fast: false
matrix:
runner: [curl, java, langchain4j]

env:
ARCADEDB_URL: http://localhost:2480
ARCADEDB_USER: root
ARCADEDB_PASS: arcadedb

steps:
- name: Checkout
uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd # v6.0.2
with:
fetch-depth: 1

- name: Set up Java
if: matrix.runner == 'java' || matrix.runner == 'langchain4j'
uses: actions/setup-java@be666c2fcd27ec809703dec50e508c2fdc7f6654 # v5.2.0
with:
java-version: '21'
distribution: 'temurin'

- name: Cache Maven repository
if: matrix.runner == 'java' || matrix.runner == 'langchain4j'
uses: actions/cache@cdf6c1fa76f9f475f3d7449005a359c84ca0f306 # v5.0.3
with:
path: ~/.m2
key: ${{ runner.os }}-m2-${{ matrix.runner }}-${{ hashFiles('graph-rag/java/pom.xml', 'graph-rag/langchain4j/pom.xml') }}
restore-keys: ${{ runner.os }}-m2-${{ matrix.runner }}-

- name: Start ArcadeDB
working-directory: graph-rag
run: docker compose up -d

- name: Setup database
working-directory: graph-rag
run: ./setup.sh

- name: Run curl queries
if: matrix.runner == 'curl'
working-directory: graph-rag
run: ./queries/queries.sh

- name: Build and run Java (Bolt)
if: matrix.runner == 'java'
working-directory: graph-rag/java
run: |
mvn package --no-transfer-progress
java -jar target/graph-rag.jar

- name: Build and run LangChain4j
if: matrix.runner == 'langchain4j'
working-directory: graph-rag/langchain4j
run: |
mvn package --no-transfer-progress
java -jar target/graph-rag-langchain4j.jar
java -cp target/graph-rag-langchain4j.jar com.arcadedb.examples.GraphRAGContentRetriever

- name: Teardown
if: always()
working-directory: graph-rag
run: docker compose down
4 changes: 3 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,8 @@ and runnable demos via both `curl` and a Java program.
| Directory | Description | ArcadeDB features |
|-----------|-------------|-------------------|
| [recommendation-engine](./recommendation-engine/) | Intelligent product and content recommendations | Graph traversal, Vector similarity, Time-series |
| [knowledge-graphs](./knowledge-graphs/) | Academic research knowledge graph with co-authorship and citation networks | Graph traversal, Vector similarity, Full-text search, Time-series |
| [graph-rag](./graph-rag/) | Graph RAG system combining knowledge graphs with vector search for retrieval-augmented generation | Graph traversal, Vector similarity, Full-text indexing, Neo4j Bolt, LangChain4j |

## Structure

Expand All @@ -19,5 +21,5 @@ Each use case directory contains:
- `sql/01-schema.sql` — vertex/edge type definitions
- `sql/02-data.sql` — sample data
- `queries/queries.sh` — all queries via `curl`
- `java/` — standalone Maven project running the same queries via `arcadedb-network`
- `java/` — standalone Maven project running the same queries via Java
- `README.md` — quickstart guide
161 changes: 161 additions & 0 deletions docs/plans/2026-02-26-graph-rag-design.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,161 @@
# Graph RAG Use Case — Design

**Date:** 2026-02-26
**Branch:** feat/graph-rag
**ArcadeDB version:** 26.2.1

## Overview

Implement the [ArcadeDB Graph RAG](https://arcadedb.com/graph-rag.html) use case following the same structure as the recommendation-engine. The use case demonstrates ArcadeDB's ability to unify vector search, graph traversal, and full-text indexing for retrieval-augmented generation — without requiring multiple databases or ETL pipelines.

Key differences from recommendation-engine:
- Java module uses **Neo4j Bolt driver** (`neo4j-java-driver`) and **Cypher** as query language, connecting via `bolt://localhost:2424`
- Additional **langchain4j** submodule demonstrates `Neo4jEmbeddingStore` and `EmbeddingStoreContentRetriever` with local `AllMiniLmL6V2` embeddings (no external API keys)

## Repository Structure

```
graph-rag/
├── README.md
├── docker-compose.yml
├── setup.sh
├── sql/
│ ├── 01-schema.sql
│ └── 02-data.sql
├── queries/
│ └── queries.sh
├── java/
│ ├── pom.xml
│ └── src/main/java/com/arcadedb/examples/
│ └── GraphRAG.java
└── langchain4j/
├── pom.xml
└── src/main/java/com/arcadedb/examples/
├── GraphRAGEmbeddingStore.java
└── GraphRAGContentRetriever.java
```

## Docker Compose

- Single service: `arcadedata/arcadedb:26.2.1`
- Ports exposed: `2480` (HTTP API), `2424` (Bolt)
- Root password via `JAVA_OPTS: -Darcadedb.server.rootPassword=arcadedb`
- Health check on `http://localhost:2480/api/v1/ready`

## Schema (`sql/01-schema.sql`)

One document type, four vertex types, and four edge types:

**Document:**
- `Chunk` — `content` (STRING), `source` (STRING), `chunkIndex` (INTEGER), `embedding` (LIST)
- Vector index on `Chunk(embedding)`: LSM, 4 dimensions, COSINE

**Vertices:**
- `Entity` — `name` (STRING)
- `Person EXTENDS Entity`
- `Concept EXTENDS Entity`
- `Organization EXTENDS Entity`

**Edges:**
- `MENTIONS` — Chunk -> Entity
- `RELATES_TO` — Entity -> Entity
- `WORKS_AT` — Person -> Organization
- `AUTHORED` — Person -> Chunk

## Sample Data (`sql/02-data.sql`)

**Domain:** Fictional tech company "ArcadeSoft" knowledge base.

**Chunks (~8-10):** Snippets from internal documentation:
- "Getting Started with GraphRAG" (2 chunks)
- "Microservices Architecture Guide" (2 chunks)
- "Vector Search Best Practices" (2 chunks)
- "Team Onboarding Handbook" (2 chunks)

Each chunk has a hand-crafted 4D embedding reflecting its topic (e.g. graph-heavy docs: `[0.9, 0.1, 0.2, 0.1]`, vector-heavy: `[0.1, 0.9, 0.2, 0.1]`).

**Entities (~8-10):**
- Persons: Alice Chen, Bob Martinez, Carol Wu, Dave Park
- Concepts: GraphRAG, Vector Search, Microservices, Knowledge Graph
- Organizations: ArcadeSoft, Platform Team, Research Team

**Edges (~20-25):**
- MENTIONS: chunks reference concepts and people
- RELATES_TO: GraphRAG -> Vector Search, GraphRAG -> Knowledge Graph, Microservices -> Knowledge Graph
- WORKS_AT: Alice -> Research Team, Bob -> Platform Team, Carol -> ArcadeSoft, Dave -> Platform Team
- AUTHORED: Alice -> GraphRAG doc chunks, Bob -> Microservices doc chunks

**Design intent:** Multi-hop queries work because querying "Vector Search" finds a chunk that MENTIONS the "GraphRAG" concept, which is MENTIONED by other chunks about GraphRAG — creating entity bridges. RELATES_TO edges form a small concept graph for traversal.

## Queries

### `queries/queries.sh` — 5 labeled sections via curl

| # | Pattern | Language | Description |
|---|---------|----------|-------------|
| 1 | Hybrid Vector + Graph | Cypher | Vector search for similar chunks, traverse MENTIONS to find entities and connected chunks |
| 2 | Multi-Hop Entity Bridge | Cypher | Find chunks connected through entity chains: query chunk -> entity -> related chunk |
| 3 | Temporal-Aware Retrieval | Cypher | Filter chunks by `chunkIndex` ordering, return most recent context first |
| 4 | Triple Hybrid | SQL | Composite scoring: vector distance + `CONTAINSTEXT` keyword + entity connection count |
| 5 | Agentic RAG Steps | Mixed | 4-step sequence: vector search, graph expansion, full-text lookup, context assembly |

### `java/GraphRAG.java` — All Cypher via Bolt

Adapts the 5 patterns to pure Cypher. Queries that rely on SQL-specific features are adapted:
- Query 4: vector distance + entity count (2-signal composite, no full-text)
- Query 5: vector search -> graph expansion -> collect results (3 steps, no full-text)

### `langchain4j/` — 2 example classes

1. **GraphRAGEmbeddingStore** — ingest text chunks, generate real 384D embeddings with AllMiniLmL6V2, store in ArcadeDB via `Neo4jEmbeddingStore` over Bolt, run similarity searches
2. **GraphRAGContentRetriever** — wire `Neo4jEmbeddingStore` into a langchain4j `EmbeddingStoreContentRetriever` pipeline, query with natural language, print retrieved chunks with scores

## Java Module (`java/`)

- **Build tool:** Maven (standalone `pom.xml`, no parent)
- **Dependency:** `org.neo4j.driver:neo4j-java-driver:5.28.x`
- **Java:** 21
- **Output:** fat JAR via maven-assembly-plugin -> `graph-rag.jar`
- **Entry point:** `GraphRAG.java` with `main` method that:
1. Opens a Neo4j `Driver` connection to `bolt://localhost:2424`
2. Runs all 5 queries sequentially in Cypher
3. Prints header and formatted results for each query
4. Closes the driver

## Langchain4j Module (`langchain4j/`)

- **Build tool:** Maven (standalone `pom.xml`, no parent, no Spring Boot)
- **Dependencies:** `langchain4j-community-neo4j`, `langchain4j-embeddings-all-minilm-l6-v2`, `neo4j-java-driver`
- **Java:** 21
- **Output:** fat JAR via maven-assembly-plugin -> `graph-rag-langchain4j.jar`
- **No external API keys required** — AllMiniLmL6V2 runs in-process

## Setup

`setup.sh` follows the recommendation-engine pattern:
1. Wait for ArcadeDB ready endpoint
2. Create database `GraphRAG` via HTTP API
3. Apply `sql/01-schema.sql`
4. Apply `sql/02-data.sql`

## Success Criteria

- `docker compose up` starts ArcadeDB with both HTTP and Bolt ports
- SQL files apply cleanly via `setup.sh`
- `queries.sh` runs all 5 queries and returns non-empty result sets
- `mvn package && java -jar target/graph-rag.jar` connects via Bolt, runs all 5 Cypher queries
- `mvn package && java -jar target/graph-rag-langchain4j.jar` ingests chunks, generates embeddings, runs similarity search and content retrieval

## Implementation Deviations

The following changes were made during integration testing:

| Design | Implementation | Reason |
|--------|---------------|--------|
| Bolt port 2424 | Port 7687 | ArcadeDB's BoltProtocolPlugin defaults to 7687 (standard Neo4j port) |
| `neo4j-java-driver:5.28.x` | `neo4j-java-driver:4.4.12` | ArcadeDB's Bolt implements protocol v4; driver 5.x fails handshake |
| `Chunk` as DOCUMENT TYPE | `Chunk` as VERTEX TYPE | Edges (MENTIONS, AUTHORED) require vertex endpoints |
| `:Entity` label in Cypher | Unlabeled `(entity)` | ArcadeDB Cypher doesn't resolve parent type labels to subtypes |
| `Neo4jEmbeddingStore` via langchain4j-community-neo4j | Direct Neo4j driver + LangChain4j `CosineSimilarity` | ArcadeDB doesn't support `SHOW VECTOR INDEX` DDL used by Neo4jEmbeddingStore |
| `vectorDistance` in SQL subquery | Direct `vectorNeighbors` ordering | `vectorDistance` doesn't work in subqueries in ArcadeDB 26.2.1 |
| Docker JAVA_OPTS single line | Multi-line with plugins | BoltProtocolPlugin must be explicitly enabled via `arcadedb.server.plugins` |
Loading