Skip to content

feat: add Graph RAG use case#10

Merged
robfrank merged 15 commits into
mainfrom
feat/graph-rag
Feb 26, 2026
Merged

feat: add Graph RAG use case#10
robfrank merged 15 commits into
mainfrom
feat/graph-rag

Conversation

@robfrank

Copy link
Copy Markdown
Contributor

Summary

  • Add graph-rag/ use case demonstrating ArcadeDB's multi-model capabilities for Retrieval-Augmented Generation
  • Implements three retrieval signals in a single database: graph traversal, vector similarity, and full-text indexing
  • Four runnable interfaces: curl/HTTP queries, Java via Neo4j Bolt driver, and two LangChain4j demos (EmbeddingStore + ContentRetriever)
  • Includes CI workflow testing all three runners (curl, java, langchain4j)
  • Updates root README with all use cases

What's included

  • docker-compose.yml with BoltProtocolPlugin enabled (ports 2480 + 7687)
  • SQL schema (vertices, edges, 4D vector index) and sample data (8 chunks, 11 entities, ~25 edges)
  • queries.sh with 5 query patterns (hybrid vector+graph, multi-hop bridging, temporal, composite scoring, agentic RAG)
  • java/ module: 5 Cypher queries via Neo4j Bolt driver (4.4.12)
  • langchain4j/ module: AllMiniLmL6V2 384D embeddings + cosine similarity search + graph context expansion
  • Design doc and implementation plan in docs/plans/

Test plan

  • CI passes for all 3 matrix runners (curl, java, langchain4j)
  • docker compose up -d && ./setup.sh creates database successfully
  • ./queries/queries.sh returns non-empty results for all 5 queries
  • java -jar target/graph-rag.jar connects via Bolt and runs all 5 queries
  • java -jar target/graph-rag-langchain4j.jar ingests chunks with 384D embeddings and returns similarity results
  • java -cp target/graph-rag-langchain4j.jar com.arcadedb.examples.GraphRAGContentRetriever runs semantic search + graph expansion

🤖 Generated with Claude Code

robfrank and others added 14 commits February 26, 2026 09:23
Defines schema, sample data, queries, and module structure for the
Graph RAG use case based on arcadedb.com/graph-rag.html. Includes
Neo4j Bolt driver Java module and langchain4j submodule with local
AllMiniLmL6V2 embeddings.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
11-task plan covering Docker Compose, schema, sample data, curl queries,
Java Bolt/Cypher module, langchain4j embedding store and content retriever
modules, README, and integration smoke test.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…r examples

Add GraphRAGEmbeddingStore.java demonstrating vector ingestion and
similarity search, and GraphRAGContentRetriever.java showing the
RAG content retriever pipeline with min-score filtering.

Fix embedding model dependency version to use community beta release
(1.11.0-beta19) since the stable artifact is not published.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Change Chunk from DOCUMENT TYPE to VERTEX TYPE (edges require vertices)
- Enable BoltProtocolPlugin in docker-compose.yml (port 7687)
- Downgrade neo4j-java-driver to 4.4.12 (compatible with ArcadeDB Bolt v4)
- Remove :Entity labels from Cypher queries (ArcadeDB doesn't resolve parent type)
- Simplify SQL vector queries (remove vectorDistance subquery)
- Rewrite langchain4j to use Neo4j driver directly (ArcadeDB doesn't support
  Neo4j's SHOW VECTOR INDEX DDL used by Neo4jEmbeddingStore)
- Update README with correct port, schema types, and run instructions

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Rename Query 1 header in GraphRAG.java to "Graph Traversal with Entity
  Collection" (was misleadingly "Hybrid Vector + Graph" without vector)
- Add empty-result guard in GraphRAGContentRetriever
- Add version comment in langchain4j pom.xml for embedding model artifact
- Document all implementation deviations in design doc (port, driver version,
  vertex type, Cypher labels, Neo4jEmbeddingStore incompatibility)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Test curl queries, Java Bolt driver, and LangChain4j demos in CI.
Also add knowledge-graphs and graph-rag to root README use case table.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@gemini-code-assist

Copy link
Copy Markdown

Summary of Changes

Hello @robfrank, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the project by adding a comprehensive Graph RAG use case. This new example effectively showcases ArcadeDB's ability to seamlessly integrate graph traversal, vector search, and full-text indexing for advanced retrieval-augmented generation. It provides practical, runnable demonstrations across various platforms, offering a clear guide for implementing sophisticated RAG solutions.

Highlights

  • New Graph RAG Use Case: Introduced a new graph-rag/ use case to demonstrate ArcadeDB's multi-model capabilities for Retrieval-Augmented Generation.
  • Unified Retrieval Signals: The use case implements three retrieval signals within a single database: graph traversal, vector similarity, and full-text indexing.
  • Multiple Runnable Interfaces: Provided four runnable interfaces for interaction: curl/HTTP queries, Java via Neo4j Bolt driver, and two LangChain4j demos (EmbeddingStore and ContentRetriever).
  • CI Workflow Integration: Included a CI workflow to test all three runners (curl, Java, LangChain4j) for robust validation.
  • Documentation Updates: Updated the root README to list the new Graph RAG use case alongside existing examples.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • README.md
    • Updated the main README to list the new knowledge-graphs and graph-rag use cases.
    • Clarified the description for the java/ module in the main README.
  • docs/plans/2026-02-26-graph-rag-design.md
    • Added a detailed design document outlining the architecture, schema, data, queries, and implementation deviations for the Graph RAG use case.
  • docs/plans/2026-02-26-graph-rag.md
    • Added a comprehensive implementation plan for the Graph RAG use case, breaking down the development into tasks for Docker Compose, SQL schema/data, query scripts, Java modules, LangChain4j modules, and README.
  • graph-rag/README.md
    • Added a dedicated README for the Graph RAG use case, providing an overview, prerequisites, quickstart instructions, schema, query patterns, sample data, LangChain4j module details, and ArcadeDB version notes.
  • graph-rag/docker-compose.yml
    • Added a Docker Compose file to set up ArcadeDB with HTTP and Bolt ports, enabling the BoltProtocolPlugin and setting the default database.
  • graph-rag/java/pom.xml
    • Added a Maven pom.xml file for the Java module, configuring it to use the Neo4j Java driver and build a fat JAR.
  • graph-rag/java/src/main/java/com/arcadedb/examples/GraphRAG.java
    • Added a Java application demonstrating five Cypher query patterns for Graph RAG via the Neo4j Bolt driver.
  • graph-rag/langchain4j/pom.xml
    • Added a Maven pom.xml file for the LangChain4j module, including dependencies for LangChain4j, AllMiniLmL6V2 embeddings, and the Neo4j Java driver.
  • graph-rag/langchain4j/src/main/java/com/arcadedb/examples/GraphRAGContentRetriever.java
    • Added a LangChain4j example demonstrating a content retrieval pipeline that combines embeddings with graph traversal for enriched context.
  • graph-rag/langchain4j/src/main/java/com/arcadedb/examples/GraphRAGEmbeddingStore.java
    • Added a LangChain4j example demonstrating embedding generation and storage in ArcadeDB via Bolt, followed by similarity search.
  • graph-rag/queries/queries.sh
    • Added a shell script containing five curl-based query patterns for the Graph RAG use case, showcasing hybrid vector+graph, multi-hop, temporal, composite scoring, and agentic RAG steps.
  • graph-rag/setup.sh
    • Added a setup script to wait for ArcadeDB, create the GraphRAG database, and apply the schema and sample data SQL files.
  • graph-rag/sql/01-schema.sql
    • Added the SQL schema definition for the Graph RAG use case, including Chunk vertex type with vector index, Entity hierarchy, and various edge types.
  • graph-rag/sql/02-data.sql
    • Added SQL statements to insert sample data for the Graph RAG use case, representing a fictional knowledge base with chunks, entities, and relationships.
Ignored Files
  • Ignored by pattern: .github/workflows/** (1)
    • .github/workflows/graph-rag.yml
Activity
  • No human activity (comments, reviews) has occurred on this pull request yet.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces an excellent and comprehensive Graph RAG use case, complete with Docker setup, schema, sample data, and multiple runnable examples. The documentation is thorough, and the inclusion of a design document with implementation deviations is particularly helpful. My review identifies a few areas for improvement: I suggest removing a temporary implementation plan file that is inconsistent with the final code, clarifying a misleading method name in the Java example, and refactoring an inefficient and polluting data storage pattern in one of the LangChain4j demos to better showcase the database's capabilities.

Comment on lines +51 to +52
session.run("CREATE (c:LCChunk {content: $content, embedding: $embedding})",
Values.parameters("content", text, "embedding", vector));

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

This implementation creates a new vertex type LCChunk on-the-fly without a corresponding schema definition or a vector index. This has two main issues:

  1. Inefficient Pattern: It performs a full scan of LCChunk vertices and computes similarity in-memory on the client side. This is highly inefficient and does not showcase the database's vector search capabilities. A proper implementation should leverage a vector index for an efficient nearest-neighbor search within the database.
  2. Database Pollution: The LCChunk vertices are created and left in the database, polluting the schema and data with temporary objects from a demo run.

It is strongly recommended to refactor this. If the goal is to show storing new embeddings, a proper schema for LCChunk (including a 384D vector index) should be added to 01-schema.sql. If a Bolt protocol limitation prevents an indexed search, this should be documented. Alternatively, the demo should clean up the data it creates after it runs.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 7e61222. Added MATCH (c:LCChunk) DELETE c cleanup before inserting new nodes to prevent data pollution across runs. Also added Javadoc explaining the in-memory similarity approach (ArcadeDB's vectorNeighbors() is SQL-only, not available over Bolt).

Comment thread docs/plans/2026-02-26-graph-rag.md
Comment thread graph-rag/java/src/main/java/com/arcadedb/examples/GraphRAG.java Outdated
@claude

claude Bot commented Feb 26, 2026

Copy link
Copy Markdown

PR Review — Graph RAG Use Case

Good addition overall. The structure follows the existing recommendation-engine pattern, the CI workflow uses pinned action SHAs, and the design deviation table in the docs is genuinely useful. A few issues worth addressing before merge:


Bugs / Correctness

1. LCChunk vertex type is undefined in the schema

GraphRAGEmbeddingStore.java creates vertices with type LCChunk but sql/01-schema.sql never declares it. ArcadeDB will create the type on the fly, bypassing the explicit schema. More importantly, every run of GraphRAGEmbeddingStore appends 4 new nodes without cleanup — running the demo twice doubles the data, degrading similarity results.

Suggestion: add CREATE VERTEX TYPE LCChunk IF NOT EXISTS to the schema, or have the demo drop+recreate the nodes at startup.

2. GraphRAGContentRetriever does not use ArcadeDB's vector index

The content retriever re-embeds chunk text to generate 384D vectors, then computes cosine similarity in-memory via LangChain4j's CosineSimilarity.between(). It never calls vectorNeighbors(). This is fine as a demo of the retrieval pattern, but the README/Javadoc implies database-side vector search. Worth clarifying with a comment such as:

// Similarity is computed in-memory because the stored embeddings are 4D;
// in production, store 384D embeddings and use vectorNeighbors() instead.

3. Query 3 "Temporal-Aware Retrieval" is misleading

MATCH (c:Chunk)
WHERE c.chunkIndex = 1
RETURN c.content, c.source, c.chunkIndex
ORDER BY c.chunkIndex DESC  -- no-op: all results have chunkIndex = 1

Filtering WHERE c.chunkIndex = 1 and then ORDER BY c.chunkIndex DESC does nothing — every row has the same index. Consider removing the WHERE and relying on ORDER BY + LIMIT to return the highest-indexed chunk per source, or drop the "temporal" framing and rename this to "Latest Chunk Per Document".


Design / Maintenance

4. Internal AI plan files committed to the public repo

docs/plans/2026-02-26-graph-rag.md opens with:

For Claude: REQUIRED SUB-SKILL: Use superpowers:executing-plans to implement this plan task-by-task.

These are internal implementation artifacts. Committing them to a public example repo is noisy for external contributors. Consider adding docs/plans/ to .gitignore or removing these files from this PR.

5. setup.sh readiness loop has no timeout

until curl -sf -u "..." "${ARCADEDB_URL}/api/v1/ready" > /dev/null 2>&1; do
  sleep 2
done

If ArcadeDB never becomes healthy, this loops forever. The CI 15-minute timeout is a backstop, but locally a developer would be stuck. A simple counter with a max-attempts guard would help.

6. LangChain4j version mismatch

langchain4j/pom.xml uses two separate version properties (1.11.0 vs 1.11.0-beta19). The inline comment explains it, but this will silently break once the GA artifact is published. Consider a TODO comment or a GitHub issue to track the upgrade.


Minor Notes

  • setup.sh SQL parser splits on newlines, so any future multi-line statement would silently fail. All current statements are single-line, but a comment warning future contributors would help.
  • docker-compose.yml exposes port 7687 on 0.0.0.0. For a local demo this is fine, but "127.0.0.1:7687:7687" is worth suggesting in the README for shared-machine environments.
  • tryRun() in GraphRAG.java catches exceptions without setting a non-zero exit code. A failed query prints to stderr but the process exits 0, so CI still shows green. Consistent with RecommendationEngine.java, so low priority.

What's Working Well

  • CI workflow uses commit-SHA pinning for all actions — good supply-chain hygiene.
  • jq -cn --arg for JSON construction in shell scripts avoids injection risk.
  • Java code uses try-with-resources correctly throughout.
  • The deviation table in the design doc clearly explains ArcadeDB-specific workarounds (Bolt port, driver version, type inheritance limits) — very useful for future maintainers.
  • Queries demonstrate a genuine mix of SQL, Cypher, and multi-step retrieval patterns across three distinct retrieval signals.

Generated with Claude Code

- Rename Query 1 method to runQuery1GraphTraversal (was misleadingly
  named HybridVectorGraph despite being graph-only over Bolt)
- Fix Query 3: remove no-op WHERE chunkIndex=1 filter, rename from
  "Temporal-Aware Retrieval" to "Latest Chunk Per Document"
- Clean up LCChunk nodes before inserting in GraphRAGEmbeddingStore
  to prevent data accumulation across repeated demo runs
- Clarify in Javadoc that similarity is computed in-memory because
  vectorNeighbors() is SQL-only, not available over Bolt protocol

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@robfrank

Copy link
Copy Markdown
Contributor Author

Addressed review feedback in 7e61222:

Fixed:

By design:

@robfrank robfrank merged commit a7b6535 into main Feb 26, 2026
13 checks passed
@claude

claude Bot commented Feb 26, 2026

Copy link
Copy Markdown

Code Review: Graph RAG Use Case

Overall this is a well-structured addition that follows the existing recommendation-engine pattern closely. The implementation deviation table in docs/plans/ is particularly valuable for documenting ArcadeDB-specific workarounds. Below are my findings, grouped by severity.


Issues

LCChunk vertex type not defined in schema (GraphRAGEmbeddingStore.java, sql/01-schema.sql)

GraphRAGEmbeddingStore creates nodes with CREATE (c:LCChunk {...}) via Cypher, but LCChunk is never declared as a vertex type in 01-schema.sql. ArcadeDB may auto-create the type, but that behavior is undocumented and could break silently on strict schema-enforcement modes. Consider adding CREATE VERTEX TYPE LCChunk IF NOT EXISTS to the schema, or document the auto-creation behavior explicitly.


Query 3 does not actually return the latest chunk per document (queries/queries.sh, GraphRAG.java)

The label says "Get the highest-indexed chunk per source document", but the query:

MATCH (c:Chunk)
RETURN c.content, c.source, c.chunkIndex
ORDER BY c.source, c.chunkIndex DESC
LIMIT 10

returns all chunks sorted, not the maximum-index chunk per source. With only 8 chunks and a LIMIT 10, everything is returned anyway, which masks the issue. A correct implementation would use aggregation or a subquery. As-is, the query does not validate the pattern it claims to demonstrate.


DELETE c instead of DETACH DELETE c (GraphRAGEmbeddingStore.java)

session.run("MATCH (c:LCChunk) DELETE c");

If LCChunk nodes ever acquire relationships (e.g., a follow-up demo adds edges), this will fail at runtime. DETACH DELETE c is the safe, idiomatic choice.


No timeout on ArcadeDB health-check loop (setup.sh)

until curl -sf ... "${ARCADEDB_URL}/api/v1/ready" > /dev/null 2>&1; do
  sleep 2
done

If ArcadeDB fails to start, this loop spins indefinitely on a local machine. The CI job's 15-minute timeout provides an outer bound, but a local ./setup.sh run could hang forever. Adding a counter (e.g., MAX_RETRIES=30) with an error exit would be a small improvement.


maven-assembly-plugin version not pinned (java/pom.xml, langchain4j/pom.xml)

Both pom.xml files omit the <version> for maven-assembly-plugin. Maven will resolve to whatever the super-POM or local repository provides, making builds non-reproducible. Pin to a specific version (e.g., 3.7.1).


Minor observations

In-memory cosine similarity limits scalability (GraphRAGContentRetriever.java, GraphRAGEmbeddingStore.java)

Both LangChain4j classes fetch all chunk embeddings from the DB and rank them client-side. The inline comments explain the reason (ArcadeDB's vectorNeighbors() is SQL-only, not available over Bolt). This is fine for a demo, but a note in the README's "ArcadeDB Version Notes" section would help users who try to adapt the code for larger datasets.


truncate utility is duplicated (GraphRAG.java, GraphRAGContentRetriever.java)

Identical private static method defined in both classes. Acceptable for standalone demos, just worth noting.


CI does not validate result non-emptiness (.github/workflows/graph-rag.yml, queries/queries.sh)

The CI passes as long as curl exits with status 0. A query returning {"result":[]} would not fail the job. Piping through jq 'if (.result | length) == 0 then error("empty result") else . end' would catch silent empty results in the curl runner.


Mixed LangChain4j dependency versions (langchain4j/pom.xml)

The pom uses 1.11.0 GA for langchain4j but 1.11.0-beta19 for langchain4j-embeddings-all-minilm-l6-v2. The comment explains the workaround; consider tracking the LangChain4j release and unifying to GA once that artifact is published.


Design doc schema section is slightly stale (docs/plans/2026-02-26-graph-rag-design.md)

The Schema section still says "One document type, four vertex types" with Chunk listed as a document type. The Implementation Deviations table correctly documents that Chunk became a vertex type, but the Schema section itself was not updated.


What works well

  • Consistent directory structure with recommendation-engine
  • set -euo pipefail on all shell scripts
  • jq -cn --arg for JSON construction avoids shell injection in SQL statements
  • Pinned action SHAs in CI for supply-chain security
  • fail-fast: false in the CI matrix ensures all three runners execute even if one fails
  • Implementation Deviations table is excellent documentation of ArcadeDB-specific workarounds (Bolt port, driver version, vectorDistance subquery limitation, etc.)
  • try-with-resources and tryRun wrapper in the Java module handle cleanup and partial failures gracefully
  • No external API keys required — fully self-contained demo

Generated with Claude Code

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant