feat: add Graph RAG use case#10
Conversation
Defines schema, sample data, queries, and module structure for the Graph RAG use case based on arcadedb.com/graph-rag.html. Includes Neo4j Bolt driver Java module and langchain4j submodule with local AllMiniLmL6V2 embeddings. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
11-task plan covering Docker Compose, schema, sample data, curl queries, Java Bolt/Cypher module, langchain4j embedding store and content retriever modules, README, and integration smoke test. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…r examples Add GraphRAGEmbeddingStore.java demonstrating vector ingestion and similarity search, and GraphRAGContentRetriever.java showing the RAG content retriever pipeline with min-score filtering. Fix embedding model dependency version to use community beta release (1.11.0-beta19) since the stable artifact is not published. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Change Chunk from DOCUMENT TYPE to VERTEX TYPE (edges require vertices) - Enable BoltProtocolPlugin in docker-compose.yml (port 7687) - Downgrade neo4j-java-driver to 4.4.12 (compatible with ArcadeDB Bolt v4) - Remove :Entity labels from Cypher queries (ArcadeDB doesn't resolve parent type) - Simplify SQL vector queries (remove vectorDistance subquery) - Rewrite langchain4j to use Neo4j driver directly (ArcadeDB doesn't support Neo4j's SHOW VECTOR INDEX DDL used by Neo4jEmbeddingStore) - Update README with correct port, schema types, and run instructions Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Rename Query 1 header in GraphRAG.java to "Graph Traversal with Entity Collection" (was misleadingly "Hybrid Vector + Graph" without vector) - Add empty-result guard in GraphRAGContentRetriever - Add version comment in langchain4j pom.xml for embedding model artifact - Document all implementation deviations in design doc (port, driver version, vertex type, Cypher labels, Neo4jEmbeddingStore incompatibility) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Test curl queries, Java Bolt driver, and LangChain4j demos in CI. Also add knowledge-graphs and graph-rag to root README use case table. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Summary of ChangesHello @robfrank, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly enhances the project by adding a comprehensive Graph RAG use case. This new example effectively showcases ArcadeDB's ability to seamlessly integrate graph traversal, vector search, and full-text indexing for advanced retrieval-augmented generation. It provides practical, runnable demonstrations across various platforms, offering a clear guide for implementing sophisticated RAG solutions. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Changelog
Ignored Files
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request introduces an excellent and comprehensive Graph RAG use case, complete with Docker setup, schema, sample data, and multiple runnable examples. The documentation is thorough, and the inclusion of a design document with implementation deviations is particularly helpful. My review identifies a few areas for improvement: I suggest removing a temporary implementation plan file that is inconsistent with the final code, clarifying a misleading method name in the Java example, and refactoring an inefficient and polluting data storage pattern in one of the LangChain4j demos to better showcase the database's capabilities.
| session.run("CREATE (c:LCChunk {content: $content, embedding: $embedding})", | ||
| Values.parameters("content", text, "embedding", vector)); |
There was a problem hiding this comment.
This implementation creates a new vertex type LCChunk on-the-fly without a corresponding schema definition or a vector index. This has two main issues:
- Inefficient Pattern: It performs a full scan of
LCChunkvertices and computes similarity in-memory on the client side. This is highly inefficient and does not showcase the database's vector search capabilities. A proper implementation should leverage a vector index for an efficient nearest-neighbor search within the database. - Database Pollution: The
LCChunkvertices are created and left in the database, polluting the schema and data with temporary objects from a demo run.
It is strongly recommended to refactor this. If the goal is to show storing new embeddings, a proper schema for LCChunk (including a 384D vector index) should be added to 01-schema.sql. If a Bolt protocol limitation prevents an indexed search, this should be documented. Alternatively, the demo should clean up the data it creates after it runs.
There was a problem hiding this comment.
Fixed in 7e61222. Added MATCH (c:LCChunk) DELETE c cleanup before inserting new nodes to prevent data pollution across runs. Also added Javadoc explaining the in-memory similarity approach (ArcadeDB's vectorNeighbors() is SQL-only, not available over Bolt).
PR Review — Graph RAG Use CaseGood addition overall. The structure follows the existing Bugs / Correctness1.
Suggestion: add 2. The content retriever re-embeds chunk text to generate 384D vectors, then computes cosine similarity in-memory via LangChain4j's 3. Query 3 "Temporal-Aware Retrieval" is misleading MATCH (c:Chunk)
WHERE c.chunkIndex = 1
RETURN c.content, c.source, c.chunkIndex
ORDER BY c.chunkIndex DESC -- no-op: all results have chunkIndex = 1Filtering Design / Maintenance4. Internal AI plan files committed to the public repo
These are internal implementation artifacts. Committing them to a public example repo is noisy for external contributors. Consider adding 5. until curl -sf -u "..." "${ARCADEDB_URL}/api/v1/ready" > /dev/null 2>&1; do
sleep 2
doneIf ArcadeDB never becomes healthy, this loops forever. The CI 15-minute timeout is a backstop, but locally a developer would be stuck. A simple counter with a max-attempts guard would help. 6. LangChain4j version mismatch
Minor Notes
What's Working Well
Generated with Claude Code |
- Rename Query 1 method to runQuery1GraphTraversal (was misleadingly named HybridVectorGraph despite being graph-only over Bolt) - Fix Query 3: remove no-op WHERE chunkIndex=1 filter, rename from "Temporal-Aware Retrieval" to "Latest Chunk Per Document" - Clean up LCChunk nodes before inserting in GraphRAGEmbeddingStore to prevent data accumulation across repeated demo runs - Clarify in Javadoc that similarity is computed in-memory because vectorNeighbors() is SQL-only, not available over Bolt protocol Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
Addressed review feedback in 7e61222: Fixed:
By design:
|
Code Review: Graph RAG Use CaseOverall this is a well-structured addition that follows the existing Issues
Query 3 does not actually return the latest chunk per document ( The label says "Get the highest-indexed chunk per source document", but the query: returns all chunks sorted, not the maximum-index chunk per source. With only 8 chunks and a LIMIT 10, everything is returned anyway, which masks the issue. A correct implementation would use aggregation or a subquery. As-is, the query does not validate the pattern it claims to demonstrate.
session.run("MATCH (c:LCChunk) DELETE c");If No timeout on ArcadeDB health-check loop ( until curl -sf ... "${ARCADEDB_URL}/api/v1/ready" > /dev/null 2>&1; do
sleep 2
doneIf ArcadeDB fails to start, this loop spins indefinitely on a local machine. The CI job's 15-minute timeout provides an outer bound, but a local
Both pom.xml files omit the Minor observationsIn-memory cosine similarity limits scalability ( Both LangChain4j classes fetch all chunk embeddings from the DB and rank them client-side. The inline comments explain the reason (ArcadeDB's
Identical private static method defined in both classes. Acceptable for standalone demos, just worth noting. CI does not validate result non-emptiness ( The CI passes as long as Mixed LangChain4j dependency versions ( The pom uses Design doc schema section is slightly stale ( The Schema section still says "One document type, four vertex types" with What works well
Generated with Claude Code |
Summary
What's included
docker-compose.ymlwith BoltProtocolPlugin enabled (ports 2480 + 7687)queries.shwith 5 query patterns (hybrid vector+graph, multi-hop bridging, temporal, composite scoring, agentic RAG)java/module: 5 Cypher queries via Neo4j Bolt driver (4.4.12)langchain4j/module: AllMiniLmL6V2 384D embeddings + cosine similarity search + graph context expansiondocs/plans/Test plan
docker compose up -d && ./setup.shcreates database successfully./queries/queries.shreturns non-empty results for all 5 queriesjava -jar target/graph-rag.jarconnects via Bolt and runs all 5 queriesjava -jar target/graph-rag-langchain4j.jaringests chunks with 384D embeddings and returns similarity resultsjava -cp target/graph-rag-langchain4j.jar com.arcadedb.examples.GraphRAGContentRetrieverruns semantic search + graph expansion🤖 Generated with Claude Code