Skip to content

feat: add fraud detection use case#16

Merged
robfrank merged 12 commits into
mainfrom
feat/fraud-detection
Mar 2, 2026
Merged

feat: add fraud detection use case#16
robfrank merged 12 commits into
mainfrom
feat/fraud-detection

Conversation

@robfrank

@robfrank robfrank commented Mar 2, 2026

Copy link
Copy Markdown
Contributor

Summary

  • Add fraud detection use case demonstrating ArcadeDB's multi-model capabilities across four signal types: graph traversal, vector similarity, time-series analysis, and full-text fuzzy matching
  • Implement 8 query patterns: fraud ring detection, synthetic identity resolution, circular money flow, structuring detection, behavioral anomaly, velocity attack, correlated activity, and multi-model investigation
  • Include docker-compose (26.3.1-SNAPSHOT), setup script, SQL schema/data (~70 records across 11 accounts), curl queries, Java runner, CI workflow, and design docs

Test plan

  • docker compose up -d in fraud-detection/ starts ArcadeDB
  • ./setup.sh creates FraudDetection database, applies schema + data without errors
  • ./queries/queries.sh runs all 8 queries and returns non-empty results
  • cd java && mvn package -q && java -jar target/fraud-detection.jar runs all 8 queries
  • CI workflow Fraud Detection CI passes both test (curl) and test (java) matrix jobs
  • Fraud ring query returns acct-B through acct-E for acct-A
  • Synthetic identity query returns the F/G pair
  • Velocity query flags acct-H

🤖 Generated with Claude Code

Demonstrate ArcadeDB's multi-model fraud detection capabilities across
four signal types: graph traversal, vector similarity, time-series
analysis, and full-text fuzzy matching with 8 query patterns.

Includes docker-compose, setup script, SQL schema/data, curl queries,
Java runner, CI workflow, and design docs. Targets ArcadeDB 26.3.1-SNAPSHOT.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@gemini-code-assist

Copy link
Copy Markdown

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a new, self-contained fraud detection use case for ArcadeDB. It showcases the database's multi-model strengths by combining graph, vector, time-series, and full-text capabilities to identify complex fraud patterns. The implementation includes a complete environment with Docker, sample data, and both command-line and Java application examples, ensuring a ready-to-run and verifiable demonstration of advanced fraud analytics.

Highlights

  • New Fraud Detection Use Case: Added a comprehensive fraud detection use case demonstrating ArcadeDB's multi-model capabilities, integrating graph traversal, vector similarity, time-series analysis, and full-text fuzzy matching.
  • Eight Query Patterns Implemented: Implemented eight distinct query patterns to detect various fraud types, including fraud rings, synthetic identities, circular money flow, structuring, behavioral anomalies, velocity attacks, correlated activity, and multi-model investigations.
  • Complete Example Environment: Provided a full working example with docker-compose for ArcadeDB, a setup script for schema and data, curl and Java-based query runners, and detailed design and CI workflow documentation.
  • CI Workflow Integration: Integrated a GitHub Actions CI workflow to automatically verify both curl and Java runners for the fraud detection use case on pushes and pull requests.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • docs/plans/2026-03-02-fraud-detection-ci.md
    • Added a detailed plan for implementing the GitHub Actions CI workflow for the fraud detection use case, covering architecture, tech stack, and verification steps.
  • docs/plans/2026-03-02-fraud-detection-design.md
    • Added the design document for the fraud detection use case, outlining the repository structure, Docker Compose configuration, database schema, sample data characteristics, and the eight query patterns.
  • fraud-detection/README.md
    • Added a README file providing an overview, prerequisites, quickstart instructions, schema details, query patterns, and sample data description for the fraud detection use case.
  • fraud-detection/docker-compose.yml
    • Added a Docker Compose file to define and run the ArcadeDB service for the fraud detection environment, including port mapping, environment variables, and a health check.
  • fraud-detection/java/pom.xml
    • Added the Maven Project Object Model (POM) file for the Java application, configuring dependencies (ArcadeDB network client) and the assembly plugin to create an executable JAR.
  • fraud-detection/java/src/main/java/com/arcadedb/examples/FraudDetection.java
    • Added the main Java application file that connects to ArcadeDB and executes all eight fraud detection queries, formatting and printing their results.
  • fraud-detection/queries/queries.sh
    • Added a shell script that uses curl and jq to execute all eight fraud detection queries against the running ArcadeDB instance.
  • fraud-detection/setup.sh
    • Added a bash script to automate the setup of the ArcadeDB database, including waiting for the server, creating the 'FraudDetection' database, and applying the schema and sample data SQL files.
  • fraud-detection/sql/01-schema.sql
    • Added the SQL schema definition file, creating vertex types (Account, Customer, Device, Phone, Address, Email, Beneficiary, Transaction), edge types (USES_DEVICE, HAS_PHONE, HAS_ADDRESS, HAS_EMAIL, TRANSFERRED_TO, BENEFICIARY_OF), a document type (Deposit), and various indexes including unique, LSM_VECTOR, and FULL_TEXT.
  • fraud-detection/sql/02-data.sql
    • Added the SQL data insertion file, populating the database with sample records that represent various fraud patterns like fraud rings, synthetic identities, and velocity attacks, as well as legitimate accounts.
Ignored Files
  • Ignored by pattern: .github/workflows/** (1)
    • .github/workflows/fraud-detection.yml
Activity
  • No specific activity (comments, reviews, progress) has been recorded for this pull request yet.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@claude

claude Bot commented Mar 2, 2026

Copy link
Copy Markdown

Code Review

Overall a well-structured addition that follows recommendation-engine conventions closely. A few issues worth addressing before merging.

HIGH: Query 3 will become flaky over time. The circular-flow Cypher query uses datetime() - duration('P30D'). The transfer data is hardcoded to Feb 5-25 2026, so this passes today but returns zero rows after April 7 2026, making CI silently green on a broken query.

HIGH: Java runner always exits 0 even if all queries fail. tryRun() catches exceptions but main() always returns normally, so the java CI matrix job can never fail. Track failures and call System.exit(1) if any query threw.

MEDIUM: queries.sh has no result assertions. A zero-row result still exits 0. The test plan calls out specific expected matches (acct-B through acct-E, the F/G pair, acct-H) but these are never verified.

MEDIUM: Confirm time_bucket() exists in ArcadeDB SQL. This is a TimescaleDB/PostgreSQL function. If ArcadeDB does not support it, Query 4 always throws, tryRun() swallows it, and CI stays green while the feature is completely broken.

MEDIUM: Remove docs/plans/ files. The two files under docs/plans/ are internal implementation-planning artifacts with step-by-step notes and internal tooling instructions. They should not be committed to the public repo.

LOW: SNAPSHOT Docker image is mutable. arcadedata/arcadedb:26.3.1-SNAPSHOT can change between CI runs. Pin to a stable release when available.

LOW: Missing .java-version file. recommendation-engine/java/ has one with value 21; fraud-detection/java/ does not.

Generated with Claude Code

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a comprehensive fraud detection use case, showcasing ArcadeDB's multi-model capabilities. The implementation is well-structured, with clear separation of concerns between setup scripts, queries, and application code. My review focuses on improving security practices in the Docker configuration, ensuring consistency across documentation and code, and applying Java best practices. I've identified a few discrepancies in the design documents compared to the implementation, which should be aligned for clarity. Overall, this is a great addition to the use cases repository.

ports:
- "2480:2480"
environment:
JAVA_OPTS: "-Darcadedb.server.rootPassword=arcadedb"

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Hardcoding credentials, even for a local demo, is a security risk and a bad practice. It's better to use a .env file to manage secrets and reference them here. This prevents credentials from being checked into version control and makes the configuration more flexible. You can use variable substitution with a default value to maintain ease of use.

You would then have the option to create a .env file in the same directory with content like:

ARCADEDB_PASS=arcadedb

And add .env to your .gitignore file.

      JAVA_OPTS: "-Darcadedb.server.rootPassword=${ARCADEDB_PASS:-arcadedb}"

SELECT t.id, t.amount, t.merchant,
vectorDistance(t.behavior_embedding, c.profile_embedding) AS deviation
FROM Transaction t
JOIN Customer c ON t.customer_id = c.id

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The join condition ON t.customer_id = c.id appears to be a typo. The Transaction type uses account_id to link to an account, and the implementation in the Java/shell files correctly uses t.account_id. The documentation should be updated for consistency.

Suggested change
JOIN Customer c ON t.customer_id = c.id
JOIN Customer c ON t.account_id = c.id

Comment on lines +178 to +184
```sql
SELECT account_id, rate(ts) AS current_tps
FROM Transaction
WHERE ts > now() - INTERVAL '5m'
GROUP BY account_id
HAVING current_tps > 2
```

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This query using now() - INTERVAL '5m' and rate() is inconsistent with the static dataset and the actual implementation. The implementation uses a fixed time window and count(*), which is more appropriate for this demo. The design document should reflect the implemented query.

Suggested update:

SELECT account_id, count(*) AS txn_count, min(ts) AS first_txn, max(ts) AS last_txn
FROM Transaction
WHERE ts BETWEEN '2026-03-01T13:00:00Z' AND '2026-03-01T13:05:00Z'
GROUP BY account_id
HAVING txn_count > 5

Comment on lines +190 to +196
```sql
SELECT correlate(a.amount, b.amount) AS correlation
FROM Transaction a, Transaction b
WHERE a.account_id = 'acct-A' AND b.account_id = 'acct-B'
AND a.ts > now() - INTERVAL '30d'
AND b.ts > now() - INTERVAL '30d'
```

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This query is significantly different from the implementation in the Java and shell files. The documentation uses correlate() and a relative time window with now(), while the implementation calculates averages and counts over a fixed historical period. The documentation should be updated to match the code for clarity and accuracy.

Suggested update:

SELECT a.account_id AS account_a, b.account_id AS account_b,
       avg(a.amount) AS avg_a, avg(b.amount) AS avg_b,
       count(*) AS matching_txns
FROM Transaction a, Transaction b
WHERE a.account_id = 'acct-A' AND b.account_id = 'acct-B'
  AND a.ts >= '2026-02-01T00:00:00Z'
  AND b.ts >= '2026-02-01T00:00:00Z'

Comment thread fraud-detection/README.md Outdated

- Docker and Docker Compose
- `curl` and `jq`
- Java 17+ and Maven 3.x (for the Java demo)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The prerequisite for Java is listed as Java 17+, but the pom.xml file is configured for Java 21. To avoid confusion and potential build issues for users, this should be made consistent. It's best to specify Java 21 here.

Suggested change
- Java 17+ and Maven 3.x (for the Java demo)
- Java 21+ and Maven 3.x (for the Java demo)

import com.arcadedb.query.sql.executor.ResultSet;
import com.arcadedb.remote.RemoteDatabase;

public class FraudDetection {

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This class consists only of static methods and is not designed for instantiation. It's a good practice to make such utility classes final and provide a private constructor to prevent accidental instantiation.

Example:

public final class FraudDetection {
  private FraudDetection() { /* prevent instantiation */ }
  // ...
}
Suggested change
public class FraudDetection {
public final class FraudDetection {

@mergify

mergify Bot commented Mar 2, 2026

Copy link
Copy Markdown
Contributor

🧪 CI Insights

Here's what we observed from your CI run for f088c17.

🟢 All jobs passed!

But CI Insights is watching 👀

- Fix typo in design doc: customer_id → account_id in Query 5
- Update design doc Queries 6 and 7 to match implementation
  (remove aspirational rate()/correlate()/now() syntax)
- Fix README Java prerequisite: 17+ → 21+ to match pom.xml and CI
- Add .java-version file for consistency with recommendation-engine

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@robfrank

robfrank commented Mar 2, 2026

Copy link
Copy Markdown
Contributor Author

Addressed review feedback in bef3b4f. Here's the breakdown:

Fixed:

  • Design doc Query 5: customer_idaccount_id (typo)
  • Design doc Query 6: replaced aspirational rate()/now() syntax with implemented count(*)/fixed timestamps
  • Design doc Query 7: replaced aspirational correlate()/now() syntax with implemented avg()/count(*)/fixed timestamps
  • README: Java prerequisite 17+ → 21+ (matches pom.xml and CI)
  • Added fraud-detection/java/.java-version for consistency with recommendation-engine

Won't fix (with reasoning):

  • Hardcoded credentials in docker-compose.yml — This is a local demo project. recommendation-engine/docker-compose.yml uses the same pattern. The password arcadedb is not a secret; it's the default for a throwaway local container.
  • final class + private constructorRecommendationEngine.java isn't final either. Consistency across use cases matters more here than utility-class pedantry in demo code.
  • Java exit code always 0 — Pre-existing pattern from RecommendationEngine.java (same tryRun structure). Should be addressed across all use cases in a separate PR if desired.
  • queries.sh no result assertions — Same pattern as recommendation-engine/queries/queries.sh. Pre-existing.
  • Query 3 flaky date — Valid concern. The datetime() - duration('P30D') filter means this query returns empty results after ~April 7 2026. However, the design doc specifies this query shape. The sample data can be regenerated with fresh dates, or the query can be updated to use fixed timestamps — either way, a maintenance task rather than a blocker.
  • time_bucket() availability — This targets 26.3.1-SNAPSHOT specifically for these time-series features. If unsupported, CI will surface it immediately.
  • Remove docs/plans/ — The recommendation-engine CI plan (docs/plans/2026-02-25-ci-pipeline.md) is already committed. These serve as design decision documentation.
  • SNAPSHOT image mutable — Known and documented in README under "ArcadeDB Version Notes". Will pin to stable when 26.3.1 releases.

🤖 Generated with Claude Code

@claude

claude Bot commented Mar 2, 2026

Copy link
Copy Markdown

PR Review: feat/fraud-detection

Good overall structure — this closely mirrors the recommendation-engine layout and the schema/data design is well-thought-out. A few issues need attention before merge.


Blocking Issues

1. Missing snapshot repository in pom.xml

arcadedb-network:26.3.1-SNAPSHOT is a SNAPSHOT artifact. Maven Central does not host snapshots. Without a <repositories> block, mvn package will fail to resolve the dependency and the Java CI job will never pass.

Fix — add to fraud-detection/java/pom.xml:

<repositories>
  <repository>
    <id>sonatype-snapshots</id>
    <url>https://s01.oss.sonatype.org/content/repositories/snapshots</url>
    <snapshots><enabled>true</enabled></snapshots>
    <releases><enabled>false</enabled></releases>
  </repository>
</repositories>

2. tryRun() swallows failures — Java runner always exits 0

private static void tryRun(Runnable r, String name) {
    try { r.run(); }
    catch (Exception e) { System.err.println("[" + name + " FAILED] " + e.getMessage()); }
}

Any query that throws prints to stderr but the program exits 0. The CI java job appears green even if all 8 queries fail. Either track failures and call System.exit(1) at the end, or let exceptions propagate from main.


3. Q3 (Circular Money Flow) is a time-bomb in CI

WHERE all(t IN relationships(path)
  WHERE t.ts > datetime() - duration('P30D'))

The transfers are hardcoded to Feb 5–25, 2026. This query returns results only while CI runs within 30 days of those dates — i.e. before ~March 25, 2026. After that, Q3 silently returns empty. Since queries.sh never asserts non-empty output, CI stays green but the test is meaningless.

Fix: use a static lower bound matching the seed data, e.g. WHERE t.ts > '2026-01-01T00:00:00Z'.


Important Issues

4. queries.sh never validates query results

jq '.result' exits 0 whether .result is [], null, or populated. The curl matrix job provides no real signal. Consider adding | jq 'if length == 0 then error("empty result") else . end' for queries with known-non-empty results (Q1 must return accounts B–E, Q2 must return the F/G pair, etc.).


5. AI planning artifacts committed to the repo

docs/plans/2026-03-02-fraud-detection-ci.md opens with:

For Claude: REQUIRED SUB-SKILL: Use superpowers:executing-plans to implement this plan task-by-task.

These are internal development artifacts, not user documentation. They should be removed before merge.


6. SNAPSHOT Docker image — fragile CI

arcadedata/arcadedb:26.3.1-SNAPSHOT may not exist on Docker Hub, may be pulled non-deterministically, or may be replaced without notice. The recommendation-engine uses a stable tag. Consider pinning to a stable release.


Minor / Non-blocking

7. Missing indexes on query-critical columns

sql/01-schema.sql creates vector and full-text indexes but omits range/equality indexes used in several queries:

  • Transaction.account_id — used in Q5, Q6, Q7
  • Transaction.ts — used in Q6 (range scan)
  • Deposit.ts — used in Q4 (time-bucket group-by)

Negligible at 70 rows but worth adding as a reference implementation.


8. setup.sh silently ignores per-statement errors

send_sql discards HTTP errors. If a CREATE INDEX or INSERT fails, setup continues silently leaving the database in a partial state. Add || { echo "FAILED: ${stmt}"; exit 1; } to the send_sql call in apply_file.


9. Structuring deposits missing for accounts D and E

02-data.sql adds 3 structuring deposits for A, B, C but none for D and E, despite the README describing all five fraud ring members as having this pattern. Q4 will flag only three accounts.


10. .mergify.yml uses [skip ci] on Dependabot auto-merges

This bypasses CI for all Dependabot merges, meaning a dependency regression can land on main without validation.


Summary

# Severity File Issue
1 Blocking java/pom.xml Missing snapshot repository
2 Blocking FraudDetection.java Failures swallowed, always exits 0
3 Blocking queries/queries.sh + data Q3 time-bomb: empty after March 25
4 Important queries/queries.sh No result validation
5 Important docs/plans/ AI artifacts committed
6 Important docker-compose.yml SNAPSHOT image fragility
7 Minor sql/01-schema.sql Missing indexes on account_id/ts
8 Minor setup.sh Silent per-statement errors
9 Minor sql/02-data.sql D/E missing structuring deposits
10 Minor .mergify.yml [skip ci] skips validation

Fix the three blocking issues before merge; the rest can be addressed in follow-up.

Generated with Claude Code

@claude

claude Bot commented Mar 2, 2026

Copy link
Copy Markdown

Code Review — Fraud Detection Use Case

Overall this is a well-structured addition that follows the established repo conventions (same layout as recommendation-engine, SHA-pinned actions, set -euo pipefail in shell scripts, tryRun/try-with-resources in Java). A few issues worth addressing before merging:


Bug / Correctness

Query 7 — Cartesian product instead of correlation

The query joins Transaction a and Transaction b on account IDs and date ranges but has no condition linking the two sides. This produces a full cross-product (3 × 3 = 9 rows for the sample data) with matching_txns = 9 — meaningless as a correlation signal. Consider adding a time-proximity or amount-similarity predicate, e.g. abs(a.amount - b.amount) / greatest(a.amount, b.amount) < 0.05.


Build / Runtime

pom.xml missing SNAPSHOT repository

arcadedb-network:26.3.1-SNAPSHOT is not in Maven Central. Without a <repositories> block pointing to ArcadeDB's snapshot repository, mvn package will fail dependency resolution unless the CI runner happens to have a warm cache. Should be declared explicitly in pom.xml.


Performance

No index on Deposit(ts) or Deposit(account_id)

Query 4 (structuring detection) filters and groups on both ts and account_id but there is no index on the Deposit document type for either property. With large data sets this will be a full collection scan. The Transaction type has no index on ts either — worth adding for all time-range queries.

Query 5 — vectorDistance() evaluated twice

The function appears in both the SELECT list and the WHERE clause. If ArcadeDB does not CSE this internally, it computes the distance twice per row. A subquery or HAVING on the alias would avoid the duplication.


Design

Transaction is not connected to Account via a graph edge

Transaction stores account_id as a string property and uses SQL JOIN to reach Customer. This is inconsistent with the graph-first schema: you cannot traverse from an Account vertex to its Transaction vertices via edge traversal. Adding a HAS_TRANSACTION (or PERFORMED) edge type would make the schema fully graph-native and enable Cypher-based traversal in future queries.

Query 6 — hardcoded timestamp window

WHERE ts BETWEEN '2026-03-01T13:00:00Z' AND '2026-03-01T13:05:00Z'

This only works for the seeded test data. A brief comment in queries.sh explaining that the window is anchored to sample timestamps (and how to adjust) would help readers extending the example.


Repo Hygiene

.mergify.yml is repo-wide scope — this file configures Mergify for the entire repository, not just the fraud-detection feature. It should be part of a separate repo-level configuration PR with explicit maintainer sign-off, not bundled into a feature addition.

docs/plans/ internal scaffolding filesdocs/plans/2026-03-02-fraud-detection-design.md and docs/plans/2026-03-02-fraud-detection-ci.md appear to be implementation planning artifacts. Unless the project intentionally archives all planning docs (which docs/plans/ already does for other use cases), it is worth deciding whether this directory should remain or be gitignored.


Minor

Unsafe cast in FraudDetection.java (Query 5)

((Number) r.getProperty("amount")).doubleValue()

If amount is null this throws NullPointerException. A null-check before doubleValue() would make it robust.


Summary

Severity Issue
Bug Query 7 Cartesian product — results are semantically wrong
Build Missing SNAPSHOT repository declaration in pom.xml
Performance No index on Deposit(ts) / Deposit(account_id)
Performance Query 5 double vectorDistance() evaluation
Design Transaction not graph-connected to Account via edge
Scope .mergify.yml is repo-wide, not fraud-detection-specific
Minor Potential NPE on amount cast in Query 5

The CI structure, SHA-pinned actions, schema design, and breadth of fraud patterns (graph ring, synthetic identity, velocity, structuring, behavioral anomaly) are all solid. Fixing the Query 7 correctness issue and the missing Maven SNAPSHOT repository declaration are the highest-priority items before merging.

robfrank and others added 2 commits March 2, 2026 14:11
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@claude

claude Bot commented Mar 2, 2026

Copy link
Copy Markdown

Code Review — Fraud Detection Use Case

This is a well-structured addition that follows the existing recommendation-engine pattern closely. The multi-model angle (graph + vector + time-series + full-text) is a compelling demo. A few issues worth addressing before merge:


Bugs / Correctness

1. Query 7 is a Cartesian product, not a correlation query

queries/queries.sh and FraudDetection.java — Query 7:

SELECT a.account_id AS account_a, b.account_id AS account_b,
       avg(a.amount) AS avg_a, avg(b.amount) AS avg_b,
       count(*) AS matching_txns
FROM Transaction a, Transaction b
WHERE a.account_id = 'acct-A' AND b.account_id = 'acct-B'
  AND a.ts >= '2026-02-01T00:00:00Z'
  AND b.ts >= '2026-02-01T00:00:00Z'

This is a cross-join with no join predicate. count(*) returns count(A rows) × count(B rows) (3 × 3 = 9 with current data), and avg() operates on duplicated values from the Cartesian product. The label "matching_txns" is misleading — no matching or correlation is actually occurring. Either add a join predicate (e.g., same time bucket, same amount range) or rewrite as two independent aggregations. As-is, this query does not demonstrate what it claims.

2. tryRun() swallows query failures — CI can pass even when queries fail

FraudDetection.java:

private static void tryRun(Runnable r, String name) {
    try {
        r.run();
    } catch (Exception e) {
        System.err.println("[" + name + " FAILED] " + e.getMessage());
    }
}

All 8 queries are wrapped in tryRun(), which catches exceptions, prints to stderr, and continues. If every query throws, the JVM exits with code 0 and CI passes. The demo-resilience intent is understandable, but it makes CI unreliable as a correctness gate. Consider either (a) tracking failures and calling System.exit(1) if any query fails, or (b) letting exceptions propagate from main().

3. Database creation errors are silently suppressed

setup.sh:

curl -sf ... -d '{"command": "create database ..."}' > /dev/null || true

The || true suppresses all errors, not just "database already exists." Auth failures, server-not-ready states, or malformed requests will all silently pass, leaving the subsequent apply_file calls to fail with cryptic errors. Consider checking the response body for a known "already exists" condition and only suppressing that specific case.

4. Inconsistent and fragile type cast in Java

FraudDetection.java, runQuery5BehavioralAnomaly:

((Number) r.getProperty("amount")).doubleValue()

This is the only place in the file that explicitly casts a property to Number. If ArcadeDB returns null, this is an unguarded NPE. The other query methods use r.getProperty(...) directly in printf, which is consistent. Either use the same pattern everywhere or add a null check here.


Performance

5. Query 7 cross-join scales as O(n²)

Related to bug #1 — the Cartesian product degrades quadratically as Transaction records grow. For the current demo data this is invisible, but the query should be rewritten before it becomes a pattern copied into production.


Reliability / CI

6. SNAPSHOT Docker image in CI

docker-compose.yml pins arcadedata/arcadedb:26.3.1-SNAPSHOT. Snapshot images are mutable and can be removed from Docker Hub without notice, breaking CI with a pull error rather than a test failure. This mirrors the recommendation-engine approach so may be intentional until a stable release is tagged — but worth flagging explicitly.

7. queries.sh does not validate non-empty results

The query() helper exits non-zero only if curl or jq fails. An empty result set ([]) is valid JSON and exits 0 — so CI passes even if the fraud ring query returns no accounts. Adding | jq 'if length == 0 then error("empty result") else . end' on queries that are expected to return results would catch regressions in the data or query logic.


Minor / Style

8. Internal directive in docs/plans/ file should be removed

docs/plans/2026-03-02-fraud-detection-ci.md has a > **For Claude:** REQUIRED SUB-SKILL: ... directive at the top — an internal tooling note that should be stripped before merging into main.

9. .mergify.yml is repo-wide scope

This file applies to the entire repository, not just the fraud-detection use case. The auto-merge condition on approved Dependabot PRs should be reviewed by repo maintainers independently of this PR.

10. Transaction.account_id as STRING vs. a graph edge

The schema stores account_id as a denormalized string on Transaction rather than a BELONGS_TO edge to Account. This forces the behavioral anomaly query to use a SQL JOIN instead of graph traversal, which is inconsistent with the graph-first design used elsewhere (e.g., TRANSFERRED_TO edges). Worth a brief note in the README explaining the intentional trade-off.


Summary

Category Count
Bugs 4 (Q7 cross-join, tryRun exit code, silent error suppression, unsafe cast)
Performance 1 (Q7 O(n²))
CI reliability 2 (SNAPSHOT image, empty-result validation)
Minor / Style 3 (plan docs cleanup, .mergify.yml scope, schema design note)

The most important fixes are #1 (Query 7 logic) and #2 (CI exit code) — both undermine CI's value as a correctness gate. The rest are polish.

Generated with Claude Code

@claude

claude Bot commented Mar 2, 2026

Copy link
Copy Markdown

Code Review — Fraud Detection Use Case

Overall this is a well-structured use case that mirrors the recommendation-engine layout and demonstrates ArcadeDB's multi-model capabilities clearly. A few items to address before merging:


🔴 Issues

1. docker-compose.yml: unpinned image tag

image: arcadedata/arcadedb:latest   #

The existing recommendation-engine/docker-compose.yml pins to arcadedata/arcadedb:26.2.1. This PR targets 26.3.1-SNAPSHOT (per the PR description and pom.xml), but the compose file uses latest. Inconsistent image tags between compose and Maven will cause hard-to-diagnose differences. Pin the tag:

image: arcadedata/arcadedb:26.3.1-SNAPSHOT   #

2. setup.sh: SQL errors are silently swallowed

send_sql pipes the curl response to /dev/null:

| curl -sf … -d @- > /dev/null   # exit 0 on HTTP 2xx, but errors in body are lost

ArcadeDB returns HTTP 200 for a statement that hits a constraint violation or syntax error, with an error field in the JSON body. A failed INSERT (e.g. duplicate unique ID) will go unnoticed, leaving the database partially initialised. The same issue exists in the recommendation-engine — but worth fixing here:

send_sql() {
  local stmt="$1"
  local response
  response=$(jq -cn --arg cmd "$stmt" '{"language":"sql","command":$cmd}' \
    | curl -sf -u "${ARCADEDB_USER}:${ARCADEDB_PASS}" \
        -X POST "${ARCADEDB_URL}/api/v1/command/${DB_NAME}" \
        -H "Content-Type: application/json" \
        -d @-)
  if echo "$response" | jq -e '.error' > /dev/null 2>&1; then
    echo "ERROR executing: $stmt" >&2
    echo "$response" | jq '.error' >&2
    return 1
  fi
}

3. sql/01-schema.sql: missing performance-critical indexes

Queries 4 (structuring), 5 (behavioral anomaly), and 6 (velocity) rely on full scans of Deposit and Transaction without index support:

Missing index Used by
Deposit(account_id) Q4 structuring GROUP BY
Deposit(ts) Q4 time-bucket
Transaction(account_id) Q5, Q7, Q8 JOINs and filters
Transaction(ts) Q6 velocity BETWEEN filter
Account(ssn) Q2 synthetic identity equality filter

For a demo with 70 rows this is invisible; at any realistic scale these queries will be full scans.

4. FraudDetection.java: NPE risk on amount property

In runQuery5BehavioralAnomaly:

((Number) r.getProperty("amount")).doubleValue()   // NPE if amount is null

Add a null guard or use r.getProperty("amount", 0.0) if the API supports it.


🟡 Improvements

5. Query 3 (Circular Money Flow): no LIMIT on cycle traversal

MATCH path = (origin:Account)-[:TRANSFERRED_TO*3..6]->(origin)

With *3..6 and no LIMIT, this enumerates all cycle lengths (3, 4, 5, 6 hops) simultaneously. Add LIMIT 50 or similar to bound result size in larger datasets.

6. Query 7 (Correlated Activity): implicit Cartesian product

FROM Transaction a, Transaction b
WHERE a.account_id = 'acct-A' AND b.account_id = 'acct-B'
  AND a.ts >= '2026-02-01T00:00:00Z'
  AND b.ts >= '2026-02-01T00:00:00Z'

This computes the full N×M cross-join between acct-A and acct-B transactions before filtering. There is no correlation predicate between a and b (e.g. matching time window, matching amount). The result is the same row repeated for every (a, b) pair — the avg and count are not meaningful as written. Either add a correlation condition (e.g. abs(a.amount - b.amount) < 100) or document what the query is actually measuring.

7. Hardcoded timestamps in Queries 6 & 7

WHERE ts BETWEEN '2026-03-01T13:00:00Z' AND '2026-03-01T13:05:00Z'

These match the sample data, so the queries work for the demo. A brief comment (or README note) explaining that the timestamp window is intentionally fixed to the seed data would prevent confusion.

8. docs/plans/ planning documents

docs/plans/2026-03-02-fraud-detection-ci.md and docs/plans/2026-03-02-fraud-detection-design.md appear to be internal implementation notes rather than user-facing documentation. The recommendation-engine use case does not include similar files. Consider removing them or moving them to a branch description/wiki so the docs/ directory stays focused on end-user content.

9. CI: no result-count assertion

Both test (curl) and test (java) pass as long as exit code is 0. A query returning 0 rows (e.g. because the data load silently failed per point #2) is indistinguishable from a successful query. Adding a jq assertion like .result | length > 0 for at least the key queries (fraud ring, synthetic identity, velocity) would make the CI a meaningful integration test rather than just a connectivity check.


✅ What's done well

  • Action SHA pins in the workflow — good supply-chain hygiene
  • set -euo pipefail in both shell scripts
  • jq used for JSON construction in send_sql / query (avoids shell injection)
  • try-with-resources for ResultSet in Java
  • tryRun wrapper lets all 8 queries run even if one fails
  • fail-fast: false in the CI matrix
  • Clean separation of schema, data, and queries into separate files
  • Schema uses IF NOT EXISTS guards, making it idempotent

ArcadeDB does not support JOIN, HAVING, comma-separated FROM tables,
vectorDistance(), time_bucket(), or .similarity(). Rewrite all 7
failing queries using proven ArcadeDB patterns:

- Q2: SEARCH_INDEX() instead of .similarity() cross-join
- Q3: Explicit 5-hop Cypher path instead of all() predicate
- Q4: Subquery wrapper instead of HAVING
- Q5: vectorCosineSimilarity() instead of vectorDistance() JOIN
- Q6: Subquery wrapper instead of HAVING
- Q7: GROUP BY account_id instead of cross-join
- Q8: Subquery IN filter instead of JOIN

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@claude

claude Bot commented Mar 2, 2026

Copy link
Copy Markdown

PR Review: Fraud Detection Use Case

Overall this is a well-structured, clearly documented use case that demonstrates ArcadeDB's multi-model capabilities across graph, vector, time-series, and full-text signals. The shell scripts are clean, the CI workflow is solid, and the Java code is idiomatic. Here are my findings:


Bugs / Correctness Issues

1. Silent CI failures — Java runner (FraudDetection.java:29-35)

tryRun catches all exceptions and only prints to stderr, but never rethrows. If every query fails, the JVM exits 0 and the CI job shows green. Consider tracking a failure counter and calling System.exit(1) at the end of main when any query failed.

2. Silent CI failures — curl runner (queries/queries.sh)

query() pipes through jq '.result' but never checks that the result array is non-empty. All 8 queries could return [] and the script still exits 0. Adding assertions for expected non-empty results (e.g. piping through jq 'if length == 0 then error("empty result") else . end') would make the CI gate meaningful.

3. Docker image version mismatch (docker-compose.yml:3)

image: arcadedata/arcadedb:latest is unpinned, while pom.xml pins to 26.3.1-SNAPSHOT. The curl tests run against whatever latest resolves to at CI time, which may diverge from the Java client version. Pinning to the same snapshot tag would keep both matrix jobs consistent.

4. Query 2 description inaccurate — SSN check is not programmatic

The description says "Find accounts matching 'Smith' via full-text index, then check for shared SSN" but both the curl and Java implementations simply return matching rows — no SSN cross-check is performed by the query. The shared SSN is only visible to a human reading the output. Either update the description to "inspect for shared SSN" or extend the query to surface only accounts sharing an SSN (e.g. via a self-join or subquery grouping by ssn).


Performance Considerations

5. No indexes on commonly filtered columns

Queries 4-7 filter on Deposit.account_id, Deposit.amount, Transaction.account_id, and Transaction.ts, none of which have indexes in 01-schema.sql. For this demo size it is fine, but adding these indexes (or a README note explaining the omission) would make the example more instructive for users adapting it to production workloads.

6. Structuring query omits the time dimension (Query 4)

The query groups deposits only by amount range and count, with no time window. True structuring involves multiple sub-threshold transactions within a short window (e.g., same day). The test data happens to place all deposits on the same day, but the query would also flag three deposits spread across a year, making the "time-series" characterisation slightly misleading. Scoping by date(ts) or a rolling window would demonstrate the temporal aspect.


Code Quality / Minor Issues

7. apply_file is fragile with multi-line SQL (setup.sh:36-45)

The function reads one line = one SQL statement. Any statement split across multiple lines will be sent as separate invalid fragments. This works for the current single-line-per-statement files, but a comment noting this constraint would prevent accidental breakage when the SQL files are extended.

8. || true silences all database creation errors (setup.sh:22)

The intent is idempotency (DB may already exist), but || true also silently swallows network errors. If ArcadeDB is unreachable, the script continues and apply_file fails with a confusing error. Checking the HTTP response code specifically for "database already exists" would give cleaner diagnostics.

9. Query 8 "multi-model" label is misleading

Query 8 is described as "Multi-Model Investigation (Combined)" but is a straightforward SQL subquery joining two vertex types via a scalar field filter. It does not combine graph traversal, vector search, or full-text lookups. Renaming it or extending it with one additional signal type would better justify the label.

10. actions/checkout version comment may be incorrect (.github/workflows/fraud-detection.yml:31)

The step pins actions/checkout by SHA with the comment # v6.0.2. The latest known major version of actions/checkout is v4; there is no published v6. Please verify the SHA resolves to the claimed tag — mismatched version comments on pinned SHAs undermine the auditability benefit of SHA pinning.


Summary

Severity Finding
High Java runner exits 0 on query failure — CI shows green on broken queries
High curl runner exits 0 on empty results — no result validation
Medium Docker image latest unpinned vs. Maven SNAPSHOT version
Medium Query 2 description claims SSN check that the query does not perform
Medium actions/checkout SHA comment does not match claimed version
Low No indexes on filter columns used in Queries 4-7
Low Structuring query missing time window
Low apply_file breaks on multi-line SQL statements
Low `
Low Query 8 "multi-model" label overstates what the query does

The structural foundation is solid — fixing the silent-failure issues in the CI runners is the highest-priority change before merging.


Generated with Claude Code

robfrank and others added 2 commits March 2, 2026 14:58
SEARCH_INDEX() is not supported in WHERE clauses in ArcadeDB.
Use a simple SSN equality filter instead for synthetic identity
detection.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Remove FULL_TEXT index (no longer used by any query)
- Update Query 2 description from Full-Text to Document signal type
- Remove SEARCH_INDEX reference from version notes

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@claude

claude Bot commented Mar 2, 2026

Copy link
Copy Markdown

Code Review — Fraud Detection Use Case

Good overall structure that mirrors the existing recommendation-engine use case. The schema, sample data, CI workflow, and Java runner are well-organized. Below are the issues found, roughly ordered by severity.


Bugs / Correctness

1. docker-compose.yml uses arcadedata/arcadedb:latest instead of a pinned version

The rest of the PR consistently references 26.3.1-SNAPSHOT (README, pom.xml, CI troubleshooting notes, PR description). Using latest means CI could break silently when a new server version ships, the Java driver (arcadedb-network:26.3.1-SNAPSHOT) may be incompatible with whatever latest resolves to at run time, and it diverges from recommendation-engine/docker-compose.yml which pins arcadedata/arcadedb:26.2.1.

Fix: pin to arcadedata/arcadedb:26.3.1-SNAPSHOT (matching pom.xml).


2. Missing FULL_TEXT index on Account(full_name) in 01-schema.sql

The design doc, README, and query table all describe Query 2 as a full-text / fuzzy-matching operation. The schema file never creates the index:

-- missing from fraud-detection/sql/01-schema.sql
CREATE INDEX IF NOT EXISTS ON Account (full_name) FULL_TEXT;

Without it, any future query that relies on the full-text index (e.g. the similarity() function) will either fail or silently fall back to a full scan.


3. Query 2 doesn't actually do fuzzy name matching

The design doc and PR description show Query 2 using a.full_name.similarity(b.full_name) BETWEEN 0.4 AND 0.9, but both queries.sh and FraudDetection.java implement it as a plain SSN equality filter:

SELECT id, full_name, ssn FROM Account WHERE ssn = '123-45-6789' ORDER BY id

This works for the demo data, but it doesn't demonstrate the full-text / fuzzy-matching capability advertised in the docs. Either implement similarity() or update the documentation to match what's actually implemented.


4. Query 5 uses a hardcoded embedding instead of joining the Customer table

In both queries.sh and FraudDetection.java:

vectorCosineSimilarity(behavior_embedding, [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8])

The vector happens to match acct-H's profile_embedding in 02-data.sql, but this is coincidental and fragile. The design doc describes joining Transaction to Customer on account_id to compare each transaction against that customer's profile. A join query would actually demonstrate the multi-model vector capability:

SELECT t.id, t.amount, t.merchant, t.account_id,
       vectorCosineSimilarity(t.behavior_embedding, c.profile_embedding) AS similarity
FROM Transaction AS t, Customer AS c
WHERE t.account_id = c.id
  AND vectorCosineSimilarity(t.behavior_embedding, c.profile_embedding) < 0.5
ORDER BY similarity

5. Query 8 doesn't demonstrate multi-model investigation

The current implementation:

SELECT id, name FROM Account
WHERE id IN (SELECT id FROM Customer WHERE recent_behavior IN ['suspicious', 'anomalous'])

The design doc and README both describe Query 8 as blending graph traversal, vector deviation, and temporal patterns into a composite risk score. The current implementation uses none of those — it's weaker than several earlier queries. Either bring the implementation closer to the design, or update the description to match what's actually implemented.


Code Quality

6. Planning documents committed to the repo

docs/plans/2026-03-02-fraud-detection-ci.md and docs/plans/2026-03-02-fraud-detection-design.md are internal scaffolding artifacts (the CI plan even contains a > **For Claude:** REQUIRED SUB-SKILL: directive). These shouldn't land on main — consider dropping them from this PR.

7. .mergify.yml is unrelated to this use case

A Mergify configuration is a repo-level change that belongs in a separate PR with its own review. It currently auto-merges Dependabot PRs with [skip ci], which bypasses CI checks entirely — that deserves explicit discussion.

8. setup.sh silently swallows database-creation errors

curl -sf ... -d "{\"command\": \"create database ${DB_NAME}\"}" > /dev/null || true

|| true makes the script continue even if the server returns an error. Subsequent apply_file calls will then fail with less obvious messages. Consider removing || true and letting set -euo pipefail handle it.

9. setup.sh wait loop has no retry limit

until curl -sf ... "${ARCADEDB_URL}/api/v1/ready" > /dev/null 2>&1; do
  sleep 2
done

If ArcadeDB never becomes healthy the script loops forever. The Docker healthcheck with retries: 20 provides a ceiling in CI, but running setup.sh locally against a broken server will spin indefinitely. A counter (e.g. MAX_RETRIES=50) with a clear exit message would make failures easier to diagnose.


Minor

  • Query 6 / Query 7 use hardcoded timestamps (2026-03-01T13:00:00Z, 2026-02-01T00:00:00Z) that must exactly match 02-data.sql. A comment in both files explaining this coupling would help future maintainers.
  • Transaction vertex lacks edges to Account — references are stored as an account_id string property instead. This is a valid design choice, but a brief note in the README explaining the intentional trade-off would help readers understand why graph edges aren't used here.

Summary

Issue Severity
docker-compose.yml uses latest instead of pinned version High
Missing FULL_TEXT index on Account(full_name) Medium
Query 2 doesn't implement fuzzy name matching Medium
Query 5 uses hardcoded reference vector Medium
Query 8 doesn't demonstrate multi-model capability Medium
Planning docs committed to main Low
.mergify.yml unrelated to this PR Low
setup.sh silently ignores DB creation errors Low
setup.sh wait loop has no retry limit Low

The most important fix before merging is pinning the Docker image version to match the Java driver snapshot. The query correctness issues (3, 4, 5) are worth addressing to ensure the use case actually demonstrates what the README and docs promise.

@claude

claude Bot commented Mar 2, 2026

Copy link
Copy Markdown

Code Review: feat: add fraud detection use case

Overall this is a well-structured addition that follows existing repo conventions. The 8-query multi-model demo is a compelling showcase for ArcadeDB. Below are my findings by area.


Critical / Bugs

setup.sh — infinite loop risk (lines 11-14)
The readiness wait loop has no timeout. If ArcadeDB never becomes healthy the CI job will hang until the 15-minute timeout-minutes limit. Consider adding a retry counter:

retries=0
until curl -sf -u "${ARCADEDB_USER}:${ARCADEDB_PASS}" \
    "${ARCADEDB_URL}/api/v1/ready" > /dev/null 2>&1; do
  retries=$((retries + 1))
  [ "$retries" -ge 30 ] && { echo "ERROR: ArcadeDB not ready after 60s"; exit 1; }
  sleep 2
done

Alternatively, use docker compose up -d --wait in the CI step to let Docker's own healthcheck (already defined with retries: 20) handle the wait.

FraudDetection.java — non-zero exit on query failure is suppressed
tryRun catches all exceptions and prints to stderr but lets the program exit 0. The CI "Build and run Java" step will report success even if every single query throws an exception. A simple fix is to track a failure flag:

private static boolean hadFailure = false;

private static void tryRun(Runnable r, String name) {
    try { r.run(); }
    catch (Exception e) {
        hadFailure = true;
        System.err.println("[" + name + " FAILED] " + e.getMessage());
    }
}

// in main(), before "All queries complete.":
if (hadFailure) System.exit(1);

02-data.sql — incomplete data placeholders
The diff contains comment stubs like:

-- ... (txn-H02 through txn-H10, all within 5 minutes)
-- ... (plus per-account devices for F, G, H, L1, L2, L3)

If these INSERTs are genuinely missing, Query 6 (velocity attack, requires txn_count > 5) and Query 1 (fraud ring via USES_DEVICE) will return empty results — silently passing CI while not testing the intended patterns. Please confirm the full data file is committed.


Reproducibility / Version Pinning

docker-compose.yml uses arcadedata/arcadedb:latest while graph-rag/docker-compose.yml pins to arcadedata/arcadedb:26.2.1 and pom.xml references 26.3.1-SNAPSHOT. Using latest means CI results are not reproducible — a new ArcadeDB release could silently break the demo. Pin to a specific version to match the pattern in the other use cases.

The SNAPSHOT dependency in pom.xml is also mutable. Once 26.3.1 GA is available, pin to it.


Missing Indexes

The schema creates indexes on primary ID columns and two LSM_VECTOR indexes, but several hot query paths have none:

Column Used in
Account.ssn Query 2 (full table scan)
Transaction.account_id Queries 5, 6, 7
Deposit.account_id Query 4

For ~70 records this is harmless today, but as a best-practices demo it is worth modelling what production schemas would look like.


Minor Issues

setup.sh — broad error suppression on database creation (line 22)
|| true swallows all errors including unexpected ones. A safer approach is to check for and ignore only the "already exists" error code, or use CREATE DATABASE IF NOT EXISTS if ArcadeDB supports that syntax.

queries.sh — no result validation
The script exits on curl failure but an empty [] result is not an error. The test plan items ("Fraud ring query returns acct-B through acct-E", "Velocity query flags acct-H") are not actually asserted. Consider adding spot-checks for known expected values.

FraudDetection.java:114 — potential NPE

((Number) r.getProperty("amount")).doubleValue()

If amount is null, this throws NullPointerException inside tryRun. A null-safe accessor would be cleaner.


Design Note

docs/plans/2026-03-02-fraud-detection-design.md references an ANN index for Query 5, and 01-schema.sql creates an LSM_VECTOR index. Worth verifying with EXPLAIN that vectorCosineSimilarity() in the SQL query actually uses the vector index rather than falling back to a full scan.


Summary

Severity Item
Bug tryRun exits 0 even when all queries fail
Bug Possible incomplete data in 02-data.sql
Reliability No timeout on readiness loop in setup.sh
Reproducibility latest image tag in docker-compose.yml; SNAPSHOT in pom.xml
Best practices Missing indexes on ssn, account_id columns
Minor || true masks unexpected DB-creation errors
Minor queries.sh does not assert expected results
Minor Potential NPE on null amount in Java runner

The use case structure, CI workflow patterns, and multi-model query design are solid and consistent with the existing repo. The two most important items before merge are the silent Java exit-0 issue and confirming the data file is complete.

- Pin docker-compose.yml to arcadedata/arcadedb:26.3.1-SNAPSHOT to
  match Maven dependency version
- Rename Query 8 from "Multi-Model Investigation" to "Cross-Type
  Investigation" — the query is a SQL subquery across vertex types,
  not a combination of graph/vector/time-series signals

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@robfrank

robfrank commented Mar 2, 2026

Copy link
Copy Markdown
Contributor Author

Review response

Fixed (2 items)

3. Docker image version mismatch — Fixed. Pinned docker-compose.yml to arcadedata/arcadedb:26.3.1-SNAPSHOT to match the Maven dependency. (726d4c7)

9. Query 8 "multi-model" label misleading — Agreed, the query is a SQL subquery across two vertex types, not a combination of graph + vector + time-series. Renamed to "Cross-Type Investigation (SQL Subquery)". (726d4c7)

Already resolved / stale (1 item)

4. Query 2 description inaccurate — This was based on an earlier version. Query 2 was rewritten in 5f9858e to use WHERE ssn = '123-45-6789' ORDER BY id. The description now says "Find accounts sharing the same SSN" which accurately describes what the query does.

Pushback (7 items)

1. Java runner exits 0 on failure — The tryRun pattern is identical to the recommendation-engine use case (RecommendationEngine.java:26-32). This is a project-wide convention. Changing it only for fraud-detection would create inconsistency across use cases. If this should change, it should be done for all use cases together.

2. curl runner exits 0 on empty results — Same as above — the query() helper is identical to recommendation-engine/queries/queries.sh:15-21. Project-wide convention.

5. No indexes on commonly filtered columns — At demo scale with ~70 records, these indexes would add schema complexity without demonstrable benefit. The schema already includes 3 UNIQUE indexes and 2 LSM_VECTOR indexes. Adding a README note about production indexing would be over-documenting for a demo.

6. Structuring query omits time dimension — The query is intentionally simplified. The sample data is constructed so that the fraud ring deposits cluster on the same day, making the query return correct results. Adding date(ts) grouping would add complexity without changing the output for the demo dataset, and time_bucket() is not supported by ArcadeDB.

7. apply_file fragile with multi-line SQL — Same pattern as recommendation-engine (setup.sh:36-45 there). All SQL files in this project use one-statement-per-line by convention. Adding a comment about a constraint that's already enforced by convention is unnecessary.

8. || true on DB creation — Same pattern as recommendation-engine (setup.sh:22 there). The until curl loop on lines 11-14 already verifies ArcadeDB is reachable before reaching the create command. If the server goes down between the health check and the create, apply_file will fail immediately on the next step with a clear error.

10. actions/checkout version comment incorrect — Verified: actions/checkout v6.0.2 exists and the SHA de0fac2e4500dabe0009e67214ff5f5447ce83dd matches. The v6 major version was released in 2025.

@claude

claude Bot commented Mar 2, 2026

Copy link
Copy Markdown

PR Review: Fraud Detection Use Case

Great addition! The overall structure mirrors the existing recommendation-engine well, and the multi-model approach (graph + vector + time-series + full-text) is a compelling showcase for ArcadeDB. Here are my findings across several categories.


Bugs / Correctness Issues

1. FULL_TEXT index missing from schema (sql/01-schema.sql)

The design doc and README both mention a FULL_TEXT index on Account(full_name) to support fuzzy name matching. It is never created in the schema file. The line should be:

CREATE INDEX IF NOT EXISTS ON Account (full_name) FULL_TEXT;

Without it, Query 2's fuzzy-matching intent cannot be fulfilled by the index.

2. Query 2 does not use full-text / fuzzy matching

The design doc specifies full_name.similarity() between 0.4 and 0.9 to catch near-duplicate names. The actual implementation uses a plain SSN equality filter (WHERE ssn = '123-45-6789'). This is simpler and works for the demo, but it skips the full-text signal type that the README advertises for this query.

3. Query 3 uses a fixed 5-hop pattern instead of variable-length cycle detection

The design doc shows:

MATCH path = (origin:Account)-[:TRANSFERRED_TO*3..6]->(origin)

The implementation hard-codes exactly 5 hops (A->B->C->D->E->A). A variable-length pattern is both more general and a better demonstration of ArcadeDB's Cypher capabilities.

4. Query 4 drops the time_bucket window

The design specifies grouping by day (time_bucket('1d', ts)) to enforce the "3+ deposits per day" rule. The actual query counts total deposits with no time bucketing, which changes the semantics. An account could have 3 deposits spread over months and still be flagged.

5. Query 5 uses a hardcoded embedding instead of joining Customer

vectorCosineSimilarity(behavior_embedding, [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8])

This vector happens to equal acct-H's profile_embedding — but it is not joined from the Customer table. The query is also scoped to WHERE account_id = 'acct-H', so it does not demonstrate cross-customer anomaly detection. The design intent was:

JOIN Customer c ON t.account_id = c.id
WHERE vectorDistance(t.behavior_embedding, c.profile_embedding) > 0.7

6. Query 7 is not a correlation query

The description says "compare transfer patterns between two accounts to detect coordination", but the implementation is just a GROUP BY account_id that returns two independent rows. The original design had a self-join computing matching_txns — the current query cannot detect that two accounts are correlated.


Performance Considerations

7. Missing index on Transaction.account_id and Deposit.account_id

Queries 4-8 all filter or group by account_id on Transaction and Deposit types, but neither has an index on that field. For a real dataset this causes full scans. Adding:

CREATE INDEX IF NOT EXISTS ON Transaction (account_id) NOTUNIQUE;
CREATE INDEX IF NOT EXISTS ON Deposit (account_id) NOTUNIQUE;

would improve query performance and better illustrate ArcadeDB indexing.

8. Transaction.account_id as a denormalised STRING

The graph schema models account relationships (device, phone, address, transfers) as edges, yet transactions are linked to accounts via a plain string foreign key. This is inconsistent with the graph-centric design and forces SQL-style lookups instead of graph traversal. Consider modeling it as a MADE_BY edge.


CI / Reliability

9. SNAPSHOT Docker image makes CI non-deterministic

image: arcadedata/arcadedb:26.3.1-SNAPSHOT

SNAPSHOT images can be overwritten at any time. CI builds will silently change behavior when a new snapshot is published. Pin to a stable release tag (or at least document that this is intentional and temporary).

10. setup.sh wait loop has no timeout

until curl -sf ... /api/v1/ready > /dev/null 2>&1; do
  sleep 2
done

If ArcadeDB never becomes healthy, this loops forever until the 15-minute CI timeout kills the job. A counter (e.g. 50 retries x 2 s = 100 s) with an explicit error message would fail faster and give a better diagnostic.

11. No result validation — empty results silently pass

Both queries.sh and FraudDetection.java print results but never assert they are non-empty. If a query returns zero rows (e.g. due to a bug or schema mismatch), CI turns green with no output and no failure. Consider checking jq '.result | length > 0' in the shell script, and throwing an exception in Java when expected queries return no rows.


Code Quality

12. NPE risk in Java Query 5 (FraudDetection.java)

((Number) r.getProperty("amount")).doubleValue()

If amount is null or not a Number, this throws. The other queries use r.getProperty() directly (safe as Object), so the explicit cast here is inconsistent and fragile.

13. || true silently ignores database creation errors (setup.sh)

curl ... -d "{\"command\": \"create database ${DB_NAME}\"}" > /dev/null || true

This suppresses all errors, including network failures or auth problems. It would be better to tolerate only a "database already exists" response by inspecting the HTTP status code or response body.


Scope / Repository Hygiene

14. docs/plans/ files are internal planning artifacts

docs/plans/2026-03-02-fraud-detection-design.md and docs/plans/2026-03-02-fraud-detection-ci.md contain implementation notes including lines like:

For Claude: REQUIRED SUB-SKILL: Use superpowers:executing-plans to implement this plan task-by-task.

These are internal scaffolding documents that have no value to end users and should not be committed to the repository.

15. .mergify.yml scope creep

This is a new repo-level Mergify configuration that affects all PRs and branches, not just the fraud-detection use case. It should be introduced in a separate, focused PR with explicit discussion from maintainers.


Minor / Suggestions

  • Query 1 traverses HAS_ADDRESS edges. The fraud ring (A-E) all share the same address row, and F/G also share a different address with each other. Verify the expected result set for acct-A does not include false positives from the address path.
  • README labels Query 2 as "Document" signal type, but the design doc and query table label it "Full-Text". Pick one and be consistent.

Summary

Category Count
Bugs / correctness 6
Performance 2
CI / reliability 3
Code quality 2
Scope / hygiene 2

The most impactful items to address before merging are #1 (missing FULL_TEXT index), #9 (SNAPSHOT image), #11 (no result validation), and #14-15 (repo hygiene). The query simplifications (#2-#6) are worth discussing — they work for the demo but diverge from the stated design goals.

🤖 Generated with Claude Code

- Remove .mergify.yml (repo-wide config unrelated to fraud-detection,
  should be introduced in a separate PR)
- Update design doc Query 2 to reflect SSN lookup implementation
  with note explaining why SEARCH_INDEX() was dropped
- Update design doc Query 3 to reflect explicit 5-hop pattern
  with note explaining ArcadeDB Cypher limitations
- Fix Query 2 signal type label: Full-Text → Document
- Fix Query 8 label: Multi-Model → Cross-Type

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@robfrank

robfrank commented Mar 2, 2026

Copy link
Copy Markdown
Contributor Author

Review response

Fixed (3 items)

9. Query 8 label misleading — Renamed to "Cross-Type Investigation (SQL Subquery)" in the previous commit (726d4c7). Design doc also updated.

15. .mergify.yml scope creep — Removed from this PR. It's a repo-wide config and should be introduced separately. (8095ca4)

Minor: README vs design doc inconsistency on Query 2 — Design doc updated to match the README ("Document" signal type) with a note explaining why SEARCH_INDEX() was dropped. (8095ca4)

Already resolved / stale (2 items)

1. FULL_TEXT index missing — Intentionally removed in 05d4a56. SEARCH_INDEX() fails with a parsing error in ArcadeDB's WHERE clause. The index served no purpose without a query using it.

2. Query 2 doesn't use full-text — Correct. SEARCH_INDEX() in WHERE is not supported by ArcadeDB (verified against the actual server — "mismatched input 'SEARCH_INDEX'"). The SSN equality filter was adopted as the working alternative in 5f9858e. The README and design doc have been updated to reflect this.

Pushback (10 items)

3. Query 3 fixed 5-hop vs variable-length — ArcadeDB's Cypher does not support all() predicates or datetime() - duration(). The variable-length pattern with these predicates fails at parse time. The explicit 5-hop pattern matches the known fraud ring topology and works. Design doc updated with a note explaining this. (8095ca4)

4. Query 4 missing time_buckettime_bucket() is not a supported ArcadeDB function (verified against the function reference). ArcadeDB also does not support HAVING. The sample data places all fraud ring deposits on the same day, so the count-based detection is correct for the demo dataset.

5. Query 5 hardcoded embedding — ArcadeDB does not support JOIN...ON syntax (verified — "no viable alternative at input 'JOIN'"). The hardcoded vector demonstrates vectorCosineSimilarity() which is the point of the Vector signal type demo. A LET-based approach could work but would add complexity without changing the result.

6. Query 7 not a correlation query — ArcadeDB does not support comma-separated tables in FROM (cross-joins) or self-joins. The GROUP BY account_id with IN ['acct-A', 'acct-B'] shows the per-account aggregates that would be compared. The correlation is visible in the output (matching avg_amount and txn_count patterns).

7. Missing indexes on account_id — At demo scale (~70 records), these indexes add schema complexity without demonstrable benefit. The schema already includes 3 UNIQUE + 2 LSM_VECTOR indexes. Production indexing guidance is out of scope for a demo use case.

8. Transaction.account_id as string — Intentional. Queries 4-7 use SQL GROUP BY account_id and WHERE account_id = ... which require a property, not edge traversal. Converting to edges would require rewriting all SQL queries to use Cypher or MATCH, defeating the purpose of demonstrating SQL capabilities.

9 (CI). SNAPSHOT Docker image — This is intentional and documented in the README ("ArcadeDB Version Notes"). The features this demo uses (vectorCosineSimilarity(), LSM_VECTOR indexes) are only available in 26.3.1-SNAPSHOT — there is no stable release with them yet. When a stable release ships, the version should be updated.

10. Wait loop no timeout — Same pattern as recommendation-engine (setup.sh:10-14 there). The CI workflow has a 15-minute timeout-minutes that covers this. Adding a second timeout layer would be redundant.

11. No result validation — Same pattern as recommendation-engine (both queries.sh and RecommendationEngine.java use identical helpers). This is a project-wide convention. Changing it for one use case creates inconsistency.

12. NPE risk in Query 5 — The amount field is defined as DOUBLE in the schema and is always set in the sample data. The cast to Number is required for printf("$%-10.2f") formatting — the other queries don't use decimal formatting so they don't need it. In a controlled demo dataset, this is not a risk.

13. || true on DB creation — Same pattern as recommendation-engine. The until curl health check loop (lines 11-14) verifies ArcadeDB is reachable before the create command runs. If the server goes down between health check and create, apply_file fails immediately with a clear error on the next step.

14. docs/plans/ files — These exist on main for all previous use cases: recommendation-engine (3 files), knowledge-graphs (3 files), graph-rag (2 files). They serve as architectural decision records. Removing them only from fraud-detection would be inconsistent. The "For Claude" lines are scaffolding that could be cleaned up, but that applies repo-wide, not to this PR.

@claude

claude Bot commented Mar 2, 2026

Copy link
Copy Markdown

Code Review: Fraud Detection Use Case

Good addition to the repository - the structure mirrors the recommendation-engine pattern cleanly and the multi-model angle (graph + vector + time-series + document) is a strong showcase for ArcadeDB. A few issues to address before merging:


Blockers

1. SNAPSHOT image/artifact availability

docker-compose.yml and pom.xml both target 26.3.1-SNAPSHOT. SNAPSHOT Docker images are typically not published to Docker Hub, and SNAPSHOT Maven artifacts may not be on Sonatype Central yet. If either is unavailable, the CI matrix will fail on every run.

Options:

  • Pin to the latest stable tag (26.2.1, as used by recommendation-engine) and document which features require a newer release
  • Or confirm that arcadedata/arcadedb:26.3.1-SNAPSHOT and com.arcadedb:arcadedb-network:26.3.1-SNAPSHOT are both reliably published before merging

2. Plan documents should not be committed

docs/plans/2026-03-02-fraud-detection-ci.md and docs/plans/2026-03-02-fraud-detection-design.md are internal implementation notes, not user-facing documentation. The CI plan doc contains a raw internal instruction that should not be in the repo. These files should be removed from the PR.


Bugs / Correctness

3. Query 6 - hard-coded timestamp window

WHERE ts BETWEEN '2026-03-01T13:00:00Z' AND '2026-03-01T13:05:00Z'

This exact window must match the test data. If the data file ever changes dates, this query silently returns 0 rows and the test still passes (the curl script only checks for non-error exit code, not non-empty results). Either add a result-count assertion in queries.sh for queries that must return rows, or use a relative time expression if ArcadeDB SQL supports it.

4. Java Query 5 - unsafe cast

((Number) r.getProperty("amount")).doubleValue()

If amount is null or the server returns an unexpected type, this throws a NullPointerException or ClassCastException. While tryRun() catches it, it silently logs [Query 5 FAILED] null with no indication of what went wrong. A null-safe pattern would be safer:

Object amt = r.getProperty("amount");
double amount = amt instanceof Number n ? n.doubleValue() : 0.0;

5. Query 8 - implicit Customer/Account ID coupling

WHERE id IN (SELECT id FROM Customer WHERE recent_behavior IN ['suspicious', 'anomalous'])

This assumes Customer.id values are identical to Account.id values for the same entity. The schema does not enforce this relationship. A comment in the SQL or README explaining this data coupling would help readers understand why it works.


Design / Maintainability

6. setup.sh - unbounded wait loop

until curl -sf ... ; do sleep 2 ; done

No timeout. If ArcadeDB never becomes healthy (e.g., image pull fails silently), this loops until the CI job's 15-minute timeout with no diagnostic output. A counter-based guard would improve this:

attempts=0
until curl -sf ...; do
  sleep 2
  (( attempts++ )) && (( attempts >= 50 )) && { echo "ArcadeDB did not start"; exit 1; }
done

(The same issue exists in recommendation-engine.)

7. setup.sh - silent failure on database creation

curl ... -d '{"command": "create database ..."}' > /dev/null || true

The || true masks unexpected errors beyond "already exists". Worth checking the HTTP response code to distinguish "already exists" from genuine failures.

8. setup.sh - one-statement-per-line constraint undocumented

apply_file() sends each line as a separate SQL command. Nothing enforces this in the SQL files, so a future contributor adding a multi-line INSERT will get a cryptic parse error. Add a comment at the top of both SQL files:

-- IMPORTANT: each statement must fit on a single line (setup.sh applies one line at a time)

Minor / Suggestions

9. README - reference vector in Query 5 unexplained

Query 5 uses [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8] as the comparison vector. A one-line note explaining this represents the expected behavioral profile for account-H's customer would clarify the intent.

10. Velocity window in Query 6 is magic

The specific window 2026-03-01T13:00:00-13:05:00Z is sized to catch the velocity attack pattern in the sample data. An inline comment tying it to the test data would help readers understand it is not arbitrary.

11. CI - consider docker compose up -d --wait

The Docker healthcheck is configured but the CI step does not wait for it - it relies on the until curl ... loop in setup.sh. Using docker compose up -d --wait would block until the healthcheck passes and eliminate the need for the loop entirely.


What's working well

  • SHA-pinned actions with version comments - good supply-chain hygiene
  • fail-fast: false on the matrix - both runners report independently
  • if: always() on teardown - containers cleaned up even on failure
  • jq -cn --arg in queries.sh for safe JSON construction - avoids injection
  • IF NOT EXISTS throughout the schema - idempotent setup
  • LSM_VECTOR index with explicit dimensions and similarity metric - correct usage
  • tryRun() wrapper in Java - individual query failures do not abort the whole run

robfrank and others added 2 commits March 2, 2026 18:36
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@robfrank robfrank merged commit c13a74f into main Mar 2, 2026
6 checks passed
@claude

claude Bot commented Mar 2, 2026

Copy link
Copy Markdown

Code Review — PR #16: Fraud Detection Use Case

This is a solid addition to the repo and follows the established recommendation-engine patterns well. Below are issues ranging from data correctness bugs to design inconsistencies.


Critical / High

1. Query 5 uses a hardcoded reference vector instead of joining with Customer

FraudDetection.java and queries/queries.sh both do:

SELECT id, amount, merchant, account_id,
       vectorCosineSimilarity(behavior_embedding, [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8]) AS profile_similarity
FROM Transaction
WHERE account_id = 'acct-H'
ORDER BY profile_similarity

The hardcoded vector [0.1, 0.2, ...] happens to match acct-H's profile_embedding, but this only works because the query is already filtered to a single account. The design doc describes the intended query as a proper JOIN:

SELECT t.id, t.amount, t.merchant,
       vectorDistance(t.behavior_embedding, c.profile_embedding) AS deviation
FROM Transaction t
JOIN Customer c ON t.account_id = c.id
WHERE vectorDistance(t.behavior_embedding, c.profile_embedding) > 0.7
ORDER BY deviation DESC

The JOIN approach demonstrates genuine anomaly detection across all accounts dynamically, which is far more instructive. The hardcoded version also breaks the LSM_VECTOR index story — vector indexes are only useful if you're querying with a known probe vector, not for per-row distance computation, but having a JOIN query at least shows the multi-model value proposition clearly.

2. Missing structuring deposits for acct-D and acct-E

sql/02-data.sql inserts structuring deposits for acct-A, acct-B, and acct-C, but acct-D and acct-E — which are described as part of the same fraud ring with the same structuring pattern — have no deposits at all. The design doc says "3+ deposits per day in the $8,000–$9,999 range" for all fraud ring members (A–E). Query 4 will only flag A, B, and C.

3. docs/plans/ files contain internal planning artifacts and should not be committed

docs/plans/2026-03-02-fraud-detection-ci.md and docs/plans/2026-03-02-fraud-detection-design.md are internal implementation plans including Claude Code-specific instructions (REQUIRED SUB-SKILL: Use superpowers:executing-plans). These aren't useful for repository consumers and could be confusing in a public repo. If the repo wants to document design decisions, a docs/adr/ or inline README section would be more appropriate.


Medium

4. Query 4 loses time-bucketing semantics silently

The design doc specifies per-day structuring detection via time_bucket('1d', ts). The implementation drops this entirely (ArcadeDB SNAPSHOT apparently doesn't support time_bucket), but counts deposits across all time without making this trade-off visible to the reader:

-- Design intent: flag accounts with 3+ deposits on the SAME day
-- Actual: flag accounts with 3+ deposits EVER matching the amount range
SELECT FROM (
  SELECT account_id, count(*) AS deposit_count
  FROM Deposit
  WHERE amount BETWEEN 8000 AND 9999
  GROUP BY account_id
) WHERE deposit_count >= 3

The README notes that HAVING is unsupported, but doesn't mention the loss of time_bucket. Either add a -- Note: time_bucket unavailable; detecting across all dates comment in the SQL files and README, or consider GROUP BY account_id, date(ts) if ArcadeDB's date() function is available.

5. Query 7 doesn't actually detect correlation

The description says "Correlated Account Activity — Compare transfer patterns between two accounts to detect coordination," but the query just shows per-account aggregate stats independently (no cross-account comparison):

SELECT account_id, avg(amount) AS avg_amount, count(*) AS txn_count
FROM Transaction
WHERE account_id IN ['acct-A', 'acct-B']
  AND ts >= '2026-02-01T00:00:00Z'
GROUP BY account_id

This doesn't demonstrate correlation. Consider adding a comment explaining that a proper temporal correlation would require a self-join (which may have ArcadeDB limitations), or rephrase the query description to match what the query actually does.

6. README signal-type label mismatch

The root README.md lists the 4th signal for fraud-detection as Cypher (a query language, not a signal type):

| [fraud-detection](...) | ... | Graph traversal, Vector similarity, Time-series, Cypher |

It should be Full-text search or Document queries to be consistent with the other rows in the table. Internally the fraud-detection README.md correctly labels it "Document queries."

7. Transaction uses a denormalized account_id string instead of an edge

The Transaction vertex stores account_id STRING instead of being connected to Account via an edge (e.g., MADE_BY edge). This is architecturally inconsistent with the rest of the model where all relationships are expressed as edges (USES_DEVICE, HAS_PHONE, TRANSFERRED_TO, etc.). Storing account_id as a string makes graph traversal from accounts to their transactions impossible and requires SQL-style joins instead.


Low

8. No FULL_TEXT index in 01-schema.sql

The design doc's schema section lists:

FULL_TEXT on Account(full_name)

and the top of the fraud-detection README mentions "full-text fuzzy matching" in the original design goals. The index is never created in the actual schema. If the intent changed (to SSN equality), remove the mention from the design doc to avoid confusion.

9. setup.sh wait loop has no explicit timeout

until curl -sf ... "${ARCADEDB_URL}/api/v1/ready" > /dev/null 2>&1; do
  sleep 2
done

This is consistent with recommendation-engine/setup.sh so it's not a regression, but if ArcadeDB fails to start for any reason the loop hangs until the CI 15-minute timeout-minutes fires. A counter-based timeout (e.g., 50 attempts × 2s = 100s) would give a clearer error message and faster failure.

10. Missing .gitignore for Java build artifacts

fraud-detection/java/ has no .gitignore. If someone runs mvn package locally, target/ will appear as untracked. Consider adding a target/ entry (or checking whether the repo-level .gitignore already covers it).


Positives

  • CI workflow correctly uses SHA-pinned action versions with version comments — good security hygiene.
  • fail-fast: false in the matrix is the right call so curl and java jobs report independently.
  • The tryRun() wrapper in Java prevents one failing query from aborting all subsequent ones.
  • Schema uses IF NOT EXISTS throughout — idempotent setup is good for re-runs.
  • The use of jq -cn --arg in bash to construct JSON avoids shell injection when building query payloads.
  • set -euo pipefail in both shell scripts is correct.

🤖 Generated with Claude Code

@claude

claude Bot commented Mar 2, 2026

Copy link
Copy Markdown

Code Review — Fraud Detection Use Case

Overall this is a solid addition. The use case demonstrates ArcadeDB's multi-model capabilities well, the structure mirrors the existing use cases, and the CI workflow follows established patterns. Below are my findings, from most critical to minor.


Critical: Java runner silently swallows failures — CI can pass with broken queries

FraudDetection.java wraps every query in tryRun() which catches all exceptions and prints to stderr, but lets main() exit 0. If every query fails, the CI job shows green.

Suggestion: Track a failure count and call System.exit(1) at the end of main() if any query threw:

// Add a counter field
private static int failures = 0;

// In tryRun catch block: failures++;

// At end of main:
if (failures > 0) {
    System.err.println(failures + " query/queries failed.");
    System.exit(1);
}

Critical: queries.sh doesn't validate non-empty results

The query() helper pipes through jq '.result' and succeeds regardless of whether the result is null or []. With set -euo pipefail, jq exits 0 even for null output. The CI curl path can pass even if ArcadeDB returns empty results for all 8 queries.

Suggestion: After printing, assert the result is neither null nor []:

query() {
  local lang="$1" cmd="$2"
  local result
  result=$(jq -cn --arg l "$lang" --arg c "$cmd" '{"language":$l,"command":$c}' \
    | curl -sf -u "$AUTH" -X POST "$QUERY_URL" \
        -H "Content-Type: application/json" -d @- \
    | jq '.result')
  echo "$result"
  [[ "$result" != "null" && "$result" != "[]" ]]
}

Bug: Missing structuring deposits for fraud ring accounts D and E

02-data.sql creates structuring deposits (3+ per day in the $8k–$9.9k range) only for accounts A, B, and C. Accounts D and E have no Deposit records at all. The design doc states all fraud ring members exhibit the structuring pattern, and the test plan asserts "Structuring query flags fraud ring accounts". Query 4 will only return A, B, and C — not the full ring.


Schema: Missing FULL_TEXT index on Account(full_name)

The design doc lists it, the README schema table references it, but 01-schema.sql does not create it. Query 2 falls back to SSN equality (which works for the demo), but the index foundation is absent if anyone wants to add fuzzy name matching later.

-- missing from 01-schema.sql:
CREATE INDEX IF NOT EXISTS ON Account (full_name) FULL_TEXT;

Query 5: Hardcoded profile vector instead of a JOIN

The behavioral anomaly query hardcodes acct-H's profile vector as a literal [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8]. This value happens to match the Customer record's profile_embedding, but the query does not JOIN against the Customer table — it's comparing against a magic constant. This is fragile if data changes and undermines the point of storing embeddings in Customer records. A cross-join or subquery reading the customer profile would be more realistic and instructive.


setup.sh: No timeout on the ArcadeDB readiness loop

If the SNAPSHOT image is unavailable or the container crashes, the until curl loop hangs until the CI job's 15-minute timeout-minutes triggers, producing an opaque timeout rather than a clear failure. Adding a retry counter (e.g. 50 retries × 2s = 100s) would fail fast with a diagnostic message.


Query 7: Describes correlation but delivers independent aggregates

The query description says "Compare transfer patterns between two accounts to detect coordination", but the implementation runs independent GROUP BY account_id aggregations. You cannot infer coordination from two separate aggregate rows without an external comparison step. The design doc showed a cross-join approach closer to the stated intent. A comment acknowledging this simplification would help readers understand the demo's scope.


Minor: No indexes on Transaction.account_id or Deposit.account_id

Queries 4–7 filter/group by account_id but there is no index on these columns. For the demo dataset size this is inconsequential, but adding them would make the schema more instructive:

CREATE INDEX IF NOT EXISTS ON Transaction (account_id) NOTUNIQUE;
CREATE INDEX IF NOT EXISTS ON Deposit (account_id) NOTUNIQUE;

Summary

Severity Issue
Critical Java tryRun() swallows failures — CI always exits 0 even when queries fail
Critical queries.sh does not assert non-empty results — curl CI path always passes
Bug Accounts D and E missing structuring deposits (Query 4 returns incomplete ring)
Schema FULL_TEXT index on Account(full_name) described but not created
Design Query 5 hardcodes the profile vector instead of reading from the Customer table
Robustness setup.sh readiness loop has no timeout/retry bound
Clarity Query 7 description promises correlation but delivers independent aggregates
Performance No indexes on account_id columns used by SQL queries 4–7

The docs/plans/ artifacts follow the repo's established convention. The CI workflow structure, SHA-pinned action versions, and fail-fast: false matrix are all correct.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant