#3730 feat: Raft-based High Availability using Apache Ratis#3731
Conversation
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request introduces a fundamental shift in ArcadeDB's high availability architecture by integrating Apache Ratis, a robust Raft consensus implementation. This change aims to provide stronger consistency guarantees, improved fault tolerance, and a more reliable distributed system. The new design replaces custom replication logic with a proven consensus algorithm, backed by extensive testing for various failure scenarios and network conditions. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Ignored Files
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here. Footnotes
|
📜 License Compliance Check✅ License check passed. See artifacts for full report. License Summary (first 50 lines) |
2 similar comments
📜 License Compliance Check✅ License check passed. See artifacts for full report. License Summary (first 50 lines) |
📜 License Compliance Check✅ License check passed. See artifacts for full report. License Summary (first 50 lines) |
There was a problem hiding this comment.
Code Review
This pull request introduces a new Raft-based High Availability (HA) implementation for ArcadeDB, replacing the existing custom protocol. It includes a new ha-raft Maven module with components like RaftLogEntryCodec for serializing WAL and schema changes, ArcadeStateMachine for applying these changes, RaftHAServer for managing the Ratis cluster, and RaftReplicatedDatabase for intercepting database operations. New global configuration properties have been added for Raft-specific settings, and the server startup logic is updated to integrate the new HA plugin. Extensive unit and integration tests have been added to cover various replication, failover, and network partition scenarios, including a new /api/v1/cluster endpoint for status. The .gitignore file was updated. Feedback includes a dependency misconfiguration in ha-raft/pom.xml, a suggestion for .gitignore hygiene, a potential edge case in TransactionManager.applyChanges regarding bucket count updates, an incomplete replica lag monitoring implementation in RaftHAServer, and a recommendation for more specific error messages in AbstractServerHttpHandler for cluster token validation.
| </dependencies> | ||
|
|
||
| <build> | ||
| <plugins> | ||
| <plugin> | ||
| <groupId>org.apache.maven.plugins</groupId> |
There was a problem hiding this comment.
There appears to be a dependency misconfiguration in this pom.xml. The arcadedb-server artifact is included twice: once with provided scope (which is correct for a plugin) and again here with compile scope. The compile scope dependency also uses a hardcoded version (26.4.1-SNAPSHOT) instead of the ${project.parent.version} property.
This can lead to unpredictable build behavior and dependency conflicts. The compile scope dependency on arcadedb-server should be removed to avoid these issues.
| *.lsmvecidx | ||
| *.metadata.json | ||
|
|
||
| notes.txt |
There was a problem hiding this comment.
The file notes.txt appears to be a personal note file. It's generally better to add such user-specific files to your global .gitignore file (e.g., ~/.config/git/ignore) or the repository's local exclude file (.git/info/exclude) rather than the project's shared .gitignore. This helps keep the project's ignore list clean and focused on project-specific generated files and artifacts.
| if (changed) { | ||
| for (final Map.Entry<Integer, Integer> entry : bucketRecordDelta.entrySet()) { | ||
| final LocalBucket bucket = (LocalBucket) database.getSchema().getBucketById(entry.getKey()); |
There was a problem hiding this comment.
This loop over bucketRecordDelta is now inside if (changed). This is a good optimization to avoid unnecessary work when no pages were actually modified in the transaction. However, the changed flag is only set to true inside the loop over tx.pages. If a transaction only contains bucket count changes but no page modifications (which might be a rare edge case), this change would prevent the bucket counts from being updated.
If it's guaranteed that any transaction affecting bucket counts will also have modified pages, then this is fine. Otherwise, it might be safer to update the bucket counts regardless of the changed flag, as was the case previously.
| private void checkReplicaLag() { | ||
| try { | ||
| final var division = raftServer.getDivision(raftGroup.getGroupId()); | ||
| final var info = division.getInfo(); | ||
| if (!info.isLeader()) | ||
| return; | ||
| final long commitIndex = info.getLastAppliedIndex(); | ||
| clusterMonitor.updateLeaderCommitIndex(commitIndex); | ||
| } catch (final Exception e) { | ||
| LogManager.instance().log(this, Level.FINE, "Error checking replica lag", e); | ||
| } | ||
| } |
There was a problem hiding this comment.
The implementation of checkReplicaLag seems incomplete. As the comment inside the method notes, the Ratis public API may not expose per-follower match indexes, so this implementation only updates the leader's commit index in the ClusterMonitor. This means the lag monitoring feature will not function as intended because it cannot calculate or report any lag for individual replicas.
Consider either removing this monitoring logic until it can be fully implemented or adding a more prominent // TODO and logging a warning at startup that replica lag monitoring is not yet functional.
| sendErrorResponse(exchange, 401, "Invalid cluster token", null, null); | ||
| return null; | ||
| } |
There was a problem hiding this comment.
The response for an invalid cluster token is a generic 401 error. For improved diagnostics in a misconfigured cluster, it would be helpful to provide a more specific error message in the response body, such as {"error": "Invalid cluster token"}. This would make it easier for operators to identify the root cause of authentication failures when forwarding requests between nodes.
Code Review: Raft-based HA using Apache RatisThis is a well-architected and thoroughly tested addition. The use of Apache Ratis is a sound choice for proven consensus semantics, and the plugin-based design with ServiceLoader is clean. Below are issues found, ordered by severity. Critical1. Duplicate At the bottom of 2. Project guidelines require updating ATTRIBUTIONS.md when adding dependencies. Apache Ratis 3.2.0 ( High Priority3. Nine Nine AI-generated implementation plan documents are being added to 4. The 5. The format changed from 6. No TLS for Raft gRPC communication All inter-node Raft traffic is unencrypted. This should be documented clearly, and a follow-up issue should track TLS support. Medium Priority7. The build section only references 8. If file creation succeeds for some files but throws on a later one, the state machine is left in a partially-applied state. Created files are not rolled back, potentially leaving a replica in an inconsistent schema state. 9. This flag is documented as not intended for production use. Having it in 10.
Minor / Style11. Typo in test class filename: 12. 13. Summary
The critical The core architecture and test coverage are excellent - the Raft consensus approach is the right foundation for long-term HA correctness. |
Code Review: Raft-based High Availability (Apache Ratis)This PR introduces a new Critical / Data-Correctness Bugs1. Leadership-change race in if (raftHAServer != null && raftHAServer.isLeader()) {
// Skip – leader already committed locally
return;
}Leadership is checked at apply time, not at commit time. Ratis delivers 2. Two-phase commit is not atomic under concurrent leadership change In final RaftClientReply reply = raftHAServer.getClient().io().send(...);
if (!reply.isSuccess()) throw ...;
if (leader)
tx.commit2ndPhase(phase1); // <- leadership may have changed hereIf the node loses leadership between 3. Non-atomic schema version increment schemaJson.put("schemaVersion", schemaJson.getLong("schemaVersion") + 1);The version is read and incremented client-side before replication. Concurrent DDL operations can both read the same base version and both submit 4. No bounds checking in final int pageCount = buf.getInt();
tx.pages = new WALFile.WALPage[pageCount];
for (int i = 0; i < pageCount; i++) {
final int deltaSize = page.changesTo - page.changesFrom + 1;
final byte[] content = new byte[deltaSize];A corrupted or malicious log entry with a large Security Concerns5. Cluster token stored without file permission hardening
tokenFile.toPath().toFile().setReadable(false, false);
tokenFile.toPath().toFile().setReadable(true, true);Or use 6. Cluster-internal auth bypasses authorisation, not just authentication A successfully validated cluster token causes the handler to resolve the user from 7. Leader-forwarding uses plain HTTP URI.create("http://" + leaderHttpAddress + "/api/v1/command/" + databaseName)The cluster token and original Code Quality8.
9. Nine markdown files under 10. final boolean isRaftPlugin = autoDiscoverRaft && "RaftHAPlugin".equals(name);If the class is renamed this check silently breaks. A marker interface or an annotation would be more robust. 11. Wildcard imports in several files
12. Commented-out code in // private ServerMonitor serverMonitor;Dead code like this should be removed before merging. Performance13. Synchronous Raft final RaftClientReply reply = raftHAServer.getClient().io().send(Message.valueOf(entry));Every transaction commit blocks the calling thread waiting for quorum acknowledgement. For workloads with many small transactions this will be a severe throughput regression compared to the existing pipelined async HA. Consider documenting this trade-off and, if possible, exploring a pipeline or batch-commit option in Ratis ( 14. Duplicate static There is a Completeness15. Snapshot-based resync is not implemented
Test Coverage16. No test for leadership-change race during commit The two critical races described in items 1 and 2 above are not tested. The integration tests cover clean failovers, but not leadership changes that occur mid-commit. 17. No test for token file persistence across node restarts There is no test verifying that after a restart, Licensing Compliance18. Per
Apache Ratis is Apache 2.0 and has its own Summary
The architectural direction is solid. The critical correctness and security issues should be resolved before this merges to avoid data-loss scenarios in production clusters. |
📜 License Compliance Check✅ License check passed. See artifacts for full report. License Summary (first 50 lines) |
|
Reviewed the automated bot feedback — no actionable items: github-advanced-security:
gemini-code-assist:
|
|
Thank you for your detailed response, @robfrank. I'd like to clarify a couple of points regarding the feedback provided by
|
📜 License Compliance Check✅ License check passed. See artifacts for full report. License Summary (first 50 lines) |
PR Review: feat: Raft-based High Availability using Apache RatisThis is a well-structured and ambitious PR. The TDD approach is followed, separation of concerns is clean, the That said, here are specific concerns worth addressing before merge: Critical / Correctness1. The state machine skips applying entries on the leader because "it was already applied via
The correct design is for all nodes (including the leader) to apply every committed log entry. If the leader already applied it locally before the commit, Ratis's idempotency (via log index) and 2.
3. Commit protocol has a failure window on the leader In
If step 3 fails (e.g., disk error), the followers have committed the entry but the leader's local database is in a half-applied state. On leader restart it will skip re-applying (per issue #1), making the leader permanently inconsistent with its own log. Security4. Cluster auth token forwarding over plain HTTP
Missing ATTRIBUTIONS.md update5. Apache Ratis dependency not added to ATTRIBUTIONS.md CLAUDE.md explicitly requires updating Design / Robustness6. Two-node MAJORITY quorum will cause writes to block whenever one node is down. NONE quorum on a multi-node cluster allows split-brain. Currently the plugin only logs warnings. For cases that are definitively unsafe (e.g., NONE quorum with ≥3 nodes), consider throwing a 7.
8. Leader failover during HTTP command forwarding has no retry
9. Schema entry decoder silently swallows The WAL section decoder inside Minor / Style10. Typo in test utility filename
11. Six AI implementation plan files ( 12. Over-broad GitHub Actions permissions in The 13. The handler calls SummaryThe architectural approach (Ratis + WAL page diffs as log entries + wrapping |
📜 License Compliance Check✅ License check passed. See artifacts for full report. License Summary (first 50 lines) |
|
PR Review posted via separate file. |
|
FULL REVIEW POSTED BELOW - please see the next comment for the complete review content |
PR Review: Raft-based High Availability using Apache RatisThis is an ambitious and well-structured PR. Architecture is sound and test coverage is impressive (unit + integration + e2e with Toxiproxy). Issues found, ordered by severity: Critical 1: Snapshot resync incomplete but log purging enabled Critical 2: deserializeWalTransaction duplicates WAL internal format Critical 3: ignoreErrors=true swallows replay failures Concurrency 4: databaseWrapper field is not volatile Concurrency 5: rewrapDatabases() is not atomic Concurrency 6: Static HTTP_CLIENT in RaftReplicatedDatabase Design 7: HA_SERVER_LIST format change breaks existing users Design 8: Peer identity tied to server name numeric suffix Design 9: maven-shade-plugin has no configuration Security 10: /api/v1/cluster authentication Security 11: Cluster token in plaintext Minor 12: ATTRIBUTIONS.md and NOTICE not updated for Apache Ratis 3.2.0. The primary blockers are the incomplete snapshot resync with enabled log purging (1), WAL deserialization duplication (2), ignoreErrors=true during replay (3), and the HA_SERVER_LIST breaking change (7). Concurrency issues (4-6) and the committed plan documents (13) should also be fixed before merge. Generated with Claude Code |
Coverage summary from CodacySee diff coverage on Codacy
Coverage variation details
Coverage variation is the difference between the coverage for the head and common ancestor commits of the pull request branch: Diff coverage details
Diff coverage is the percentage of lines that are covered by tests out of the coverable lines that the pull request added or modified: See your quality gate settings Change summary preferencesFootnotes
|
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #3731 +/- ##
==========================================
- Coverage 64.80% 64.19% -0.62%
==========================================
Files 1582 1586 +4
Lines 117464 118511 +1047
Branches 24962 25179 +217
==========================================
- Hits 76128 76078 -50
- Misses 30947 31850 +903
- Partials 10389 10583 +194 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Code Review: Raft-Based High Availability (#3731)This is an ambitious and well-structured PR that replaces the custom replication protocol with Apache Ratis (Raft consensus). The architecture is sound and the code is generally clean. Below is my detailed review. Summary82 files changed, ~15,900 additions. Introduces a new Critical Issues1. Snapshot-based recovery not implemented ( The comments explicitly acknowledge that 2. Race condition in cluster token initialization ( If multiple nodes start simultaneously, they may both attempt to create/write the cluster token file without any locking or atomic file creation. Use High Priority Issues3. ThreadLocal buffer leak ( private static final ThreadLocal<List<byte[]>> schemaWalBuffer = ...
private static final ThreadLocal<List<Map<Integer, Integer>>> schemaBucketDeltaBuffer = ...If an exception is thrown between the buffer-clear and buffer-send operations, the buffers are left populated. Wrap buffer usage in try-finally to guarantee cleanup. In a thread-pool environment this causes state to leak into the next operation on the same thread. 4. Blocking HTTP call without timeout ( final HttpResponse<String> response = HTTP_CLIENT.send(builder.build(), HttpResponse.BodyHandlers.ofString());This blocking call has no timeout. If the leader is slow or unreachable, the calling thread hangs indefinitely. Configure 5. Silent error handling in forwarded command responses ( When the leader returns an error response (e.g. Medium Priority Issues6. Database name not URL-encoded in forwarded HTTP URI "http://" + leaderHttpAddress + "/api/v1/command/" + getName()While ArcadeDB name validation likely prevents special characters, URL-encode the database name as defense-in-depth (e.g. using 7. Server name parsing is fragile ( Parsing the peer ID by looking for the last hyphen/separator assumes a specific naming convention. A server name like 8. Leader forwarding idempotency gap When a replica forwards a DDL command to the leader and the leader crashes mid-execution, a retry from the replica could re-execute the same DDL on the new leader, causing unexpected side effects. Consider adding an idempotency/request ID header to forwarded commands. Code Style / Minor Issues9. Broad 10. 11. Test assertions — Per project conventions, prefer Security Notes
Performance Notes
Test CoverageGood breadth of testing. Specific gaps worth addressing before production:
Summary AssessmentThe architecture is solid and the Raft integration is well-done. The code is readable and follows project conventions well. The main concerns are:
Happy to discuss any of these points further. |
📜 License Compliance Check✅ License check passed. See artifacts for full report. License Summary (first 50 lines) |
2 similar comments
📜 License Compliance Check✅ License check passed. See artifacts for full report. License Summary (first 50 lines) |
📜 License Compliance Check✅ License check passed. See artifacts for full report. License Summary (first 50 lines) |
Code Review: Raft-based HA using Apache RatisOverall, this is a well-structured and well-tested implementation of a Raft-based HA stack. The use of Apache Ratis (Apache 2.0, battle-tested) is a sound choice, and the module structure correctly follows the project's That said, there are several issues that should be addressed before merge. BLOCKERS 1.
2. The project requires: When adding a dependency, you MUST update ATTRIBUTIONS.md and, if Apache-licensed with a NOTICE file, incorporate required notices into the main NOTICE file. Apache Ratis is Apache-licensed and ships a NOTICE file. Neither 3. Unresolved TODO in codec — empty bucket deltas
IMPORTANT ISSUES 4.
5. One test is disabled: 6. Typo in test class name
MINOR ISSUES 7. maven-shade-plugin — verify no classpath conflicts The 8. Verify the token is never logged at any level in 9. Peer ID resolution relies on the server name having a numeric suffix separated by 10. E2E workflow passes The e2e-ha.yml workflow runs POSITIVE HIGHLIGHTS
|
Coverage summary from CodacySee diff coverage on Codacy
Coverage variation details
Coverage variation is the difference between the coverage for the head and common ancestor commits of the pull request branch: Diff coverage details
Diff coverage is the percentage of lines that are covered by tests out of the coverable lines that the pull request added or modified: See your quality gate settings Change summary preferences |
📜 License Compliance Check✅ License check passed. See artifacts for full report. License Summary (first 50 lines) |
PR Review: Raft-based High Availability using Apache RatisThis is a substantial architectural change. The overall approach is sound and the test coverage is commendable. Here are the issues found. CRITICAL / BLOCKERS 1. ATTRIBUTIONS.md and NOTICE not updated 2. Snapshot-based resync is not wired — incomplete feature 3. Inter-node command forwarding uses plain HTTP with no auth fallback 4. rewrapDatabases() has a non-atomic replace race SIGNIFICANT ISSUES 5. Wildcard imports in RaftReplicatedDatabase 6. Commented-out code left in ArcadeDBServer.java 7. Typo in ClusterDatatbaseChecker.java 8. forwardCommandToLeaderViaRaft — misleading name 9. persistLastApplied uses StandardOpenOption.SYNC on every transaction 10. ClusterMonitor.getReplicaLags() creates a ConcurrentHashMap unnecessarily 11. startLagMonitor / stopLagMonitor — no synchronization on executor lifecycle 12. ArcadeStateMachine.applyTxEntry — leader may double-apply transactions MINOR / STYLE 13. Plan documents committed to the repo 14. findLocalPeerId — poor error message on non-numeric suffix 15. ha-raft not included in package/distribution 16. e2e-ha workflow uses --fail-never POSITIVE HIGHLIGHTS
Reviewed with Claude Code |
Code ReviewThis is a substantial and ambitious PR — replacing a custom HA protocol with Apache Ratis (a proven Raft implementation) is the right architectural direction. The module structure is clean, the test coverage is broad, and the e2e container test setup with Toxiproxy is excellent for realistic failure scenarios. That said, there are several blocking issues that need to be addressed before merge. 🔴 Critical — Blockers1. Data integrity: Leadership-change race in
|
- Add 8 new GlobalConfiguration entries: HA_STOP_SERVER_ON_REPLICATION_FAILURE, HA_SNAPSHOT_WRITE_TIMEOUT, HA_SNAPSHOT_WATCHDOG_TIMEOUT, HA_SNAPSHOT_GAP_TOLERANCE, HA_SNAPSHOT_MAX_ENTRY_SIZE, HA_IDEMPOTENCY_CACHE_TTL_MS, HA_IDEMPOTENCY_CACHE_MAX_ENTRIES, HA_GRPC_ALLOWLIST_REFRESH_MS - these are used by subsequent snapshot hardening, idempotency cache, gRPC security, and step-down retry tasks - RemoteSchema: make types/buckets fields volatile and reload() synchronized; build new maps locally then publish atomically to fix ConcurrentModificationException race on shared RemoteDatabase instances - PostVerifyDatabaseHandler: parallel peer queries via CompletableFuture, improved error handling, path traversal protection, response byte limit, and non-leader early return path - KubernetesAutoJoin: pod-ordinal-based jitter, Mode.ADD atomic config change, buildProbeProperties() inheriting TLS settings, method renamed tryAutoJoin() - RetryStep: add TimeoutException to catch clause alongside NeedRetryException - PostCommandHandler: fix duplicate requestMap variable (pre-existing compile error) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
… phase-2 failure Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…hardening, idempotency cache, gRPC security - Linearizable reads on followers via ReadIndex protocol (fetchReadIndex, ensureLinearizableRead, ensureLinearizableFollowerRead) - Snapshot hardening: write timeout watchdog, CountingInputStream compression ratio check, configurable watchdog timeout - HTTP idempotency cache for POST request deduplication via X-Request-Id header - PeerAddressAllowlistFilter + RaftGrpcServicesCustomizer for gRPC peer IP validation Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…eady-applied pages applyChanges() previously threw ConcurrentModificationException on the first already-applied page and aborted the entire multi-page transaction. After a partial apply (crash recovery, Phase-2 failure), the remaining un-applied pages were silently lost, causing WALVersionGapException cascades on every subsequent transaction touching those pages. Now already-applied pages are always skipped individually so the remaining pages in the same transaction are still processed, preventing gap spread. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…rd, HTTP watchdog - RaftTransactionBroker: remove timeoutMs parameter from all replicate*() methods; timeout is now derived from quorumTimeout inside RaftGroupCommitter - RaftLogEntryCodec: add trailing-byte validation after decode to detect corrupted entries - SnapshotInstaller: extract MAX_ZIP_ENTRY_UNCOMPRESSED_BYTES constant, set server-wide snapshot-in-progress flag during install - ArcadeDBServer: add snapshotInstallInProgress flag - AbstractServerHttpHandler: return 503 with Retry-After during snapshot installation - RemoteHttpComponent: add sendWithWatchdog() to prevent infinite hangs on HTTP requests - Load test: tune parameters for faster iteration, switch to UNIQUE_HASH indexes - Add SnapshotSymlinkProtectionTest, GremlinValueComparatorTest, codec trailing-byte tests - run_adb scripts: add typeDefaultBuckets=5 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…rding reads The cluster-internal forwarded-auth block in AbstractServerHttpHandler was dropped in 137d8d46f while porting apache-ratis hardening, leaving replicas unable to authenticate any request to the leader whose original token was a per-node Bearer AU- session (Studio 401 after login on a follower). In parallel PostCommandHandler.execute() was forwarding every POST to the leader on replicas, including reads from PostQueryHandler, which bounced queries that belong locally on the follower. - Restore validateClusterForwardedAuth in AbstractServerHttpHandler with constant-time cluster-token compare; runs before the Authorization check. - Drop redundant HTTP-level forwarding from PostCommandHandler (and the now-obsolete buildForwardRequest / PostCommandHandlerForwardAuthTest). Writes keep being forwarded at the engine level by RaftReplicatedDatabase.command() via isExecutedByTheLeader()/isDDL(); reads run locally on followers. - Add FollowerSessionTokenQueryIT covering both paths: follower SELECT with an AU- session token returns 200, and follower INSERT with an AU- token is forwarded to the leader and returns 200.
GraphEngine.getEdgeHeadChunk catches RecordNotFoundException when walking a vertex's in/out edge list and continues safely, but logs the event at WARNING with a full stack trace. Under concurrent commits on the leader - phase 2 of RaftReplicatedDatabase.commit runs under executeInReadLock so multiple commits apply pages in parallel - a post-commit MV refresh opens a fresh transaction and traverses the graph. One commit may have written edge E and its container vertex's edge list before the target vertex record's page is visible to another thread, so the traversal's lazy-load of that target throws RecordNotFoundException even for an insert-only workload (no delete required). Distinguish the two failure modes by the RID state at catch time: - rid == null: the vertex itself could not be lazy-loaded. Happens on transient page-visibility races during concurrent phase 2 commits, and also on real stale references from concurrent deletes. Log at FINE. - rid != null: the edge-chunk segment pointed to by the vertex is orphaned. Genuine corruption worth surfacing — keep WARNING. No behavior change: both paths still return null and the traversal skips the missing vertex. MV refresh correctness under concurrent commits is a separate issue (transiently-missing rows are re-covered on the next refresh).
TypeIndex.getAssociatedBucketId() always returns -1, so two TypeIndexes on the same property but different types (e.g. User[id] and Photo[id]) produced the same structural key "-1:[id]", causing HashMap collision in compareIndexes(). The last-iterated TypeIndex for that key won in indexes2Map, causing cross-type comparisons (Photo from DB1 vs User from DB2) and false "0 vs 50000" difference reports on identical databases. Fix: include the type name in the structural key when bucketId == -1. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
splitIndex() created new files outside recordFileChanges(), so followers never received the new file IDs and diverged from the leader after compaction (observed: geospatial index with 2.1M entries on leader vs 198K on followers after a load test). - DatabaseInternal: add default runWithCompactionReplication() hook - LSMTreeIndex.compact(): delegate through getWrappedDatabaseInstance() so the HA layer can intercept without an engine->ha-raft dependency - RaftReplicatedDatabase: override captures new/removed files, serializes all pages of each new compaction file as synthetic WAL (txId=-1), then calls replicateSchema(); followers return false to skip independent compaction that would create diverging file IDs - WALFile.WALTransaction: add forceApply flag (default false) - TransactionManager.applyChanges(): respect forceApply to skip the version-gap check for compaction pages whose version > 1 - ArcadeStateMachine.deserializeWalTransaction(): set forceApply when txId < 0 (the sentinel for compaction WAL entries) - PaginatedComponentFile: add readPage(int, ByteBuffer) helper - RaftIndexCompactionReplicationIT: enable lsmTreeCompactionReplication and compactionReplicationWithConcurrentWrites tests Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Tweaks from the compaction divergence investigation: reduced numOfUsers, enabled photos and likes, added progress logging in DatabaseWrapper, and labelled paths in ClusterDatabaseChecker comparisons. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
… compaction When runWithCompactionReplication() called startRecordingChanges(), the shared getRecordedChanges() \!= null flag in commit() caused ALL concurrent user-transaction threads to buffer their WAL to their per-thread schemaWalBuffer instead of replicating via Raft TX_ENTRY. Those buffers were never consumed (runWithCompactionReplication uses serializeFilePagesAsWal, not schemaWalBuffer), so the writes were applied locally but never sent to followers, leaving them N page-versions behind and triggering WALVersionGapException on the next TX_ENTRY that touched those pages. Fix: replace the shared getRecordedChanges() \!= null check with a thread-local isSchemaCommitThread flag that is set only by the thread executing the recordFileChanges() DDL callback (which already holds a write lock, so no concurrency issue exists there). Compaction and all other concurrent threads now replicate normally. Also: add @tag("slow") test with real concurrent writer threads running alongside compaction to prevent regression; rename the prior sequential variant to compactionReplicationWithSequentialWrites. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Two bugs caused 'File with id N was not found' SchemaException in applySchemaEntry() on followers receiving compaction SCHEMA_ENTRY: 1. TransactionManager.applyChanges() called getFileById() (throws on missing file) while the null-check below it assumed getFileByIdIfExists() semantics (returns null). The null check was dead code. Fixed by using getFileByIdIfExists() so the page-count/vector-index update is skipped when the component is not yet registered in the schema. 2. applySchemaEntry() reloaded the schema BEFORE writing WAL pages to disk. New compaction files created by createNewFiles() are initially empty; the schema reload saw empty files, logged 'Cannot find index ... Ignoring it', and left the new files unregistered. Subsequent WAL application then could not find the file in the schema (via the now-throwing getFileById). Fixed by moving WAL application before the schema reload so load() finds files with valid content and registers them properly. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…dex diagnostics The count-comparison loop iterated over indexes1 and called entry2.countEntries() without guarding against null. When an index was present in DB1 but absent in DB2, entry2 was null (already reported by the preceding 'not present in DB2' loop), causing an NPE that propagated through compare() before any collected error messages could be printed. Fix: skip entries where entry2 == null in the count loop; the missing-index report was already emitted above. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…actions Root Cause 1: cluster token mismatch between follower (PBKDF2-derived) and leader (raw config value). Followers derive the token via PBKDF2-HMAC-SHA256 from cluster name + root password when HA_CLUSTER_TOKEN is not explicitly set. The leader validated the forwarded request against the raw config value (empty), so every follower-to-leader forwarded command was rejected with 403/500. Fix: Added getClusterToken() to HAServerPlugin interface (default: null) and overrode it in RaftHAPlugin to delegate to RaftHAServer.getClusterToken(). AbstractServerHttpHandler.validateClusterForwardedAuth() now uses the effective token (PBKDF2-derived when not configured) instead of the raw config value. Root Cause 2: ArcadeStateMachine.startTransaction() unconditionally set originatedLocally=true for ALL client requests on the leader, including transactions forwarded from a follower's RaftClient. For follower-originated transactions, Phase 2 never runs on the leader - only on the originating follower (which also skips Phase 2). The result: the leader's state machine skipped applying the WAL data, leaving the leader with stale pages. Fix: startTransaction() now compares request.getClientId() against the leader's own RaftClient.getId(). Only requests submitted by this node's own client are marked originatedLocally=true. Follower-forwarded requests get context=null, so the leader correctly applies their WAL via applyTxEntry(). Also fixed the test to wait for ALL servers (not just the creating server) before asserting cross-server record visibility, removing a race condition where the follower's applyTransaction could still be pending. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Raise MAX_COMPRESSION_RATIO from 200 to 100_000: ArcadeDB dictionary pages (327,680 bytes, all-zeros) legitimately compress at ~946:1 with DEFLATE; the old 200:1 guard rejected valid snapshots. Added regression test covering the sparse-page case. - Fix HttpServer.handleServerStartException to walk the full exception cause chain when detecting BindException. Undertow/XNIO wraps the bind failure more than one level deep, so the single-level check caused the server to throw instead of retrying the next port in the range. - Fix BaseRaftHATest.startServers() to patch each server's httpAddresses map with the actual bound HTTP ports after startup. Pre-start estimates in HA_SERVER_LIST can be stale when ports shift; this ensures follower-to-leader write forwarding uses the correct address. - Fix FollowerSessionTokenQueryIT and RaftVerifyDatabaseIT to use getServer(index).getHttpServer().getPort() instead of hardcoded 2480 + index, so tests are robust to dynamic port assignment. - Fix RaftVerifyDatabaseIT assertions to match the actual handler response shape: leader response wraps under "result", uses "peers" with "status" strings, not "nodes" with "match" booleans. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Two bugs caused 500 errors when reading from replicas after writes:
1. RaftHTTP2ServersIT was calling waitForReplicationIsCompleted(serverIndex)
which only waits for the writing server to catch up. When that server is
the Raft leader the check is trivially instant, leaving follower replicas
behind. Cross-server reads then hit stale dictionary state or missing
records. Fixed by adding waitForAllServers() to BaseRaftHATest and using
it throughout checkDeleteGraphElements.
2. BinarySerializer.getPropertyNames() caught dictionary lookup exceptions
and returned null, which caused a confusing secondary NPE
('c is null') in ResultInternal.getPropertyNames via result.addAll(null).
Fixed by returning Collections.emptySet() instead of null so the real
SEVERE-level log is the diagnostic and no downstream NPE occurs.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
📜 License Compliance Check✅ License check passed. See artifacts for full report. License Summary (first 50 lines) |
📜 License Compliance Check✅ License check passed. See artifacts for full report. License Summary (first 50 lines) |
feat(ha): replace legacy HA with Apache Ratis-based Raft implementation
Removes the custom leader-follower replication stack (HAServer,
ReplicatedDatabase, Leader2ReplicaNetworkExecutor, and the full message
protocol) and replaces it with a new `ha-raft` module built on Apache
Ratis 3.2.2.
Core new components:
- RaftHAPlugin / RaftHAServer: Ratis RaftServer wrapper with peer
discovery, dynamic membership, K8s auto-join, and configurable gRPC
port (HA_RAFT_PORT)
- RaftReplicatedDatabase: LocalDatabase wrapper that commits transactions
through the Raft log (3-phase commit with schema save under lock)
- ArcadeStateMachine: SimpleStateMachineStorage-backed state machine that
applies WAL pages and schema entries; tracks applied-index persistently
to survive restarts
- RaftGroupCommitter: batched Raft submission with LZ4 WAL compression,
ALL-quorum watch, cancellable pending entries, and an idempotency cache
- RaftTransactionBroker: single facade for all commit/schema/snapshot
paths; codec validation and snapshot 503 guard
- SnapshotInstaller: crash-safe snapshot install with zip-slip/zip-bomb
protection, NIO streams, symlink defence, and exponential backoff retry
- ClusterMonitor: replication-lag tracking and leader health watchdog
Hardening and features ported from apache-ratis branch:
- Linearizable reads and configurable read consistency
- PBKDF2 cluster token with auth forwarding (replica-to-leader and
snapshot download)
- LeaderProxy: HTTP-based write forwarding so follower-originated
transactions and schema changes reach the leader transparently
- IdempotencyCache for at-most-once server command delivery
- WAL version-gap detection (WALVersionGapException) and gap handling on
follower apply
- LSM index compaction replication to followers
- User create/drop routed through Raft (SECURITY_USERS_ENTRY log type)
- Database drop/import/restore replication (DROP_DATABASE_ENTRY and
INSTALL_DATABASE_ENTRY with forceSnapshot flag)
- New peer seeded with current users after peer-add
- Emergency stop and step-down retry loop for ALL-quorum failures
- HALog configurable-level verbose logging utility
- Quorum enum replacing string-based HA_QUORUM config
Server-layer changes:
- HAServerPlugin and HAReplicatedDatabase interfaces decouple server from
HA implementation
- ServerSocketFactory relocated to com.arcadedb.server.network
- HttpServer shares a single LeaderProxy instance instead of one per
handler
- AbstractServerHttpHandler gains forwarded-auth and X-Forwarded-User
handling
- ServerSecurity user hooks made non-synchronized to unblock Raft apply
- GlobalConfiguration pruned of legacy-only HA entries; new entries added
for Raft storage, snapshot threshold, election timeouts, buffer sizes,
and retry parameters
- DatabaseComparator extended with per-error reporter for cluster
consistency checks
Studio: enhanced cluster monitor UI with live Raft metrics (leader,
term, commit index, replication lag per peer)
Tests:
- All legacy server/ha IT tests removed
- New ha-raft integration tests cover: write forwarding, schema/view
replication, HTTP layer, index operations and compaction, crash-and-
recover, snapshot resync, split-brain via MiniRaftClusterWithGrpc,
user replication, drop/import/restore database propagation, read
consistency, and peer-add user seed
- New e2e-ha module with TestContainers + Toxiproxy scenarios: 3-node
happy path, leader failover, rolling restart, network partition,
packet loss, network delay, split-brain, and database lifecycle
propagation (drop, import, restore, user management)
- GitHub Actions workflow for e2e-ha tests added
(cherry picked from commit f6afe41)
Summary
Redesigns the ArcadeDB HA stack using Apache Ratis (Raft consensus) for leader election, log replication, and cluster management. Replaces the custom replication protocol with proven consensus semantics.
Closes #3730
What's included
New module:
ha-raft/api/v1/clusterendpoint for cluster statusNew module:
e2e-haEnd-to-end container tests using TestContainers + Toxiproxy for realistic cluster scenarios.
Key features
host:raftPort:httpPort:priority)Test plan
mvn test -pl ha-raftandmvn compile test-compile -pl e2e-ha)🤖 Generated with Claude Code