RATIS-2236 Fixed bug where manual triggerSnapshot would never finish #1207

OneSizeFitsQuorum · 2025-01-07T11:25:32Z

see https://issues.apache.org/jira/browse/RATIS-2236

Signed-off-by: OneSizeFitQuorum <[email protected]>

OneSizeFitsQuorum · 2025-01-07T11:25:44Z

SzyWilliam

+1 the solution looks good to me

szetszwo · 2025-01-07T18:47:40Z

@OneSizeFitsQuorum , The manual trigger snapshot cannot be completed since StateMachineUpdater is looping in waitForCommit(). IoTDB may retry indefinitely but it could never succeed.

However, if we let StateMachineUpdater pass waitForCommit() without satisfying the basic Raft condition

appliedIndex <= commitIndex,

we may get some wired bugs later on.

So, how about we let it take snapshot within waitForCommit()?

diff --git a/ratis-server/src/main/java/org/apache/ratis/server/impl/SnapshotManagementRequestHandler.java b/ratis-server/src/main/java/org/apache/ratis/server/impl/SnapshotManagementRequestHandler.java
index 8632242b18..9c73cb1e32 100644
--- a/ratis-server/src/main/java/org/apache/ratis/server/impl/SnapshotManagementRequestHandler.java
+++ b/ratis-server/src/main/java/org/apache/ratis/server/impl/SnapshotManagementRequestHandler.java
@@ -113,6 +113,10 @@ class SnapshotManagementRequestHandler {
     return pending.get().map(PendingRequest::shouldTriggerTakingSnapshot).orElse(false);
   }
 
+  boolean hasPendingRequest() {
+    return pending.get().isPresent();
+  }
+
   void completeTakingSnapshot(long index) {
     pending.getAndSetNull().ifPresent(p -> p.complete(index));
   }
diff --git a/ratis-server/src/main/java/org/apache/ratis/server/impl/StateMachineUpdater.java b/ratis-server/src/main/java/org/apache/ratis/server/impl/StateMachineUpdater.java
index f13ee0d6d2..7474d2606f 100644
--- a/ratis-server/src/main/java/org/apache/ratis/server/impl/StateMachineUpdater.java
+++ b/ratis-server/src/main/java/org/apache/ratis/server/impl/StateMachineUpdater.java
@@ -216,6 +216,10 @@ class StateMachineUpdater implements Runnable {
     // Thus it is possible to have applied > committed initially.
     final long applied = getLastAppliedIndex();
     for(; applied >= raftLog.getLastCommittedIndex() && state == State.RUNNING && !shouldStop(); ) {
+      if (server.getSnapshotRequestHandler().hasPendingRequest()) {
+        takeSnapshot();
+      }
+
       if (awaitForSignal.await(100, TimeUnit.MILLISECONDS)) {
         return;
       }

OneSizeFitsQuorum · 2025-01-08T02:58:49Z

@szetszwo
I don't think prematurely exiting the waitForCommit function has any impact, because during continuous writes, applyIndex will always be smaller than commitIndex, so the stateMachineUpdater will never get stuck in waitForCommit. This ensures correctness for two reasons: first, before we execute the checkAndTakeSnapshot function, we always run applyLog to ensure that applyIndex reaches commitIndex; second, the semantics of takeSnapshot itself is based on taking snapshots at applyIndex.

  public void run() {
    for(; state != State.STOP; ) {
      try {
        waitForCommit();

        if (state == State.RELOAD) {
          reload();
        }

        final MemoizedSupplier<List<CompletableFuture<Message>>> futures = applyLog();
        checkAndTakeSnapshot(futures);

        if (shouldStop()) {
          checkAndTakeSnapshot(futures);
          stop();
        }
      } catch (Throwable t) {
        if (t instanceof InterruptedException && state == State.STOP) {
          Thread.currentThread().interrupt();
          LOG.info("{} was interrupted.  Exiting ...", this);
        } else {
          state = State.EXCEPTION;
          LOG.error(this + " caught a Throwable.", t);
          server.close();
        }
      }
    }
  }

   * @return the largest index of the log entry that has been applied to the
   *         state machine and also included in the snapshot. Note the log purge
   *         should be handled separately.
   */
  // TODO: refactor this
  long takeSnapshot() throws IOException;

Of course, your patch can also work perfectly fine, and I've already modified it to match your approach.

Signed-off-by: OneSizeFitQuorum <[email protected]>

SzyWilliam

@OneSizeFitsQuorum thanks for working on the patch, @szetszwo thanks for reviewing the patch. Both solutions sound good to me, there's only a small issue on the implementation side.

SzyWilliam · 2025-01-08T04:49:26Z

ratis-server/src/main/java/org/apache/ratis/server/impl/StateMachineUpdater.java

@@ -216,6 +216,9 @@ private void waitForCommit() throws InterruptedException {
    // Thus it is possible to have applied > committed initially.
    final long applied = getLastAppliedIndex();
    for(; applied >= raftLog.getLastCommittedIndex() && state == State.RUNNING && !shouldStop(); ) {
+      if (server.getSnapshotRequestHandler().getPending().get().isPresent()) {


We shall use server.getSnapshotRequestHandler().shouldTriggerTakingSnapshot() in this scenario, see

ratis/ratis-server/src/main/java/org/apache/ratis/server/impl/SnapshotManagementRequestHandler.java

Lines 112 to 114 in 17c9652

boolean shouldTriggerTakingSnapshot() {

return pending.get().map(PendingRequest::shouldTriggerTakingSnapshot).orElse(false);

}

This method will clear the flag and guarantee one snapshot be taken each request. Otherwise, one request may trigger two snapshots.

Thanks for the detailed review! Yes, if i do not check it in the for loop as the first commit. I should use shouldTriggerTakingSnapshot instead

@SzyWilliam , it is a bug in my suggestion but not in @OneSizeFitsQuorum 's early change. In the early change, it calls shouldTakeSnapshot() and then server.getSnapshotRequestHandler().shouldTriggerTakingSnapshot().

Signed-off-by: OneSizeFitQuorum <[email protected]>

szetszwo

+1 the change looks good.

@OneSizeFitsQuorum, thanks for accommodating my suggestion! Your original change is also good since it keeps waitForCommit() just waiting but not doing any actions. The action part (applying log, taking snapshot, etc.) remains in the loop in run().

Anyway, let's merge the current change to unblock the 3.1.3 release.

szetszwo · 2025-01-08T17:18:47Z

@SzyWilliam , thanks a lot for reviewing this!

…1207)

…pache#1207)

…1207)

fix bug

70463b3

Signed-off-by: OneSizeFitQuorum <[email protected]>

SzyWilliam reviewed Jan 7, 2025

View reviewed changes

fix review

6c90e8c

Signed-off-by: OneSizeFitQuorum <[email protected]>

OneSizeFitsQuorum force-pushed the jira2236 branch from 21d83f4 to 6c90e8c Compare January 8, 2025 02:59

SzyWilliam reviewed Jan 8, 2025

View reviewed changes

fix review

eca0c58

Signed-off-by: OneSizeFitQuorum <[email protected]>

szetszwo approved these changes Jan 8, 2025

View reviewed changes

szetszwo closed this Jan 8, 2025

szetszwo reopened this Jan 8, 2025

szetszwo merged commit bdde3ae into apache:master Jan 8, 2025
17 of 19 checks passed

OneSizeFitsQuorum deleted the jira2236 branch January 9, 2025 01:59

SzyWilliam pushed a commit that referenced this pull request Jan 9, 2025

RATIS-2236 Fixed bug where manual triggerSnapshot would never finish (#…

ac5c6e8

…1207)

SzyWilliam pushed a commit that referenced this pull request Jan 9, 2025

RATIS-2236 Fixed bug where manual triggerSnapshot would never finish (#…

0bd730e

…1207)

SzyWilliam pushed a commit that referenced this pull request Jan 9, 2025

RATIS-2236 Fixed bug where manual triggerSnapshot would never finish (#…

3255448

…1207)

SzyWilliam pushed a commit to SzyWilliam/ratis that referenced this pull request Jan 9, 2025

RATIS-2236 Fixed bug where manual triggerSnapshot would never finish (a…

f3353f6

…pache#1207)

szetszwo pushed a commit that referenced this pull request Jan 9, 2025

RATIS-2236 Fixed bug where manual triggerSnapshot would never finish (#…

dd8486a

…1207)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RATIS-2236 Fixed bug where manual triggerSnapshot would never finish #1207

RATIS-2236 Fixed bug where manual triggerSnapshot would never finish #1207

OneSizeFitsQuorum commented Jan 7, 2025

OneSizeFitsQuorum commented Jan 7, 2025

SzyWilliam left a comment

szetszwo commented Jan 7, 2025

OneSizeFitsQuorum commented Jan 8, 2025

SzyWilliam left a comment

SzyWilliam Jan 8, 2025

OneSizeFitsQuorum Jan 8, 2025

szetszwo Jan 8, 2025

szetszwo left a comment

szetszwo commented Jan 8, 2025

	boolean shouldTriggerTakingSnapshot() {
	return pending.get().map(PendingRequest::shouldTriggerTakingSnapshot).orElse(false);
	}

RATIS-2236 Fixed bug where manual triggerSnapshot would never finish #1207

RATIS-2236 Fixed bug where manual triggerSnapshot would never finish #1207

Conversation

OneSizeFitsQuorum commented Jan 7, 2025

OneSizeFitsQuorum commented Jan 7, 2025

SzyWilliam left a comment

Choose a reason for hiding this comment

szetszwo commented Jan 7, 2025

OneSizeFitsQuorum commented Jan 8, 2025

SzyWilliam left a comment

Choose a reason for hiding this comment

SzyWilliam Jan 8, 2025

Choose a reason for hiding this comment

OneSizeFitsQuorum Jan 8, 2025

Choose a reason for hiding this comment

szetszwo Jan 8, 2025

Choose a reason for hiding this comment

szetszwo left a comment

Choose a reason for hiding this comment

szetszwo commented Jan 8, 2025