-
Notifications
You must be signed in to change notification settings - Fork 426
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RATIS-2236 Fixed bug where manual triggerSnapshot would never finish #1207
Conversation
Signed-off-by: OneSizeFitQuorum <[email protected]>
@szetszwo @adoroszlai PTAL |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 the solution looks good to me
@OneSizeFitsQuorum , The manual trigger snapshot cannot be completed since StateMachineUpdater is looping in waitForCommit(). IoTDB may retry indefinitely but it could never succeed. However, if we let StateMachineUpdater pass waitForCommit() without satisfying the basic Raft condition
we may get some wired bugs later on. So, how about we let it take snapshot within waitForCommit()? diff --git a/ratis-server/src/main/java/org/apache/ratis/server/impl/SnapshotManagementRequestHandler.java b/ratis-server/src/main/java/org/apache/ratis/server/impl/SnapshotManagementRequestHandler.java
index 8632242b18..9c73cb1e32 100644
--- a/ratis-server/src/main/java/org/apache/ratis/server/impl/SnapshotManagementRequestHandler.java
+++ b/ratis-server/src/main/java/org/apache/ratis/server/impl/SnapshotManagementRequestHandler.java
@@ -113,6 +113,10 @@ class SnapshotManagementRequestHandler {
return pending.get().map(PendingRequest::shouldTriggerTakingSnapshot).orElse(false);
}
+ boolean hasPendingRequest() {
+ return pending.get().isPresent();
+ }
+
void completeTakingSnapshot(long index) {
pending.getAndSetNull().ifPresent(p -> p.complete(index));
}
diff --git a/ratis-server/src/main/java/org/apache/ratis/server/impl/StateMachineUpdater.java b/ratis-server/src/main/java/org/apache/ratis/server/impl/StateMachineUpdater.java
index f13ee0d6d2..7474d2606f 100644
--- a/ratis-server/src/main/java/org/apache/ratis/server/impl/StateMachineUpdater.java
+++ b/ratis-server/src/main/java/org/apache/ratis/server/impl/StateMachineUpdater.java
@@ -216,6 +216,10 @@ class StateMachineUpdater implements Runnable {
// Thus it is possible to have applied > committed initially.
final long applied = getLastAppliedIndex();
for(; applied >= raftLog.getLastCommittedIndex() && state == State.RUNNING && !shouldStop(); ) {
+ if (server.getSnapshotRequestHandler().hasPendingRequest()) {
+ takeSnapshot();
+ }
+
if (awaitForSignal.await(100, TimeUnit.MILLISECONDS)) {
return;
} |
@szetszwo public void run() {
for(; state != State.STOP; ) {
try {
waitForCommit();
if (state == State.RELOAD) {
reload();
}
final MemoizedSupplier<List<CompletableFuture<Message>>> futures = applyLog();
checkAndTakeSnapshot(futures);
if (shouldStop()) {
checkAndTakeSnapshot(futures);
stop();
}
} catch (Throwable t) {
if (t instanceof InterruptedException && state == State.STOP) {
Thread.currentThread().interrupt();
LOG.info("{} was interrupted. Exiting ...", this);
} else {
state = State.EXCEPTION;
LOG.error(this + " caught a Throwable.", t);
server.close();
}
}
}
} * @return the largest index of the log entry that has been applied to the
* state machine and also included in the snapshot. Note the log purge
* should be handled separately.
*/
// TODO: refactor this
long takeSnapshot() throws IOException; Of course, your patch can also work perfectly fine, and I've already modified it to match your approach. |
Signed-off-by: OneSizeFitQuorum <[email protected]>
21d83f4
to
6c90e8c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@OneSizeFitsQuorum thanks for working on the patch, @szetszwo thanks for reviewing the patch. Both solutions sound good to me, there's only a small issue on the implementation side.
@@ -216,6 +216,9 @@ private void waitForCommit() throws InterruptedException { | |||
// Thus it is possible to have applied > committed initially. | |||
final long applied = getLastAppliedIndex(); | |||
for(; applied >= raftLog.getLastCommittedIndex() && state == State.RUNNING && !shouldStop(); ) { | |||
if (server.getSnapshotRequestHandler().getPending().get().isPresent()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We shall use server.getSnapshotRequestHandler().shouldTriggerTakingSnapshot()
in this scenario, see
ratis/ratis-server/src/main/java/org/apache/ratis/server/impl/SnapshotManagementRequestHandler.java
Lines 112 to 114 in 17c9652
boolean shouldTriggerTakingSnapshot() { | |
return pending.get().map(PendingRequest::shouldTriggerTakingSnapshot).orElse(false); | |
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the detailed review! Yes, if i do not check it in the for loop as the first commit. I should use shouldTriggerTakingSnapshot
instead
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@SzyWilliam , it is a bug in my suggestion but not in @OneSizeFitsQuorum 's early change. In the early change, it calls shouldTakeSnapshot()
and then server.getSnapshotRequestHandler().shouldTriggerTakingSnapshot()
.
Signed-off-by: OneSizeFitQuorum <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 the change looks good.
@OneSizeFitsQuorum, thanks for accommodating my suggestion! Your original change is also good since it keeps waitForCommit()
just waiting but not doing any actions. The action part (applying log, taking snapshot, etc.) remains in the loop in run()
.
Anyway, let's merge the current change to unblock the 3.1.3 release.
@SzyWilliam , thanks a lot for reviewing this! |
see https://issues.apache.org/jira/browse/RATIS-2236