Implement detection and potential mitigation of recovery failure cycles by sebastianburckhardt · Pull Request #435 · microsoft/durabletask-netherite

sebastianburckhardt · 2024-10-29T18:18:41Z

In light of recent issues with FASTER crashing repeatedly during recovery, while replaying the commit log, this PR implements several steps that should help us troubleshoot the issue (and possibly mitigate it).

We are adding a recovery attempt counter to the last-checkpoint.json file so we can detect if partition recovery is repeatedly failing.
If the number of recovery attempts is between 3 and 30, we are boosting the tracing for the duration of the recovery. This may help us investigate the location of where the crash happens.
If the number of recovery attempts is larger than 6, we are disabling the prefetch during the replay. This means FASTER is executing fetch operations sequentially during replay, which slows down the replay A LOT but makes it more deterministic so we can better pinpoint the failure. It is also possible that this may eliminate the failure (e.g. if the bug is a race condition). Slowing the replay down would be a bad idea in general but actually desirable in this situation since it will also reduce the frequency of crashes caused by the struggling partition.

…o boost tracing and disable prefetch during replay

davidmrdavid

Two nits, otherwise looks fantastic!

src/DurableTask.Netherite/StorageLayer/Faster/PartitionStorage.cs

davidmrdavid · 2024-10-29T18:32:11Z

src/DurableTask.Netherite/StorageLayer/Faster/AzureBlobs/BlobManager.cs

+            if (this.CheckpointInfo.RecoveryAttempts > 0 || DateTimeOffset.UtcNow - lastModified > TimeSpan.FromMinutes(5))
+            {
+                this.CheckpointInfo.RecoveryAttempts++;
+
+                this.TraceHelper.FasterProgress($"Incremented recovery attempt counter to {this.CheckpointInfo.RecoveryAttempts} in {this.checkpointCompletedBlob.Name}.");
+
+                await this.WriteCheckpointMetadataAsync();
+
+                if (this.CheckpointInfo.RecoveryAttempts > 3 && this.CheckpointInfo.RecoveryAttempts < 30)
+                {
+                    this.TraceHelper.BoostTracing = true;
+                }
+            }
+
+            return true;


given that this could fail definitely - should we have a cap to how big this integer can grow? Maybe it it's larger than ~100, it's not worth increasing it further, or is it?

I don't see why capping the counter itself would be useful. No matter how large, it will still give us useful information (also in the traces).

Or did you mean to cap the actual recovery attempts?

In the extrene: I just worry about the integer getting too large to represent, and then causing another class of issues. In general, I think there's no benefit in increasing this counter past ~10k, for example. I'd prefer to have an upper limit here. After ~10k, we know it is simply "too many" anyways. I do feel a bit stronger about this.

… permanently breaking anything

sebastianburckhardt · 2024-10-30T21:41:30Z

I am running a final test. If that works we can merge and release.

implement detection of repeated recovery failures, and add triggers t…

d3ea290

…o boost tracing and disable prefetch during replay

davidmrdavid reviewed Oct 29, 2024

View reviewed changes

add comments as per PR feedback

b97eb92

davidmrdavid approved these changes Oct 29, 2024

View reviewed changes

sebastianburckhardt mentioned this pull request Oct 29, 2024

bump minor version to 2.1.0 #436

Merged

sebastianburckhardt added 2 commits October 30, 2024 08:53

fix the disabling of prefetch, make more readable, improve tracing

d89c992

disable prefetch on every other attempt only, to make sure we are not…

8c6cb3c

… permanently breaking anything

sebastianburckhardt merged commit 68fb3e7 into main Oct 31, 2024

This was referenced Nov 26, 2025

deps(nuget): Bump the nuget-minor group with 21 updates orange-dot/durable-functions#13

Closed

deps(nuget): Bump Azure.Identity and 11 others orange-dot/durable-functions#16

Closed

dependabot bot mentioned this pull request Dec 29, 2025

deps(nuget): Bump Azure.Identity and 11 others orange-dot/durable-functions#30

Closed

dependabot bot mentioned this pull request Jan 5, 2026

deps(nuget): Bump the nuget-minor group with 18 updates orange-dot/durable-functions#35

Closed

dependabot bot mentioned this pull request Jan 26, 2026

deps(nuget): Bump Azure.Identity and 17 others orange-dot/durable-functions#45

Closed

This was referenced Feb 3, 2026

deps(nuget): Bump the nuget-minor group with 18 updates orange-dot/durable-functions#47

Closed

deps(nuget): Bump Azure.Identity and 17 others orange-dot/durable-functions#50

Closed

dependabot bot mentioned this pull request Feb 16, 2026

deps(nuget): Bump the nuget-minor group with 18 updates orange-dot/durable-functions#53

Closed

dependabot bot mentioned this pull request Feb 23, 2026

deps(nuget): Bump Azure.Identity and 17 others orange-dot/durable-functions#57

Closed

This was referenced Mar 16, 2026

deps(nuget): Bump the nuget-minor group with 18 updates orange-dot/durable-functions#65

Open

deps(nuget): Bump Azure.Identity and 16 others orange-dot/durable-functions#66

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement detection and potential mitigation of recovery failure cycles#435

Implement detection and potential mitigation of recovery failure cycles#435
sebastianburckhardt merged 4 commits intomainfrom
pr/recycle-detection

sebastianburckhardt commented Oct 29, 2024

Uh oh!

davidmrdavid left a comment

Uh oh!

Uh oh!

davidmrdavid Oct 29, 2024

Uh oh!

sebastianburckhardt Oct 29, 2024

Uh oh!

davidmrdavid Oct 29, 2024

Uh oh!

sebastianburckhardt commented Oct 30, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

sebastianburckhardt commented Oct 29, 2024

Uh oh!

davidmrdavid left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

davidmrdavid Oct 29, 2024

Choose a reason for hiding this comment

Uh oh!

sebastianburckhardt Oct 29, 2024

Choose a reason for hiding this comment

Uh oh!

davidmrdavid Oct 29, 2024

Choose a reason for hiding this comment

Uh oh!

sebastianburckhardt commented Oct 30, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants