Implement detection and potential mitigation of recovery failure cycles#435
Implement detection and potential mitigation of recovery failure cycles#435sebastianburckhardt merged 4 commits intomainfrom
Conversation
…o boost tracing and disable prefetch during replay
davidmrdavid
left a comment
There was a problem hiding this comment.
Two nits, otherwise looks fantastic!
src/DurableTask.Netherite/StorageLayer/Faster/PartitionStorage.cs
Outdated
Show resolved
Hide resolved
| if (this.CheckpointInfo.RecoveryAttempts > 0 || DateTimeOffset.UtcNow - lastModified > TimeSpan.FromMinutes(5)) | ||
| { | ||
| this.CheckpointInfo.RecoveryAttempts++; | ||
|
|
||
| this.TraceHelper.FasterProgress($"Incremented recovery attempt counter to {this.CheckpointInfo.RecoveryAttempts} in {this.checkpointCompletedBlob.Name}."); | ||
|
|
||
| await this.WriteCheckpointMetadataAsync(); | ||
|
|
||
| if (this.CheckpointInfo.RecoveryAttempts > 3 && this.CheckpointInfo.RecoveryAttempts < 30) | ||
| { | ||
| this.TraceHelper.BoostTracing = true; | ||
| } | ||
| } | ||
|
|
||
| return true; |
There was a problem hiding this comment.
given that this could fail definitely - should we have a cap to how big this integer can grow? Maybe it it's larger than ~100, it's not worth increasing it further, or is it?
There was a problem hiding this comment.
I don't see why capping the counter itself would be useful. No matter how large, it will still give us useful information (also in the traces).
Or did you mean to cap the actual recovery attempts?
There was a problem hiding this comment.
In the extrene: I just worry about the integer getting too large to represent, and then causing another class of issues. In general, I think there's no benefit in increasing this counter past ~10k, for example. I'd prefer to have an upper limit here. After ~10k, we know it is simply "too many" anyways. I do feel a bit stronger about this.
|
I am running a final test. If that works we can merge and release. |
In light of recent issues with FASTER crashing repeatedly during recovery, while replaying the commit log, this PR implements several steps that should help us troubleshoot the issue (and possibly mitigate it).
We are adding a recovery attempt counter to the last-checkpoint.json file so we can detect if partition recovery is repeatedly failing.
If the number of recovery attempts is between 3 and 30, we are boosting the tracing for the duration of the recovery. This may help us investigate the location of where the crash happens.
If the number of recovery attempts is larger than 6, we are disabling the prefetch during the replay. This means FASTER is executing fetch operations sequentially during replay, which slows down the replay A LOT but makes it more deterministic so we can better pinpoint the failure. It is also possible that this may eliminate the failure (e.g. if the bug is a race condition). Slowing the replay down would be a bad idea in general but actually desirable in this situation since it will also reduce the frequency of crashes caused by the struggling partition.