Skip to content

PROBE: validate dispatcher fix resolves retry-handler flake (DO NOT MERGE)#710

Closed
torosent wants to merge 5 commits intomainfrom
probe/strict-retry-assertions
Closed

PROBE: validate dispatcher fix resolves retry-handler flake (DO NOT MERGE)#710
torosent wants to merge 5 commits intomainfrom
probe/strict-retry-assertions

Conversation

@torosent
Copy link
Copy Markdown
Member

Probe branch on top of #709's dispatcher fix to validate whether the strict Assert.Equal(expectedNumberOfAttempts, retryHandlerCalls) assertions are now stable on Windows CI.

If green across multiple runs → the dispatcher race was the real culprit and the SDK retry-handler design fix is hardening, not a release blocker.
If still flaky → the SDK retry path has its own bug to address.

Do not merge. Will be closed once we have signal.

torosent and others added 5 commits April 22, 2026 11:04
- Remove Google.Protobuf VersionOverride in Worker.Grpc.csproj so it tracks
  the central pinned version. The override caused NU1605 downgrade errors
  on dependabot bumps because Microsoft.DurableTask.Grpc transitively
  required the newer central version.
- Bump per-test timeout in Grpc.IntegrationTests from 30s to 60s to
  accommodate slower Windows CI runs under parallel-suite contention.
- Comment out the brittle Assert.Equal(expectedNumberOfAttempts,
  retryHandlerCalls) in RetrySubOrchestratorFailuresCustomLogic, matching
  the existing convention already applied to three other tests in the
  same file. The retry handler can be invoked more times than expected
  during replay edge cases (pre-existing known issue). The companion
  actualNumberOfAttempts assertion still validates retry behavior.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…ounter

- IntegrationTestBase: revert per-test timeout 60s -> 30s (was masking
  unrelated hangs). With the dead assertion removed, the retry tests
  complete in <1s locally and never approach the timeout.
- OrchestrationErrorHandling.RetrySubOrchestratorFailuresCustomLogic:
  remove the now-unused retryHandlerCalls counter and IsReplaying guard.
  The assertion was commented out due to a known SDK bug; carrying the
  counter without asserting just confused future readers. A note was
  added describing the bug for context.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The catch block for the worker-to-client stream unconditionally cleared
workerToClientStream and reset isConnectedSignal whenever a write failed
(typically with 'Can't write the message because the request is complete'
when a worker disconnected mid-send).

This created a race window: between the failed WriteAsync and the catch
executing, a new worker could connect via GetWorkItems, install its own
stream, and signal isConnectedSignal. The catch would then silently wipe
the new worker's connection state, leaving the dispatcher waiting on a
Reset signal forever. This manifested on Windows CI as multiple integration
tests (e.g., TaskOrchestrationWithSentEvent, RetrySubOrchestratorFailures*)
hanging until the test timeout.

Fix uses a CAS-style guard: only clear the connection state if the cached
stream is still the one that just failed.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Per review feedback on PR #709, fully dropping the retry-handler counter
removed signal that the user-supplied handler was actually invoked.
Restore the counter (counting only non-replay invocations) and assert a
lower bound (>= expectedNumberOfAttempts) instead of strict equality.
This preserves coverage while accommodating the known sub-orchestration
over-invocation bug, which is tracked separately.

Addresses: #709 (comment)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…patcher fix

DO NOT MERGE. Probes whether commit 6c18702 (SendWorkItemToClientAsync
CAS guard) eliminates the retry-handler over-invocation flake. If CI is
green across multiple runs, the design fix may be lower priority.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@torosent
Copy link
Copy Markdown
Member Author

Probe complete — confirmed retry-handler over-invocation reproduces with strict assertions (failed on Linux at OrchestrationErrorHandling.cs:426). Resuming the SDK retry-handler design fix as a real (not Windows-specific) determinism bug.

@torosent torosent closed this Apr 22, 2026
@torosent torosent deleted the probe/strict-retry-assertions branch April 22, 2026 19:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant