fix: report Tcp.CommandFailed when a scheduled connect retry throws (#8195)#8214
Conversation
…kkadotnet#8195) TcpOutgoingConnection scheduled its connect retry as a raw Action on the HashedWheelTimer scheduler thread. When Socket.ConnectAsync threw inside that callback (PlatformNotSupportedException on Linux when reusing a socket after a failed connection attempt), the exception was logged and swallowed by the scheduler. The commander never received Tcp.Connected or Tcp.CommandFailed and stayed stuck permanently. The retry now runs inside the actor's message loop: it is scheduled as a RetryConnect self-message via IWithTimers, and the ConnectAsync call is wrapped in ReportConnectFailure, so any exception is surfaced to the commander as Tcp.CommandFailed and the connection actor stops. Using a timer also cancels the pending retry automatically when the actor stops.
| var self = Self; | ||
| var previousAddress = (IPEndPoint)args.RemoteEndPoint; | ||
| args.RemoteEndPoint = fallbackAddress; | ||
| Context.System.Scheduler.Advanced.ScheduleOnce(TimeSpan.FromMilliseconds(1), () => |
There was a problem hiding this comment.
yeah so this was the real bug - having a lambda running outside the actor attempting the socket reconnect. We probably should have never designed it that way. Refactoring it to use an in-actor scheduled message instead.
| // to the commander as Tcp.CommandFailed instead of being swallowed by the scheduler. | ||
| ReportConnectFailure(() => | ||
| { | ||
| if (!Socket.ConnectAsync(args)) |
There was a problem hiding this comment.
somewhat confusing: does not return a Task. Returns true if the connect operation is pending (and we'll get a SocketAsyncEventArgs event to that effect later when it comples,) returns false if the operation already completed synchronously, which is why we send a SocketConnected message to ourselves immediately.
| } | ||
|
|
||
| private void ScheduleConnectRetry() | ||
| => Timers.StartSingleTimer(RetryConnectTimerKey, RetryConnect.Instance, TimeSpan.FromMilliseconds(1)); |
There was a problem hiding this comment.
preserves the same timing mechanics as before, just via messaging rather than delegates.
|
I think this fix is probably inapplicable to the |
…kkadotnet#8195) TcpOutgoingConnection scheduled its connect retry as a raw Action on the HashedWheelTimer scheduler thread. When Socket.ConnectAsync threw inside that callback (PlatformNotSupportedException on Linux when reusing a socket after a failed connection attempt), the exception was logged and swallowed by the scheduler. The commander never received Tcp.Connected or Tcp.CommandFailed and stayed stuck permanently. The retry now runs inside the actor's message loop: it is scheduled as a RetryConnect self-message via IWithTimers, and the ConnectAsync call is wrapped in ReportConnectFailure, so any exception is surfaced to the commander as Tcp.CommandFailed and the connection actor stops. Using a timer also cancels the pending retry automatically when the actor stops. Forward-port of akkadotnet#8214 (merged to v1.5).
…8195) (#8215) TcpOutgoingConnection scheduled its connect retry as a raw Action on the HashedWheelTimer scheduler thread. When Socket.ConnectAsync threw inside that callback (PlatformNotSupportedException on Linux when reusing a socket after a failed connection attempt), the exception was logged and swallowed by the scheduler. The commander never received Tcp.Connected or Tcp.CommandFailed and stayed stuck permanently. The retry now runs inside the actor's message loop: it is scheduled as a RetryConnect self-message via IWithTimers, and the ConnectAsync call is wrapped in ReportConnectFailure, so any exception is surfaced to the commander as Tcp.CommandFailed and the connection actor stops. Using a timer also cancels the pending retry automatically when the actor stops. Forward-port of #8214 (merged to v1.5).
Summary
Fixes #8195. On Linux, a dropped TCP connection could leave the commander/user actor permanently stuck — it never received
Tcp.ConnectedorTcp.CommandFailed, and the only recovery was a process restart.Root cause
When a connect attempt fails,
TcpOutgoingConnection.Connectingschedules a retry. That retry was scheduled as a rawActionviaContext.System.Scheduler.Advanced.ScheduleOnce(...), so it ran on the HashedWheelTimer scheduler thread — outside the actor's message loop — and calledSocket.ConnectAsyncdirectly.When that call threw (
PlatformNotSupportedExceptionon Linux when reusing a socket after a failed connect attempt;NotSupportedExceptionon the IPv4/IPv6 DNS-fallback path), the exception propagated intoHashedWheelTimerScheduler.Bucket.Execute, which logs and swallows it. Because the exception never re-entered the actor, the existingReportConnectFailure→Stoppath never ran, soTcp.CommandFailedwas never delivered.Fix
The retry now runs inside the actor's message loop:
TcpOutgoingConnectionimplementsIWithTimers(consistent withTcpListenerin the same module).RetryConnectself-message viaTimers.StartSingleTimer.Receive<RetryConnect>handler performs theSocket.ConnectAsynccall wrapped in the existingReportConnectFailure, so any exception is surfaced to the commander asTcp.CommandFailedand the connection actor stops cleanly.This also removes a latent bug: the old raw action could run
Socket.ConnectAsyncon an already-disposed socket if the actor stopped before the scheduled callback fired. WithIWithTimers, the pending timer is canceled automatically when the actor stops.Testing
Added
Should_report_CommandFailed_when_a_scheduled_connect_retry_throwstoTcpIntegrationSpec.PlatformNotSupportedExceptionis architecture-specific and does not reproduce on x64, so the test instead drives the same swallowed-exception code path deterministically and cross-platform: it pre-seeds the DNS cache so a host resolves to both IPv4 and IPv6, forces an IPv4-only socket (OutgoingSocketForceIpv4), and lets the IPv6 fallback retry throwNotSupportedException(thrown on every platform).Verified the test fails without the fix (
Error while executing scheduled task ... NotSupportedException, then a 10s timeout waiting forCommandFailed) and passes with it. The fullAkka.Tests.IOsuite passes.