Skip to content

fix: report Tcp.CommandFailed when a scheduled connect retry throws (#8195)#8214

Merged
Aaronontheweb merged 1 commit into
akkadotnet:v1.5from
Aaronontheweb:fix/8195-tcp-connect-retry-swallowed-exception
May 16, 2026
Merged

fix: report Tcp.CommandFailed when a scheduled connect retry throws (#8195)#8214
Aaronontheweb merged 1 commit into
akkadotnet:v1.5from
Aaronontheweb:fix/8195-tcp-connect-retry-swallowed-exception

Conversation

@Aaronontheweb
Copy link
Copy Markdown
Member

Summary

Fixes #8195. On Linux, a dropped TCP connection could leave the commander/user actor permanently stuck — it never received Tcp.Connected or Tcp.CommandFailed, and the only recovery was a process restart.

Root cause

When a connect attempt fails, TcpOutgoingConnection.Connecting schedules a retry. That retry was scheduled as a raw Action via Context.System.Scheduler.Advanced.ScheduleOnce(...), so it ran on the HashedWheelTimer scheduler thread — outside the actor's message loop — and called Socket.ConnectAsync directly.

When that call threw (PlatformNotSupportedException on Linux when reusing a socket after a failed connect attempt; NotSupportedException on the IPv4/IPv6 DNS-fallback path), the exception propagated into HashedWheelTimerScheduler.Bucket.Execute, which logs and swallows it. Because the exception never re-entered the actor, the existing ReportConnectFailureStop path never ran, so Tcp.CommandFailed was never delivered.

Fix

The retry now runs inside the actor's message loop:

  • TcpOutgoingConnection implements IWithTimers (consistent with TcpListener in the same module).
  • The retry is scheduled as a RetryConnect self-message via Timers.StartSingleTimer.
  • A Receive<RetryConnect> handler performs the Socket.ConnectAsync call wrapped in the existing ReportConnectFailure, so any exception is surfaced to the commander as Tcp.CommandFailed and the connection actor stops cleanly.

This also removes a latent bug: the old raw action could run Socket.ConnectAsync on an already-disposed socket if the actor stopped before the scheduled callback fired. With IWithTimers, the pending timer is canceled automatically when the actor stops.

Testing

Added Should_report_CommandFailed_when_a_scheduled_connect_retry_throws to TcpIntegrationSpec. PlatformNotSupportedException is architecture-specific and does not reproduce on x64, so the test instead drives the same swallowed-exception code path deterministically and cross-platform: it pre-seeds the DNS cache so a host resolves to both IPv4 and IPv6, forces an IPv4-only socket (OutgoingSocketForceIpv4), and lets the IPv6 fallback retry throw NotSupportedException (thrown on every platform).

Verified the test fails without the fix (Error while executing scheduled task ... NotSupportedException, then a 10s timeout waiting for CommandFailed) and passes with it. The full Akka.Tests.IO suite passes.

…kkadotnet#8195)

TcpOutgoingConnection scheduled its connect retry as a raw Action on the
HashedWheelTimer scheduler thread. When Socket.ConnectAsync threw inside that
callback (PlatformNotSupportedException on Linux when reusing a socket after a
failed connection attempt), the exception was logged and swallowed by the
scheduler. The commander never received Tcp.Connected or Tcp.CommandFailed and
stayed stuck permanently.

The retry now runs inside the actor's message loop: it is scheduled as a
RetryConnect self-message via IWithTimers, and the ConnectAsync call is wrapped
in ReportConnectFailure, so any exception is surfaced to the commander as
Tcp.CommandFailed and the connection actor stops. Using a timer also cancels
the pending retry automatically when the actor stops.
Copy link
Copy Markdown
Member Author

@Aaronontheweb Aaronontheweb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

var self = Self;
var previousAddress = (IPEndPoint)args.RemoteEndPoint;
args.RemoteEndPoint = fallbackAddress;
Context.System.Scheduler.Advanced.ScheduleOnce(TimeSpan.FromMilliseconds(1), () =>
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah so this was the real bug - having a lambda running outside the actor attempting the socket reconnect. We probably should have never designed it that way. Refactoring it to use an in-actor scheduled message instead.

// to the commander as Tcp.CommandFailed instead of being swallowed by the scheduler.
ReportConnectFailure(() =>
{
if (!Socket.ConnectAsync(args))
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

somewhat confusing: does not return a Task. Returns true if the connect operation is pending (and we'll get a SocketAsyncEventArgs event to that effect later when it comples,) returns false if the operation already completed synchronously, which is why we send a SocketConnected message to ourselves immediately.

}

private void ScheduleConnectRetry()
=> Timers.StartSingleTimer(RetryConnectTimerKey, RetryConnect.Instance, TimeSpan.FromMilliseconds(1));
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

preserves the same timing mechanics as before, just via messaging rather than delegates.

@Aaronontheweb Aaronontheweb marked this pull request as ready for review May 16, 2026 17:09
@Aaronontheweb Aaronontheweb merged commit e5afb74 into akkadotnet:v1.5 May 16, 2026
8 of 11 checks passed
@Aaronontheweb Aaronontheweb deleted the fix/8195-tcp-connect-retry-swallowed-exception branch May 16, 2026 17:10
@Aaronontheweb
Copy link
Copy Markdown
Member Author

I think this fix is probably inapplicable to the dev branch given the divergence around TcpConnection, but I'll double check just in case.

@Aaronontheweb Aaronontheweb added the akka.net v1.5 Issues affecting Akka.NET v1.5 label May 16, 2026
Aaronontheweb added a commit to Aaronontheweb/akka.net that referenced this pull request May 16, 2026
…kkadotnet#8195)

TcpOutgoingConnection scheduled its connect retry as a raw Action on the
HashedWheelTimer scheduler thread. When Socket.ConnectAsync threw inside that
callback (PlatformNotSupportedException on Linux when reusing a socket after a
failed connection attempt), the exception was logged and swallowed by the
scheduler. The commander never received Tcp.Connected or Tcp.CommandFailed and
stayed stuck permanently.

The retry now runs inside the actor's message loop: it is scheduled as a
RetryConnect self-message via IWithTimers, and the ConnectAsync call is wrapped
in ReportConnectFailure, so any exception is surfaced to the commander as
Tcp.CommandFailed and the connection actor stops. Using a timer also cancels
the pending retry automatically when the actor stops.

Forward-port of akkadotnet#8214 (merged to v1.5).
Aaronontheweb added a commit that referenced this pull request May 17, 2026
…8195) (#8215)

TcpOutgoingConnection scheduled its connect retry as a raw Action on the
HashedWheelTimer scheduler thread. When Socket.ConnectAsync threw inside that
callback (PlatformNotSupportedException on Linux when reusing a socket after a
failed connection attempt), the exception was logged and swallowed by the
scheduler. The commander never received Tcp.Connected or Tcp.CommandFailed and
stayed stuck permanently.

The retry now runs inside the actor's message loop: it is scheduled as a
RetryConnect self-message via IWithTimers, and the ConnectAsync call is wrapped
in ReportConnectFailure, so any exception is surfaced to the commander as
Tcp.CommandFailed and the connection actor stops. Using a timer also cancels
the pending retry automatically when the actor stops.

Forward-port of #8214 (merged to v1.5).
This was referenced May 21, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

akka.net v1.5 Issues affecting Akka.NET v1.5 akka-io

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant