Skip to content

fix: report Tcp.CommandFailed when a scheduled connect retry throws (#8195)#8215

Merged
Aaronontheweb merged 1 commit into
akkadotnet:devfrom
Aaronontheweb:fix/8195-tcp-connect-retry-swallowed-exception-dev
May 17, 2026
Merged

fix: report Tcp.CommandFailed when a scheduled connect retry throws (#8195)#8215
Aaronontheweb merged 1 commit into
akkadotnet:devfrom
Aaronontheweb:fix/8195-tcp-connect-retry-swallowed-exception-dev

Conversation

@Aaronontheweb
Copy link
Copy Markdown
Member

Summary

Forward-port of #8214 (merged to v1.5) to dev. The #8132 Akka.IO transport rewrite changed the transport layer but left the outgoing-connection state machine intact, so the bug from #8195 is present on dev as well.

On Linux, a dropped TCP connection could leave the commander/user actor permanently stuck — it never received Tcp.Connected or Tcp.CommandFailed, and the only recovery was a process restart.

Root cause

When a connect attempt fails, TcpOutgoingConnection.Connecting schedules a retry. That retry was scheduled as a raw Action via Context.System.Scheduler.Advanced.ScheduleOnce(...), so it ran on the HashedWheelTimer scheduler thread — outside the actor's message loop — and called Socket.ConnectAsync directly.

When that call threw (PlatformNotSupportedException on Linux when reusing a socket after a failed connect attempt), the exception propagated into HashedWheelTimerScheduler.Bucket.Execute, which logs and swallows it. Because the exception never re-entered the actor, the existing ReportConnectFailureStop path never ran, so Tcp.CommandFailed was never delivered.

Fix

The retry now runs inside the actor's message loop (identical approach to #8214):

  • TcpOutgoingConnection implements IWithTimers (consistent with TcpListener in the same module).
  • The retry is scheduled as a RetryConnect self-message via Timers.StartSingleTimer.
  • A Receive<RetryConnect> handler performs the Socket.ConnectAsync call wrapped in the existing ReportConnectFailure, so any exception is surfaced to the commander as Tcp.CommandFailed and the connection actor stops cleanly.

This also removes a latent bug: the old raw action could run Socket.ConnectAsync on an already-disposed socket if the actor stopped before the scheduled callback fired. With IWithTimers, the pending timer is canceled automatically when the actor stops.

The dev change is slightly smaller than #8214 because dev no longer has the DNS IPv4/IPv6 fallback path (Connecting has a single retry case).

Testing

Added Should_report_CommandFailed_when_outgoing_connection_is_refused to TcpIntegrationSpec — a behavioral guard asserting that a refused outgoing connection always ends with Tcp.CommandFailed (the actor must never hang).

Note: #8214 used a deterministic cross-platform regression test built on the DNS IPv4→IPv6 fallback retry path. dev removed that fallback path, and PlatformNotSupportedException is architecture-specific (does not reproduce on x64), so that exact test cannot be ported here. This behavioral test catches the regression on the affected platform (arm64 Linux) and guards the "never hang" contract everywhere; on x64 it passes regardless, consistent with x64 not exhibiting the bug.

Full Akka.Tests.IO suite (52 tests) passes locally.

…kkadotnet#8195)

TcpOutgoingConnection scheduled its connect retry as a raw Action on the
HashedWheelTimer scheduler thread. When Socket.ConnectAsync threw inside that
callback (PlatformNotSupportedException on Linux when reusing a socket after a
failed connection attempt), the exception was logged and swallowed by the
scheduler. The commander never received Tcp.Connected or Tcp.CommandFailed and
stayed stuck permanently.

The retry now runs inside the actor's message loop: it is scheduled as a
RetryConnect self-message via IWithTimers, and the ConnectAsync call is wrapped
in ReportConnectFailure, so any exception is surfaced to the commander as
Tcp.CommandFailed and the connection actor stops. Using a timer also cancels
the pending retry automatically when the actor stops.

Forward-port of akkadotnet#8214 (merged to v1.5).
@Aaronontheweb Aaronontheweb force-pushed the fix/8195-tcp-connect-retry-swallowed-exception-dev branch from 2bbc3c4 to 2612865 Compare May 16, 2026 18:28
@Aaronontheweb Aaronontheweb merged commit c7f0fbd into akkadotnet:dev May 17, 2026
8 of 11 checks passed
@Aaronontheweb Aaronontheweb deleted the fix/8195-tcp-connect-retry-swallowed-exception-dev branch May 17, 2026 19:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant