Skip to content

Fix stale ACK causing irrecoverable quarantine after transient network disruption#8116

Merged
Aaronontheweb merged 2 commits into
akkadotnet:devfrom
Arkatufus:#6414-Fix-stale-ack-disassiciation
Mar 19, 2026
Merged

Fix stale ACK causing irrecoverable quarantine after transient network disruption#8116
Aaronontheweb merged 2 commits into
akkadotnet:devfrom
Arkatufus:#6414-Fix-stale-ack-disassiciation

Conversation

@Arkatufus
Copy link
Copy Markdown
Contributor

Fixes #6414

Problem

When a transient network disruption occurs and a node reconnects with the same UID (process stayed alive), a stale ACK from the remote node's previous receive buffer state can trigger an ArgumentException in AckedSendBuffer<T>.Acknowledge(). This exception is wrapped in a HopelessAssociation, which causes an irrecoverable quarantine of the remote node.

In production, this quarantine creates an "indirectly connected" topology that can cause SBR's keep-majority strategy to down an entire cluster.

The root cause is a guard check that throws when ack.CumulativeAck > MaxSeq:

Error encountered while processing system message acknowledgement buffer: [-1 []] ack: ACK[0, []]
---> System.ArgumentException: Highest SEQ so far was -1 but cumulative ACK is 0

This happens because:

  1. The sender's ReliableDeliverySupervisor creates a fresh AckedSendBuffer with MaxSeq = -1 (new endpoint)
  2. The receiver's EndpointReader.PreStart() restores a stale AckedReceiveBuffer from the shared _receiveBuffers dictionary (UID matches), and sends ACK[0] based on old state
  3. The guard check 0 > -1 fires and throws, quarantining the association

Fix

Changed the guard in AckedSendBuffer<T>.Acknowledge() from a hard throw to a no-op return this. When the ACK references a sequence number higher than anything in the buffer, there is nothing to acknowledge and nothing to corrupt - the buffer is either empty or contains only messages with lower sequence numbers.

Added a warning log at the ReliableDeliverySupervisor call site so operators have visibility when stale ACKs are being ignored.

Tests

Added two unit tests to AckedDeliverySpec that reproduce the exact failure:

Both tests fail with the original ArgumentException on the old code and pass with the fix.

Copy link
Copy Markdown
Member

@Aaronontheweb Aaronontheweb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Copy Markdown
Member

@Aaronontheweb Aaronontheweb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, not sure about something

Comment thread src/core/Akka.Remote/Endpoint.cs Outdated
{
var oldBuffer = _resendBuffer;
_resendBuffer = _resendBuffer.Acknowledge(ack);
if (ReferenceEquals(_resendBuffer, oldBuffer) && ack.CumulativeAck > _resendBuffer.MaxSeq)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Arkatufus why is this necessary?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

logging feedback, it can be removed

Copy link
Copy Markdown
Member

@Aaronontheweb Aaronontheweb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@Aaronontheweb Aaronontheweb enabled auto-merge (squash) March 19, 2026 22:13
@Aaronontheweb Aaronontheweb merged commit 3b58d43 into akkadotnet:dev Mar 19, 2026
12 checks passed
Arkatufus added a commit to Arkatufus/akka.net that referenced this pull request Mar 23, 2026
…k disruption (akkadotnet#8116)

* Fix stale remote ack causes disassociation / quarantine

* remove redundant code

(cherry picked from commit 3b58d43)
Aaronontheweb pushed a commit that referenced this pull request Mar 23, 2026
…k disruption (#8116) (#8124)

* Fix stale remote ack causes disassociation / quarantine

* remove redundant code

(cherry picked from commit 3b58d43)
This was referenced May 21, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Association to [akka_ tcp://name_server:49766] with UId [1896650320] is irrecoverably failed. Quarantining address

2 participants