Skip to content

[FIXED] Stuck consumer after leader change#6469

Merged
derekcollison merged 1 commit intomainfrom
maurice/stuck-consumer-after-leader-change
Feb 7, 2025
Merged

[FIXED] Stuck consumer after leader change#6469
derekcollison merged 1 commit intomainfrom
maurice/stuck-consumer-after-leader-change

Conversation

@MauriceVanVeen
Copy link
Copy Markdown
Member

When a client requests for messages, in o.deliverMsg the following is done:

	// Update delivered first.
	o.updateDelivered(dseq, seq, dc, ts)

	// Send message.
	o.outq.send(pmsg)

For o.updateDelivered we need to establish quorum to have all servers know the message was delivered, and this can fail. So we would have sent the client messages that a new leader will not know have ever been delivered.

Once a new leader gets elected and receives an ACK for a message it doesn't know was delivered it would move o.sseq ahead. This is incorrect, since the message is not in o.pending it results in ack floors not being updated. And if any messages before the one that was acked were not acknowledged/NAK-ed, that would mean these messages would never be redelivered resulting in the stuck consumer symptom.

Signed-off-by: Maurice van Veen github@mauricevanveen.com

Signed-off-by: Maurice van Veen <github@mauricevanveen.com>
@MauriceVanVeen MauriceVanVeen requested a review from a team as a code owner February 7, 2025 11:12
Copy link
Copy Markdown
Member

@neilalexander neilalexander left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@derekcollison derekcollison merged commit cd93fef into main Feb 7, 2025
@derekcollison derekcollison deleted the maurice/stuck-consumer-after-leader-change branch February 7, 2025 15:22
neilalexander added a commit that referenced this pull request Feb 10, 2025
Includes the following:

- #6465
- #6464
- #6469
- #6471
- #6472
- #6474
- #6477
- #6480
- #6487
- #6488

Signed-off-by: Neil Twigg <neil@nats.io>
neilalexander added a commit that referenced this pull request Apr 17, 2025
Related to #6469, about the
following code:
```go
	// Update delivered first.
	o.updateDelivered(dseq, seq, dc, ts)

	// Send message.
	o.outq.send(pmsg)
```

`o.updateDelivered` requires proposing delivered state through Raft, and
even if proposing fails, we immediately sent the message to the client.
This is great for performance, but really bad for properly replicating
this piece of data. Before the before-mentioned PR there would be a
bunch of nasty side-effects of stuck consumer, perceived data loss
through missed redeliveries, etc. Because clients could get messages
that a new leader wouldn't know about if proposals failed.

The core issue is that we should only send the message AFTER we had
quorum on updating delivered state. Otherwise the following could
happen: message gets sent to the client, `updateDelivered` proposal
fails, leader changes, `AckSync` will now timeout indefinitely even with
retries because the new leader doesn't know this message was even
delivered.

Signed-off-by: Maurice van Veen <github@mauricevanveen.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants