Skip to content

rabbit_quorum_queue: Shrink batches of QQs in parallel (backport #15081)#15765

Merged
the-mikedavis merged 1 commit intov4.3.xfrom
mergify/bp/v4.3.x/pr-15081
Mar 18, 2026
Merged

rabbit_quorum_queue: Shrink batches of QQs in parallel (backport #15081)#15765
the-mikedavis merged 1 commit intov4.3.xfrom
mergify/bp/v4.3.x/pr-15081

Conversation

@mergify
Copy link
Copy Markdown

@mergify mergify bot commented Mar 18, 2026

Shrinking a member node off of a QQ can be parallelized. The operation involves

  • removing the node from the QQ's cluster membership (appending a command to the log and committing it) with ra:remove_member/3
  • updating the metadata store to remove the member from the QQ type state with rabbit_amqqueue:update/2
  • deleting the queue data from the node with ra:force_delete_server/2 if the node can be reached

All of these operations are I/O bound. Updating the cluster membership and metadata store involves appending commands to those logs and replicating them. Writing commands to Ra synchronously in serial is fairly slow - sending many commands in parallel is much more efficient. By parallelizing these steps we can write larger chunks of commands to WAL(s).

ra:force_delete_server/2 benefits from parallelizing if the node being shrunk off is no longer reachable, for example in some hardware failures. The underlying rpc:call/4 will attempt to auto-connect to the node and this can take some time to time out. By parallelizing this, each rpc:call/4 reuses the same underlying distribution entry and all calls fail together once the connection fails to establish.

Discussed in #15057


This is an automatic backport of pull request #15081 done by Mergify.

Shrinking a member node off of a QQ can be parallelized. The operation
involves

* removing the node from the QQ's cluster membership (appending a
  command to the log and committing it) with `ra:remove_member/3`
* updating the metadata store to remove the member from the QQ type
  state with `rabbit_amqqueue:update/2`
* deleting the queue data from the node with `ra:force_delete_server/2`
  if the node can be reached

All of these operations are I/O bound. Updating the cluster membership
and metadata store involves appending commands to those logs and
replicating them. Writing commands to Ra synchronously in serial is
fairly slow - sending many commands in parallel is much more efficient.
By parallelizing these steps we can write larger chunks of commands to
WAL(s).

`ra:force_delete_server/2` benefits from parallelizing if the node being
shrunk off is no longer reachable, for example in some hardware
failures. The underlying `rpc:call/4` will attempt to auto-connect to
the node and this can take some time to time out. By parallelizing this,
each `rpc:call/4` reuses the same underlying distribution entry and
all calls fail together once the connection fails to establish.

(cherry picked from commit 511692a)
@the-mikedavis
Copy link
Copy Markdown
Collaborator

Any thoughts on backporting this to v4.2.x as well? I think it should backport cleanly

@the-mikedavis the-mikedavis added this to the 4.3.0 milestone Mar 18, 2026
@the-mikedavis the-mikedavis merged commit 48a1770 into v4.3.x Mar 18, 2026
184 checks passed
@the-mikedavis the-mikedavis deleted the mergify/bp/v4.3.x/pr-15081 branch March 18, 2026 18:22
the-mikedavis added a commit that referenced this pull request Mar 18, 2026
rabbit_quorum_queue: Shrink batches of QQs in parallel (backport #15081) (backport #15765)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants