Commit 2a0a0df
committed
rabbit_peer_discovery: Rewrite the core logic
[Why]
This work started as an effort to add peer discovery support to our
Khepri integration. Indeed, as part of the task to integrate Khepri, we
missed the fact that `rabbit_peer_discovery:maybe_create_cluster/1` was
called from the Mnesia-specific code only. Even though we knew about it
because we hit many issues caused by the fact the `join_cluster` and
peer discovery use different code path to create a cluster.
To add support for Khepri, the first version of this patch was to move
the call to `rabbit_peer_discovery:maybe_create_cluster/1` from
`rabbit_db_cluster` instead of `rabbit_mnesia`. To achieve that, it made
sense to unify the code and simply call `rabbit_db_cluster:join/2`
instead of duplicating the work.
Unfortunately, doing so highlighted another issue: the way the node to
cluster with was selected. Indeed, it could cause situations where
multiple clusters are created instead of one, without resorting to
out-of-band counter-measures, like a 30-second delay added in the
Kubernetes operator (rabbitmq/cluster-operator#1156). This problem was
even more frequent when we tried to unify the code path and call
`join_cluster`.
After several iterations on the patch and even more discussions with the
team, we decided to rewrite the algorithm to make node selection more
robust and still use `rabbit_db_cluster:join/2` to create the cluster.
[How]
This commit is only about the rewrite of the algorithm. Calling peer
discovery from `rabbit_db_cluster` instead of `rabbit_mnesia` (and thus
making peer discovery work with Khepri) will be done in a follow-up
commit.
We wanted the new algorithm to fulfill the following properties:
1. `rabbit_peer_discovery` should provide the ability to re-trigger it
easily to re-evaluate the cluster. The new public API is
`rabbit_peer_discovery:sync_desired_cluster/0`.
2. The selection of the node to join should be designed in a way that
all nodes select the same, regardless of the order in which they
become available. The adopted solution is to sort the list of
discovered nodes with the following criterias (in that order):
1. the size of the cluster a discovered node is part of; sorted from
bigger to smaller clusters
2. the start time of a discovered node; sorted from older to younger
nodes
3. the name of a discovered node; sorted alphabetically
The first node in that list will not join anyone and simply proceed
with its boot process. Other nodes will try to join the first node.
3. To reduce the chance of incorrectly having multiple standalone nodes
because the discovery backend returned only a single node, we want to
apply the following constraints to the list of nodes after it is
filtered and sorted (see property 2 above):
* The list must contain `node()` (i.e. the node running peer
discovery itself).
* If the RabbitMQ's cluster size hint is greater than 1, the list
must have at least two nodes. The cluster size hint is the maximum
between the configured target cluster size hint and the number of
elements in the nodes list returned by the backend.
If one of the constraint is not met, the entire peer discovery
process is restarted after a delay.
4. The lock is acquired only to protect the actual join, not the
discovery step where the backend is queried to get the list of peers.
With the node selection described above, this will let the first node
to start without acquiring the lock.
5. The cluster membership views queried as part of the algorithm to sort
the list of nodes will be used to detect additional clusters or
standalone nodes that did not cluster correctly. These nodes will be
asked to re-evaluate peer discovery to increase the chance of forming
a single cluster.
6. After some delay, peer discovery will be re-evaluated to further
eliminate the chances of having multiple clusters instead of one.
This commit covers properties from point 1 to point 4. Remaining
properties will be the scope of additional pull requests after this one
works.
If there is a failure at any point during discovery, filtering/sorting,
locking or joining, the entire process is restarted after a delay. This
is configured using the following parameters:
* cluster_formation.discovery_retry_limit
* cluster_formation.discovery_retry_interval
The default parameters were bumped to 30 retries with a delay of 1
second between each.
The locking retries/interval parameters are not used by the new
algorithm anymore.
There are extra minor changes that come with the rewrite:
* The configured backend is cached in a persistent term. The goal is to
make sure we use the same backend throughout the entire process and
when we call `maybe_unregister/0` even if the configuration changed
for whatever reason in between.
* `maybe_register/0` is called from `rabbit_db_cluster` instead of at
the end of a successful peer discovery process. `rabbit_db_cluster`
had to call `maybe_register/0` if the node was not virgin anyway. So
make it simpler and always call it in `rabbit_db_cluster` regardless
of the state of the node.
* `log_configured_backend/0` is gone. `maybe_init/0` can log the backend
directly. There is no need to explicitly call another function for
that.
* Messages are logged using `?LOG_*()` macros instead of the old
`rabbit_log` module.1 parent 8a59ae3 commit 2a0a0df
File tree
10 files changed
+916
-340
lines changed- deps
- rabbitmq_cli/test/diagnostics
- rabbit
- src
- test
10 files changed
+916
-340
lines changed| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
958 | 958 | | |
959 | 959 | | |
960 | 960 | | |
| 961 | + | |
| 962 | + | |
| 963 | + | |
| 964 | + | |
| 965 | + | |
| 966 | + | |
| 967 | + | |
| 968 | + | |
961 | 969 | | |
962 | 970 | | |
963 | 971 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
1735 | 1735 | | |
1736 | 1736 | | |
1737 | 1737 | | |
| 1738 | + | |
| 1739 | + | |
| 1740 | + | |
| 1741 | + | |
| 1742 | + | |
| 1743 | + | |
| 1744 | + | |
| 1745 | + | |
1738 | 1746 | | |
1739 | 1747 | | |
1740 | 1748 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
51 | 51 | | |
52 | 52 | | |
53 | 53 | | |
54 | | - | |
55 | 54 | | |
56 | 55 | | |
57 | 56 | | |
| |||
66 | 65 | | |
67 | 66 | | |
68 | 67 | | |
| 68 | + | |
69 | 69 | | |
70 | | - | |
71 | 70 | | |
72 | 71 | | |
73 | 72 | | |
| |||
82 | 81 | | |
83 | 82 | | |
84 | 83 | | |
85 | | - | |
86 | | - | |
87 | | - | |
88 | | - | |
89 | | - | |
90 | | - | |
91 | 84 | | |
92 | 85 | | |
93 | 86 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
114 | 114 | | |
115 | 115 | | |
116 | 116 | | |
| 117 | + | |
| 118 | + | |
117 | 119 | | |
118 | 120 | | |
119 | 121 | | |
120 | 122 | | |
121 | 123 | | |
122 | | - | |
123 | | - | |
| 124 | + | |
| 125 | + | |
| 126 | + | |
| 127 | + | |
| 128 | + | |
| 129 | + | |
| 130 | + | |
| 131 | + | |
| 132 | + | |
| 133 | + | |
| 134 | + | |
124 | 135 | | |
125 | 136 | | |
126 | 137 | | |
| |||
141 | 152 | | |
142 | 153 | | |
143 | 154 | | |
144 | | - | |
145 | | - | |
146 | | - | |
147 | | - | |
148 | | - | |
149 | | - | |
150 | | - | |
151 | | - | |
152 | | - | |
153 | | - | |
154 | | - | |
155 | | - | |
156 | | - | |
157 | | - | |
158 | | - | |
159 | | - | |
160 | | - | |
161 | 155 | | |
162 | 156 | | |
163 | 157 | | |
| |||
0 commit comments