Commit ef13342
committed
Prevent global:sync/0 from being stuck
Prior to this commit, global:sync/0 gets sometimes stuck when either
performing a rolling update on Kubernetes or when creating a new
RabbitMQ cluster on Kubernetes.
When performing a rolling update, the node being booted will be stuck
in:
```
2022-07-26 10:49:58.891896+00:00 [debug] <0.226.0> == Plugins (prelaunch phase) ==
2022-07-26 10:49:58.891908+00:00 [debug] <0.226.0> Setting plugins up
2022-07-26 10:49:58.920915+00:00 [debug] <0.226.0> Loading the following plugins: [cowlib,cowboy,rabbitmq_web_dispatch,
2022-07-26 10:49:58.920915+00:00 [debug] <0.226.0> rabbitmq_management_agent,amqp_client,
2022-07-26 10:49:58.920915+00:00 [debug] <0.226.0> rabbitmq_management,quantile_estimator,
2022-07-26 10:49:58.920915+00:00 [debug] <0.226.0> prometheus,rabbitmq_peer_discovery_common,
2022-07-26 10:49:58.920915+00:00 [debug] <0.226.0> accept,rabbitmq_peer_discovery_k8s,
2022-07-26 10:49:58.920915+00:00 [debug] <0.226.0> rabbitmq_prometheus]
2022-07-26 10:49:58.926373+00:00 [debug] <0.226.0> Feature flags: REFRESHING after applications load...
2022-07-26 10:49:58.926416+00:00 [debug] <0.372.0> Feature flags: registering controller globally before proceeding with task: refresh_after_app_load
2022-07-26 10:49:58.926450+00:00 [debug] <0.372.0> Feature flags: [global sync] @ [email protected]
```
During cluster creation, an example log of global:sync/0 being stuck can
be found in bullet point 2 of
#5331 (review)
When global:sync/0 is stuck, it never receives a message in line
https://github.com/erlang/otp/blob/bd05b07f973f11d73c4fc77d59b69f212f121c2d/lib/kernel/src/global.erl#L2942
This issue can be observed in both `kind` and GKE.
`kind` uses CoreDNS, GKE uses kubedns.
CoreDNS does not resolve the hostname of RabbitMQ and its peers
correctly for up to 30 seconds after node startup.
This is because the default cache value of CoreDNS is 30 seconds and
CoreDNS has a bug described in
kubernetes/kubernetes#92559
global:sync/0 is known to be buggy "in the presence of network failures"
unless the kernel parameter `prevent_overlapping_partitions` is set to
`true`.
When either:
1. setting CoreDNS cache value to 1 second (see
#5322 (comment)
on how to set this value), or
2. setting the kernel parameter `prevent_overlapping_partitions` to `true`
rolling updates do NOT get stuck anymore.
This means we are hitting here a combination of:
1. Kubernetes DNS bug not updating DNS caches promptly for headless
services with `publishNotReadyAddresses: true`, and
2. Erlang bug which causes global:sync/0 to hang forever in the presence
of network failures.
The Erlang bug is fixed by setting `prevent_overlapping_partitions` to `true` (default in Erlang/OTP 25).
In RabbitMQ however, we explicitly set `prevent_overlapping_partitions`
to `false` because we fear other issues could arise if we set this parameter to `true`.
Luckily, to resolve this issue of global:sync/0 being stuck, we can just
call function rabbit_node_monitor:global_sync/0 which provides a
workaround. This function was introduced 8 years ago in
9fcb31f
With this commit applied, rolling updates are not stuck anymore and we
see in the debug log the workaround sometimes being applied.
(cherry picked from commit 4bf78d8)1 parent 316a472 commit ef13342
1 file changed
+1
-1
lines changed| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
268 | 268 | | |
269 | 269 | | |
270 | 270 | | |
271 | | - | |
| 271 | + | |
272 | 272 | | |
273 | 273 | | |
274 | 274 | | |
| |||
0 commit comments