-
Notifications
You must be signed in to change notification settings - Fork 4k
Description
The RabbitMQ “global hang workaround” as described in
rabbitmq-server/deps/rabbit/src/rabbit_node_monitor.erl
Lines 271 to 285 in 238995b
| %%---------------------------------------------------------------------------- | |
| %% "global" hang workaround. | |
| %%---------------------------------------------------------------------------- | |
| %% This code works around a possible inconsistency in the "global" | |
| %% state, causing global:sync/0 to never return. | |
| %% | |
| %% 1. A process is spawned. | |
| %% 2. If after 10", global:sync() didn't return, the "global" | |
| %% state is parsed. | |
| %% 3. If it detects that a sync is blocked for more than 10", | |
| %% the process sends fake nodedown/nodeup events to the two | |
| %% nodes involved (one local, one remote). | |
| %% 4. Both "global" instances restart their synchronisation. | |
| %% 5. global:sync() finally returns. |
breaks from OTP 25.1 onwards because the messages being sent in
rabbitmq-server/deps/rabbit/src/rabbit_node_monitor.erl
Lines 334 to 337 in 238995b
| {global_name_server, ThisNode} ! {nodedown, PeerNode}, | |
| {global_name_server, PeerNode} ! {nodedown, ThisNode}, | |
| {global_name_server, ThisNode} ! {nodeup, PeerNode}, | |
| {global_name_server, PeerNode} ! {nodeup, ThisNode}, |
global due to the following commit: erlang/otp@9274e89
This means rolling updates in RabbitMQ on Kubernetes with image rabbitmq:3.11.0 containing Erlang 25.1.1 will get (sometimes) stuck.
The reasoning of using "global hang workaround" is further described in #5438.
In short, a combination of the following issues leads us to relying on that workaround:
- the new feature flags controller v2 introduced in 3.11 calls
global:sync/0early on boot - following Kubernetes bug: Headless Service not publishing not ready addresses for statefulset kubernetes/kubernetes#92559
- Erlang
globalbug (got fixed in Erlang by setting parameterprevent_overlapping_partitionstotrue) - RabbitMQ not being able to set
prevent_overlapping_partitionstotrueas described in Revert "Set kernel param prevent_overlapping_partitions to true" #5483
To reproduce, trigger a few times a rolling update on kindest/node:v1.25.2 using a 5 node RabbitMQ cluster with image rabbitmq:3.11.0-management.
The logs show the following before the node gets stuck:
2022-10-06 07:13:05.032492+00:00 [debug] <0.232.0> Register `rabbit` process (<0.232.0>) for rabbit_node_monitor
## ## RabbitMQ 3.11.0
## ##
########## Copyright (c) 2007-2022 VMware, Inc. or its affiliates.
###### ##
########## Licensed under the MPL 2.0. Website: https://rabbitmq.com
Erlang: 25.1.1 [jit]
TLS Library: OpenSSL - OpenSSL 1.1.1q 5 Jul 2022
Release series support status: supported
Doc guides: https://rabbitmq.com/documentation.html
Support: https://rabbitmq.com/contact.html
Tutorials: https://rabbitmq.com/getstarted.html
Monitoring: https://rabbitmq.com/monitoring.html
Logs: /var/log/rabbitmq/[email protected]_upgrade.log
<stdout>
Config file(s): /etc/rabbitmq/conf.d/10-defaults.conf
/etc/rabbitmq/conf.d/10-operatorDefaults.conf
/etc/rabbitmq/conf.d/11-default_user.conf
/etc/rabbitmq/conf.d/90-userDefinedConfiguration.conf
Starting broker...2022-10-06 07:13:05.033106+00:00 [info] <0.232.0>
2022-10-06 07:13:05.033106+00:00 [info] <0.232.0> node : [email protected]
2022-10-06 07:13:05.033106+00:00 [info] <0.232.0> home dir : /var/lib/rabbitmq
2022-10-06 07:13:05.033106+00:00 [info] <0.232.0> config file(s) : /etc/rabbitmq/conf.d/10-defaults.conf
2022-10-06 07:13:05.033106+00:00 [info] <0.232.0> : /etc/rabbitmq/conf.d/10-operatorDefaults.conf
2022-10-06 07:13:05.033106+00:00 [info] <0.232.0> : /etc/rabbitmq/conf.d/11-default_user.conf
2022-10-06 07:13:05.033106+00:00 [info] <0.232.0> : /etc/rabbitmq/conf.d/90-userDefinedConfiguration.conf
2022-10-06 07:13:05.033106+00:00 [info] <0.232.0> cookie hash : idxtUKhQEU9Dav23AqS/0g==
2022-10-06 07:13:05.033106+00:00 [info] <0.232.0> log(s) : /var/log/rabbitmq/[email protected]_upgrade.log
2022-10-06 07:13:05.033106+00:00 [info] <0.232.0> : <stdout>
2022-10-06 07:13:05.033106+00:00 [info] <0.232.0> database dir : /var/lib/rabbitmq/mnesia/[email protected]
2022-10-06 07:13:05.033162+00:00 [debug] <0.232.0>
2022-10-06 07:13:05.033173+00:00 [debug] <0.232.0> == Plugins (prelaunch phase) ==
2022-10-06 07:13:05.033184+00:00 [debug] <0.232.0> Setting plugins up
2022-10-06 07:13:05.047274+00:00 [debug] <0.232.0> Plugins discovery: ignoring getopt, not a RabbitMQ plugin
2022-10-06 07:13:05.047316+00:00 [debug] <0.232.0> Plugins discovery: ignoring quantile_estimator, not a RabbitMQ plugin
2022-10-06 07:13:05.062932+00:00 [debug] <0.232.0> Loading the following plugins: [rabbitmq_management_agent,cowlib,cowboy,
2022-10-06 07:13:05.062932+00:00 [debug] <0.232.0> rabbitmq_web_dispatch,amqp_client,
2022-10-06 07:13:05.062932+00:00 [debug] <0.232.0> rabbitmq_management,prometheus,
2022-10-06 07:13:05.062932+00:00 [debug] <0.232.0> rabbitmq_peer_discovery_common,accept,
2022-10-06 07:13:05.062932+00:00 [debug] <0.232.0> rabbitmq_peer_discovery_k8s,
2022-10-06 07:13:05.062932+00:00 [debug] <0.232.0> rabbitmq_prometheus]
2022-10-06 07:13:05.068152+00:00 [debug] <0.232.0> Feature flags: REFRESHING after applications load...
2022-10-06 07:13:05.068197+00:00 [debug] <0.377.0> Feature flags: registering controller globally before proceeding with task: refresh_after_app_load
2022-10-06 07:13:05.068231+00:00 [debug] <0.377.0> Feature flags: [global sync] @ [email protected]
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0> Global hang workaround: global state on [email protected] seems inconsistent
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0> * Peer global state: {status,<25933.55.0>,
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0> {module,gen_server},
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0> [[{'$ancestors',[kernel_sup,<25933.47.0>]},
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0> {rand_seed,
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0> {#{bits => 58,jump => #Fun<rand.3.34006561>,
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0> next => #Fun<rand.0.34006561>,type => exsss,
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0> uniform => #Fun<rand.1.34006561>,
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0> uniform_n => #Fun<rand.2.34006561>},
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0> [122137426785371025|262422651984364825]}},
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0> {creation_extension,269151976804057088},
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0> {'$initial_call',{global,init,1}}],
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0> running,<25933.49.0>,[],
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0> [{header,
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0> "Status for generic server global_name_server"},
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0> {data,
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0> [{"Status",running},
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0> {"Parent",<25933.49.0>},
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0> {"Logged events",[]}]},
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0> {data,
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0> [{"State",
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0> {state,
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0> {conf,true,false},
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0> #{'[email protected]' => 8,
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0> '[email protected]' => 8,
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0> '[email protected]' => 8,
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0> {connection_id,
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0> '[email protected]'} =>
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0> 12910273,
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0> {connection_id,
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0> '[email protected]'} =>
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0> 12823848,
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0> {connection_id,
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0> '[email protected]'} =>
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0> 13047845,
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0> {connection_id,
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0> '[email protected]'} =>
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0> 12992514},
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0> ['[email protected]',
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0> '[email protected]',
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0> '[email protected]',
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0> '[email protected]'],
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0> [],[],'[email protected]',
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0> <25933.56.0>,<25933.57.0>,no_trace,false}}]}]]}
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0> * Local global state: {status,<0.55.0>,
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0> {module,gen_server},
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0> [[{{pre_connect,'[email protected]'},
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0> {8,
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0> {locker,no_longer_a_pid,
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0> #{'[email protected]' => 8,
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0> '[email protected]' => 8,
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0> '[email protected]' => 8,
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0> {connection_id,
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0> '[email protected]'} =>
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0> 12910273,
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0> {connection_id,
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0> '[email protected]'} =>
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0> 12823848,
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0> {connection_id,
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0> '[email protected]'} =>
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0> 13047845,
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0> {connection_id,
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0> '[email protected]'} =>
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0> 12992514},
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0> <25933.56.0>},
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0> -576460752303423477}},
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0> {{prot_vsn,'[email protected]'},8},
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0> {rand_seed,
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0> {#{bits => 58,jump => #Fun<rand.3.34006561>,
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0> next => #Fun<rand.0.34006561>,type => exsss,
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0> uniform => #Fun<rand.1.34006561>,
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0> uniform_n => #Fun<rand.2.34006561>},
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0> [2150540948049099|39771336696590305]}},
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0> {{prot_vsn,'[email protected]'},8},
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0> {{sync_tag_his,
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0> '[email protected]'},
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0> -576460752303423477},
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0> {creation_extension,20109732664573952},
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0> {'$initial_call',{global,init,1}},
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0> {'$ancestors',[kernel_sup,<0.47.0>]},
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0> {{sync_tag_his,
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0> '[email protected]'},
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0> -576460752303423469},
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0> {{pre_connect,'[email protected]'},
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0> {8,
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0> {locker,no_longer_a_pid,
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0> #{'[email protected]' => 8,
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0> '[email protected]' => 8,
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0> '[email protected]' => 8,
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0> {connection_id,
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0> '[email protected]'} =>
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0> 6631665,
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0> {connection_id,
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0> '[email protected]'} =>
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0> 9346854,
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0> {connection_id,
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0> '[email protected]'} =>
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0> 9344269,
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0> {connection_id,
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0> '[email protected]'} =>
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0> 6616455},
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0> <25934.56.0>},
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0> -576460752303423469}}],
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0> running,<0.49.0>,[],
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0> [{header,
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0> "Status for generic server global_name_server"},
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0> {data,
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0> [{"Status",running},
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0> {"Parent",<0.49.0>},
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0> {"Logged events",[]}]},
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0> {data,
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0> [{"State",
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0> {state,
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0> {conf,true,false},
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0> #{'[email protected]' => 8,
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0> '[email protected]' => 8,
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0> '[email protected]' => 8,
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0> '[email protected]' => 8,
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0> {connection_id,
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0> '[email protected]'} =>
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0> 12032275,
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0> {connection_id,
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0> '[email protected]'} =>
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0> 1029715,
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0> {connection_id,
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0> '[email protected]'} =>
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0> 1027052,
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0> {connection_id,
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0> '[email protected]'} =>
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0> 1021319},
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0> ['[email protected]',
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0> '[email protected]'],
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0> [],
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0> [<0.426.0>],
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0> '[email protected]',<0.56.0>,
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0> <0.57.0>,no_trace,false}}]}]]}
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0> Faking nodedown/nodeup between [email protected] and [email protected]
2022-10-06 07:13:25.073375+00:00 [warning] <0.55.0> The global_name_server received an unexpected message:
2022-10-06 07:13:25.073375+00:00 [warning] <0.55.0> handle_info({nodedown,'[email protected]'}, _)
2022-10-06 07:13:25.073375+00:00 [warning] <0.55.0>
2022-10-06 07:13:25.073521+00:00 [warning] <0.55.0> The global_name_server received an unexpected message:
2022-10-06 07:13:25.073521+00:00 [warning] <0.55.0> handle_info({nodeup,'[email protected]'}, _)
2022-10-06 07:13:25.073521+00:00 [warning] <0.55.0>
</details>