Skip to content

Rolling updates hang on K8s with rabbitmq:3.11.0 #6034

@ansd

Description

@ansd

The RabbitMQ “global hang workaround” as described in

%%----------------------------------------------------------------------------
%% "global" hang workaround.
%%----------------------------------------------------------------------------
%% This code works around a possible inconsistency in the "global"
%% state, causing global:sync/0 to never return.
%%
%% 1. A process is spawned.
%% 2. If after 10", global:sync() didn't return, the "global"
%% state is parsed.
%% 3. If it detects that a sync is blocked for more than 10",
%% the process sends fake nodedown/nodeup events to the two
%% nodes involved (one local, one remote).
%% 4. Both "global" instances restart their synchronisation.
%% 5. global:sync() finally returns.

breaks from OTP 25.1 onwards because the messages being sent in
{global_name_server, ThisNode} ! {nodedown, PeerNode},
{global_name_server, PeerNode} ! {nodedown, ThisNode},
{global_name_server, ThisNode} ! {nodeup, PeerNode},
{global_name_server, PeerNode} ! {nodeup, ThisNode},
are not handled anymore in global due to the following commit: erlang/otp@9274e89

This means rolling updates in RabbitMQ on Kubernetes with image rabbitmq:3.11.0 containing Erlang 25.1.1 will get (sometimes) stuck.

The reasoning of using "global hang workaround" is further described in #5438.
In short, a combination of the following issues leads us to relying on that workaround:

  1. the new feature flags controller v2 introduced in 3.11 calls global:sync/0 early on boot
  2. following Kubernetes bug: Headless Service not publishing not ready addresses for statefulset kubernetes/kubernetes#92559
  3. Erlang global bug (got fixed in Erlang by setting parameter prevent_overlapping_partitions to true)
  4. RabbitMQ not being able to set prevent_overlapping_partitions to true as described in Revert "Set kernel param prevent_overlapping_partitions to true" #5483

To reproduce, trigger a few times a rolling update on kindest/node:v1.25.2 using a 5 node RabbitMQ cluster with image rabbitmq:3.11.0-management.

The logs show the following before the node gets stuck:

2022-10-06 07:13:05.032492+00:00 [debug] <0.232.0> Register `rabbit` process (<0.232.0>) for rabbit_node_monitor

  ##  ##      RabbitMQ 3.11.0
  ##  ##
  ##########  Copyright (c) 2007-2022 VMware, Inc. or its affiliates.
  ######  ##
  ##########  Licensed under the MPL 2.0. Website: https://rabbitmq.com

  Erlang:      25.1.1 [jit]
  TLS Library: OpenSSL - OpenSSL 1.1.1q  5 Jul 2022
  Release series support status: supported

  Doc guides:  https://rabbitmq.com/documentation.html
  Support:     https://rabbitmq.com/contact.html
  Tutorials:   https://rabbitmq.com/getstarted.html
  Monitoring:  https://rabbitmq.com/monitoring.html

  Logs: /var/log/rabbitmq/[email protected]_upgrade.log
        <stdout>

  Config file(s): /etc/rabbitmq/conf.d/10-defaults.conf
                  /etc/rabbitmq/conf.d/10-operatorDefaults.conf
                  /etc/rabbitmq/conf.d/11-default_user.conf
                  /etc/rabbitmq/conf.d/90-userDefinedConfiguration.conf

  Starting broker...2022-10-06 07:13:05.033106+00:00 [info] <0.232.0>
2022-10-06 07:13:05.033106+00:00 [info] <0.232.0>  node           : [email protected]
2022-10-06 07:13:05.033106+00:00 [info] <0.232.0>  home dir       : /var/lib/rabbitmq
2022-10-06 07:13:05.033106+00:00 [info] <0.232.0>  config file(s) : /etc/rabbitmq/conf.d/10-defaults.conf
2022-10-06 07:13:05.033106+00:00 [info] <0.232.0>                 : /etc/rabbitmq/conf.d/10-operatorDefaults.conf
2022-10-06 07:13:05.033106+00:00 [info] <0.232.0>                 : /etc/rabbitmq/conf.d/11-default_user.conf
2022-10-06 07:13:05.033106+00:00 [info] <0.232.0>                 : /etc/rabbitmq/conf.d/90-userDefinedConfiguration.conf
2022-10-06 07:13:05.033106+00:00 [info] <0.232.0>  cookie hash    : idxtUKhQEU9Dav23AqS/0g==
2022-10-06 07:13:05.033106+00:00 [info] <0.232.0>  log(s)         : /var/log/rabbitmq/[email protected]_upgrade.log
2022-10-06 07:13:05.033106+00:00 [info] <0.232.0>                 : <stdout>
2022-10-06 07:13:05.033106+00:00 [info] <0.232.0>  database dir   : /var/lib/rabbitmq/mnesia/[email protected]
2022-10-06 07:13:05.033162+00:00 [debug] <0.232.0>
2022-10-06 07:13:05.033173+00:00 [debug] <0.232.0> == Plugins (prelaunch phase) ==
2022-10-06 07:13:05.033184+00:00 [debug] <0.232.0> Setting plugins up
2022-10-06 07:13:05.047274+00:00 [debug] <0.232.0> Plugins discovery: ignoring getopt, not a RabbitMQ plugin
2022-10-06 07:13:05.047316+00:00 [debug] <0.232.0> Plugins discovery: ignoring quantile_estimator, not a RabbitMQ plugin
2022-10-06 07:13:05.062932+00:00 [debug] <0.232.0> Loading the following plugins: [rabbitmq_management_agent,cowlib,cowboy,
2022-10-06 07:13:05.062932+00:00 [debug] <0.232.0>                                 rabbitmq_web_dispatch,amqp_client,
2022-10-06 07:13:05.062932+00:00 [debug] <0.232.0>                                 rabbitmq_management,prometheus,
2022-10-06 07:13:05.062932+00:00 [debug] <0.232.0>                                 rabbitmq_peer_discovery_common,accept,
2022-10-06 07:13:05.062932+00:00 [debug] <0.232.0>                                 rabbitmq_peer_discovery_k8s,
2022-10-06 07:13:05.062932+00:00 [debug] <0.232.0>                                 rabbitmq_prometheus]
2022-10-06 07:13:05.068152+00:00 [debug] <0.232.0> Feature flags: REFRESHING after applications load...
2022-10-06 07:13:05.068197+00:00 [debug] <0.377.0> Feature flags: registering controller globally before proceeding with task: refresh_after_app_load
2022-10-06 07:13:05.068231+00:00 [debug] <0.377.0> Feature flags: [global sync] @ [email protected]
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0> Global hang workaround: global state on [email protected] seems inconsistent
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0>  * Peer global state:  {status,<25933.55.0>,
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0>                         {module,gen_server},
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0>                         [[{'$ancestors',[kernel_sup,<25933.47.0>]},
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0>                           {rand_seed,
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0>                            {#{bits => 58,jump => #Fun<rand.3.34006561>,
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0>                               next => #Fun<rand.0.34006561>,type => exsss,
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0>                               uniform => #Fun<rand.1.34006561>,
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0>                               uniform_n => #Fun<rand.2.34006561>},
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0>                             [122137426785371025|262422651984364825]}},
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0>                           {creation_extension,269151976804057088},
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0>                           {'$initial_call',{global,init,1}}],
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0>                          running,<25933.49.0>,[],
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0>                          [{header,
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0>                            "Status for generic server global_name_server"},
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0>                           {data,
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0>                            [{"Status",running},
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0>                             {"Parent",<25933.49.0>},
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0>                             {"Logged events",[]}]},
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0>                           {data,
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0>                            [{"State",
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0>                              {state,
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0>                               {conf,true,false},
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0>                               #{'[email protected]' => 8,
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0>                                 '[email protected]' => 8,
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0>                                 '[email protected]' => 8,
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0>                                 {connection_id,
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0>                                  '[email protected]'} =>
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0>                                  12910273,
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0>                                 {connection_id,
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0>                                  '[email protected]'} =>
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0>                                  12823848,
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0>                                 {connection_id,
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0>                                  '[email protected]'} =>
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0>                                  13047845,
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0>                                 {connection_id,
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0>                                  '[email protected]'} =>
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0>                                  12992514},
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0>                               ['[email protected]',
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0>                                '[email protected]',
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0>                                '[email protected]',
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0>                                '[email protected]'],
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0>                               [],[],'[email protected]',
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0>                               <25933.56.0>,<25933.57.0>,no_trace,false}}]}]]}
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0>  * Local global state: {status,<0.55.0>,
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0>                         {module,gen_server},
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0>                         [[{{pre_connect,'[email protected]'},
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0>                            {8,
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0>                             {locker,no_longer_a_pid,
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0>                              #{'[email protected]' => 8,
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0>                                '[email protected]' => 8,
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0>                                '[email protected]' => 8,
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0>                                {connection_id,
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0>                                 '[email protected]'} =>
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0>                                 12910273,
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0>                                {connection_id,
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0>                                 '[email protected]'} =>
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0>                                 12823848,
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0>                                {connection_id,
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0>                                 '[email protected]'} =>
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0>                                 13047845,
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0>                                {connection_id,
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0>                                 '[email protected]'} =>
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0>                                 12992514},
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0>                              <25933.56.0>},
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0>                             -576460752303423477}},
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0>                           {{prot_vsn,'[email protected]'},8},
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0>                           {rand_seed,
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0>                            {#{bits => 58,jump => #Fun<rand.3.34006561>,
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0>                               next => #Fun<rand.0.34006561>,type => exsss,
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0>                               uniform => #Fun<rand.1.34006561>,
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0>                               uniform_n => #Fun<rand.2.34006561>},
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0>                             [2150540948049099|39771336696590305]}},
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0>                           {{prot_vsn,'[email protected]'},8},
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0>                           {{sync_tag_his,
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0>                             '[email protected]'},
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0>                            -576460752303423477},
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0>                           {creation_extension,20109732664573952},
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0>                           {'$initial_call',{global,init,1}},
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0>                           {'$ancestors',[kernel_sup,<0.47.0>]},
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0>                           {{sync_tag_his,
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0>                             '[email protected]'},
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0>                            -576460752303423469},
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0>                           {{pre_connect,'[email protected]'},
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0>                            {8,
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0>                             {locker,no_longer_a_pid,
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0>                              #{'[email protected]' => 8,
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0>                                '[email protected]' => 8,
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0>                                '[email protected]' => 8,
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0>                                {connection_id,
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0>                                 '[email protected]'} =>
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0>                                 6631665,
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0>                                {connection_id,
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0>                                 '[email protected]'} =>
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0>                                 9346854,
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0>                                {connection_id,
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0>                                 '[email protected]'} =>
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0>                                 9344269,
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0>                                {connection_id,
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0>                                 '[email protected]'} =>
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0>                                 6616455},
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0>                              <25934.56.0>},
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0>                             -576460752303423469}}],
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0>                          running,<0.49.0>,[],
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0>                          [{header,
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0>                            "Status for generic server global_name_server"},
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0>                           {data,
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0>                            [{"Status",running},
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0>                             {"Parent",<0.49.0>},
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0>                             {"Logged events",[]}]},
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0>                           {data,
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0>                            [{"State",
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0>                              {state,
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0>                               {conf,true,false},
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0>                               #{'[email protected]' => 8,
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0>                                 '[email protected]' => 8,
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0>                                 '[email protected]' => 8,
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0>                                 '[email protected]' => 8,
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0>                                 {connection_id,
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0>                                  '[email protected]'} =>
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0>                                  12032275,
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0>                                 {connection_id,
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0>                                  '[email protected]'} =>
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0>                                  1029715,
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0>                                 {connection_id,
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0>                                  '[email protected]'} =>
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0>                                  1027052,
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0>                                 {connection_id,
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0>                                  '[email protected]'} =>
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0>                                  1021319},
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0>                               ['[email protected]',
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0>                                '[email protected]'],
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0>                               [],
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0>                               [<0.426.0>],
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0>                               '[email protected]',<0.56.0>,
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0>                               <0.57.0>,no_trace,false}}]}]]}
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0> Faking nodedown/nodeup between [email protected] and [email protected]
2022-10-06 07:13:25.073375+00:00 [warning] <0.55.0> The global_name_server received an unexpected message:
2022-10-06 07:13:25.073375+00:00 [warning] <0.55.0> handle_info({nodedown,'[email protected]'}, _)
2022-10-06 07:13:25.073375+00:00 [warning] <0.55.0>
2022-10-06 07:13:25.073521+00:00 [warning] <0.55.0> The global_name_server received an unexpected message:
2022-10-06 07:13:25.073521+00:00 [warning] <0.55.0> handle_info({nodeup,'[email protected]'}, _)
2022-10-06 07:13:25.073521+00:00 [warning] <0.55.0>
</details>

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions