Handle case of non matching gun conn pids when receiving a gun_down message #611

Th3-M4jor · 2024-06-23T19:30:14Z

Occasionally when my bot has to reconnect I would see 2 ready events in quick succession. I noticed from the below stacktrace that the PIDs for the gun connections didn't match and that caused the Shard to blow up.

Stacktrace

2024-06-23 14:07:30.419 pid=<0.2610.0> [error] ** State machine <0.2610.0> terminating
** Last event = {info,{gun_down,<0.563597.0>,ws,normal,
                                [#Ref<0.1002467860.1931214849.138526>]}}
** When server state  = {connected,#{stream =>
                                         #Ref<0.1002467860.1931214849.143966>,
                                     seq => 30,
                                     '__struct__' =>
                                         'Elixir.Nostrum.Struct.WSState',
                                     session =>
                                         <<"e7c5b7a3c54504b725b76c1ffa68496a">>,
                                     conn => <0.564181.0>,
                                     gateway =>
                                         <<"gateway-us-east1-d.discord.gg">>,
                                     resume_gateway =>
                                         <<"wss://gateway-us-east1-d.discord.gg">>,
                                     shard_num => 0,total_shards => 1,
                                     heartbeat_ack => true,
                                     heartbeat_interval => 41250,
                                     last_heartbeat_ack =>
                                         #{microsecond => {108488,6},
                                           second => 21,
                                           calendar => 'Elixir.Calendar.ISO',
                                           month => 6,
                                           '__struct__' => 'Elixir.DateTime',
                                           utc_offset => 0,std_offset => 0,
                                           year => 2024,hour => 14,day => 23,
                                           zone_abbr => <<"UTC">>,minute => 7,
                                           time_zone => <<"Etc/UTC">>},
                                     conn_pid => <0.2610.0>,
                                     last_heartbeat_send =>
                                         #{microsecond => {164837,6},
                                           second => 34,
                                           calendar => 'Elixir.Calendar.ISO',
                                           month => 6,
                                           '__struct__' => 'Elixir.DateTime',
                                           utc_offset => 0,std_offset => 0,
                                           year => 2024,hour => 14,day => 23,
                                           zone_abbr => <<"UTC">>,minute => 6,
                                           time_zone => <<"Etc/UTC">>},
                                     compress_ctx =>
                                         #Ref<0.1002467860.1927938051.205417>}}
** Reason for termination = error:function_clause
** Callback modules = ['Elixir.Nostrum.Shard.Session']
** Callback mode = [state_functions,state_enter]
** Stacktrace =
**  [{'Elixir.Nostrum.Shard.Session',connected,
         [info,
          {gun_down,<0.563597.0>,ws,normal,
              [#Ref<0.1002467860.1931214849.138526>]},
          #{stream => #Ref<0.1002467860.1931214849.143966>,seq => 30,
            '__struct__' => 'Elixir.Nostrum.Struct.WSState',
            session => <<"e7c5b7a3c54504b725b76c1ffa68496a">>,
            conn => <0.564181.0>,
            gateway => <<"gateway-us-east1-d.discord.gg">>,
            resume_gateway => <<"wss://gateway-us-east1-d.discord.gg">>,
            shard_num => 0,total_shards => 1,heartbeat_ack => true,
            heartbeat_interval => 41250,
            last_heartbeat_ack =>
                #{microsecond => {108488,6},
                  second => 21,calendar => 'Elixir.Calendar.ISO',month => 6,
                  '__struct__' => 'Elixir.DateTime',utc_offset => 0,
                  std_offset => 0,year => 2024,hour => 14,day => 23,
                  zone_abbr => <<"UTC">>,minute => 7,
                  time_zone => <<"Etc/UTC">>},
            conn_pid => <0.2610.0>,
            last_heartbeat_send =>
                #{microsecond => {164837,6},
                  second => 34,calendar => 'Elixir.Calendar.ISO',month => 6,
                  '__struct__' => 'Elixir.DateTime',utc_offset => 0,
                  std_offset => 0,year => 2024,hour => 14,day => 23,
                  zone_abbr => <<"UTC">>,minute => 6,
                  time_zone => <<"Etc/UTC">>},
            compress_ctx => #Ref<0.1002467860.1927938051.205417>}],
         [{file,"lib/nostrum/shard/session.ex"},{line,324}]},
     {gen_statem,loop_state_callback,11,[{file,"gen_statem.erl"},{line,1395}]},
     {proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,241}]}]
** Time-outs: {1,[{state_timeout,send_heartbeat}]}

jchristgit · 2024-06-29T19:00:47Z

This works but I feel like we're solving it a bit too high in the stack. Shouldn't we instead forcibly `:gun.close` and `:gun.flush` for any previous connection when we go down? We already have the `:reconnect` internal event for `connected` state.

Th3-M4jor · 2024-06-30T00:13:25Z

Looking at the code, we do brutally close then flush the connection whenever when it goes down or otherwise be disconnected.

What could theoretically be happening is a race condition? Where we close and flush the connection on our end at the same time as it being closed on the other end and for some odd reason it doesn't end up in the process mailbox until after we've reconnected. Not sure how else this could be happening.

jchristgit · 2024-06-30T09:01:40Z

What could theoretically be happening is a race condition? Where we close and flush the connection on our end at the same time as it being closed on the other end and for some odd reason it doesn't end up in the process mailbox until after we've reconnected. Not sure how else this could be happening.

Hm, from my understanding flush should take care of it, otherwise it sounds like an upstream bug. Or maybe we are misusing it. To debug this stuff having traces from `:sys.trace` would really be useful to know what the ratelimiter was cooking up in the meantime. Ultimately I think the best solution is to write our own WebSocket implementation to fully understand what is going on. I am about half joking.

Th3-M4jor · 2024-06-30T16:12:46Z

I've got logs for my bot going back to early May and after looking again, it seems this specific error only happened once to me 🤔. So its likely this was just one of those weird flukes, that likely wouldn't happen again for a long time.

handle gun_down in shard for non-matching gun conns

8f05e04

jb3 approved these changes Jun 23, 2024

View reviewed changes

jb3 merged commit e56ac19 into Kraigie:master Jun 23, 2024
10 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Handle case of non matching gun conn pids when receiving a gun_down message #611

Handle case of non matching gun conn pids when receiving a gun_down message #611

Th3-M4jor commented Jun 23, 2024

jchristgit commented Jun 29, 2024 via email

Th3-M4jor commented Jun 30, 2024

jchristgit commented Jun 30, 2024 via email

Th3-M4jor commented Jun 30, 2024

Handle case of non matching gun conn pids when receiving a gun_down message #611

Handle case of non matching gun conn pids when receiving a gun_down message #611

Conversation

Th3-M4jor commented Jun 23, 2024

jchristgit commented Jun 29, 2024 via email

Th3-M4jor commented Jun 30, 2024

jchristgit commented Jun 30, 2024 via email

Th3-M4jor commented Jun 30, 2024