Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle case of non matching gun conn pids when receiving a gun_down message #611

Merged

Conversation

Th3-M4jor
Copy link
Contributor

Occasionally when my bot has to reconnect I would see 2 ready events in quick succession. I noticed from the below stacktrace that the PIDs for the gun connections didn't match and that caused the Shard to blow up.

Stacktrace
2024-06-23 14:07:30.419 pid=<0.2610.0> [error] ** State machine <0.2610.0> terminating
** Last event = {info,{gun_down,<0.563597.0>,ws,normal,
                                [#Ref<0.1002467860.1931214849.138526>]}}
** When server state  = {connected,#{stream =>
                                         #Ref<0.1002467860.1931214849.143966>,
                                     seq => 30,
                                     '__struct__' =>
                                         'Elixir.Nostrum.Struct.WSState',
                                     session =>
                                         <<"e7c5b7a3c54504b725b76c1ffa68496a">>,
                                     conn => <0.564181.0>,
                                     gateway =>
                                         <<"gateway-us-east1-d.discord.gg">>,
                                     resume_gateway =>
                                         <<"wss://gateway-us-east1-d.discord.gg">>,
                                     shard_num => 0,total_shards => 1,
                                     heartbeat_ack => true,
                                     heartbeat_interval => 41250,
                                     last_heartbeat_ack =>
                                         #{microsecond => {108488,6},
                                           second => 21,
                                           calendar => 'Elixir.Calendar.ISO',
                                           month => 6,
                                           '__struct__' => 'Elixir.DateTime',
                                           utc_offset => 0,std_offset => 0,
                                           year => 2024,hour => 14,day => 23,
                                           zone_abbr => <<"UTC">>,minute => 7,
                                           time_zone => <<"Etc/UTC">>},
                                     conn_pid => <0.2610.0>,
                                     last_heartbeat_send =>
                                         #{microsecond => {164837,6},
                                           second => 34,
                                           calendar => 'Elixir.Calendar.ISO',
                                           month => 6,
                                           '__struct__' => 'Elixir.DateTime',
                                           utc_offset => 0,std_offset => 0,
                                           year => 2024,hour => 14,day => 23,
                                           zone_abbr => <<"UTC">>,minute => 6,
                                           time_zone => <<"Etc/UTC">>},
                                     compress_ctx =>
                                         #Ref<0.1002467860.1927938051.205417>}}
** Reason for termination = error:function_clause
** Callback modules = ['Elixir.Nostrum.Shard.Session']
** Callback mode = [state_functions,state_enter]
** Stacktrace =
**  [{'Elixir.Nostrum.Shard.Session',connected,
         [info,
          {gun_down,<0.563597.0>,ws,normal,
              [#Ref<0.1002467860.1931214849.138526>]},
          #{stream => #Ref<0.1002467860.1931214849.143966>,seq => 30,
            '__struct__' => 'Elixir.Nostrum.Struct.WSState',
            session => <<"e7c5b7a3c54504b725b76c1ffa68496a">>,
            conn => <0.564181.0>,
            gateway => <<"gateway-us-east1-d.discord.gg">>,
            resume_gateway => <<"wss://gateway-us-east1-d.discord.gg">>,
            shard_num => 0,total_shards => 1,heartbeat_ack => true,
            heartbeat_interval => 41250,
            last_heartbeat_ack =>
                #{microsecond => {108488,6},
                  second => 21,calendar => 'Elixir.Calendar.ISO',month => 6,
                  '__struct__' => 'Elixir.DateTime',utc_offset => 0,
                  std_offset => 0,year => 2024,hour => 14,day => 23,
                  zone_abbr => <<"UTC">>,minute => 7,
                  time_zone => <<"Etc/UTC">>},
            conn_pid => <0.2610.0>,
            last_heartbeat_send =>
                #{microsecond => {164837,6},
                  second => 34,calendar => 'Elixir.Calendar.ISO',month => 6,
                  '__struct__' => 'Elixir.DateTime',utc_offset => 0,
                  std_offset => 0,year => 2024,hour => 14,day => 23,
                  zone_abbr => <<"UTC">>,minute => 6,
                  time_zone => <<"Etc/UTC">>},
            compress_ctx => #Ref<0.1002467860.1927938051.205417>}],
         [{file,"lib/nostrum/shard/session.ex"},{line,324}]},
     {gen_statem,loop_state_callback,11,[{file,"gen_statem.erl"},{line,1395}]},
     {proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,241}]}]
** Time-outs: {1,[{state_timeout,send_heartbeat}]}

@jb3 jb3 merged commit e56ac19 into Kraigie:master Jun 23, 2024
10 checks passed
@jchristgit
Copy link
Collaborator

jchristgit commented Jun 29, 2024 via email

@Th3-M4jor
Copy link
Contributor Author

Looking at the code, we do brutally close then flush the connection whenever when it goes down or otherwise be disconnected.

What could theoretically be happening is a race condition? Where we close and flush the connection on our end at the same time as it being closed on the other end and for some odd reason it doesn't end up in the process mailbox until after we've reconnected. Not sure how else this could be happening.

@jchristgit
Copy link
Collaborator

jchristgit commented Jun 30, 2024 via email

@Th3-M4jor
Copy link
Contributor Author

I've got logs for my bot going back to early May and after looking again, it seems this specific error only happened once to me 🤔. So its likely this was just one of those weird flukes, that likely wouldn't happen again for a long time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants