Configure abnormal exit reasons for DynamicSupervisor #131

AndrewDryga · 2016-12-07T20:09:56Z

Motivation: provide a supervisor for temporary GenServers. They should be spawned when new job is started, and exit with :normal reason after job is finished.

From the Elixir docs it looks like :transient restart type is most suitable for this use case, but actually it's not, because whenever supervisor reaches max_restart_intensity it will exit with reason :shutdown that is threaten as normal, and supervisor will die silently.

I guess it will be a good solution to provide different restart strategy that will restart process even when it exits with :shutdown reason or to provide a way to configure abnormal exit reasons for supervisor.

The text was updated successfully, but these errors were encountered:

AndrewDryga · 2016-12-07T20:23:50Z

Btw, is there are any good ways to avoid this issue except spawning limited pool of workers and re-use them?

josevalim · 2016-12-07T20:44:27Z

Sorry but why is the supervisor reaching max_restart_intensity? The value is not increased if the GenServer exits with normal reason.

AndrewDryga · 2016-12-07T20:55:10Z

In our case it happens because of bugs in our application (for example when we receive job from RabbitMQ and fail to process it due to some unpredicted temporary bug, eg.: some external service is unavailable).

Then supervisor retries job (because RabbitMQ will re-deliver unacknowledged messages) and reaches max restart intensity. Supervisor dies silently, because :shutdown reason is not abnormal and whole application node keeps to live silently without processing any jobs.

We use Kubernetes, it will restart whole container and it is desired to have application supervisor to die in this cases, because cluster will be able to heal itself in some time. (And we will notice container restarts.)

For this cases it would be nice to have something like "do not restart only when reason is :normal" option.

josevalim · 2016-12-07T22:18:17Z

@AndrewDryga Why is the supervisor child spec being specified with reason :transient then? If you want it to be restarted, then it should be define as restart: :permanent?

AndrewDryga · 2016-12-07T22:21:25Z

@josevalim because we want to be able to kill it with :normal reason when job is completed. Sample usage:

defmodule MyWorker do
  use GenServer

  def start_link(%{} = job, tag) do
    GenServer.start_link(__MODULE__, [job: job, tag: tag])
  end

  def init(state) do
    {:ok, state, 100}
  end

  def handle_info(:timeout, [job: job]) do
    Logger.debug("Started processing Gap Analyzer job: #{inspect job}. Worker: #{__MODULE__}")

    job
    |> do_task()
    |> produce_next_task()
    |> send_ack(tag, id)

    {:stop, :normal, []}
  end

  # ..
end

josevalim · 2016-12-07T22:23:08Z

@AndrewDryga but that's not related to the exit the supervisor uses when reaching the max restart intensity.

josevalim · 2016-12-07T22:24:49Z

I guess it will be a good solution to provide different restart strategy that will restart process even when it exits with :shutdown reason

We already have such restart strategy. It is called :permanent and it is the default mode.

AndrewDryga · 2016-12-07T22:27:38Z

:permanent will restart job that exited with :normal reason, at least I don't know any exit reasons that will not trigger it's restarts. Or I misunderstood the docs.

josevalim · 2016-12-07T22:31:26Z

@AndrewDryga there is a confusion here because you are referring to two different processes at the same time.

In the issue, you say you want to change the exit value of a supervisor because :shutdown is not properly restarted. However, you are telling me you want to exit a job with :normal reason. Given we are talking about a supervisor and a job, you should use a restart of permanent for the supervisor and the restart of transient for the job. No?

AndrewDryga · 2016-12-07T23:28:14Z

You are right. I don't know where is the best place to put this option, so this misunderstanding appears. And due to language barrier it's hard to tell what do I want :).

Here is logs from our test environment that describes issue:

=SUPERVISOR REPORT==== 7-Dec-2016::15:42:24 ===
     Supervisor: {local,'Elixir.Trader.Workers.Supervisor'}
     Context:    child_terminated
     Reason:     {#{'__exception__' => true,
                    '__struct__' => 'Elixir.Postgrex.Error',
                    connection_id => 16553,
                    message => nil,
                    postgres => #{code => undefined_column,
                      file => <<"parse_relation.c">>,
                      line => <<"3090">>,
                      message => <<"column b1.loans_invest_whole does not exist">>,
                      pg_code => <<"42703">>,
                      position => <<"350">>,
                      routine => <<"errorMissingColumn">>,
                      severity => <<"ERROR">>,
                      unknown => <<"ERROR">>}},
                  [{'Elixir.Ecto.Adapters.SQL',execute_and_cache,7,
                       [{file,"lib/ecto/adapters/sql.ex"},{line,415}]},
                   {'Elixir.Ecto.Repo.Queryable',execute,5,
                       [{file,"lib/ecto/repo/queryable.ex"},{line,121}]},
                   {'Elixir.Ecto.Repo.Queryable',all,4,
                       [{file,"lib/ecto/repo/queryable.ex"},{line,35}]},
                   {'Elixir.Ecto.Repo.Queryable',one,4,
                       [{file,"lib/ecto/repo/queryable.ex"},{line,59}]},
                   {'Elixir.Trader.Workers.Analyzer',do_task,2,
                       [{file,"lib/workers/analyzer.ex"},{line,95}]},
                   {'Elixir.Trader.Workers.Analyzer',handle_info,2,
                       [{file,"lib/workers/analyzer.ex"},{line,50}]},
                   {gen_server,try_dispatch,4,
                       [{file,"gen_server.erl"},{line,615}]},
                   {gen_server,handle_msg,5,
                       [{file,"gen_server.erl"},{line,681}]}]}
     Offender:   [{pid,<0.1425.0>},
                  {id,'Elixir.Trader.Workers.Analyzer'},
                  {mfargs,
                      {'Elixir.Trader.Workers.Analyzer',start_link,
                          [#{<<"buckets">> => [#{<<"actual_volume">> => 0,<<"bucket_id">> => 1}],
                             <<"portfolio_subscription_id">> => 1},
                           1]}},
                  {restart_type,transient},
                  {shutdown,5000},
                  {child_type,worker}]
=SUPERVISOR REPORT==== 7-Dec-2016::15:42:24 ===
     Supervisor: {local,'Elixir.Trader.Workers.Supervisor'}
     Context:    shutdown
     Reason:     reached_max_restart_intensity
     Offender:   [{pid,<0.1425.0>},
                  {id,'Elixir.Trader.Workers.Analyzer'},
                  {mfargs,
                      {'Elixir.Trader.Workers.Analyzer',start_link,
                          [#{<<"buckets">> => [#{<<"actual_volume">> => 0,<<"bucket_id">> => 1}],
                             <<"portfolio_subscription_id">> => 1},
                           1]}},
                  {restart_type,transient},
                  {shutdown,5000},
                  {child_type,worker}]
=SUPERVISOR REPORT==== 7-Dec-2016::15:42:24 ===
     Supervisor: {local,'Elixir.Trader.GapAnalyzer.Supervisor'}
     Context:    child_terminated
     Reason:     shutdown
     Offender:   [{pid,<0.1245.0>},
                  {id,'Elixir.Trader.Workers.Supervisor'},
                  {mfargs,{'Elixir.Trader.Workers.Supervisor',start_link,[]}},
                  {restart_type,permanent},
                  {shutdown,infinity},
                  {child_type,supervisor}]

How does this happen?

We are sending malformed task to RabbitMQ or shut down some external service.
Some process within application consumes messages RabbitMQ and starts job worker (Elixir.Trader.Workers.Analyzer) for each of it via workers supervisor (Elixir.Trader.Workers.Supervisor).
Worker fails to process it's job and exits with abnormal exit code.
Job gets rescheduled to the same container, and steps 1-3 are repeating until supervisor reaches max restart intensity.
WorkersSupervisor exits with :shutdown code. And we have zombie container that lives and does nothing.

What is desired behaviour:

Whenever max restart intensity is reached for workers supervisor, application supervisor will be also exit.
Worker itself should be able to exit with :normal (or any other) reason without being restarted.

I will try to write a sample code for this case, but it's hard to reproduce.

fishcakez · 2016-12-07T23:58:41Z

@AndrewDryga in the above log Trader.Workers.Analyzer is crashing and causes Trader.Workers.Supervisor to reach its max restart intensity and shut down. Trader.Workers.Supervisor is a permanent child of Trader.GapAnalyzer.Supervisor and so should be restarted, can't see it in the logs but progress reports would show this. If you want to bubble the error higher up the supervisor tree immediately you can set max_restarts: 0 in Trader.GapAnalyzer.Supervisor.

josevalim · 2016-12-08T09:19:16Z

I am closing this I believe there is no bug or feature request per se but we will be glad to continue the discussion. :)

AndrewDryga · 2016-12-08T10:08:07Z

@fishcakez In our case Trader.Workers.Supervisor is not restarting :(. There are nothing happens after last log message (from example above).

Here you can see the same situation in totally different application: bitwalker/distillery#118 (comment)

I can give access to the source code, if you willing to look into it.

josevalim · 2016-12-08T10:20:08Z

Can you provide a sample app that reproduces the error?

…

On Thu, Dec 8, 2016 at 11:08 Andrew Dryga ***@***.***> wrote: @fishcakez <https://github.com/fishcakez> In our case Trader.Workers.Supervisor is not restarting :(. — You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub <#131 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAAlbiBJfdXDxXzaoToGMsT_q7S1Nu6kks5rF9cHgaJpZM4LHDE1> .

josevalim · 2016-12-08T10:21:31Z

Keep in mind that, if a dynamic supervisor restarts, all of its previous processes will be lost. You can also do "elixir --logger-sasl-reports true -S mix run" to get progress reports.

AndrewDryga · 2016-12-08T11:53:15Z

I've build a sample app but wasn't able to reproduce this issue.

I guess it's not an supervisors fault, it looks like this happens because we have limited prefetch count (limit for unacknowledged messages that is sent to node) that is equal to number of processes that gets spawned. Once all of them is killed, RabbitMQ would not receive acknowledgements and neither reschedule messages, nor send a new ones, because from it's perspective node is processing jobs at max capacity.

Is there any good practices to do some job whenever supervisor children dies? We need to store RabbitMQ tags and send negative acknowledgement when this situation occurs.

Maybe GenServer with duplicate processes that monitor workers?

josevalim · 2016-12-08T14:51:28Z

I would have each consumer execute each job inside a task and use activities such as Task.yield to find if the job terminated or not. I.e. the best way to do it is to decouple the ack/deack system from the processing.

AndrewDryga · 2017-01-25T17:12:31Z

Here is our solution for this problem: https://github.com/Nebo15/gen_task

AndrewDryga changed the title ~~Configure abnormal trasons for DynamicSupervisor~~ Configure abnormal exit reasons for DynamicSupervisor Dec 7, 2016

josevalim closed this as completed Dec 8, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Configure abnormal exit reasons for DynamicSupervisor #131

Configure abnormal exit reasons for DynamicSupervisor #131

AndrewDryga commented Dec 7, 2016 •

edited

Loading

AndrewDryga commented Dec 7, 2016

josevalim commented Dec 7, 2016

AndrewDryga commented Dec 7, 2016 •

edited

Loading

josevalim commented Dec 7, 2016

AndrewDryga commented Dec 7, 2016 •

edited

Loading

josevalim commented Dec 7, 2016

josevalim commented Dec 7, 2016 •

edited

Loading

AndrewDryga commented Dec 7, 2016

josevalim commented Dec 7, 2016

AndrewDryga commented Dec 7, 2016

fishcakez commented Dec 7, 2016

josevalim commented Dec 8, 2016

AndrewDryga commented Dec 8, 2016 •

edited

Loading

josevalim commented Dec 8, 2016 via email

josevalim commented Dec 8, 2016 via email

AndrewDryga commented Dec 8, 2016

josevalim commented Dec 8, 2016

AndrewDryga commented Jan 25, 2017

Configure abnormal exit reasons for DynamicSupervisor #131

Configure abnormal exit reasons for DynamicSupervisor #131

Comments

AndrewDryga commented Dec 7, 2016 • edited Loading

AndrewDryga commented Dec 7, 2016

josevalim commented Dec 7, 2016

AndrewDryga commented Dec 7, 2016 • edited Loading

josevalim commented Dec 7, 2016

AndrewDryga commented Dec 7, 2016 • edited Loading

josevalim commented Dec 7, 2016

josevalim commented Dec 7, 2016 • edited Loading

AndrewDryga commented Dec 7, 2016

josevalim commented Dec 7, 2016

AndrewDryga commented Dec 7, 2016

fishcakez commented Dec 7, 2016

josevalim commented Dec 8, 2016

AndrewDryga commented Dec 8, 2016 • edited Loading

josevalim commented Dec 8, 2016 via email

josevalim commented Dec 8, 2016 via email

AndrewDryga commented Dec 8, 2016

josevalim commented Dec 8, 2016

AndrewDryga commented Jan 25, 2017

AndrewDryga commented Dec 7, 2016 •

edited

Loading

AndrewDryga commented Dec 7, 2016 •

edited

Loading

AndrewDryga commented Dec 7, 2016 •

edited

Loading

josevalim commented Dec 7, 2016 •

edited

Loading

AndrewDryga commented Dec 8, 2016 •

edited

Loading