Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Configure abnormal exit reasons for DynamicSupervisor #131

Closed
AndrewDryga opened this issue Dec 7, 2016 · 18 comments
Closed

Configure abnormal exit reasons for DynamicSupervisor #131

AndrewDryga opened this issue Dec 7, 2016 · 18 comments

Comments

@AndrewDryga
Copy link

AndrewDryga commented Dec 7, 2016

Motivation: provide a supervisor for temporary GenServers. They should be spawned when new job is started, and exit with :normal reason after job is finished.

From the Elixir docs it looks like :transient restart type is most suitable for this use case, but actually it's not, because whenever supervisor reaches max_restart_intensity it will exit with reason :shutdown that is threaten as normal, and supervisor will die silently.

I guess it will be a good solution to provide different restart strategy that will restart process even when it exits with :shutdown reason or to provide a way to configure abnormal exit reasons for supervisor.

@AndrewDryga AndrewDryga changed the title Configure abnormal trasons for DynamicSupervisor Configure abnormal exit reasons for DynamicSupervisor Dec 7, 2016
@AndrewDryga
Copy link
Author

Btw, is there are any good ways to avoid this issue except spawning limited pool of workers and re-use them?

@josevalim
Copy link
Member

Sorry but why is the supervisor reaching max_restart_intensity? The value is not increased if the GenServer exits with normal reason.

@AndrewDryga
Copy link
Author

AndrewDryga commented Dec 7, 2016

In our case it happens because of bugs in our application (for example when we receive job from RabbitMQ and fail to process it due to some unpredicted temporary bug, eg.: some external service is unavailable).

Then supervisor retries job (because RabbitMQ will re-deliver unacknowledged messages) and reaches max restart intensity. Supervisor dies silently, because :shutdown reason is not abnormal and whole application node keeps to live silently without processing any jobs.

We use Kubernetes, it will restart whole container and it is desired to have application supervisor to die in this cases, because cluster will be able to heal itself in some time. (And we will notice container restarts.)

For this cases it would be nice to have something like "do not restart only when reason is :normal" option.

@josevalim
Copy link
Member

@AndrewDryga Why is the supervisor child spec being specified with reason :transient then? If you want it to be restarted, then it should be define as restart: :permanent?

@AndrewDryga
Copy link
Author

AndrewDryga commented Dec 7, 2016

@josevalim because we want to be able to kill it with :normal reason when job is completed. Sample usage:

defmodule MyWorker do
  use GenServer

  def start_link(%{} = job, tag) do
    GenServer.start_link(__MODULE__, [job: job, tag: tag])
  end

  def init(state) do
    {:ok, state, 100}
  end

  def handle_info(:timeout, [job: job]) do
    Logger.debug("Started processing Gap Analyzer job: #{inspect job}. Worker: #{__MODULE__}")

    job
    |> do_task()
    |> produce_next_task()
    |> send_ack(tag, id)

    {:stop, :normal, []}
  end

  # ..
end

@josevalim
Copy link
Member

@AndrewDryga but that's not related to the exit the supervisor uses when reaching the max restart intensity.

@josevalim
Copy link
Member

josevalim commented Dec 7, 2016

I guess it will be a good solution to provide different restart strategy that will restart process even when it exits with :shutdown reason

We already have such restart strategy. It is called :permanent and it is the default mode.

@AndrewDryga
Copy link
Author

:permanent will restart job that exited with :normal reason, at least I don't know any exit reasons that will not trigger it's restarts. Or I misunderstood the docs.

@josevalim
Copy link
Member

@AndrewDryga there is a confusion here because you are referring to two different processes at the same time.

In the issue, you say you want to change the exit value of a supervisor because :shutdown is not properly restarted. However, you are telling me you want to exit a job with :normal reason. Given we are talking about a supervisor and a job, you should use a restart of permanent for the supervisor and the restart of transient for the job. No?

@AndrewDryga
Copy link
Author

You are right. I don't know where is the best place to put this option, so this misunderstanding appears. And due to language barrier it's hard to tell what do I want :).

Here is logs from our test environment that describes issue:

=SUPERVISOR REPORT==== 7-Dec-2016::15:42:24 ===
     Supervisor: {local,'Elixir.Trader.Workers.Supervisor'}
     Context:    child_terminated
     Reason:     {#{'__exception__' => true,
                    '__struct__' => 'Elixir.Postgrex.Error',
                    connection_id => 16553,
                    message => nil,
                    postgres => #{code => undefined_column,
                      file => <<"parse_relation.c">>,
                      line => <<"3090">>,
                      message => <<"column b1.loans_invest_whole does not exist">>,
                      pg_code => <<"42703">>,
                      position => <<"350">>,
                      routine => <<"errorMissingColumn">>,
                      severity => <<"ERROR">>,
                      unknown => <<"ERROR">>}},
                  [{'Elixir.Ecto.Adapters.SQL',execute_and_cache,7,
                       [{file,"lib/ecto/adapters/sql.ex"},{line,415}]},
                   {'Elixir.Ecto.Repo.Queryable',execute,5,
                       [{file,"lib/ecto/repo/queryable.ex"},{line,121}]},
                   {'Elixir.Ecto.Repo.Queryable',all,4,
                       [{file,"lib/ecto/repo/queryable.ex"},{line,35}]},
                   {'Elixir.Ecto.Repo.Queryable',one,4,
                       [{file,"lib/ecto/repo/queryable.ex"},{line,59}]},
                   {'Elixir.Trader.Workers.Analyzer',do_task,2,
                       [{file,"lib/workers/analyzer.ex"},{line,95}]},
                   {'Elixir.Trader.Workers.Analyzer',handle_info,2,
                       [{file,"lib/workers/analyzer.ex"},{line,50}]},
                   {gen_server,try_dispatch,4,
                       [{file,"gen_server.erl"},{line,615}]},
                   {gen_server,handle_msg,5,
                       [{file,"gen_server.erl"},{line,681}]}]}
     Offender:   [{pid,<0.1425.0>},
                  {id,'Elixir.Trader.Workers.Analyzer'},
                  {mfargs,
                      {'Elixir.Trader.Workers.Analyzer',start_link,
                          [#{<<"buckets">> => [#{<<"actual_volume">> => 0,<<"bucket_id">> => 1}],
                             <<"portfolio_subscription_id">> => 1},
                           1]}},
                  {restart_type,transient},
                  {shutdown,5000},
                  {child_type,worker}]
=SUPERVISOR REPORT==== 7-Dec-2016::15:42:24 ===
     Supervisor: {local,'Elixir.Trader.Workers.Supervisor'}
     Context:    shutdown
     Reason:     reached_max_restart_intensity
     Offender:   [{pid,<0.1425.0>},
                  {id,'Elixir.Trader.Workers.Analyzer'},
                  {mfargs,
                      {'Elixir.Trader.Workers.Analyzer',start_link,
                          [#{<<"buckets">> => [#{<<"actual_volume">> => 0,<<"bucket_id">> => 1}],
                             <<"portfolio_subscription_id">> => 1},
                           1]}},
                  {restart_type,transient},
                  {shutdown,5000},
                  {child_type,worker}]
=SUPERVISOR REPORT==== 7-Dec-2016::15:42:24 ===
     Supervisor: {local,'Elixir.Trader.GapAnalyzer.Supervisor'}
     Context:    child_terminated
     Reason:     shutdown
     Offender:   [{pid,<0.1245.0>},
                  {id,'Elixir.Trader.Workers.Supervisor'},
                  {mfargs,{'Elixir.Trader.Workers.Supervisor',start_link,[]}},
                  {restart_type,permanent},
                  {shutdown,infinity},
                  {child_type,supervisor}]

How does this happen?

  1. We are sending malformed task to RabbitMQ or shut down some external service.
  2. Some process within application consumes messages RabbitMQ and starts job worker (Elixir.Trader.Workers.Analyzer) for each of it via workers supervisor (Elixir.Trader.Workers.Supervisor).
  3. Worker fails to process it's job and exits with abnormal exit code.
  4. Job gets rescheduled to the same container, and steps 1-3 are repeating until supervisor reaches max restart intensity.
  5. WorkersSupervisor exits with :shutdown code. And we have zombie container that lives and does nothing.

What is desired behaviour:

  1. Whenever max restart intensity is reached for workers supervisor, application supervisor will be also exit.
  2. Worker itself should be able to exit with :normal (or any other) reason without being restarted.

I will try to write a sample code for this case, but it's hard to reproduce.

@fishcakez
Copy link
Member

@AndrewDryga in the above log Trader.Workers.Analyzer is crashing and causes Trader.Workers.Supervisor to reach its max restart intensity and shut down. Trader.Workers.Supervisor is a permanent child of Trader.GapAnalyzer.Supervisor and so should be restarted, can't see it in the logs but progress reports would show this. If you want to bubble the error higher up the supervisor tree immediately you can set max_restarts: 0 in Trader.GapAnalyzer.Supervisor.

@josevalim
Copy link
Member

I am closing this I believe there is no bug or feature request per se but we will be glad to continue the discussion. :)

@AndrewDryga
Copy link
Author

AndrewDryga commented Dec 8, 2016

@fishcakez In our case Trader.Workers.Supervisor is not restarting :(. There are nothing happens after last log message (from example above).

Here you can see the same situation in totally different application: bitwalker/distillery#118 (comment)

I can give access to the source code, if you willing to look into it.

@josevalim
Copy link
Member

josevalim commented Dec 8, 2016 via email

@josevalim
Copy link
Member

josevalim commented Dec 8, 2016 via email

@AndrewDryga
Copy link
Author

I've build a sample app but wasn't able to reproduce this issue.

I guess it's not an supervisors fault, it looks like this happens because we have limited prefetch count (limit for unacknowledged messages that is sent to node) that is equal to number of processes that gets spawned. Once all of them is killed, RabbitMQ would not receive acknowledgements and neither reschedule messages, nor send a new ones, because from it's perspective node is processing jobs at max capacity.

Is there any good practices to do some job whenever supervisor children dies? We need to store RabbitMQ tags and send negative acknowledgement when this situation occurs.

Maybe GenServer with duplicate processes that monitor workers?

@josevalim
Copy link
Member

I would have each consumer execute each job inside a task and use activities such as Task.yield to find if the job terminated or not. I.e. the best way to do it is to decouple the ack/deack system from the processing.

@AndrewDryga
Copy link
Author

Here is our solution for this problem: https://github.com/Nebo15/gen_task

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

3 participants