fluentd gives unexcepted error when sending SIGHUP signal #1674

mhietamaki · 2017-08-29T06:37:34Z

Following scenario was tested with official 0.14.19 and 0.14.20 fluentd docker images.

Description:
We have used SIGHUP signal to reload configuration each time our configuration changes. This has been working perfectly in our use case until 0.14.20. Apparently something has been changed in 0.14.20 since now we get following error each time we send SIGHUP signal to fluentd process.

2017-08-29 06:16:25 +0000 [error]: #0 unexpected error error_class=SignalException error="SIGHUP"
  2017-08-29 06:16:25 +0000 [error]: #0 /usr/lib/ruby/gems/2.3.0/gems/fluentd-0.14.20/lib/fluent/engine.rb:228:in `sleep'
  2017-08-29 06:16:25 +0000 [error]: #0 /usr/lib/ruby/gems/2.3.0/gems/fluentd-0.14.20/lib/fluent/engine.rb:228:in `run'
  2017-08-29 06:16:25 +0000 [error]: #0 /usr/lib/ruby/gems/2.3.0/gems/fluentd-0.14.20/lib/fluent/supervisor.rb:772:in `run_engine'
  2017-08-29 06:16:25 +0000 [error]: #0 /usr/lib/ruby/gems/2.3.0/gems/fluentd-0.14.20/lib/fluent/supervisor.rb:521:in `block in run_worker'
  2017-08-29 06:16:25 +0000 [error]: #0 /usr/lib/ruby/gems/2.3.0/gems/fluentd-0.14.20/lib/fluent/supervisor.rb:697:in `main_process'
  2017-08-29 06:16:25 +0000 [error]: #0 /usr/lib/ruby/gems/2.3.0/gems/fluentd-0.14.20/lib/fluent/supervisor.rb:516:in `run_worker'
  2017-08-29 06:16:25 +0000 [error]: #0 /usr/lib/ruby/gems/2.3.0/gems/fluentd-0.14.20/lib/fluent/command/fluentd.rb:316:in `<top (required)>'
  2017-08-29 06:16:25 +0000 [error]: #0 /usr/lib/ruby/2.3.0/rubygems/core_ext/kernel_require.rb:55:in `require'
  2017-08-29 06:16:25 +0000 [error]: #0 /usr/lib/ruby/2.3.0/rubygems/core_ext/kernel_require.rb:55:in `require'
  2017-08-29 06:16:25 +0000 [error]: #0 /usr/lib/ruby/gems/2.3.0/gems/fluentd-0.14.20/bin/fluentd:5:in `<top (required)>'
  2017-08-29 06:16:25 +0000 [error]: #0 /usr/bin/fluentd:22:in `load'
  2017-08-29 06:16:25 +0000 [error]: #0 /usr/bin/fluentd:22:in `<main>'

How to reproduce:

docker run -d --name signal_test fluent/fluentd:v0.14.20
docker kill -s "SIGHUP" signal_test
docker logs signal_test

Complete log

The text was updated successfully, but these errors were encountered:

repeatedly · 2017-08-29T19:44:10Z

We recently changed the signal handling of docker image: fluent/fluentd-docker-image#93

Could you check this is the cause or not?
We didn't change fluentd internal signal handling recently.

kkurten · 2017-08-31T14:49:25Z

Updated with additional information
I did some further testing and actually dumb-init (which is configured in entrypoint.sh) by default sends the signals to the entire process group. What happens is that SIGHUP is first sent to Fluentd worker process and that causes engine.rb to rescue and raise the SignalException. After that the supervisor process receives the SIGHUP signal and tries to kill the worker. Dumb-init can be configured to send the signal only to the direct child, see the "session behavior" section in https://github.com/Yelp/dumb-init/blob/master/README.md.

Note: Docker 1.13+ has native support for Tini (https://github.com/krallin/tini) which provides basically identical functionality than dumb-init. It can be enabled by passing --init parameter to docker run command.

I did some quick debugging to find the root cause of the issue (I work in the same team as mikkohoo). ~~The dumb-init sends the SIGHUP signal correctly to Fluentd supervisor so that part seems to work fine.~~ However, I noticed couple issues in the signal handling code that seem to cause the issue.

~https://github.com/fluent/fluentd/blob/master/lib/fluent/engine.rb#L231 was changed at some point to catch all Exceptions. Problem is that it also catches the SignalException so the worker does not get properly shutdown.~~ This is caused by dumb-init (see above), fluentd is working correctly.
When Fluentd supervisor receives SIGHUP the kill_worker function (https://github.com/fluent/fluentd/blob/master/lib/fluent/supervisor.rb#L174 ) gets called and it kills all the worker processes. However, it seems that when a worker is killed the PID of the worker is not removed from config[:worker_pid] array. When you send another SIGHUP it always fails because supervisor tries to kill a worker process that no longer exists and you get an "Unexpected error No such process" error.

Note that I'm not familiar with Fluentd codebase and I also rarely do any Ruby programming so take the above with grain of salt. :)

repeatedly · 2017-09-01T12:22:52Z

I see. Thanks for further investigation.
So does adding --single-child to this line fix the problem?
https://github.com/fluent/fluentd-docker-image/blob/943ce25fa978a8933947fe2c62b89bf9dc461d17/entrypoint.sh.erb#L2

kkurten · 2017-09-01T13:10:20Z

I think the entrypoint script will break down if the parameter is added to the shebang line because parameter handling is a bit weird in shebang lines. It would probably be cleaner to add ENV DUMB_INIT_SETSID=0 to Dockerfile. This would have the same effect than adding --single-child parameter to the shebang.

This change would fix the dumb-init issue but not the config[:worker_pid] cleanup issue in supervisor implementation. It is easy to reproduce by sending SIGHUP twice to Fluentd . You should always get the "Unexpected error No such process" error after sending the second SIGHUP.

Older Fluentd versions (like 0.14.15) probably work because the config[:worker_pid] was a single value, not an array. When supervisor received the SIGHUP it killed the old worker and overwrote config[:worker_pid] with the new worker PID. In this commit ed5f0cb the config[:worker_pid] was changed to array but nothing seems to remove the worker PID from array when worker is killed.

…#1674

repeatedly · 2017-09-05T08:23:32Z

Patch for this issue:

supervisor: Clear previous worker pids when receives kill signals. ref #1674 #1683
Set DUMB_INIT_SETSID=0 for sending signal to only supervisor fluentd-docker-image#95

supervisor: Clear previous worker pids when receives kill signals. ref #1674

repeatedly · 2017-09-08T10:47:32Z

Build new fluentd images with above 2 patches.
Try it!

mhietamaki · 2017-09-12T12:22:41Z

Working as expect, good job!

repeatedly added a commit that referenced this issue Sep 5, 2017

supervisor: Clear previous worker pids when receives kill signals. ref …

6dac783

…#1674

repeatedly mentioned this issue Sep 5, 2017

Set DUMB_INIT_SETSID=0 for sending signal to only supervisor fluent/fluentd-docker-image#95

Merged

repeatedly added a commit that referenced this issue Sep 6, 2017

Merge pull request #1683 from fluent/clear-worker-pid-after-kill

d14ea82

supervisor: Clear previous worker pids when receives kill signals. ref #1674

repeatedly closed this as completed Sep 9, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fluentd gives unexcepted error when sending SIGHUP signal #1674

fluentd gives unexcepted error when sending SIGHUP signal #1674

mhietamaki commented Aug 29, 2017

repeatedly commented Aug 29, 2017

kkurten commented Aug 31, 2017 •

edited

Loading

repeatedly commented Sep 1, 2017

kkurten commented Sep 1, 2017

repeatedly commented Sep 5, 2017

repeatedly commented Sep 8, 2017 •

edited

Loading

mhietamaki commented Sep 12, 2017

fluentd gives unexcepted error when sending SIGHUP signal #1674

fluentd gives unexcepted error when sending SIGHUP signal #1674

Comments

mhietamaki commented Aug 29, 2017

repeatedly commented Aug 29, 2017

kkurten commented Aug 31, 2017 • edited Loading

repeatedly commented Sep 1, 2017

kkurten commented Sep 1, 2017

repeatedly commented Sep 5, 2017

repeatedly commented Sep 8, 2017 • edited Loading

mhietamaki commented Sep 12, 2017

kkurten commented Aug 31, 2017 •

edited

Loading

repeatedly commented Sep 8, 2017 •

edited

Loading