Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fluentd gives unexcepted error when sending SIGHUP signal #1674

Closed
mhietamaki opened this issue Aug 29, 2017 · 7 comments
Closed

fluentd gives unexcepted error when sending SIGHUP signal #1674

mhietamaki opened this issue Aug 29, 2017 · 7 comments

Comments

@mhietamaki
Copy link

Following scenario was tested with official 0.14.19 and 0.14.20 fluentd docker images.

Description:
We have used SIGHUP signal to reload configuration each time our configuration changes. This has been working perfectly in our use case until 0.14.20. Apparently something has been changed in 0.14.20 since now we get following error each time we send SIGHUP signal to fluentd process.

2017-08-29 06:16:25 +0000 [error]: #0 unexpected error error_class=SignalException error="SIGHUP"
  2017-08-29 06:16:25 +0000 [error]: #0 /usr/lib/ruby/gems/2.3.0/gems/fluentd-0.14.20/lib/fluent/engine.rb:228:in `sleep'
  2017-08-29 06:16:25 +0000 [error]: #0 /usr/lib/ruby/gems/2.3.0/gems/fluentd-0.14.20/lib/fluent/engine.rb:228:in `run'
  2017-08-29 06:16:25 +0000 [error]: #0 /usr/lib/ruby/gems/2.3.0/gems/fluentd-0.14.20/lib/fluent/supervisor.rb:772:in `run_engine'
  2017-08-29 06:16:25 +0000 [error]: #0 /usr/lib/ruby/gems/2.3.0/gems/fluentd-0.14.20/lib/fluent/supervisor.rb:521:in `block in run_worker'
  2017-08-29 06:16:25 +0000 [error]: #0 /usr/lib/ruby/gems/2.3.0/gems/fluentd-0.14.20/lib/fluent/supervisor.rb:697:in `main_process'
  2017-08-29 06:16:25 +0000 [error]: #0 /usr/lib/ruby/gems/2.3.0/gems/fluentd-0.14.20/lib/fluent/supervisor.rb:516:in `run_worker'
  2017-08-29 06:16:25 +0000 [error]: #0 /usr/lib/ruby/gems/2.3.0/gems/fluentd-0.14.20/lib/fluent/command/fluentd.rb:316:in `<top (required)>'
  2017-08-29 06:16:25 +0000 [error]: #0 /usr/lib/ruby/2.3.0/rubygems/core_ext/kernel_require.rb:55:in `require'
  2017-08-29 06:16:25 +0000 [error]: #0 /usr/lib/ruby/2.3.0/rubygems/core_ext/kernel_require.rb:55:in `require'
  2017-08-29 06:16:25 +0000 [error]: #0 /usr/lib/ruby/gems/2.3.0/gems/fluentd-0.14.20/bin/fluentd:5:in `<top (required)>'
  2017-08-29 06:16:25 +0000 [error]: #0 /usr/bin/fluentd:22:in `load'
  2017-08-29 06:16:25 +0000 [error]: #0 /usr/bin/fluentd:22:in `<main>'

How to reproduce:

docker run -d --name signal_test fluent/fluentd:v0.14.20
docker kill -s "SIGHUP" signal_test
docker logs signal_test

Complete log

@repeatedly
Copy link
Member

We recently changed the signal handling of docker image: fluent/fluentd-docker-image#93

Could you check this is the cause or not?
We didn't change fluentd internal signal handling recently.

@kkurten
Copy link

kkurten commented Aug 31, 2017

Updated with additional information
I did some further testing and actually dumb-init (which is configured in entrypoint.sh) by default sends the signals to the entire process group. What happens is that SIGHUP is first sent to Fluentd worker process and that causes engine.rb to rescue and raise the SignalException. After that the supervisor process receives the SIGHUP signal and tries to kill the worker. Dumb-init can be configured to send the signal only to the direct child, see the "session behavior" section in https://github.com/Yelp/dumb-init/blob/master/README.md.

Note: Docker 1.13+ has native support for Tini (https://github.com/krallin/tini) which provides basically identical functionality than dumb-init. It can be enabled by passing --init parameter to docker run command.


I did some quick debugging to find the root cause of the issue (I work in the same team as mikkohoo). The dumb-init sends the SIGHUP signal correctly to Fluentd supervisor so that part seems to work fine. However, I noticed couple issues in the signal handling code that seem to cause the issue.

  1. ~https://github.com/fluent/fluentd/blob/master/lib/fluent/engine.rb#L231 was changed at some point to catch all Exceptions. Problem is that it also catches the SignalException so the worker does not get properly shutdown.~~ This is caused by dumb-init (see above), fluentd is working correctly.
  2. When Fluentd supervisor receives SIGHUP the kill_worker function (https://github.com/fluent/fluentd/blob/master/lib/fluent/supervisor.rb#L174 ) gets called and it kills all the worker processes. However, it seems that when a worker is killed the PID of the worker is not removed from config[:worker_pid] array. When you send another SIGHUP it always fails because supervisor tries to kill a worker process that no longer exists and you get an "Unexpected error No such process" error.

Note that I'm not familiar with Fluentd codebase and I also rarely do any Ruby programming so take the above with grain of salt. :)

@repeatedly
Copy link
Member

I see. Thanks for further investigation.
So does adding --single-child to this line fix the problem?
https://github.com/fluent/fluentd-docker-image/blob/943ce25fa978a8933947fe2c62b89bf9dc461d17/entrypoint.sh.erb#L2

@kkurten
Copy link

kkurten commented Sep 1, 2017

I think the entrypoint script will break down if the parameter is added to the shebang line because parameter handling is a bit weird in shebang lines. It would probably be cleaner to add ENV DUMB_INIT_SETSID=0 to Dockerfile. This would have the same effect than adding --single-child parameter to the shebang.

This change would fix the dumb-init issue but not the config[:worker_pid] cleanup issue in supervisor implementation. It is easy to reproduce by sending SIGHUP twice to Fluentd . You should always get the "Unexpected error No such process" error after sending the second SIGHUP.

Older Fluentd versions (like 0.14.15) probably work because the config[:worker_pid] was a single value, not an array. When supervisor received the SIGHUP it killed the old worker and overwrote config[:worker_pid] with the new worker PID. In this commit ed5f0cb the config[:worker_pid] was changed to array but nothing seems to remove the worker PID from array when worker is killed.

@repeatedly
Copy link
Member

Patch for this issue:

repeatedly added a commit that referenced this issue Sep 6, 2017
supervisor: Clear previous worker pids when receives kill signals. ref #1674
@repeatedly
Copy link
Member

repeatedly commented Sep 8, 2017

Build new fluentd images with above 2 patches.
Try it!

@mhietamaki
Copy link
Author

Working as expect, good job!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants