Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fluentd containers do not stop gracefully #92

Closed
ghost opened this issue Aug 1, 2017 · 4 comments
Closed

Fluentd containers do not stop gracefully #92

ghost opened this issue Aug 1, 2017 · 4 comments

Comments

@ghost
Copy link

ghost commented Aug 1, 2017

When a container using any of the images v0.12.39, v0.12.39-debian, v0.14.20-debian and v0.14.20-debian is stopped, the fluentd process inside receives a SIGKILL. It does not terminate gracefully. It does not wait for buffers to flush. When stopped fluentd should receive SIGTERM, terminate gracefully, flush it's buffers and kick a goal.

Today I set out to verify that when I configure flush_at_shutdown true, I can see that my buffer was flushed. I was not able to observe this! Why?

Q. What does the container's process tree look like?

Let's create a simple fluentd container:

# docker run -d --name fluentd fluent/fluentd:v0.12.39

And ask it to print it's process tree:

# docker exec fluentd pstree -p
entrypoint.sh(1)---sh(7)---sh(15)---fluentd(16)---fluentd(18)

There are two sh processes between PID 1 and fluentd. Maybe SIGTERM is not being forwarded correctly.

Q. What signals are processes in our container getting when we stop it?

Let's start the container again:

# docker start fluentd

And list the processes inside:

# docker top fluentd
PID                 USER                TIME                COMMAND
24847               root                0:00                {entrypoint.sh} /usr/bin/dumb-init /bin/sh /bin/entrypoint...
24870               root                0:00                /bin/sh /bin/entrypoint.sh /bin/sh -c fluentd -c ...
24879               dockrema            0:00                /bin/sh -c fluentd -c /fluentd/etc/${FLUENTD_CONF} ...
24880               dockrema            0:00                {fluentd} /usr/bin/ruby /usr/bin/fluentd -c ...
24882               dockrema            0:00                {fluentd} /usr/bin/ruby /usr/bin/fluentd -c ...

Now we'll trace signals for all the processes in the container:

/ # strace -e trace=signal $(docker top fluentd | tail -n +2 | awk ' { x=x" -p "
$1 } END { print x }') 2>&1 | grep "+++"

Let's stop the container again:

# docker stop fluentd

Now we can inspect the strace output:

[pid 24870] +++ killed by SIGTERM +++
[pid 24879] +++ killed by SIGKILL +++
[pid 24880] +++ killed by SIGKILL +++
[pid 24882] +++ killed by SIGKILL +++
+++ exited with 143 +++

We can see that most of the processes in the container actually get a SIGKILL when we stop the container. Most importantly, the fluentd processes received SIGKILL.

Q. Why are the fluentd processes receiving SIGKILL?

It looks like dumb-init is forwarding on the SIGTERM, but the shell scripts it has forked have not. I guess it's because the shell scripts weren't made to forward signals like dumb-init was.

The first shell script is /bin/entrypoint.sh. It doesn't contain any signal handling code.

The second shell script comes from the container image's command:

# docker inspect fluentd | jq .[0].Config.Cmd
[
  "/bin/sh",
  "-c",
  "fluentd -c /fluentd/etc/${FLUENTD_CONF} -p /fluentd/plugins $FLUENTD_OPT"
]

This doesn't have any signal handling code either.

Both of these shell scripts seem to only needed when bootstrapping the fluentd process, so it would be fine for them to exec and be replaced by their children. This is what dumb-init's README recommends.

Q. What other images does this issue affect?

fluent/fluentd:v0.12.39-debian confirmed:

# docker run -d --name fluentd fluent/fluentd:v0.12.39-debian
53b48119d744a03a841af1f21907ed3765ab2c9f5719e01e2195e70f2c813c0a
# strace -e trace=signal $(docker top fluentd | tail -n +2 | awk ' { x=x" -p "
$1 } END { print x }') 2>&1 | grep "+++" > /tmp/trace &
# docker stop fluentd
fluentd
[1]+  Done                       strace -e trace=signal $(...) 2>&1 | grep "+++" 1>/tmp/trace
# cat /tmp/trace 
[pid 26332] +++ killed by SIGTERM +++
[pid 26322] +++ killed by SIGTERM +++
[pid 26336] +++ killed by SIGKILL +++
[pid 26338] +++ killed by SIGKILL +++
+++ exited with 143 +++

fluent/fluentd:v0.14.20-debian confirmed:

# docker run -d --name fluentd fluent/fluentd:v0.14.20-debian
fluentd
# strace -e trace=signal $(docker top fluentd | tail -n +2 | awk ' { x=x" -p "
$1 } END { print x }') 2>&1 | grep "+++" > /tmp/trace &
# docker stop fluentd
fluentd
[1]+  Done                       strace -e trace=signal $(...) 2>&1 | grep "+++" 1>/tmp/trace
# cat /tmp/trace 
[pid 25941] +++ killed by SIGTERM +++
[pid 25924] +++ killed by SIGTERM +++
[pid 25945] +++ killed by SIGKILL +++
[pid 25949] +++ killed by SIGKILL +++
+++ exited with 143 +++

fluent/fluentd:v0.14.20 confirmed:

# docker run -d --name fluentd fluent/fluentd:v0.14.20
c156be46a24a081f87d384c9590c3350171ff8bba5747d7ddb104d42118fa22b
# strace -e trace=signal $(docker top fluentd | tail -n +2 | awk ' { x=x" -p "
$1 } END { print x }') 2>&1 | grep "+++" > /tmp/trace &
# docker stop fluentd
fluentd
[1]+  Done                       strace -e trace=signal $(...) 2>&1 | grep "+++" 1>/tmp/trace
# cat /tmp/trace 
[pid 26112] +++ killed by SIGTERM +++
[pid 26105] +++ killed by SIGTERM +++
[pid 26113] +++ killed by SIGKILL +++
[pid 26117] +++ killed by SIGKILL +++
+++ exited with 143 +++
@ghost ghost mentioned this issue Aug 1, 2017
@ghost
Copy link
Author

ghost commented Aug 4, 2017

Seems related to #86.

@ghost
Copy link
Author

ghost commented Aug 21, 2017

Thank you for fixing this!

I see that build b7ay9f2auephqunq7z2bprz of v0.14-debian failed.

It looks as though the other images are fixed. I spot checked v0.12 and it looked good:

# docker run -d --name fluentd fluent/fluentd:v0.12@sha256:52dae638b897a1beac3b3e370c56cad3560367a2b5f65d4cad9c438db795358a
eac0340051ce05151434abcded582371b8325ac2ec574d4d4391e2ef7c84ac40
# strace -e trace=signal $(docker top fluentd | tail -n +2 | awk ' { x=x" -p " $2 } END { print x }') 2>&1 | grep "+++" > /tmp/trace &
[1] 17177 17178
# docker stop fluentd                                                                                                                      
[1]  + 17177 done       sudo strace -e trace=signal  2>&1 | 
       17178 done       grep --color=auto --exclude-dir={.bzr,CVS,.git,.hg,.svn} "+++" > /tmp/trace
fluentd
# cat /tmp/trace 
[pid 17082] +++ exited with 0 +++
[pid 17065] +++ exited with 0 +++
+++ exited with 0 +++

@repeatedly
Copy link
Member

It seems the problem is installing gosu.

gpg: keybox '/tmp/tmp.KJIXxDfDhJ/pubring.kbx' created
�[0m
�[91mgpg: keyserver receive failed: No name
�[0m
Removing intermediate container 9820e6727252

But I'm not sure why only v0.14-debian / v0.14-debian-onbuild have this problem...

@repeatedly
Copy link
Member

repeatedly commented Sep 8, 2017

Rebuild images until success resolves build error. Closed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant