Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

setpgid(0,0) fails if dcron is process group leader #13

Open
ghost opened this issue Aug 28, 2016 · 11 comments
Open

setpgid(0,0) fails if dcron is process group leader #13

ghost opened this issue Aug 28, 2016 · 11 comments

Comments

@ghost
Copy link

ghost commented Aug 28, 2016

Using s6-rc, dcron dies because setpgid(0,0) fails.

My s6-rc/dcron/ directory has the following structure

$ ls
dependencies
pipeline-name
producer-for
run
type
$ cat ./producer-for
dcron-log
$ cat ./pipeline-name
dcron-pipeline
$ cat ./dependencies
fsck
$ cat ./type
longrun
$ cat ./run
#!/bin/execlineb -P
fdmove -c 2 1
exec -c
/usr/sbin/crond -M /bin/true -f

Running s6-rc -u change dcron results in dcron-log capturing the
following:

setpgid: Operation not permitted

However, running sudo /usr/sbin/crond -M /bin/true -f in a TTY works
just fine. After discussion with the s6 folks, it appears this is because
s6 makes the service it supervises the session and group leader, and
indeed setpgid(2) will fail with EPERM if the process is currently a
session leader.

I'm not very familiar with why the details behind dcron creating a new
group, but from the comments, it looks like EPERM isn't necessarily
an error condition, since dcron is already in the state it wants to
change to. Replacing the relevant code (around main.c:272) with

if (getsid(0) != getpid()) {
        if (setpgid(0,0)) {
                perror("setpgid");
                exit(1);
        }
}       

might do what's desired (or perhaps getpgrp() in place of getsid(0),
depending on exactly what dcron needs?). I'm unsure enough of the inner
workings of dcron that I'm not submitting that as a pull request, however.

@dubiousjim
Copy link
Owner

On Sun, Aug 28, 2016, at 02:42 PM, Gilles wrote:

Using s6-rc, dcron dies because setpgid(0,0) fails.

My s6-rc/dcron/ directory has the following structure

$ ls
dependencies
pipeline-name
producer-for
run
type
$ cat ./producer-for
dcron-log
$ cat ./pipeline-name
dcron-pipeline
$ cat ./dependencies
fsck
$ cat ./type
longrun
$ cat ./run
#!/bin/execlineb -P
fdmove -c 2 1
exec -c
/usr/sbin/crond -M /bin/true -f

Running s6-rc -u change dcron results in dcron-log capturing the
following:

setpgid: Operation not permitted

However, running sudo /usr/sbin/crond -M /bin/true -f in a TTY works
just fine. After discussion with the s6 folks, it appears this is
because
s6 makes the service it supervises the session and group leader, and
indeed setpgid(2) will fail with EPERM if the process is currently a
session leader.

I'm not very familiar with why the details behind dcron creating a new
group, but from the comments, it looks like EPERM isn't necessarily
an error condition, since dcron is already in the state it wants to
change to. Replacing the relevant code (around main.c:272) with

if (getsid(0) != getpid()) {
        if (setpgid(0,0)) {
                perror("setpgid");
                exit(1);
        }
}       

might do what's desired (or perhaps getpgrp() in place of getsid(0),
depending on exactly what dcron needs?). I'm unsure enough of the inner
workings of dcron that I'm not submitting that as a pull request,
however.

Hi, thanks for this report. Offhand, your diagnosis sounds right to me.
But I'll do more research. I'm afraid I won't be able to get to this
right away, as I'm overwhelmed with other responsibilities at the

moment. But I hope to get to it sometime this fall.

[email protected]

@dcharbonnier
Copy link

this seams important for running dcron in a docker container, exec cron -f fail so we can't get the pid 1 using an entrypoint

@frjaraur
Copy link

frjaraur commented Feb 8, 2017

Same problem here ;)

@LordVeovis
Copy link

LordVeovis commented Apr 9, 2017

Same problem with alpine linux 3.5.2 and crond 4.5
Alpine Linux is a very light distribution based on busybox, musl and openrc.

I have observe the following :
If crond is launched without the -f flags, it backgrounds itself. open-rc lose the pid of crond so it shows crond as crashed with rc-status.
If crond is launched with the -f flags. start-stop-daemon can grab the correct pid, then it puts the daemon into background but crond crash.

I strace it and have the same conclusion about the fatal setpgid

apu:~# strace -f start-stop-daemon -v --start --background --make-pidfile --pidfile /var/run/crond.pid --exec /usr/sbin/crond -- -c /
etc/crontabs -f
[...uninteresting stuff...]
close(9)                                = -1 EBADF (Bad file descriptor)
close(8)                                = -1 EBADF (Bad file descriptor)
close(7)                                = -1 EBADF (Bad file descriptor)
close(6)                                = -1 EBADF (Bad file descriptor)
close(5)                                = -1 EBADF (Bad file descriptor)
close(4)                                = 0
close(3)                                = -1 EBADF (Bad file descriptor)
setsid()                                = 7011
execve("/usr/sbin/crond", ["/usr/sbin/crond", "-c", "/etc/crontabs", "-f"], [/* 19 vars */]) = 0
arch_prctl(ARCH_SET_FS, 0x7d500e02bb48) = 0
set_tid_address(0x7d500e02bb80)         = 7011
mprotect(0x7d500e028000, 4096, PROT_READ) = 0
mprotect(0xa38e79f1000, 4096, PROT_READ) = 0
getuid()                                = 0
close(0)                                = 0
close(1)                                = 0
open("/dev/null", O_RDWR)               = 0
dup2(0, 0)                              = 0
dup2(0, 1)                              = 1
mkdir("/tmp/cron.FmJlEk", 0700)         = 0
chmod("/tmp/cron.FmJlEk", 0755)         = 0
setpgid(0, 0)                           = -1 EPERM (Operation not permitted)
writev(2, [{iov_base="", iov_len=0}, {iov_base="setpgid", iov_len=7}], 2) = 7
writev(2, [{iov_base="", iov_len=0}, {iov_base=":", iov_len=1}], 2) = 1
writev(2, [{iov_base="", iov_len=0}, {iov_base=" ", iov_len=1}], 2) = 1
writev(2, [{iov_base="", iov_len=0}, {iov_base="Operation not permitted", iov_len=23}], 2) = 23
writev(2, [{iov_base="", iov_len=0}, {iov_base="\n", iov_len=1}], 2) = 1
exit_group(1)                           = ?

I absolutely don't understand the purpose of setpgid/getpgid neither why dcron needs it, so no patch.
For my case, I just needed to correctly report the pid to openrc, so I made a very dirty hack to the init.d file by adding a start_post function:

#!/sbin/openrc-run

name="busybox $SVCNAME"
command="/usr/sbin/$SVCNAME"
pidfile="/var/run/$SVCNAME.pid"
command_args="$CRON_OPTS"
#command_background=yes
#start_stop_daemon_args="-b -m"

depend() {
        need localmount
        need logger
}

start_post() {
        local pids pid ppid

        pids=`pidof $SVCNAME`
        pid=0

        for pid in $pids; do
                ppid=`grep '^PPid:' /proc/$pid/status | grep -o '[0-9]*'`

                if [ "$ppid" = '1' ]; then
                        echo "$pid" > $pidfile
                        return 0;
                fi
        done
}

@wonderbeyond
Copy link

wonderbeyond commented Oct 30, 2017

A docker ENTRYPOINT script like below can get around this problem:

#!/bin/sh
set -e

# see: https://github.com/dubiousjim/dcron/issues/13
# ignore using `exec` for `dcron` to get another pid instead of `1`
# exec "$@"
"$@"

@loreb
Copy link

loreb commented Dec 19, 2018

Imho it should just specialcase EPERM.
From https://pubs.opengroup.org/onlinepubs/9699919799/functions/setpgid.html, EPERM can happen only if:

[EPERM]
The process indicated by the pid argument is a session leader.
Our case, which is ok - we already are a session leader, it's like the proverbial robot trying to open a door that's already open.

[EPERM]
The value of the pid argument matches the process ID of a child process of the calling process and the child process is not in the same session as the calling process.
(0,0) is treated specially, this can't happen ("As a special case, if pid is 0, the process ID of the calling process shall be used. Also, if pgid is 0, the process ID of the indicated process shall be used.")

[EPERM]
The value of the pgid argument is valid but does not match the process ID of the process indicated by the pid argument and there is no process with a process group ID that matches the value of the pgid argument in the same session as the calling process.
(0,0) same as above

@powerman
Copy link

In short: dcron works if it's started as container PID1, works if it's started as non-PID1, but fails if some other process started as PID1 and then exec into dcron.

Easy workaround: start container using docker's init as PID1 (docker run --init or init: true in docker-compose.yml).

@njdoyle
Copy link

njdoyle commented Jul 17, 2023

Having run in to this myself, I figured I'd share the patch I use in case anyone else in the future runs in to this problem and wants a fix. It just special cases EPERM as the only way that error is possible with (0,0) input is the case where we're already the session leader (the state we want to be in anyways (using the above comments' analysis and reading the manpage)).

diff -Naur dcron-4.5.orig/main.c dcron-4.5.setpgid/main.c
--- dcron-4.5.orig/main.c	2011-05-01 19:00:17.000000000 -0400
+++ dcron-4.5.setpgid/main.c	2023-07-14 13:31:39.470797778 -0400
@@ -270,8 +270,10 @@
 
 		/* stay in existing session, but start a new process group */
 		if (setpgid(0,0)) {
-			perror("setpgid");
-			exit(1);
+			if (EPERM != errno) {
+				perror("setpgid");
+				exit(1);
+			}
 		}
 
 		/* stderr stays open, start SIGHUP ignoring, SIGCHLD handling */

This happens a lot because many init systems or service managers already give services their own process group; it isn't specifically a PID 1 issue.

@x-yuri
Copy link

x-yuri commented Jul 6, 2024

I was going to file an issue in the supposedly maintained fork. But the issues there are disabled. I wonder if @ptchinster will notice my message here.

Meanwhile I'll leave the information here:

Dockerfile:

FROM alpine:3.20
RUN apk add git build-base \
    && git clone https://github.com/ptchinster/dcron \
    && cd dcron \
    && make

a.sh:

#!/bin/sh -eu
dcron/crond -f

b.sh:

#!/bin/sh -eu
exec dcron/crond -f
$ docker build -t i .

$ docker run --rm i dcron/crond -f
setpgid: Operation not permitted

$ docker run --rm --init i dcron/crond -f
dcron/crond 4.5 dillon's cron daemon, started with loglevel notice
unable to scan directory /etc/cron.d

// I believe sh exec'es dcron/crond in this case
$ docker run --rm i sh -c 'dcron/crond -f'
setpgid: Operation not permitted

// echo prevents exec
$ docker run --rm i sh -c 'dcron/crond -f; echo test'
dcron/crond 4.5 dillon's cron daemon, started with loglevel notice
unable to scan directory /etc/cron.d

$ docker run --rm -v "$PWD:/app" -w /app i ./a.sh
dcron/crond 4.5 dillon's cron daemon, started with loglevel notice
unable to scan directory /etc/cron.d

$ docker run --rm -v "$PWD:/app" -w /app i ./b.sh
setpgid: Operation not permitted

With the following patch (credit):

patch:

--- main.c
+++ main.c
@@ -270,8 +270,10 @@
 
 		/* stay in existing session, but start a new process group */
 		if (setpgid(0,0)) {
-			perror("setpgid");
-			exit(1);
+			if (errno != EPERM) {
+				perror("setpgid");
+				exit(1);
+			}
 		}
 
 		/* stderr stays open, start SIGHUP ignoring, SIGCHLD handling */

Dockerfile:

FROM alpine:3.20
COPY patch .
RUN apk add git build-base \
    && git clone https://github.com/ptchinster/dcron \
    && cd dcron \
    && patch < /patch \
    && make

all the cases work except for the last one for some reason.

For now running it with --init looks like the best workaround.

Also at some point I tried different alternatives. The program I liked the most is supercronic, considering that I needed to run it in a docker container. See this and the linked answer.

@powerman
Copy link

powerman commented Jul 6, 2024

The program I liked the most is supercronic, considering that I needed to run it in a docker container.

It looks interesting, but… The problem is supercronic kills email reporting cron feature. Sure, it make sense to use other ways for reporting in a docker (e.g. prometheus metrics), but cron tasks often used email not just to report failures but also to report success/stats/etc. This means nowadays they should use something like a pushgateway to report metrics instead. It would be nice if supercronic provide such a feature for cron tasks as a replacement for dropped emailing feature. Also metrics reported by supercronic about task failures should be configurable by task names (I suppose this feature is not supported yet just because it cron format has no special field for task name) - otherwise it would be hard to know which task fail and thus will result in overhead like "run one supercronic container per each cron task".

P.S. Actually one supercronic container per one cron task (or several tightly related tasks) is probably a good idea anyway in a docker world. It'll both separate each task metrics/alerts and also their logs.

@ptchinster
Copy link

Issues are now open on the fork

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

10 participants