Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cancelling a job doesn't run post hooks or upload artifacts #119

Closed
keithpitt opened this issue Mar 27, 2015 · 12 comments
Closed

Cancelling a job doesn't run post hooks or upload artifacts #119

keithpitt opened this issue Mar 27, 2015 · 12 comments

Comments

@keithpitt
Copy link
Contributor

When we cancel a job, we send a KILL signal to the bootstrap.sh process. This is troublesome because it doesn't allow us to upload artifacts, or run any post-script hooks. So, the idea I have is to send the kill signals to the underlying script process, not the bootstrap one.

To do this, first we need to find out the PID, I did some bash and came up with this:

#bootstrap.sh

./script.sh &
PID=$!
echo "(PID is: $PID) Waiting for it..."
wait $PID
echo "Run the hooks here"
#script.sh

function say-goodbye {
  echo "Goodbye!"
}

trap say-goodbye EXIT
SLEEP=15

echo "This is my current PID: $$"
echo "And now I'm sleeping for $SLEEP seconds"
sleep $SLEEP

Now the problem is communicating the PID back to the agent, but I think this can be done with buildkite-agent meta-data or something. Once we have that PID, the canceller just kills the process PID, not the boostrap one. And all the after script tasks + cleanup should all "just work"...I think!

@graemej
Copy link

graemej commented Apr 2, 2015

fyi: SIGTERM might be a better choice of signal so that it's catchable. In our usecase we're launching Docker containers as we have a few custom volume mounts and a SIGKILL doesn't reliably disconnect the docker run process from the process owned by the docker daemon when cancelling.

@toolmantim
Copy link
Contributor

@graemej we do in fact send a SIGTERM (see

j.process.Kill()
and
err := p.signal(syscall.SIGTERM)
). We only send a SIGKILL if it's still alive 10 seconds after sending a SIGTERM.

This issue is about having the agent kill the process that's invoked by the bootstrap.sh, rather than the bootstrap.sh process itself.

@toolmantim
Copy link
Contributor

@graemej I've seen that with docker though! Especially if you don't use the [] syntax in the CMD of the Docker file, because the docker run command will be running using bash eval.

@lox
Copy link
Contributor

lox commented May 31, 2017

This should be fixed now!

@lox lox closed this as completed May 31, 2017
@keithpitt
Copy link
Contributor Author

keithpitt commented May 31, 2017

@lox oh right! Since the bootstrap swallows signals and just forwards them onto the child process, it means the bootstrap has a chance to do all the things?

@lox
Copy link
Contributor

lox commented May 31, 2017

Urgh, y'know, I mis-read this. I was actually looking for specifically this bug report to mention that pre-exit needs to be added to the list of things that don't run after cancel.

@lox lox reopened this May 31, 2017
@lox
Copy link
Contributor

lox commented May 31, 2017

As it stands, cancel will send a SIGTERM to the bootstrap, which will send it to any sub-process-groups it has created and then exit. We need to add some handling to run specific hooks in the bootstrap post-killing of it's subprocesses but before it exits itself.

Which hooks should run? I'm kind of tempted to keep the current behaviour given it's been the status quo for so long. Perhaps we should add a pre-cancel and post-cancel hook?

@lox lox added the enhancement label Nov 4, 2017
@lox lox added this to the v3.0.0 milestone Nov 4, 2017
@lox
Copy link
Contributor

lox commented Jan 16, 2018

I think unfortunately this one is going to be bumped to 3.1.0.

@lox lox modified the milestones: v3.0.0, v3.1.0 Jan 16, 2018
@keithpitt keithpitt modified the milestones: v3.1.0, v4.0.0 Mar 13, 2018
@ticky
Copy link
Contributor

ticky commented Mar 21, 2018

Just going to put this here because it might be related: do we have any plans to make the force-kill timeout user-configurable? It seems that ten seconds is not enough time for some applications to clean up after themselves.

@amitsaha
Copy link
Contributor

As it stands, cancel will send a SIGTERM to the bootstrap, which will send it to any sub-process-groups it has created and then exit. We need to add some handling to run specific hooks in the bootstrap post-killing of it's subprocesses but before it exits itself.

@lox it will be useful to consider windows OS when discussing this, since on Windows:

  • No concept of signals
  • AFAIK, no idea of process groups

Hence, I think issue #794 may be worth considering.

Which hooks should run? I'm kind of tempted to keep the current behaviour given it's been the status quo for so long. Perhaps we should add a pre-cancel and post-cancel hook?

These separate hooks would be very useful to have. This in addition to #794 would improve the experience on Windows greatly.

@lox
Copy link
Contributor

lox commented Jul 15, 2018

It turns out that we do run pre-exit after cancellation! We added a test recently in #789. It looks like it's not acting as hoped under windows though.

@lox
Copy link
Contributor

lox commented Jan 9, 2019

It's now acting correctly under windows in #879.

@lox lox closed this as completed Jan 9, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants