-
Notifications
You must be signed in to change notification settings - Fork 6
Recent agent changes broke postgres CI on windows #218
Comments
Sorry about that! There were some changes in #217 related to how scripts are executed and maybe this particular line might affected your build. We'll try to investigate. In the meantime you can add |
Yea, I was guessing it might be related, and staring at the code trying to figure out what it's trying to do ;) It'd be good if stuff like killing processes were noted in the debug log.
Ah, neat. |
Ah, found it, I think: On windows it looks like you create a job for each script: https://github.com/cirruslabs/cirrus-ci-agent/blob/master/internal/executor/shellcommands_windows.go#L39 In contrast to before now the entire job is killed, which includes all child processes. Basically undoing #179 afaics? |
And because the job is created without JOB_OBJECT_LIMIT_SILENT_BREAKAWAY_OK, we can't even escape the job easily... |
Will be reverted once fixed on cirrus's side. See also cirruslabs/cirrus-ci-agent#218 Discussion: https://postgr.es/m/CA+hUKGKx7k14n2nAALSvv6M_AB6oHasNBA65X6Dvo8hwfi9y0A@mail.gmail.com
Hi Andres! I'm sorry that these changes have broke your build. However, I don't currently see any other way around this, as we can't guarantee both (1) By the way, the underlying issue for these fixes we've had to implement in the agent is explained in golang/go#23019. Have you considered switching to background script instruction instead? |
The problem isn't caused by killing the direct child processes. They're all dead already. The problem is that on windows you're killing the entire job, including children of child processes, and you're not allowing commands to "escape" the job (see comment about JOB_OBJECT_LIMIT_SILENT_BREAKAWAY_OK above). Note that on !windows you're just killing the process group, rather than something like a job (which'd be more equivalent to a cgroup). You actually can create child processes as a process group on windows too. Have you considered creating the shell with CREATE_NEW_PROCESS_GROUP and then killing that process group, rather than a job?
Yes, doesn't work without adding a fair bit of added complication / fragility. The command only returns when we're ready to execute the next steps. |
Hi Andres! Sorry for the late reply. I've managed to reproduce your issue and I think that I understand the problem you're facing better now. I understand that in your case the the simplest solution would be to use a blocking The problem with this is that it is not ideal for non-ephemeral environments with Persistent Worker, where a single compute environment is re-used across build runs. To me, it still seems that the proper solution would be to call However, I've realized there might be a middle-ground approach, see #220. It should keep the old behavior for background processes that don't hold the standard output, yet still ensure that the agent retrieves all the output from the processes it has spawned, which is vital in e.g. cache instruction's |
Hi, Thanks for working on this.
That makes sense - but the !windows case currently isn't protected against that at all? Process groups are obviously easy to "escape" from, and a lot of software does so (including pg_ctl). Seems that for windows the right approach would be to have one job across the whole testrun and kill that at the end? For linux you'd presumably need a cgroup. No idea for BSDs or macOS.
That'd kind of work, but is a fair bit more complicated than desirable :/. Besides needing to loop etc, there's an added complications: Currently the code to drop admin rights etc is in pg_ctl, which will return after starting postgres. The daemon itself refuses to run with admin rights. We don't currently have tools to wait until the server is shut down. Partially that's something we should improve in our tooling, admittedly...
Sounds good. |
A run for a branch of postgres that doesn't specify the agent version ended up using "agent version 1.74.0-b55ebe7", which contains #220 afaics, still fails: https://cirrus-ci.com/task/6021634623537152 Reading through later iterations of that change, I'd have to set CIRRUS_ESCAPING_PROCESSES? Would be nice to document that, if so. I still don't understand the equivalence between killing jobs on windows and killing process groups everywhere else. Process groups can be escaped (and are by typical daemons) with a single setsid(), jobs not. |
You're right: the decision to use the jobs on Windows was due to convenience, because the process group implementation is lacking (see nodejs/node#3617). After, I've rationalized this as a way towards better resource controls, but since it's not possible to use cgroups conveniently across different UNIX-like OSes, the full process controls are currently lacking.
Unfortunately this wouldn't work, because we need to ensure that there are no processes holding their end of our pipe FD at the end of each instructions, otherwise we'd wait for the pipe to EOF indefinitely. This pipe is used to collect the standard output/error and is crucial for cache fingerprint scripts, for example (because missing output will result in a different hash). |
That's right. Docs on the way: cirruslabs/cirrus-ci-docs#984. BTW, regarding Have you meant And as for the |
Just FTR: CIRRUS_ESCAPING_PROCESSES did fix the issue. |
Hi,
Something between agent version 1.73.2-f0bbc44 and 1.73.6-6fc7e4b broke our windows tests.
Successful run:
https://cirrus-ci.com/task/5737468287778816
Failed rerun:
https://cirrus-ci.com/task/5239235262283776
It looks like processes started in one script are now terminated. It might be that postgres doesn't do the right magic dance to fully detach from the starting console on windows - but it's not obvious what would be required, we've experimented with different CreateProcess() flags without success so far.
Regards,
Andres
The text was updated successfully, but these errors were encountered: