Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add '--pid' to the singularity command #4

Open
lgorenstein opened this issue Mar 8, 2021 · 5 comments
Open

Add '--pid' to the singularity command #4

lgorenstein opened this issue Mar 8, 2021 · 5 comments

Comments

@lgorenstein
Copy link
Contributor

I noticed that sometimes if the application run is interrupted (e.g. Ctrl-C'd), it leaves behind some of this processes (mpiexec.hydra, or python, etc). I discovered it while playing with a Singularity image of NGC GAMESS container, but it is definitely not limited to it.

Here's a simple reproduction using RAPIDS AI interactive example:

$ module use ngc-container-environment-modules
$ module load rapidsai/0.17
$ jupyter notebook --ip 0.0.0.0 --no-browser --notebook-dir /rapids/notebooks
   .... Jupyter starts ....
   .... I can open the browser, use the notebook, everything's great ....

Now if I hit Ctrl-C, everything shuts down as expected and I get my prompt back.
But there are ghosts left behind:

$ ps uxww | grep '[c]onda'
lev      190040  0.5  0.0 2675596 86400 pts/105 S    20:09   0:02 /opt/conda/envs/rapids/bin/python3.7 /opt/conda/envs/rapids/bin/jupyter-lab --allow-root --ip=0.0.0.0 --no-browser --NotebookApp.token=

Changing container_launch definition to be ..... run --nv --pid ......
fixes the problem and eliminates ghost processes. Our singularity is 3.6.4... not sure how this play on other versions.

@samcmill
Copy link
Contributor

Can you please submit an issue for this in the Singularity GitHub? I don't think --pid should be required to cleanup from a SIGINT.

--pid can also have some undesired side effects, for instance it breaks NCCL which is used by the DL containers. So I'd rather see this fixed in Singularity than add this workaround here.

@lgorenstein
Copy link
Contributor Author

Sure, submitted.

@lgorenstein
Copy link
Contributor Author

Scott, please see this comment: apptainer/singularity#5884 (comment)

Looks like that --pid is indeed needed because of the way the container starts jupyter-lab with nohup ... &.

Still leaves a question of why I saw mpiexec.hydra's... Might be worth adding both TINI_SUBREAPER=1 and TINI_KILL_PROCESS_GROUP=1 to the modules for all containers that use tini.

@samcmill
Copy link
Contributor

To my knowledge, only the Rapids container uses tini. It seems like --pid may be appropriate there (although I'm still concerned about NCCL), but I'm not sure if it should be applied globally?

@lgorenstein
Copy link
Contributor Author

That's fair. I played with couple other containers and they don't seem to be affected. The one exception is the GAMESS-17 container, but a) it's a bit of a problem child, and b) that is why I have kept '--pid' there.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants