-
Notifications
You must be signed in to change notification settings - Fork 5.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Argocd-reposerver spawns too many git zombie processes in the host node #3611
Comments
I would like to contribute for this issue. Where should I start ? |
Hi @kanapuli, thanks for your bug report and the interest in contributing! Much appreciated! Please check our documentation at https://argoproj.github.io/argo-cd/developer-guide/contributing/ on how to get started. If you have any questions after reading this document, feel free to ask. Also, if there's something missing or wrongly documented, please let us know! As for this special problem, I think the root cause might be found in the package we use for executing external commands, at https://github.com/argoproj/pkg (or more specifically, https://github.com/argoproj/pkg/tree/master/exec), however might also be elsewhere (I didn't dig any deeper yet). |
Thanks @jannfis . Let me go through the code and documentation and ask here if I have a question. |
any updates? I've encounter same error |
I've seen this issue on multiple occasions. We had to reboot k8s nodes because they became unresponsive because of the enormous amount of zombie processes. I assume the argocd-reposerver does not properly kill the git process after execution or doesn't wait for the child process to return its exit code. |
Same here on version 1.5.3 (duplicated report : #3694 ) |
Hey guys, just a question during my search for root cause of this issue and trying to reproduce it reliably: Are you making use of Kustomize for (some of) your applications? |
No, I just use plain kubernetes yaml manifests for my application. |
Just a heads-up: It seems I can reliably reproduce it now. I'll be digging for the root cause. |
I did some more research, and it happens that this issue only comes to light when run in Kubernetes. This is most likely due to the fact that in a Kubernetes pod, there is no init-like process which reaps terminating processes that do not have a parent process anymore. The issue is not reproducible when running ArgoCD outside a K8s cluster, no matter whether run in a Docker container or not. I took some time to audit the functions used by ArgoCD to spawn external processes, and in fact, we are correctly waiting for any child processes to exit, thus, ArgoCD should not leave unreaped zombie processes behind. But as we can observer, it happens. One can easily reproduce it by creating a Kustomize application, which uses a non-accessible remote base, i.e. a Git repository that requires a different authentication than the repository where the original I think what is going on the following:
In the UNIX world, orphaned processes will be reassigned to PID 1 as their parent, which is usually I see three possible solutions to solve this problem:
Solution 1 feels like reinventing the wheel, so I did a small PoC using solution 2 adapting ArgoCD's docker entry point script to execute the Solution 3 seems the most native one, however, might not be available in early versions of K8s that could still be found in the wild (it became stable with K8s v1.17). I think it was introduced as alpha feature on v1.10 and became beta since at least v1.14, from where it is enabled by default. According to my research, the Kubernetes people adopted Long story short: The workaround for this issue is to set |
This was an excellent summary. For option 3, enabling shareProcessNamespace, we would enable the mode with an assumption that the pause container is really a
It sounds like we should either do option 2 or 3 for the repo-server, since it does spawn child processes. But for argocd-server and argocd-application-controller, I might avoid it since we don't spawn child processes AFIK. I'm in favor of doing 2, since 3 is making assumptions about the kubernetes version underlying implementation using tini. |
) (#3721) * fix: Reap orphaned ("zombie") processes in argocd-repo-server pod
) (#3721) * fix: Reap orphaned ("zombie") processes in argocd-repo-server pod
While uid_entrypoint.sh contains the OpenShift specific manipulation of /etc/passwd it also starts the reposerver via tini and so ensures that any zombies produced by reposerver and its decendants are collected. This matches the behaviour from the manifests included with the main ArgoCD project. See: * https://github.com/argoproj/argo-cd/blob/f93da5346c3dfe0ec75549fd78b2d30ce7d5cfad/manifests/base/repo-server/argocd-repo-server-deployment.yaml#L24 * argoproj/argo-cd#3721 * argoproj/argo-cd#3611
…466) * fix(argocd): Unconditionally start reposerver with uid_entrypoint.sh While uid_entrypoint.sh contains the OpenShift specific manipulation of /etc/passwd it also starts the reposerver via tini and so ensures that any zombies produced by reposerver and its decendants are collected. This matches the behaviour from the manifests included with the main ArgoCD project. See: * https://github.com/argoproj/argo-cd/blob/f93da5346c3dfe0ec75549fd78b2d30ce7d5cfad/manifests/base/repo-server/argocd-repo-server-deployment.yaml#L24 * argoproj/argo-cd#3721 * argoproj/argo-cd#3611 * chore: Bumping minor semver as this feels like a bit more than a patch change.
…rgoproj#466) * fix(argocd): Unconditionally start reposerver with uid_entrypoint.sh While uid_entrypoint.sh contains the OpenShift specific manipulation of /etc/passwd it also starts the reposerver via tini and so ensures that any zombies produced by reposerver and its decendants are collected. This matches the behaviour from the manifests included with the main ArgoCD project. See: * https://github.com/argoproj/argo-cd/blob/f93da5346c3dfe0ec75549fd78b2d30ce7d5cfad/manifests/base/repo-server/argocd-repo-server-deployment.yaml#L24 * argoproj/argo-cd#3721 * argoproj/argo-cd#3611 * chore: Bumping minor semver as this feels like a bit more than a patch change.
We're currently seeing this issue (well, almost. The defunct processes are mostly 'ssh'). |
I don't have anything special to add, but want to share my experience when debugging a similar problem - completely unrelated to argo-cd (never used it). Our java application was running as PID 1 in a docker container and was running git clone commands as sub-proceses. A simple command like this would reproduce the problem (If you don't have an SSH key set up for git)
It fails with a permissions error leaving an A similar thing happens with https-style URLs, but in this case, it keeps around It feels like a bug in git where they don't reap children processes if a git clone subprocess terminates with an error. |
If you are trying to resolve an environment-specific issue or have a one-off question about the edge case that does not require a feature then please consider asking a
question in argocd slack channel.
Checklist:
argocd version
.Describe the bug
I am running argocd in an AWS ec2 machine as a pod. The host machine where argocd-repo-server pod runs seem to have so many git zombie processes.
To Reproduce
I am not exactly sure how to reproduce this. Just running argocd-repo-server pod for couple of days can create the zombies of git.
Expected behavior
If a git process is invoked by argocd to check the application status , it has to properly killed by argocd.
Screenshots
data:image/s3,"s3://crabby-images/d4a2e/d4a2e4d4ac362c1fd11e4c1f7f522bfbc4b6cae1" alt="image"
data:image/s3,"s3://crabby-images/4ca85/4ca85034f7a1d5bf8aed1cb8ddf0733de7f8e2ca" alt="image"
Version
Logs
The text was updated successfully, but these errors were encountered: