fix: add fatal timeout upgrade with SIGKILL to ARGO_EXEC_TIMEOUT (closes #20785, #18478)#22713
Conversation
❌ Preview Environment deleted from BunnyshellAvailable commands (reply to this comment):
|
| if err != nil { | ||
| timeout = 90 * time.Second | ||
| } | ||
| fatalTimeout, err = time.ParseDuration(os.Getenv("ARGOCD_EXEC_FATAL_TIMEOUT")) |
There was a problem hiding this comment.
I don't think this is something that should be exposed to users but in a followup I think we should add metrics for when the fatal SIGKILL is sent. My guess is that the SIGTERM stall is either some kind of race-condition corruption in argo multithreading or an actual rare race condition in git itself but I don't have a good idea of what to debug without further understanding of the issue.
6bc9431 to
9669e14
Compare
argoproj#20785, argoproj#18478) Signed-off-by: Hazel Sudzilouski <dsudzilouski@olin.edu>
9669e14 to
2148f16
Compare
Signed-off-by: Hazel Sudzilouski <dsudzilouski@olin.edu>
Signed-off-by: Hazel Sudzilouski <dsudzilouski@olin.edu>
fc7f5ca to
d2aa6ca
Compare
Signed-off-by: Hazel Sudzilouski <dsudzilouski@olin.edu>
| ShouldWait: true, | ||
| }, | ||
| } | ||
| // The returned error string in this case should contain a "fatal" in this case |
There was a problem hiding this comment.
added test for new error flow case. previously this cmd would have forced argocd to deadlock.
| // The returned error string in this case should contain a "fatal" in this case | ||
| _, err := RunWithExecRunOpts(exec.Command("sh", "-c", "trap 'trap - 15 && echo captured && sleep 10000' 15 && sleep 2"), opts) | ||
| assert.ErrorContains(t, err, "failed timeout after 200ms") | ||
| // The expected timeout is ARGOCD_EXEC_TIMEOUT + ARGOCD_EXEC_FATAL_TIMEOUT = 200ms + 100ms = 300ms |
There was a problem hiding this comment.
maybe ARGOCD_EXEC_FATAL_TIMEOUT is a little bit confusing naming since this is timeout in addition too ARGOCD_EXEC_TIMEOUT. open to suggestions on a better name here
| if opts.CaptureStderr { | ||
| output += stderr.String() | ||
| } | ||
| logCtx.WithFields(logrus.Fields{"duration": time.Since(start)}).Debug(redactor(output)) |
There was a problem hiding this comment.
some overlap here with the return logic below but with different error code. I'm open to refactoring the exit to a new function if this is desired but also fine leaving it as is since it's not too bad
Signed-off-by: Hazel Sudzilouski <dsudzilouski@olin.edu>
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #22713 +/- ##
=========================================
Coverage ? 59.94%
=========================================
Files ? 342
Lines ? 58648
Branches ? 0
=========================================
Hits ? 35158
Misses ? 20638
Partials ? 2852 ☔ View full report in Codecov by Sentry. |
|
hey @crenshaw-dev lmk what you need from me to help get this merged. I think all the core logic and tests are there. thanks |
todaywasawesome
left a comment
There was a problem hiding this comment.
I think we should add docs to how how to change the default fataltimeout and this will be ready to go.
f43c983 to
8259f23
Compare
Signed-off-by: Hazel Sudzilouski <dsudzilouski@olin.edu>
8259f23 to
8b62cf5
Compare
|
@todaywasawesome added docs and linked screenshots and PR diff of the documentation changes. lmk if there's anything else you need. thanks! |
todaywasawesome
left a comment
There was a problem hiding this comment.
One last suggestion, looks great. Good work.
| !!! note | ||
| If a CMP renders blank manfiests, and `prune` is set to `true`, Argo CD will automatically remove resources. CMP plugin authors should ensure errors are part of the exit code. Commonly something like `kustomize build . | cat` won't pass errors because of the pipe. Consider setting `set -o pipefail` so anything piped will pass errors on failure. | ||
| !!! note | ||
| Although this should never happen, if a CMP command fails to gracefully exit on `ARGOCD_EXEC_TIMEOUT`, it will be forcefully killed after an additional timeout of `ARGOCD_EXEC_FATAL_TIMEOUT`. This is an implementation detail that should generally not concern end users. |
There was a problem hiding this comment.
| Although this should never happen, if a CMP command fails to gracefully exit on `ARGOCD_EXEC_TIMEOUT`, it will be forcefully killed after an additional timeout of `ARGOCD_EXEC_FATAL_TIMEOUT`. This is an implementation detail that should generally not concern end users. | |
| If a CMP command fails to gracefully exit on `ARGOCD_EXEC_TIMEOUT`, it will be forcefully killed after an additional timeout of `ARGOCD_EXEC_FATAL_TIMEOUT`. |
I don't think we need so much commentary there. Here people are making CMPs so they're buying their own problems.
There was a problem hiding this comment.
updated PR body with a new link to the docs @todaywasawesome
Signed-off-by: Hazel Sudzilouski <dsudzilouski@olin.edu>
ishitasequeira
left a comment
There was a problem hiding this comment.
Thanks @hazel-sudz for the PR. LGTM! @todaywasawesome do you have any open concerns?
Signed-off-by: Alexandre Gaudreault <alexandre_gaudreault@intuit.com>
argoproj#20785, argoproj#18478) (argoproj#22713) Signed-off-by: Hazel Sudzilouski <dsudzilouski@olin.edu> Signed-off-by: Alexandre Gaudreault <alexandre_gaudreault@intuit.com> Co-authored-by: Alexandre Gaudreault <alexandre_gaudreault@intuit.com> Signed-off-by: dsuhinin <suhinin.dmitriy@gmail.com>
argoproj#20785, argoproj#18478) (argoproj#22713) Signed-off-by: Hazel Sudzilouski <dsudzilouski@olin.edu> Signed-off-by: Alexandre Gaudreault <alexandre_gaudreault@intuit.com> Co-authored-by: Alexandre Gaudreault <alexandre_gaudreault@intuit.com> Signed-off-by: Jonathan Ogilvie <jonathan.ogilvie@sumologic.com>
|
Thank you @hazel-sudz @crenshaw-dev for this fix. Are you planning to get it release in version |
argoproj#20785, argoproj#18478) (argoproj#22713) Signed-off-by: Hazel Sudzilouski <dsudzilouski@olin.edu> Signed-off-by: Alexandre Gaudreault <alexandre_gaudreault@intuit.com> Co-authored-by: Alexandre Gaudreault <alexandre_gaudreault@intuit.com>
…L to ARGO_EXEC_TIMEOUT (#419) * chore: move pkg/exec in-tree (argoproj#22175) (argoproj#22460) Signed-off-by: Michael Crenshaw <350466+crenshaw-dev@users.noreply.github.com> * fix: add fatal timeout upgrade with SIGKILL to ARGO_EXEC_TIMEOUT (closes argoproj#20785, argoproj#18478) (argoproj#22713) Signed-off-by: Hazel Sudzilouski <dsudzilouski@olin.edu> Signed-off-by: Alexandre Gaudreault <alexandre_gaudreault@intuit.com> Co-authored-by: Alexandre Gaudreault <alexandre_gaudreault@intuit.com> --------- Signed-off-by: Michael Crenshaw <350466+crenshaw-dev@users.noreply.github.com> Signed-off-by: Hazel Sudzilouski <dsudzilouski@olin.edu> Signed-off-by: Alexandre Gaudreault <alexandre_gaudreault@intuit.com> Co-authored-by: Michael Crenshaw <350466+crenshaw-dev@users.noreply.github.com> Co-authored-by: Hazel Sudzilouski <t-danielsu@microsoft.com> Co-authored-by: Alexandre Gaudreault <alexandre_gaudreault@intuit.com>
argoproj#20785, argoproj#18478) (argoproj#22713) Signed-off-by: Hazel Sudzilouski <dsudzilouski@olin.edu> Signed-off-by: Alexandre Gaudreault <alexandre_gaudreault@intuit.com> Co-authored-by: Alexandre Gaudreault <alexandre_gaudreault@intuit.com> Signed-off-by: enneitex <etienne.divet@gmail.com>
…L to ARGO_EXEC_TIMEOUT (#419) * chore: move pkg/exec in-tree (argoproj#22175) (argoproj#22460) Signed-off-by: Michael Crenshaw <350466+crenshaw-dev@users.noreply.github.com> * fix: add fatal timeout upgrade with SIGKILL to ARGO_EXEC_TIMEOUT (closes argoproj#20785, argoproj#18478) (argoproj#22713) Signed-off-by: Hazel Sudzilouski <dsudzilouski@olin.edu> Signed-off-by: Alexandre Gaudreault <alexandre_gaudreault@intuit.com> Co-authored-by: Alexandre Gaudreault <alexandre_gaudreault@intuit.com> --------- Signed-off-by: Michael Crenshaw <350466+crenshaw-dev@users.noreply.github.com> Signed-off-by: Hazel Sudzilouski <dsudzilouski@olin.edu> Signed-off-by: Alexandre Gaudreault <alexandre_gaudreault@intuit.com> Co-authored-by: Michael Crenshaw <350466+crenshaw-dev@users.noreply.github.com> Co-authored-by: Hazel Sudzilouski <t-danielsu@microsoft.com> Co-authored-by: Alexandre Gaudreault <alexandre_gaudreault@intuit.com>
* chore: sync all codefresh code changes into v3.0.2 (#397) * chore: sync all codefresh code changes into v3.0.2 without event-reporter related changes Signed-off-by: oleksandr-codefresh <oleksandr.saulyak@octopus.com> * removed cf script Signed-off-by: oleksandr-codefresh <oleksandr.saulyak@octopus.com> * upgraded git-lfs to 3.6.1 in Dockerfile (#386) Signed-off-by: reggie-k <regina.voloshin@codefresh.io> Signed-off-by: oleksandr-codefresh <oleksandr.saulyak@octopus.com> * fixed webstorm go.mod issue Signed-off-by: oleksandr-codefresh <oleksandr.saulyak@octopus.com> * e2e: improved error logs Signed-off-by: oleksandr-codefresh <oleksandr.saulyak@octopus.com> * fixed changes on generated files Signed-off-by: oleksandr-codefresh <oleksandr.saulyak@octopus.com> * chore: replace heptio-images with argocd-e2e-container (argoproj#23040) Signed-off-by: nitishfy <justnitish06@gmail.com> Signed-off-by: Nitish Kumar <justnitish06@gmail.com> (cherry picked from commit 309acd1) Signed-off-by: oleksandr-codefresh <oleksandr.saulyak@octopus.com> * feat: upgraded github.com/expr-lang/expr from 0.16.9 to 0.17.0 Signed-off-by: oleksandr-codefresh <oleksandr.saulyak@octopus.com> * e2e [TestTrackAppStateAndSyncApp / TestNewStyleResourceActionMixedOk / TestNewStyleResourceActionPermitted / TestNamespacedPermissions]: added wait for sync operation Signed-off-by: oleksandr-codefresh <oleksandr.saulyak@octopus.com> --------- Signed-off-by: oleksandr-codefresh <oleksandr.saulyak@octopus.com> Signed-off-by: reggie-k <regina.voloshin@codefresh.io> Co-authored-by: Regina Voloshin <regina.voloshin@codefresh.io> Co-authored-by: Nitish Kumar <justnitish06@gmail.com> # Conflicts: # .github/workflows/ci-build.yaml # cmd/argocd/commands/app_test.go # go.mod # go.sum # manifests/base/kustomization.yaml # manifests/core-install-with-hydrator.yaml # manifests/core-install.yaml # manifests/core-install/kustomization.yaml # manifests/ha/base/kustomization.yaml # manifests/ha/install-with-hydrator.yaml # manifests/ha/install.yaml # manifests/ha/namespace-install-with-hydrator.yaml # manifests/ha/namespace-install.yaml # manifests/install-with-hydrator.yaml # manifests/install.yaml # manifests/namespace-install-with-hydrator.yaml # manifests/namespace-install.yaml # pkg/apiclient/application/application.pb.go # pkg/apiclient/application/application.pb.gw.go # pkg/apis/application/v1alpha1/generated.pb.go # reposerver/apiclient/mocks/RepoServerServiceClient.go # reposerver/apiclient/repository.pb.go # server/application/application.proto # util/git/mocks/Client.go * fix(validateDestination query): as we moved to argo.GetDestinationCluster, we can simply rely on error returned from this request (#405) Signed-off-by: oleksandr-codefresh <oleksandr.saulyak@octopus.com> * removed curl from image (#406) Signed-off-by: reggie-k <regina.voloshin@codefresh.io> * feat: Add GitHub API metrics (#404) * added github api metrics Signed-off-by: reggie-k <regina.voloshin@codefresh.io> * fix(docs): fix applicationsetcontroller.enable.github.api.metrics to false in docs cm (argoproj#23516) Signed-off-by: reggie-k <regina.voloshin@codefresh.io> * fix: Account for batch event processing in e2e tests (argoproj#22356) Signed-off-by: Andrii Korotkov <andrii.korotkov@verkada.com> --------- Signed-off-by: reggie-k <regina.voloshin@codefresh.io> Signed-off-by: Andrii Korotkov <andrii.korotkov@verkada.com> Co-authored-by: Andrii Korotkov <137232734+andrii-korotkov-verkada@users.noreply.github.com> * cherry-pick 1b48f36 Upgrade ubuntu base image to latest 25.04 digest (#407) Signed-off-by: reggie-k <regina.voloshin@codefresh.io> Co-authored-by: dudinea <eugene.doudine@octopus.com> * feat: CR-29912 manual cherry pick app set pr generator return 0 results if the repo does not exist (#409) * manually added the changes Signed-off-by: reggie-k <regina.voloshin@codefresh.io> * pull request functionality Signed-off-by: reggie-k <regina.voloshin@codefresh.io> * pull request functionality Signed-off-by: reggie-k <regina.voloshin@codefresh.io> --------- Signed-off-by: reggie-k <regina.voloshin@codefresh.io> * feat: move pkg/exec in-tree and add fatal timeout upgrade with SIGKILL to ARGO_EXEC_TIMEOUT (#419) * chore: move pkg/exec in-tree (argoproj#22175) (argoproj#22460) Signed-off-by: Michael Crenshaw <350466+crenshaw-dev@users.noreply.github.com> * fix: add fatal timeout upgrade with SIGKILL to ARGO_EXEC_TIMEOUT (closes argoproj#20785, argoproj#18478) (argoproj#22713) Signed-off-by: Hazel Sudzilouski <dsudzilouski@olin.edu> Signed-off-by: Alexandre Gaudreault <alexandre_gaudreault@intuit.com> Co-authored-by: Alexandre Gaudreault <alexandre_gaudreault@intuit.com> --------- Signed-off-by: Michael Crenshaw <350466+crenshaw-dev@users.noreply.github.com> Signed-off-by: Hazel Sudzilouski <dsudzilouski@olin.edu> Signed-off-by: Alexandre Gaudreault <alexandre_gaudreault@intuit.com> Co-authored-by: Michael Crenshaw <350466+crenshaw-dev@users.noreply.github.com> Co-authored-by: Hazel Sudzilouski <t-danielsu@microsoft.com> Co-authored-by: Alexandre Gaudreault <alexandre_gaudreault@intuit.com> * feat: CR-30512 stop using bitnami images (#420) * removed references from all the images except for astra healthcheck Signed-off-by: reggie-k <regina.voloshin@codefresh.io> * removed reposerver/repository/testdata/helm-with-local-dependency/.argocd-helm-dep-up Signed-off-by: reggie-k <regina.voloshin@codefresh.io> * reverted health check references since they are treated as text Signed-off-by: reggie-k <regina.voloshin@codefresh.io> * reverted health check references since they are treated as text Signed-off-by: reggie-k <regina.voloshin@codefresh.io> --------- Signed-off-by: reggie-k <regina.voloshin@codefresh.io> * chore: bumps redis to 8.x (#422) * bumps Docker test container to redis 8 * bumps redis version to 8.2.1 Signed-off-by: Patroklos Papapetrou <ppapapetrou76@gmail.com> * use a previous version of go-redis Signed-off-by: Patroklos Papapetrou <ppapapetrou76@gmail.com> --------- Signed-off-by: Patroklos Papapetrou <ppapapetrou76@gmail.com> * upgrade sqlite in docker image to address CVE-2025-6965 (#425) * final changes after rebase Signed-off-by: Patroklos Papapetrou <ppapapetrou76@gmail.com> * final changes after rebase Signed-off-by: Patroklos Papapetrou <ppapapetrou76@gmail.com> * address new linter issues Signed-off-by: Patroklos Papapetrou <ppapapetrou76@gmail.com> --------- Signed-off-by: oleksandr-codefresh <oleksandr.saulyak@octopus.com> Signed-off-by: reggie-k <regina.voloshin@codefresh.io> Signed-off-by: Andrii Korotkov <andrii.korotkov@verkada.com> Signed-off-by: Michael Crenshaw <350466+crenshaw-dev@users.noreply.github.com> Signed-off-by: Hazel Sudzilouski <dsudzilouski@olin.edu> Signed-off-by: Alexandre Gaudreault <alexandre_gaudreault@intuit.com> Signed-off-by: Patroklos Papapetrou <ppapapetrou76@gmail.com> Co-authored-by: Oleksandr Saulyak <oleksandr.saulyak@octopus.com> Co-authored-by: Regina Voloshin <regina.voloshin@codefresh.io> Co-authored-by: Nitish Kumar <justnitish06@gmail.com> Co-authored-by: Andrii Korotkov <137232734+andrii-korotkov-verkada@users.noreply.github.com> Co-authored-by: dudinea <eugene.doudine@octopus.com> Co-authored-by: Michael Crenshaw <350466+crenshaw-dev@users.noreply.github.com> Co-authored-by: Hazel Sudzilouski <t-danielsu@microsoft.com> Co-authored-by: Alexandre Gaudreault <alexandre_gaudreault@intuit.com>
Checklist:
Closes #20785 #18478. Based on what multiple people have reported it appears as though calls to git may sometimes deadlock and not respect SIGTERM. The root cause of this is not currently known. Based on reports it seems to be happen when git remote is really slow/flaky. If cmds do not respect SIGTERM the expected behavior should be to upgrade to SIGKILL after a given additional timeout interval. Otherwise, the entire argocd repo-server will stall which is an objectively worse outcome.
New Documentation
docs build for PR: https://argo-cd--22713.org.readthedocs.build/en/22713/operator-manual/config-management-plugins/#using-a-config-management-plugin-with-an-application