Skip to content

fix: add fatal timeout upgrade with SIGKILL to ARGO_EXEC_TIMEOUT (closes #20785, #18478)#22713

Merged
agaudreault merged 11 commits intoargoproj:masterfrom
hazel-sudz:metrics-git-stall
Jun 13, 2025
Merged

fix: add fatal timeout upgrade with SIGKILL to ARGO_EXEC_TIMEOUT (closes #20785, #18478)#22713
agaudreault merged 11 commits intoargoproj:masterfrom
hazel-sudz:metrics-git-stall

Conversation

@hazel-sudz
Copy link

@hazel-sudz hazel-sudz commented Apr 17, 2025

Checklist:

  • Either (a) I've created an enhancement proposal and discussed it with the community, (b) this is a bug fix, or (c) this does not need to be in the release notes.
  • The title of the PR states what changed and the related issues number (used for the release note).
  • The title of the PR conforms to the Toolchain Guide
  • I've included "Closes [ISSUE #]" or "Fixes [ISSUE #]" in the description to automatically close the associated issue.
  • I've updated both the CLI and UI to expose my feature, or I plan to submit a second PR with them.
  • Does this PR require documentation updates?
  • I've updated documentation as required by this PR.
  • I have signed off all my commits as required by DCO
  • I have written unit and/or e2e tests for my change. PRs without these are unlikely to be merged.
  • My build is green (troubleshooting builds).
  • My new feature complies with the feature status guidelines.
  • I have added a brief description of why this PR is necessary and/or what this PR solves.
  • Optional. My organization is added to USERS.md.
  • Optional. For bug fixes, I've indicated what older releases this fix should be cherry-picked into (this may or may not happen depending on risk/complexity).

Closes #20785 #18478. Based on what multiple people have reported it appears as though calls to git may sometimes deadlock and not respect SIGTERM. The root cause of this is not currently known. Based on reports it seems to be happen when git remote is really slow/flaky. If cmds do not respect SIGTERM the expected behavior should be to upgrade to SIGKILL after a given additional timeout interval. Otherwise, the entire argocd repo-server will stall which is an objectively worse outcome.

New Documentation

image
image

docs build for PR: https://argo-cd--22713.org.readthedocs.build/en/22713/operator-manual/config-management-plugins/#using-a-config-management-plugin-with-an-application

@hazel-sudz hazel-sudz requested a review from a team as a code owner April 17, 2025 23:09
@bunnyshell
Copy link

bunnyshell bot commented Apr 17, 2025

❌ Preview Environment deleted from Bunnyshell

Available commands (reply to this comment):

  • 🚀 /bns:deploy to deploy the environment

if err != nil {
timeout = 90 * time.Second
}
fatalTimeout, err = time.ParseDuration(os.Getenv("ARGOCD_EXEC_FATAL_TIMEOUT"))
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is something that should be exposed to users but in a followup I think we should add metrics for when the fatal SIGKILL is sent. My guess is that the SIGTERM stall is either some kind of race-condition corruption in argo multithreading or an actual rare race condition in git itself but I don't have a good idea of what to debug without further understanding of the issue.

@hazel-sudz hazel-sudz marked this pull request as draft April 17, 2025 23:44
Signed-off-by: Hazel Sudzilouski <dsudzilouski@olin.edu>
Signed-off-by: Hazel Sudzilouski <dsudzilouski@olin.edu>
Hazel Sudzilouski and others added 2 commits April 17, 2025 17:33
Signed-off-by: Hazel Sudzilouski <dsudzilouski@olin.edu>
ShouldWait: true,
},
}
// The returned error string in this case should contain a "fatal" in this case
Copy link
Author

@hazel-sudz hazel-sudz Apr 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added test for new error flow case. previously this cmd would have forced argocd to deadlock.

// The returned error string in this case should contain a "fatal" in this case
_, err := RunWithExecRunOpts(exec.Command("sh", "-c", "trap 'trap - 15 && echo captured && sleep 10000' 15 && sleep 2"), opts)
assert.ErrorContains(t, err, "failed timeout after 200ms")
// The expected timeout is ARGOCD_EXEC_TIMEOUT + ARGOCD_EXEC_FATAL_TIMEOUT = 200ms + 100ms = 300ms
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe ARGOCD_EXEC_FATAL_TIMEOUT is a little bit confusing naming since this is timeout in addition too ARGOCD_EXEC_TIMEOUT. open to suggestions on a better name here

if opts.CaptureStderr {
output += stderr.String()
}
logCtx.WithFields(logrus.Fields{"duration": time.Since(start)}).Debug(redactor(output))
Copy link
Author

@hazel-sudz hazel-sudz Apr 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some overlap here with the return logic below but with different error code. I'm open to refactoring the exit to a new function if this is desired but also fine leaving it as is since it's not too bad

@hazel-sudz hazel-sudz marked this pull request as ready for review April 18, 2025 01:40
@hazel-sudz hazel-sudz marked this pull request as draft April 18, 2025 01:50
Signed-off-by: Hazel Sudzilouski <dsudzilouski@olin.edu>
@codecov
Copy link

codecov bot commented Apr 18, 2025

Codecov Report

❌ Patch coverage is 90.62500% with 3 lines in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (master@f4edcf7). Learn more about missing BASE report.
⚠️ Report is 422 commits behind head on master.

Files with missing lines Patch % Lines
util/exec/exec.go 90.62% 2 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff            @@
##             master   #22713   +/-   ##
=========================================
  Coverage          ?   59.94%           
=========================================
  Files             ?      342           
  Lines             ?    58648           
  Branches          ?        0           
=========================================
  Hits              ?    35158           
  Misses            ?    20638           
  Partials          ?     2852           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@hazel-sudz hazel-sudz marked this pull request as ready for review April 18, 2025 15:47
@hazel-sudz
Copy link
Author

hey @crenshaw-dev lmk what you need from me to help get this merged. I think all the core logic and tests are there. thanks

Copy link
Contributor

@todaywasawesome todaywasawesome left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should add docs to how how to change the default fataltimeout and this will be ready to go.

@hazel-sudz hazel-sudz requested a review from a team as a code owner April 22, 2025 22:00
Signed-off-by: Hazel Sudzilouski <dsudzilouski@olin.edu>
@hazel-sudz
Copy link
Author

@todaywasawesome added docs and linked screenshots and PR diff of the documentation changes. lmk if there's anything else you need. thanks!

Copy link
Contributor

@todaywasawesome todaywasawesome left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One last suggestion, looks great. Good work.

!!! note
If a CMP renders blank manfiests, and `prune` is set to `true`, Argo CD will automatically remove resources. CMP plugin authors should ensure errors are part of the exit code. Commonly something like `kustomize build . | cat` won't pass errors because of the pipe. Consider setting `set -o pipefail` so anything piped will pass errors on failure.
!!! note
Although this should never happen, if a CMP command fails to gracefully exit on `ARGOCD_EXEC_TIMEOUT`, it will be forcefully killed after an additional timeout of `ARGOCD_EXEC_FATAL_TIMEOUT`. This is an implementation detail that should generally not concern end users.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Although this should never happen, if a CMP command fails to gracefully exit on `ARGOCD_EXEC_TIMEOUT`, it will be forcefully killed after an additional timeout of `ARGOCD_EXEC_FATAL_TIMEOUT`. This is an implementation detail that should generally not concern end users.
If a CMP command fails to gracefully exit on `ARGOCD_EXEC_TIMEOUT`, it will be forcefully killed after an additional timeout of `ARGOCD_EXEC_FATAL_TIMEOUT`.

I don't think we need so much commentary there. Here people are making CMPs so they're buying their own problems.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks no worries pushed on the last commit a615cf4

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated PR body with a new link to the docs @todaywasawesome

Signed-off-by: Hazel Sudzilouski <dsudzilouski@olin.edu>
@crenshaw-dev crenshaw-dev added this to the v3.1 milestone May 21, 2025
Copy link
Member

@ishitasequeira ishitasequeira left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @hazel-sudz for the PR. LGTM! @todaywasawesome do you have any open concerns?

Signed-off-by: Alexandre Gaudreault <alexandre_gaudreault@intuit.com>
@agaudreault agaudreault enabled auto-merge (squash) June 13, 2025 15:15
@agaudreault agaudreault merged commit 44fce0e into argoproj:master Jun 13, 2025
26 of 27 checks passed
dsuhinin pushed a commit to dsuhinin/argo-cd that referenced this pull request Jun 16, 2025
 argoproj#20785, argoproj#18478) (argoproj#22713)

Signed-off-by: Hazel Sudzilouski <dsudzilouski@olin.edu>
Signed-off-by: Alexandre Gaudreault <alexandre_gaudreault@intuit.com>
Co-authored-by: Alexandre Gaudreault <alexandre_gaudreault@intuit.com>
Signed-off-by: dsuhinin <suhinin.dmitriy@gmail.com>
jcogilvie pushed a commit to Sanyaku/argo-cd that referenced this pull request Jun 16, 2025
 argoproj#20785, argoproj#18478) (argoproj#22713)

Signed-off-by: Hazel Sudzilouski <dsudzilouski@olin.edu>
Signed-off-by: Alexandre Gaudreault <alexandre_gaudreault@intuit.com>
Co-authored-by: Alexandre Gaudreault <alexandre_gaudreault@intuit.com>
Signed-off-by: Jonathan Ogilvie <jonathan.ogilvie@sumologic.com>
@prathameshnyt
Copy link

prathameshnyt commented Jun 16, 2025

Thank you @hazel-sudz @crenshaw-dev for this fix. Are you planning to get it release in version 2.x?

reggie-k pushed a commit to codefresh-io/argo-cd that referenced this pull request Jul 28, 2025
 argoproj#20785, argoproj#18478) (argoproj#22713)

Signed-off-by: Hazel Sudzilouski <dsudzilouski@olin.edu>
Signed-off-by: Alexandre Gaudreault <alexandre_gaudreault@intuit.com>
Co-authored-by: Alexandre Gaudreault <alexandre_gaudreault@intuit.com>
reggie-k added a commit to codefresh-io/argo-cd that referenced this pull request Aug 5, 2025
…L to ARGO_EXEC_TIMEOUT (#419)

* chore: move pkg/exec in-tree (argoproj#22175) (argoproj#22460)

Signed-off-by: Michael Crenshaw <350466+crenshaw-dev@users.noreply.github.com>

* fix: add fatal timeout upgrade with SIGKILL to ARGO_EXEC_TIMEOUT (closes argoproj#20785, argoproj#18478) (argoproj#22713)

Signed-off-by: Hazel Sudzilouski <dsudzilouski@olin.edu>
Signed-off-by: Alexandre Gaudreault <alexandre_gaudreault@intuit.com>
Co-authored-by: Alexandre Gaudreault <alexandre_gaudreault@intuit.com>

---------

Signed-off-by: Michael Crenshaw <350466+crenshaw-dev@users.noreply.github.com>
Signed-off-by: Hazel Sudzilouski <dsudzilouski@olin.edu>
Signed-off-by: Alexandre Gaudreault <alexandre_gaudreault@intuit.com>
Co-authored-by: Michael Crenshaw <350466+crenshaw-dev@users.noreply.github.com>
Co-authored-by: Hazel Sudzilouski <t-danielsu@microsoft.com>
Co-authored-by: Alexandre Gaudreault <alexandre_gaudreault@intuit.com>
enneitex pushed a commit to enneitex/argo-cd that referenced this pull request Aug 24, 2025
 argoproj#20785, argoproj#18478) (argoproj#22713)

Signed-off-by: Hazel Sudzilouski <dsudzilouski@olin.edu>
Signed-off-by: Alexandre Gaudreault <alexandre_gaudreault@intuit.com>
Co-authored-by: Alexandre Gaudreault <alexandre_gaudreault@intuit.com>
Signed-off-by: enneitex <etienne.divet@gmail.com>
ppapapetrou76 pushed a commit to codefresh-io/argo-cd that referenced this pull request Sep 16, 2025
…L to ARGO_EXEC_TIMEOUT (#419)

* chore: move pkg/exec in-tree (argoproj#22175) (argoproj#22460)

Signed-off-by: Michael Crenshaw <350466+crenshaw-dev@users.noreply.github.com>

* fix: add fatal timeout upgrade with SIGKILL to ARGO_EXEC_TIMEOUT (closes argoproj#20785, argoproj#18478) (argoproj#22713)

Signed-off-by: Hazel Sudzilouski <dsudzilouski@olin.edu>
Signed-off-by: Alexandre Gaudreault <alexandre_gaudreault@intuit.com>
Co-authored-by: Alexandre Gaudreault <alexandre_gaudreault@intuit.com>

---------

Signed-off-by: Michael Crenshaw <350466+crenshaw-dev@users.noreply.github.com>
Signed-off-by: Hazel Sudzilouski <dsudzilouski@olin.edu>
Signed-off-by: Alexandre Gaudreault <alexandre_gaudreault@intuit.com>
Co-authored-by: Michael Crenshaw <350466+crenshaw-dev@users.noreply.github.com>
Co-authored-by: Hazel Sudzilouski <t-danielsu@microsoft.com>
Co-authored-by: Alexandre Gaudreault <alexandre_gaudreault@intuit.com>
ppapapetrou76 added a commit to codefresh-io/argo-cd that referenced this pull request Sep 18, 2025
* chore: sync all codefresh code changes into v3.0.2 (#397)

* chore: sync all codefresh code changes into v3.0.2 without event-reporter related changes

Signed-off-by: oleksandr-codefresh <oleksandr.saulyak@octopus.com>

* removed cf script

Signed-off-by: oleksandr-codefresh <oleksandr.saulyak@octopus.com>

* upgraded git-lfs to 3.6.1 in Dockerfile (#386)

Signed-off-by: reggie-k <regina.voloshin@codefresh.io>
Signed-off-by: oleksandr-codefresh <oleksandr.saulyak@octopus.com>

* fixed webstorm go.mod issue

Signed-off-by: oleksandr-codefresh <oleksandr.saulyak@octopus.com>

* e2e: improved error logs

Signed-off-by: oleksandr-codefresh <oleksandr.saulyak@octopus.com>

* fixed changes on generated files

Signed-off-by: oleksandr-codefresh <oleksandr.saulyak@octopus.com>

* chore: replace heptio-images with argocd-e2e-container (argoproj#23040)

Signed-off-by: nitishfy <justnitish06@gmail.com>
Signed-off-by: Nitish Kumar <justnitish06@gmail.com>

(cherry picked from commit 309acd1)
Signed-off-by: oleksandr-codefresh <oleksandr.saulyak@octopus.com>

* feat: upgraded github.com/expr-lang/expr from 0.16.9 to 0.17.0

Signed-off-by: oleksandr-codefresh <oleksandr.saulyak@octopus.com>

* e2e [TestTrackAppStateAndSyncApp / TestNewStyleResourceActionMixedOk / TestNewStyleResourceActionPermitted / TestNamespacedPermissions]: added wait for sync operation

Signed-off-by: oleksandr-codefresh <oleksandr.saulyak@octopus.com>

---------

Signed-off-by: oleksandr-codefresh <oleksandr.saulyak@octopus.com>
Signed-off-by: reggie-k <regina.voloshin@codefresh.io>
Co-authored-by: Regina Voloshin <regina.voloshin@codefresh.io>
Co-authored-by: Nitish Kumar <justnitish06@gmail.com>
# Conflicts:
#	.github/workflows/ci-build.yaml
#	cmd/argocd/commands/app_test.go
#	go.mod
#	go.sum
#	manifests/base/kustomization.yaml
#	manifests/core-install-with-hydrator.yaml
#	manifests/core-install.yaml
#	manifests/core-install/kustomization.yaml
#	manifests/ha/base/kustomization.yaml
#	manifests/ha/install-with-hydrator.yaml
#	manifests/ha/install.yaml
#	manifests/ha/namespace-install-with-hydrator.yaml
#	manifests/ha/namespace-install.yaml
#	manifests/install-with-hydrator.yaml
#	manifests/install.yaml
#	manifests/namespace-install-with-hydrator.yaml
#	manifests/namespace-install.yaml
#	pkg/apiclient/application/application.pb.go
#	pkg/apiclient/application/application.pb.gw.go
#	pkg/apis/application/v1alpha1/generated.pb.go
#	reposerver/apiclient/mocks/RepoServerServiceClient.go
#	reposerver/apiclient/repository.pb.go
#	server/application/application.proto
#	util/git/mocks/Client.go

* fix(validateDestination query): as we moved to argo.GetDestinationCluster, we can simply rely on error returned from this request (#405)

Signed-off-by: oleksandr-codefresh <oleksandr.saulyak@octopus.com>

* removed curl from image (#406)

Signed-off-by: reggie-k <regina.voloshin@codefresh.io>

* feat: Add GitHub API metrics (#404)

* added github api metrics

Signed-off-by: reggie-k <regina.voloshin@codefresh.io>

* fix(docs): fix applicationsetcontroller.enable.github.api.metrics to false in docs cm (argoproj#23516)

Signed-off-by: reggie-k <regina.voloshin@codefresh.io>

* fix: Account for batch event processing in e2e tests (argoproj#22356)

Signed-off-by: Andrii Korotkov <andrii.korotkov@verkada.com>

---------

Signed-off-by: reggie-k <regina.voloshin@codefresh.io>
Signed-off-by: Andrii Korotkov <andrii.korotkov@verkada.com>
Co-authored-by: Andrii Korotkov <137232734+andrii-korotkov-verkada@users.noreply.github.com>

* cherry-pick 1b48f36 Upgrade ubuntu base image to latest 25.04 digest (#407)

Signed-off-by: reggie-k <regina.voloshin@codefresh.io>
Co-authored-by: dudinea <eugene.doudine@octopus.com>

* feat: CR-29912 manual cherry pick app set pr generator return 0 results if the repo does not exist (#409)

* manually added the changes

Signed-off-by: reggie-k <regina.voloshin@codefresh.io>

* pull request functionality

Signed-off-by: reggie-k <regina.voloshin@codefresh.io>

* pull request functionality

Signed-off-by: reggie-k <regina.voloshin@codefresh.io>

---------

Signed-off-by: reggie-k <regina.voloshin@codefresh.io>

* feat: move pkg/exec in-tree and add fatal timeout upgrade with SIGKILL to ARGO_EXEC_TIMEOUT (#419)

* chore: move pkg/exec in-tree (argoproj#22175) (argoproj#22460)

Signed-off-by: Michael Crenshaw <350466+crenshaw-dev@users.noreply.github.com>

* fix: add fatal timeout upgrade with SIGKILL to ARGO_EXEC_TIMEOUT (closes argoproj#20785, argoproj#18478) (argoproj#22713)

Signed-off-by: Hazel Sudzilouski <dsudzilouski@olin.edu>
Signed-off-by: Alexandre Gaudreault <alexandre_gaudreault@intuit.com>
Co-authored-by: Alexandre Gaudreault <alexandre_gaudreault@intuit.com>

---------

Signed-off-by: Michael Crenshaw <350466+crenshaw-dev@users.noreply.github.com>
Signed-off-by: Hazel Sudzilouski <dsudzilouski@olin.edu>
Signed-off-by: Alexandre Gaudreault <alexandre_gaudreault@intuit.com>
Co-authored-by: Michael Crenshaw <350466+crenshaw-dev@users.noreply.github.com>
Co-authored-by: Hazel Sudzilouski <t-danielsu@microsoft.com>
Co-authored-by: Alexandre Gaudreault <alexandre_gaudreault@intuit.com>

* feat: CR-30512 stop using bitnami images (#420)

* removed references from all the images except for astra healthcheck

Signed-off-by: reggie-k <regina.voloshin@codefresh.io>

* removed reposerver/repository/testdata/helm-with-local-dependency/.argocd-helm-dep-up

Signed-off-by: reggie-k <regina.voloshin@codefresh.io>

* reverted health check references since they are treated as text

Signed-off-by: reggie-k <regina.voloshin@codefresh.io>

* reverted health check references since they are treated as text

Signed-off-by: reggie-k <regina.voloshin@codefresh.io>

---------

Signed-off-by: reggie-k <regina.voloshin@codefresh.io>

* chore: bumps redis to 8.x (#422)

* bumps Docker test container to redis 8

* bumps redis version to 8.2.1

Signed-off-by: Patroklos Papapetrou <ppapapetrou76@gmail.com>

* use a previous version of go-redis

Signed-off-by: Patroklos Papapetrou <ppapapetrou76@gmail.com>

---------

Signed-off-by: Patroklos Papapetrou <ppapapetrou76@gmail.com>

* upgrade sqlite in docker image to address CVE-2025-6965 (#425)

* final changes after rebase

Signed-off-by: Patroklos Papapetrou <ppapapetrou76@gmail.com>

* final changes after rebase

Signed-off-by: Patroklos Papapetrou <ppapapetrou76@gmail.com>

* address new linter issues

Signed-off-by: Patroklos Papapetrou <ppapapetrou76@gmail.com>

---------

Signed-off-by: oleksandr-codefresh <oleksandr.saulyak@octopus.com>
Signed-off-by: reggie-k <regina.voloshin@codefresh.io>
Signed-off-by: Andrii Korotkov <andrii.korotkov@verkada.com>
Signed-off-by: Michael Crenshaw <350466+crenshaw-dev@users.noreply.github.com>
Signed-off-by: Hazel Sudzilouski <dsudzilouski@olin.edu>
Signed-off-by: Alexandre Gaudreault <alexandre_gaudreault@intuit.com>
Signed-off-by: Patroklos Papapetrou <ppapapetrou76@gmail.com>
Co-authored-by: Oleksandr Saulyak <oleksandr.saulyak@octopus.com>
Co-authored-by: Regina Voloshin <regina.voloshin@codefresh.io>
Co-authored-by: Nitish Kumar <justnitish06@gmail.com>
Co-authored-by: Andrii Korotkov <137232734+andrii-korotkov-verkada@users.noreply.github.com>
Co-authored-by: dudinea <eugene.doudine@octopus.com>
Co-authored-by: Michael Crenshaw <350466+crenshaw-dev@users.noreply.github.com>
Co-authored-by: Hazel Sudzilouski <t-danielsu@microsoft.com>
Co-authored-by: Alexandre Gaudreault <alexandre_gaudreault@intuit.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Applications are stuck in refreshing

6 participants