Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Argo CD synchronization lasts incredibly long #3663

Closed
AlehB opened this issue May 28, 2020 · 8 comments
Closed

Argo CD synchronization lasts incredibly long #3663

AlehB opened this issue May 28, 2020 · 8 comments
Labels
bug/priority:high Should be fixed in the next patch release bug/severity:criticial A critical bug in ArgoCD, possibly resulting in data loss or severe degraded overall functionality bug Something isn't working component:core Syncing, diffing, cluster state cache

Comments

@AlehB
Copy link

AlehB commented May 28, 2020

Describe the bug

Hello team,

We are trying to install prometheus-operator helm chart (https://github.com/helm/charts/tree/master/stable/prometheus-operator) in our Kubernetes cluster with Argo CD

We encountered two problems:

  • After the chart is added in Argo CD dashboard and manual sync is started, it took about an hour for Argo to just begin to actually sync (begin to create kubernetes resources)

  • Synchronization lasts incredibly long. It is already lasts about 17 hours and the application is still not fully launched (No events available in Argo CD dashboard and no any errors)

We've tried it several times with different helm chart versions
For other helm chart (very small ones) our Argo CD installation works fine

Is there any option to speed up the start of the application? Now Argo CD looks like an unsuitable option for such Helm charts

To Reproduce

Expected behavior

Argo CD begins to create kubernetes resources immediately
It takes a reasonable amount of time to get everything ready

Screenshots

Sync status at the moment
image

As an example, CustomResourceDefinitions are not ready
image

Version

v1.5.3+095c5d6
@AlehB AlehB added the bug Something isn't working label May 28, 2020
@AlehB AlehB changed the title Argocd synchronization lasts incredibly long Argo CD synchronization lasts incredibly long May 28, 2020
@ghostsquad
Copy link

I'm willing to guess that ArgoCD doesn't know how to check the "status" of CRDs. But I'm not sure exactly. This is a common problem with Spinnaker too.

@ventris
Copy link

ventris commented Jun 2, 2020

Im having the same issue with v1.5.5+0fdef48. Upgraded from 1.4.x and the issues started for me.

@aimbot31
Copy link

aimbot31 commented Jun 2, 2020

Same issue when i'm trying to install istio, argocd stuck with the CRD's.
In the argocd-application-controller, i got this error : Failed to convert resource: no kind \"CustomResourceDefinition\" is registered for version \"apiextensions.k8s.io/v1\" in scheme \"k8s.io/kubernetes/pkg/api/legacyscheme/scheme.go:30\"

k8s version 1.16.8
argocd version : 1.5.5

@alexmt alexmt self-assigned this Jun 3, 2020
@alexmt alexmt added this to the v1.7 milestone Jun 22, 2020
@alexmt alexmt added bug/priority:high Should be fixed in the next patch release bug/severity:criticial A critical bug in ArgoCD, possibly resulting in data loss or severe degraded overall functionality component:core Syncing, diffing, cluster state cache labels Jun 22, 2020
@alexmt alexmt modified the milestones: v1.7 , v1.8 Aug 25, 2020
@alexmt alexmt removed their assignment Sep 17, 2020
@jgwest
Copy link
Member

jgwest commented Oct 20, 2020

Short version

  • This issue references a number of behaviours (which I carefully work through below), but with the possible exception of A, none of them are bugs in Argo CD.
  • As part of this issue, @alexmt has added additional improvements over the last few months to make it more obvious that a sync operation is in progress, which should reduce any confusion over what Argo CD is up to during a sync operation.

I recommend closing this issue, and (if one does not already exist) opening a new enhancement to examine how to better handle the A) case below, where Argo CD crashes/is restarted/stopped while a sync operation is in progress. (Another option is to repurpose this current issue to handle A), but IMHO a clean slate is better)

Long version

There are a few different behaviours being described here, which I'll address one at a time:

A) Synchronization takes a 'really long time'

It is currently possible for an Argo CD application's sync operation state to appear to get "stuck" in a running state, which can make it look like it is taking 'a really long time', when in fact no sync operation is taking place. When this happens, Argo CD thinks an operation is in progress (for example, reporting in the web UI that an operation is ongoing) when in fact it is not.

This has the potential to occur any time the Argo CD controller process is prematurely stopped (for example, due to a Argo CD controller crash). (I personally see this 'stuck operation' during Argo CD development, where, during debug, I kill and restart the Argo CD controller container when it is in the middle of a long-running sync operation.)

This behaviour is due to the nature of how an operation's state is stored by Argo CD. It stores it in the Argo CD Application CRD in k8s (backed by etcd):

   "operationState": {
        "message": "waiting for completion of hook batch/Job/jgw-prometheus-kube-promet-admission-create",
        "operation": {
            "initiatedBy": {},
            "retry": {},
            "sync": {
                "revision": "9e806d0691adcdfc812f9b2a714ecb9961367e94",
                "syncStrategy": {
                    "hook": {}
                }
            }
        },
        "phase": "Running",
        "startedAt": "2020-10-20T04:41:06Z",
    (...)

The Argo CD controller keeps track of which operation is running, and updates the 'operationState' field as that operation progresses. However, if the Argo CD controller process is restarted, it does not appear to have a way to detect that 'operationState' is no longer valid, and thus the operationState field will remain in the 'running' state until the operation is manually terminated.

You can terminate an operation in this state from within the UI, by clicking on Syncing and then Terminate, or via the CLI.

In practice, this shouldn't happen except in rare cases where the controller dies unexpectedly during a sync (which, since there are no log files attached to this issue, I'm not sure we can investigate the specific trigger, here)

B) 'CustomResourceDefinitions are not ready' for prometheus-operator chart

It turns out this is not an Argo CD issue, but rather due to the behaviour of the prometheseus-operator itself.

When you look at the difference between what Argo CD expects (desired state), and what it finds (live state), you will see the only difference is this:

 annotations:
    helm.sh/hook: crd-install  # <---- this field is missing from 'annotations' in live state

Argo CD expects to find the above field in annotations, but the CRD itself does not contain it. Why is that?

Well, Argo CD is applying the correct version of the manifest CRD containing this field:

  • If you examine the Argo CD logs, you can see that the k8s CRD resource that is being applied to k8s does correctly contain this field
  • Likewise, if you setup a breakpoint in Argo CD, you can see that the correct version of the CRD is applied to the cluster (but then overwritten by a subsequent sync operation)

So who is doing the overwriting of the "good" desired version of the CRD, with the "bad" live version of the CRD?

The prometheus-operator deployment itself! Before version v0.39.0 of the prometheus-operator, the operator Helm chart starts the operator with the following parameter: --manage-crds (source).

This parameter, --manage-crds, tells the operator to replace the existing CRDs (the 'good' version) with an operator-managed version (source), which causes the CRD k8s resource to diverge from what Argo CD expects.

To confirm if this is the issue you are seeing, you can kubectl logs deployment/prometheus-kube-promet-operator -n (namespace) kube-prometheus-stack to the operator deployment, and you will see the following in the logs:

level=info ts=2020-10-20T04:41:50.240287866Z caller=operator.go:294 component=thanosoperator msg="connection established" cluster-version=v1.18.6+k3s1
level=info ts=2020-10-20T04:41:50.620988059Z caller=operator.go:701 component=thanosoperator msg="CRD updated" crd=ThanosRuler
level=info ts=2020-10-20T04:41:50.659110663Z caller=operator.go:1918 component=prometheusoperator msg="CRD updated" crd=Prometheus
level=info ts=2020-10-20T04:41:50.682323905Z caller=operator.go:1918 component=prometheusoperator msg="CRD updated" crd=ServiceMonitor
level=info ts=2020-10-20T04:41:50.699229523Z caller=operator.go:1918 component=prometheusoperator msg="CRD updated" crd=PodMonitor
level=info ts=2020-10-20T04:41:50.707645936Z caller=operator.go:1918 component=prometheusoperator msg="CRD updated" crd=PrometheusRule
level=info ts=2020-10-20T04:41:51.031850543Z caller=operator.go:655 component=alertmanageroperator msg="CRD updated" crd=Alertmanager

(notice the 'CRD Updated' message, as it updates the CRDs one-by-one)

Fortunately it appears that this parameter is no longer in use in newer versions of the prometheus operator, so you may get better luck out of those versions.

In any case, this is not an Argo CD issue, and mechanisms exist in Argo CD to ignore differences like this.

C) Log message: Failed to convert resource: no kind \"CustomResourceDefinition\" is registered for version \"apiextensions.k8s.io/v1\" in scheme \"k8s.io/kubernetes/pkg/api/legacyscheme/scheme.go:30\"

This message is much less scary then it sounds; it's just a debug-level log message indicating that the particular group/verson/kind of the resource was not recognized by Argo CD (likely due to an older dependency), and in that scenario Argo CD will just acquire the resource from the cluster directly.

More discussion of this previously: #3670

Other projects seeing the same issue, due to Kubernetes API changes: prometheus-community/helm-charts#202

@rbreeze
Copy link
Member

rbreeze commented Feb 2, 2021

@AlehB can you confirm that this issue still exists on 1.8?

@jessesuen
Copy link
Member

For large applications, the v1.9 feature to only apply objects which are outofsync will help here.
https://argo-cd.readthedocs.io/en/latest/user-guide/sync-options/#selective-sync

@jessesuen jessesuen removed this from the v1.9 milestone Feb 11, 2021
@threeseed
Copy link

@rbreeze .. Looks related to this.

Pretty sure ArgoCD is broken with Kube Prometheus which is a worry as it's a very popular component.

@blakepettersson
Copy link
Member

This should have been resolved with the introduction of Server-Side Apply, feel free to re-open if that is not the case.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug/priority:high Should be fixed in the next patch release bug/severity:criticial A critical bug in ArgoCD, possibly resulting in data loss or severe degraded overall functionality bug Something isn't working component:core Syncing, diffing, cluster state cache
Projects
None yet
Development

No branches or pull requests

10 participants