Argo CD synchronization lasts incredibly long #3663

AlehB · 2020-05-28T07:58:02Z

Describe the bug

Hello team,

We are trying to install prometheus-operator helm chart (https://github.com/helm/charts/tree/master/stable/prometheus-operator) in our Kubernetes cluster with Argo CD

We encountered two problems:

After the chart is added in Argo CD dashboard and manual sync is started, it took about an hour for Argo to just begin to actually sync (begin to create kubernetes resources)
Synchronization lasts incredibly long. It is already lasts about 17 hours and the application is still not fully launched (No events available in Argo CD dashboard and no any errors)

We've tried it several times with different helm chart versions
For other helm chart (very small ones) our Argo CD installation works fine

Is there any option to speed up the start of the application? Now Argo CD looks like an unsuitable option for such Helm charts

To Reproduce

Create an app in Argo CD dashboard based on more or less large helm chart (e.g. https://github.com/helm/charts/tree/master/stable/prometheus-operator)

Expected behavior

Argo CD begins to create kubernetes resources immediately
It takes a reasonable amount of time to get everything ready

Screenshots

Sync status at the moment

As an example, CustomResourceDefinitions are not ready

Version

v1.5.3+095c5d6

ghostsquad · 2020-06-01T18:32:54Z

I'm willing to guess that ArgoCD doesn't know how to check the "status" of CRDs. But I'm not sure exactly. This is a common problem with Spinnaker too.

ventris · 2020-06-02T09:17:04Z

Im having the same issue with v1.5.5+0fdef48. Upgraded from 1.4.x and the issues started for me.

aimbot31 · 2020-06-02T12:03:18Z

Same issue when i'm trying to install istio, argocd stuck with the CRD's.
In the argocd-application-controller, i got this error : Failed to convert resource: no kind \"CustomResourceDefinition\" is registered for version \"apiextensions.k8s.io/v1\" in scheme \"k8s.io/kubernetes/pkg/api/legacyscheme/scheme.go:30\"

k8s version 1.16.8
argocd version : 1.5.5

jgwest · 2020-10-20T06:08:01Z

Short version

This issue references a number of behaviours (which I carefully work through below), but with the possible exception of A, none of them are bugs in Argo CD.
As part of this issue, @alexmt has added additional improvements over the last few months to make it more obvious that a sync operation is in progress, which should reduce any confusion over what Argo CD is up to during a sync operation.

I recommend closing this issue, and (if one does not already exist) opening a new enhancement to examine how to better handle the A) case below, where Argo CD crashes/is restarted/stopped while a sync operation is in progress. (Another option is to repurpose this current issue to handle A), but IMHO a clean slate is better)

Long version

There are a few different behaviours being described here, which I'll address one at a time:

A) Synchronization takes a 'really long time'

It is currently possible for an Argo CD application's sync operation state to appear to get "stuck" in a running state, which can make it look like it is taking 'a really long time', when in fact no sync operation is taking place. When this happens, Argo CD thinks an operation is in progress (for example, reporting in the web UI that an operation is ongoing) when in fact it is not.

This has the potential to occur any time the Argo CD controller process is prematurely stopped (for example, due to a Argo CD controller crash). (I personally see this 'stuck operation' during Argo CD development, where, during debug, I kill and restart the Argo CD controller container when it is in the middle of a long-running sync operation.)

This behaviour is due to the nature of how an operation's state is stored by Argo CD. It stores it in the Argo CD Application CRD in k8s (backed by etcd):

   "operationState": {
        "message": "waiting for completion of hook batch/Job/jgw-prometheus-kube-promet-admission-create",
        "operation": {
            "initiatedBy": {},
            "retry": {},
            "sync": {
                "revision": "9e806d0691adcdfc812f9b2a714ecb9961367e94",
                "syncStrategy": {
                    "hook": {}
                }
            }
        },
        "phase": "Running",
        "startedAt": "2020-10-20T04:41:06Z",
    (...)

The Argo CD controller keeps track of which operation is running, and updates the 'operationState' field as that operation progresses. However, if the Argo CD controller process is restarted, it does not appear to have a way to detect that 'operationState' is no longer valid, and thus the operationState field will remain in the 'running' state until the operation is manually terminated.

You can terminate an operation in this state from within the UI, by clicking on Syncing and then Terminate, or via the CLI.

In practice, this shouldn't happen except in rare cases where the controller dies unexpectedly during a sync (which, since there are no log files attached to this issue, I'm not sure we can investigate the specific trigger, here)

B) 'CustomResourceDefinitions are not ready' for prometheus-operator chart

It turns out this is not an Argo CD issue, but rather due to the behaviour of the prometheseus-operator itself.

When you look at the difference between what Argo CD expects (desired state), and what it finds (live state), you will see the only difference is this:

 annotations:
    helm.sh/hook: crd-install  # <---- this field is missing from 'annotations' in live state

Argo CD expects to find the above field in annotations, but the CRD itself does not contain it. Why is that?

Well, Argo CD is applying the correct version of the manifest CRD containing this field:

If you examine the Argo CD logs, you can see that the k8s CRD resource that is being applied to k8s does correctly contain this field
Likewise, if you setup a breakpoint in Argo CD, you can see that the correct version of the CRD is applied to the cluster (but then overwritten by a subsequent sync operation)

So who is doing the overwriting of the "good" desired version of the CRD, with the "bad" live version of the CRD?

The prometheus-operator deployment itself! Before version v0.39.0 of the prometheus-operator, the operator Helm chart starts the operator with the following parameter: --manage-crds (source).

This parameter, --manage-crds, tells the operator to replace the existing CRDs (the 'good' version) with an operator-managed version (source), which causes the CRD k8s resource to diverge from what Argo CD expects.

To confirm if this is the issue you are seeing, you can kubectl logs deployment/prometheus-kube-promet-operator -n (namespace) kube-prometheus-stack to the operator deployment, and you will see the following in the logs:

level=info ts=2020-10-20T04:41:50.240287866Z caller=operator.go:294 component=thanosoperator msg="connection established" cluster-version=v1.18.6+k3s1
level=info ts=2020-10-20T04:41:50.620988059Z caller=operator.go:701 component=thanosoperator msg="CRD updated" crd=ThanosRuler
level=info ts=2020-10-20T04:41:50.659110663Z caller=operator.go:1918 component=prometheusoperator msg="CRD updated" crd=Prometheus
level=info ts=2020-10-20T04:41:50.682323905Z caller=operator.go:1918 component=prometheusoperator msg="CRD updated" crd=ServiceMonitor
level=info ts=2020-10-20T04:41:50.699229523Z caller=operator.go:1918 component=prometheusoperator msg="CRD updated" crd=PodMonitor
level=info ts=2020-10-20T04:41:50.707645936Z caller=operator.go:1918 component=prometheusoperator msg="CRD updated" crd=PrometheusRule
level=info ts=2020-10-20T04:41:51.031850543Z caller=operator.go:655 component=alertmanageroperator msg="CRD updated" crd=Alertmanager

(notice the 'CRD Updated' message, as it updates the CRDs one-by-one)

Fortunately it appears that this parameter is no longer in use in newer versions of the prometheus operator, so you may get better luck out of those versions.

In any case, this is not an Argo CD issue, and mechanisms exist in Argo CD to ignore differences like this.

C) Log message: `Failed to convert resource: no kind \"CustomResourceDefinition\" is registered for version \"apiextensions.k8s.io/v1\" in scheme \"k8s.io/kubernetes/pkg/api/legacyscheme/scheme.go:30\"`

This message is much less scary then it sounds; it's just a debug-level log message indicating that the particular group/verson/kind of the resource was not recognized by Argo CD (likely due to an older dependency), and in that scenario Argo CD will just acquire the resource from the cluster directly.

More discussion of this previously: #3670

Other projects seeing the same issue, due to Kubernetes API changes: prometheus-community/helm-charts#202

rbreeze · 2021-02-02T22:03:04Z

@AlehB can you confirm that this issue still exists on 1.8?

jessesuen · 2021-02-11T21:51:29Z

For large applications, the v1.9 feature to only apply objects which are outofsync will help here.
https://argo-cd.readthedocs.io/en/latest/user-guide/sync-options/#selective-sync

threeseed · 2021-08-22T11:09:48Z

@rbreeze .. Looks related to this.

Pretty sure ArgoCD is broken with Kube Prometheus which is a worry as it's a very popular component.

blakepettersson · 2024-01-11T11:28:07Z

This should have been resolved with the introduction of Server-Side Apply, feel free to re-open if that is not the case.

AlehB added the bug Something isn't working label May 28, 2020

AlehB changed the title ~~Argocd synchronization lasts incredibly long~~ Argo CD synchronization lasts incredibly long May 28, 2020

alexmt self-assigned this Jun 3, 2020

aimbot31 mentioned this issue Jun 8, 2020

Warning logs related to CustomResourceDefinition (CRD) #3670

Closed

alexmt added this to the v1.7 milestone Jun 22, 2020

alexmt added bug/priority:high Should be fixed in the next patch release bug/severity:criticial A critical bug in ArgoCD, possibly resulting in data loss or severe degraded overall functionality component:core Syncing, diffing, cluster state cache labels Jun 22, 2020

This was referenced Jul 9, 2020

feat: improve sync operation messages argoproj/gitops-engine#84

Merged

feat: display sync operation status message on app details page #3918

Merged

alexmt modified the milestones: v1.7 , v1.8 Aug 25, 2020

alexmt removed their assignment Sep 17, 2020

jgwest mentioned this issue Oct 20, 2020

Prometheus-operator CRDs out-of-sync due to helm.sh/hook: crd-install annotation missing #4175

Closed

alexmt modified the milestones: v1.8, v1.9 Oct 29, 2020

jessesuen removed this from the v1.9 milestone Feb 11, 2021

blakepettersson closed this as completed Jan 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Argo CD synchronization lasts incredibly long #3663

Argo CD synchronization lasts incredibly long #3663

AlehB commented May 28, 2020

ghostsquad commented Jun 1, 2020

ventris commented Jun 2, 2020

aimbot31 commented Jun 2, 2020 •

edited

Loading

jgwest commented Oct 20, 2020 •

edited

Loading

rbreeze commented Feb 2, 2021

jessesuen commented Feb 11, 2021

threeseed commented Aug 22, 2021

blakepettersson commented Jan 11, 2024

Argo CD synchronization lasts incredibly long #3663

Argo CD synchronization lasts incredibly long #3663

Comments

AlehB commented May 28, 2020

ghostsquad commented Jun 1, 2020

ventris commented Jun 2, 2020

aimbot31 commented Jun 2, 2020 • edited Loading

jgwest commented Oct 20, 2020 • edited Loading

Short version

Long version

A) Synchronization takes a 'really long time'

B) 'CustomResourceDefinitions are not ready' for prometheus-operator chart

C) Log message: Failed to convert resource: no kind \"CustomResourceDefinition\" is registered for version \"apiextensions.k8s.io/v1\" in scheme \"k8s.io/kubernetes/pkg/api/legacyscheme/scheme.go:30\"

rbreeze commented Feb 2, 2021

jessesuen commented Feb 11, 2021

threeseed commented Aug 22, 2021

blakepettersson commented Jan 11, 2024

aimbot31 commented Jun 2, 2020 •

edited

Loading

jgwest commented Oct 20, 2020 •

edited

Loading

C) Log message: `Failed to convert resource: no kind \"CustomResourceDefinition\" is registered for version \"apiextensions.k8s.io/v1\" in scheme \"k8s.io/kubernetes/pkg/api/legacyscheme/scheme.go:30\"`