Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

v0.13.5 trying to create already existing record #3754

Open
dmitriishaburov opened this issue Jun 29, 2023 · 18 comments
Open

v0.13.5 trying to create already existing record #3754

dmitriishaburov opened this issue Jun 29, 2023 · 18 comments
Labels
kind/bug Categorizes issue or PR as related to a bug.

Comments

@dmitriishaburov
Copy link

What happened:

After update to v0.13.5, external-dns is having CrashLoopBackOff trying to create already existing DNS record in Route53 that it already manages.

What you expected to happen:

Not crashing.

How to reproduce it (as minimally and precisely as possible):

Probably:

  • Create DNS record for LoadBalancer in Route53 via v0.13.4, then update to v0.13.5

Anything else we need to know?:

We have following DNS records for EKS LoadBalancer, which were created by external-dns v0.13.4:

cname-tempo-distributor.prelive.domain TXT Simple - No "heritage=external-dns,external-dns/owner=default,external-dns/resource=service/tempo/tempo-distributed-distributor"
tempo-distributor.prelive.domain A Simple - Yes k8s-tempo-tempodis-xxxx.elb.eu-west-1.amazonaws.com.
tempo-distributor.prelive.domain TXT Simple - No "heritage=external-dns,external-dns/owner=default,external-dns/resource=service/tempo/tempo-distributed-distributor"

After update to v0.13.5, external-dns trying to recreate them and fails:

time="2023-06-29T11:27:47Z" level=info msg="Desired change: CREATE a-tempo-distributor.prelive.domain TXT [Id: /hostedzone/ID]"
time="2023-06-29T11:27:47Z" level=info msg="Desired change: CREATE tempo-distributor.prelive.domain  A [Id: /hostedzone/ID]"
time="2023-06-29T11:27:47Z" level=info msg="Desired change: CREATE tempo-distributor.prelive.domain  TXT [Id: /hostedzone/ID]"
time="2023-06-29T11:27:47Z" level=error msg="Failure in zone domain. [Id: /hostedzone/ID] when submitting change batch: InvalidChangeBatch: [Tried to create resource record set [name='tempo-distributor.prelive.domain.', type='A'] but it already exists, Tried to create resource record set [name='tempo-distributor.prelive.domain.', type='TXT'] but it already exists]\n\tstatus code: 400, request id: ID"
time="2023-06-29T11:27:48Z" level=fatal msg="failed to submit all changes for the following zones: [/hostedzone/ID]"

Command line args (for both versions):

    Args:
      --log-level=info
      --log-format=text
      --interval=1m
      --source=service
      --source=ingress
      --policy=upsert-only
      --registry=txt
      --provider=aws

Environment:

  • External-DNS version (use external-dns --version): v0.13.5
  • DNS provider: Route53
@dmitriishaburov dmitriishaburov added the kind/bug Categorizes issue or PR as related to a bug. label Jun 29, 2023
@zeqk
Copy link

zeqk commented Jun 30, 2023

May be related to this, looks like a similar problem
#3484
#3706

@iodeslykos
Copy link

iodeslykos commented Jul 4, 2023

We encountered the same issue described by the OP after upgrading the Helm Chart to version 1.13.0 (external-dns v0.13.5) from 1.12.2 (external-dns v0.13.4): Pods entered CrashLoopBackOff after repeated failures to create records that already existed in the target Route53 Hosted Zone.

Current working solution is to downgrade back to Helm Chart v1.12.2 (v0.13.4).

@aardbol
Copy link

aardbol commented Jul 5, 2023

Same issue with Google DNS. Downgrade to Helm Chart v1.12.2 (v0.13.4) works

@jbilliau-rcd
Copy link

We are having the same issue, the pod straight up crashes from a normal error we see all the time, through multiple version upgrades over the years.

@alfredkrohmer
Copy link
Contributor

This seems to be caused by this changes: #3009

@szuecs
Copy link
Contributor

szuecs commented Sep 8, 2023

cname-tempo-distributor.prelive.domain TXT Simple - No "heritage=external-dns,external-dns/owner=default,external-dns/resource=service/tempo/tempo-distributed-distributor"
tempo-distributor.prelive.domain A Simple - Yes k8s-tempo-tempodis-xxxx.elb.eu-west-1.amazonaws.com.
tempo-distributor.prelive.domain TXT Simple - No "heritage=external-dns,external-dns/owner=default,external-dns/resource=service/tempo/tempo-distributed-distributor"

What do you mean by Yes/No ?

@iodeslykos
Copy link

cname-tempo-distributor.prelive.domain TXT Simple - No "heritage=external-dns,external-dns/owner=default,external-dns/resource=service/tempo/tempo-distributed-distributor"
tempo-distributor.prelive.domain A Simple - Yes k8s-tempo-tempodis-xxxx.elb.eu-west-1.amazonaws.com.
tempo-distributor.prelive.domain TXT Simple - No "heritage=external-dns,external-dns/owner=default,external-dns/resource=service/tempo/tempo-distributed-distributor"

What do you mean by Yes/No ?

Yes and No are part of the record when pulled via the AWS CLI utility from Route53 and represent whether or not the record is an Alias, which is a Route53-specific extension to DNS functionality.

@CAR6807
Copy link

CAR6807 commented Nov 16, 2023

Any update on this?
Is this fixed in 1.14.0?

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Feb 14, 2024
@krmichelos
Copy link

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Feb 14, 2024
@CAR6807
Copy link

CAR6807 commented Mar 13, 2024

Bump

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jun 11, 2024
@FernandoMiguel
Copy link

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jun 12, 2024
@iodeslykos
Copy link

To ensure that the maintainers understand this is still an issue: it is.

We did upgrade past v1.12.x, but it required a significant amount of record deletions in order to allow CoreDNS to recreate the records it had previously managed without issue. Now everything is working fine.

If there is conflict in DNS we would expect the error message, but not the entire CoreDNS deployment to enter CrashLoopBackoff.

@CAR6807
Copy link

CAR6807 commented Jun 19, 2024

Bump.
This is preventing us from addressing high vulnerabilities fixed in newer version.
We would like to avoid have to delete existing records to avoid potential outages.
Record conflicts should not cause the external dns controller to crash outright.

@mlazowik
Copy link

I'm guessing this is fixed, at least for some providers, by #4166?

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 9, 2024
@aardbol
Copy link

aardbol commented Dec 21, 2024

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 21, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug.
Projects
None yet
Development

No branches or pull requests