dns: refactor DNS handling by ironcladlou · Pull Request #121 · openshift/cluster-ingress-operator

ironcladlou · 2019-02-13T15:24:56Z

Depends on openshift/installer#1233.
Blocks openshift/installer#1169.

Controller changes

These changes are primarily in support of cluster DNS management API changes, but also include the beginnings of larger refactor of the controller into a more discrete desired/current/apply-changes model. The general pattern being promoted is:

Desired: compute the desired state of a resource.
Current: get the current state of a resource.
Ensure: given the desired and current state of a resource in our managed graph, compute and apply any necessary changes.
Finalize: clean up any external resources which depend on the current resource.

This change is meant to be a minimal incremental refactor to help us start reigning in complexity with the goal of being able to support more sophisticated change management. Hopefully with enough such refactors new useful patterns will emerge, and in the meantime there is benefit in testability and legibility in this sort of decomposition.

Note about manifests.go: another important goal is to eliminate the ManifestFactory type and replace it with stateless functions which simply load and return manifest content — any additional processing of a manifest to compute desired state should be moved into controller files. This is to contain sprawl.

Note about resource naming: when trying to get the current state of a resource, instead of computing the desired state just to get the name of the resource to fetch, start extracting functions to compute names of resources which can take operator config into account (e.g. operator/operand namespaces) and which can be reused for both current and desired computations.

pkg/operator/controller/controller.go

Extract LB service management into a separate file, and add LB service finalization to the ingress deletion logic.

Extract ServiceMonitor management into a separate file. Eliminate the dependency on the prometheus project and switch to using Unstructured for managing servicemonitor resources; this fixes a race condition where the ServiceMonitor may not exist yet when we construct our client, which uses a caching discovery client that won't ever refresh and pick up the type later.

pkg/operator/controller/controller_lb.go

This file contains new and extracted pieces from the main controller file which support dealing with the LB service.

pkg/operator/controller/controller_dns.go

This file contains new functions to handle applying DNS records associated with an LB service. Note that dealing with current resources isn't yet supported; for now, always try to apply the desired state and rely on the DNS manager to manage no-ops efficiently.

pkg/manifests/manifests.go

Remove the RouterServiceCloud() function and replace it with a stateless LoadBalancerService() function which only loads contents of the manifest. Move all the logic for computing desired state into the controller.

Do the same for MetricsServiceMonitor().

DNS manager changes

pkg/dns/dns.go

Add new types to represent DNS records (taking us further down the slippery slope of reinventing external-dns). Use the new types with expanded DNS interface methods for Ensure() and Delete() to allow more sophisticated DNS management.

pkg/dns/aws/dns.go

Refactor away all assumptions about public and private zones and installer topology. Intead, each call to Ensure() or Delete() works with a single Record as input and uses the zone information encoded in the Record. This eliminates the need for any zone discovery. When a record has a zone addressed by tags rather than ID, cache the ID we find for the zone.

cmd/cluster-ingress-operator/main.go

Unwire ingress and cluster version config from DNS manager, replacing their usage with DNS config. This allows for a much simpler and more reliable DNS management implementation.

Misc

hack/uninstall.sh

Remove only managed components by default rather than the entire operator infra; this is better when trying to run a local operator process against a remote cluster.

Miciah · 2019-02-13T15:28:18Z

pkg/dns/aws/dns.go

-//
-// 1. A public zone shared by all clusters with a domain name equal to the cluster base domain
-// 2. A private zone for the cluster with a domain name equal to the cluster base domain
+// Manager provides AWS DNS record management. In this implmentation, calling


Typo: "implmentation".

Miciah · 2019-02-13T15:39:59Z

pkg/dns/dns.go

+	// Type is the DNS record type.
+	Type RecordType
+
+	// Alias are the options for an ALIAS record.


"Alias is".

Miciah · 2019-02-13T15:42:46Z

pkg/manifests/manifests_test.go

 	"time"

 	ingressv1alpha1 "github.com/openshift/cluster-ingress-operator/pkg/apis/ingress/v1alpha1"
+	"github.com/openshift/cluster-ingress-operator/pkg/manifests"


This is an import cycle (self-import); can just delete it.

Miciah · 2019-02-13T15:43:26Z

pkg/manifests/manifests_test.go

-	if _, err := f.RouterServiceCloud(ci); err != nil {
-		t.Errorf("invalid RouterServiceCloud: %v", err)
-	}
+	manifests.LoadBalancerService()


Delete the manifests. (this is in the manifests package).

Miciah · 2019-02-13T15:45:48Z

pkg/operator/controller/controller_dns.go

+	// TODO: This will need revisited when we stop defaulting .spec.ingressDomain
+	// and .spec.highAvailability as both can be nil but used with an effective
+	// system-provided default reported on status.
+	if ci.Spec.HighAvailability.Type != ingressv1alpha1.CloudClusterIngressHA || ci.Spec.IngressDomain == nil {


We do indeed want eventually to replace ci.Spec.HighAvailability.Type with an effective value, but as long as we are using ci.Spec.HighAvailability, we should still check whether it is nil.

Miciah · 2019-02-13T15:45:54Z

pkg/operator/controller/controller_lb.go

+// desired if the high availability type is Cloud. An LB service will declare an
+// owner reference to the given deployment.
+func desiredLoadBalancerService(ci *ingressv1alpha1.ClusterIngress, infra *configv1.Infrastructure, deployment *appsv1.Deployment) (*corev1.Service, error) {
+	if ci.Spec.HighAvailability.Type != ingressv1alpha1.CloudClusterIngressHA {


Need to check whether ci.Spec.HighAvailability is nil.

Miciah · 2019-02-13T15:50:02Z

pkg/operator/controller/controller_lb.go

+
+// ensureLoadBalancerService creates an LB service if one is desired but absent.
+// Always returns the current LB service if one exists (whether it already
+// existed or was created during the course of the function).


This means ensureRouterForIngress will ensureDNS an existing LB service even if none is desired, right? Do we want to do that?

If one exists at all, then ensureDNS will try and create any DNS for it. If later on the clusteringress is deleted, we'll tear it all down.

If the clusteringress was deleted we won't even be in this code path in the first place.

Not sure there's an error here yet. However, I think this decomposition lets us at least start reasoning about the scenarios...

Miciah · 2019-02-13T16:00:11Z

Should getZoneID check whether the zone is private? I guess before I assumed it was not necessary to do so because we were coding for a specific use-case in which the public zone was not tagged with the cluster id, but as we make the zone look-up more generic, I believe we should revisit that assumption.

Edit: Never mind, we use getZoneID to look up both the private zone and the public zone. However, it might be worth documenting in the API that consumers of the DNS config's .spec.publicZone and .spec.privateZone are not responsible for filtering out private or public zones, respectively, which means that the DNS config must specify ids or tags that identify only public and private zones, respectively.

ironcladlou · 2019-02-13T18:08:36Z

@Miciah

Should getZoneID check whether the zone is private? I guess before I assumed it was not necessary to do so because we were coding for a specific use-case in which the public zone was not tagged with the cluster id, but as we make the zone look-up more generic, I believe we should revisit that assumption.

Edit: Never mind, we use getZoneID to look up both the private zone and the public zone. However, it might be worth documenting in the API that consumers of the DNS config's .spec.publicZone and .spec.privateZone are not responsible for filtering out private or public zones, respectively, which means that the DNS config must specify ids or tags that identify only public and private zones, respectively.

I thought the intent of the DNS API was to allow us to specifically address a unique zone, regardless of its scope in the underlying platform (e.g. public/private, which is an implementation detail). Put another way, two zones can't have the same ID (or tag set as a proxy for ID) but different scopes. @abhinavdahiya, do you agree with that?

Miciah · 2019-02-13T18:17:15Z

cmd/cluster-ingress-operator/main.go


 	// Set up the DNS manager.
-	dnsManager, err := createDNSManager(kubeClient, operatorNamespace, infraConfig, dnsConfig, clusterVersionConfig)
+	dnsManager, err := createDNSManager(kubeClient, operatorNamespace, infraConfig, dnsConfig)


We can also delete the get at line 80, and the RBAC rule in 00-cluster-role.yaml.

Miciah · 2019-02-13T18:19:14Z

Put another way, two zones can't have the same ID (or tag set as a proxy for ID) but different scopes.

The scenario I have in mind is that you have a public zone and a private zone and specify tags for each that select both. Presumably that is in invalid configuration, right? The API doesn't make that obvious.

* clean up constraint/override stanzas * pin to kubernetes 1.12.5 (to match OKD) * pin to controller-runtime v0.1.9 (the latest compatible with kubernetes 1.12.5) * pin to aws v1.15.72 * upgrade openshift/api to pick up new config types

These changes are primarily in support of cluster DNS management API changes, but also include the beginnings of larger refactor of the controller into a more discrete desired/current/apply-changes model. The general pattern being promoted is: * _Desired_: compute the desired state of a resource. * _Current_: get the current state of a resource. * _Ensure_: given the desired and current state of a resource in our managed graph, compute and apply any necessary changes. * _Finalize_: clean up any external resources which depend on the current resource. This change is meant to be a minimal incremental refactor to help us start reigning in complexity with the goal of being able to support more sophisticated change management. Hopefully with enough such refactors new useful patterns will emerge, and in the meantime there is benefit in testability and legibility in this sort of decomposition. _Note about `manifests.go`_: another important goal is to eliminate the [ManifestFactory](pkg/manifests/manifests.go) type and replace it with stateless functions which simply load and return manifest content — any additional processing of a manifest to compute desired state should be moved into controller files. This is to contain sprawl. _Note about resource naming_: when trying to get the current state of a resource, instead of computing the desired state just to get the name of the resource to fetch, start extracting functions to compute names of resources which can take operator config into account (e.g. operator/operand namespaces) and which can be reused for both current and desired computations. Extract LB service management into a separate file, and add LB service finalization to the ingress deletion logic. Extract `ServiceMonitor` management into a separate file. Eliminate the dependency on the prometheus project and switch to using `Unstructured` for managing servicemonitor resources; this fixes a race condition where the `ServiceMonitor` may not exist yet when we construct our client, which uses a caching discovery client that won't ever refresh and pick up the type later. This file contains new and extracted pieces from the main controller file which support dealing with the LB service. This file contains new functions to handle applying DNS records associated with an LB service. Note that dealing with _current_ resources isn't yet supported; for now, always try to apply the desired state and rely on the DNS manager to manage no-ops efficiently. Remove the `RouterServiceCloud()` function and replace it with a stateless `LoadBalancerService()` function which only loads contents of the manifest. Move all the logic for computing desired state into the controller. Do the same for `MetricsServiceMonitor()`. Add new types to represent DNS records (taking us further down the slippery slope of reinventing [external-dns](https://github.com/kubernetes-incubator/external-dns)). Use the new types with expanded DNS interface methods for `Ensure()` and `Delete()` to allow more sophisticated DNS management. Refactor away all assumptions about public and private zones and installer topology. Intead, each call to `Ensure()` or `Delete()` works with a single `Record` as input and uses the zone information encoded in the `Record`. This eliminates the need for any zone discovery. When a record has a zone addressed by tags rather than ID, cache the ID we find for the zone. Unwire ingress and cluster version config from DNS manager, replacing their usage with DNS config. This allows for a much simpler and more reliable DNS management implementation. Remove only managed components by default rather than the entire operator infra; this is better when trying to run a local operator process against a remote cluster.

ironcladlou · 2019-02-13T23:28:48Z

@pravisankar I threw in the ServiceMonitor change last minute to help with flakes since I was already fixing dependencies and doing refactoring around the area.

Miciah · 2019-02-13T23:38:04Z

pkg/manifests/manifests.go

-	MetricsClusterRoleBinding = "assets/router/metrics/cluster-role-binding.yaml"
-	MetricsRole               = "assets/router/metrics/role.yaml"
-	MetricsRoleBinding        = "assets/router/metrics/role-binding.yaml"
+	MetricsServiceMonitorAsset = "assets/router/metrics/service-monitor.yaml"


Looks like MetricsServiceMonitorAsset can be deleted now.

Miciah · 2019-02-14T00:37:00Z

Woe:

time="2019-02-14T00:25:47Z" level=error msg="failed to reconcile: failed to ensure clusteringress openshift-ingress-operator/default: failed to ensure servicemonitor for default: no matches for kind \"ServiceMonitor\" in version \"monitoring.coreos.com/v1\""

https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_cluster-ingress-operator/121/pull-ci-openshift-cluster-ingress-operator-master-e2e-aws/531/artifacts/e2e-aws/pods/openshift-ingress-operator_ingress-operator-5c4dfcfd94-kzw8k_ingress-operator.log.gz

abhinavdahiya · 2019-02-14T13:33:17Z

/retest

ironcladlou · 2019-02-14T14:58:00Z

@Miciah was that before my service monitor patch? Such an error should now lead to retries

ironcladlou · 2019-02-14T14:58:04Z

/retest

Miciah · 2019-02-14T14:59:43Z

@Miciah was that before my service monitor patch? Such an error should now lead to retries

It was after.

Miciah · 2019-02-14T15:33:53Z

/lgtm

ironcladlou · 2019-02-14T16:23:49Z

/retest

ironcladlou · 2019-02-14T16:29:55Z

Router failure in the last failed run was due to the cluster ID mismatch bug (openshift/installer#762)

ironcladlou · 2019-02-14T18:02:12Z

/retest

pravisankar

/lgtm

openshift-ci-robot · 2019-02-14T18:03:54Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: ironcladlou, Miciah, pravisankar

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~OWNERS~~ [Miciah,ironcladlou,pravisankar]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

openshift-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Feb 13, 2019

openshift-ci-robot requested review from pravisankar and ramr February 13, 2019 15:25

openshift-ci-robot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. approved Indicates a PR has been approved by an approver from all required OWNERS files. labels Feb 13, 2019

ironcladlou mentioned this pull request Feb 13, 2019

Move DNS records from base_domain to cluster_domain openshift/installer#1169

Merged

6 tasks

Miciah reviewed Feb 13, 2019

View reviewed changes

ironcladlou force-pushed the dns-api-refactor branch from 5bf06fd to de31dcb Compare February 13, 2019 19:53

ironcladlou changed the title ~~WIP: dns: refactor DNS handling~~ dns: refactor DNS handling Feb 13, 2019

openshift-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Feb 13, 2019

ironcladlou added 2 commits February 13, 2019 18:06

deps: update and pin versions

ecd45e7

* clean up constraint/override stanzas * pin to kubernetes 1.12.5 (to match OKD) * pin to controller-runtime v0.1.9 (the latest compatible with kubernetes 1.12.5) * pin to aws v1.15.72 * upgrade openshift/api to pick up new config types

ironcladlou force-pushed the dns-api-refactor branch from de31dcb to f4d5db3 Compare February 13, 2019 23:08

Miciah reviewed Feb 13, 2019

View reviewed changes

This was referenced Feb 14, 2019

dns/aws: update public zone discovery to search parent domains #117

Closed

WIP: dns: clean DNS records on clusteringress deletion #114

Closed

openshift-ci-robot assigned Miciah Feb 14, 2019

openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Feb 14, 2019

pravisankar approved these changes Feb 14, 2019

View reviewed changes

openshift-ci-robot assigned pravisankar Feb 14, 2019

openshift-merge-robot merged commit 194431f into openshift:master Feb 14, 2019

Miciah mentioned this pull request Feb 27, 2020

Delete cluster DNS config from DNS provider config #365

Merged

Conversation

ironcladlou commented Feb 13, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Controller changes

DNS manager changes

Misc

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Miciah commented Feb 13, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ironcladlou commented Feb 13, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Miciah commented Feb 13, 2019

Uh oh!

ironcladlou commented Feb 13, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Miciah commented Feb 14, 2019

Uh oh!

abhinavdahiya commented Feb 14, 2019

Uh oh!

ironcladlou commented Feb 14, 2019

Uh oh!

ironcladlou commented Feb 14, 2019

Uh oh!

Miciah commented Feb 14, 2019

Uh oh!

Miciah commented Feb 14, 2019

Uh oh!

ironcladlou commented Feb 14, 2019

Uh oh!

ironcladlou commented Feb 14, 2019

Uh oh!

ironcladlou commented Feb 14, 2019

Uh oh!

pravisankar left a comment

Choose a reason for hiding this comment

Uh oh!

openshift-ci-robot commented Feb 14, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

ironcladlou commented Feb 13, 2019 •

edited

Loading

Miciah commented Feb 13, 2019 •

edited

Loading