Skip to content

dns: refactor DNS handling#121

Merged
openshift-merge-robot merged 2 commits intoopenshift:masterfrom
ironcladlou:dns-api-refactor
Feb 14, 2019
Merged

dns: refactor DNS handling#121
openshift-merge-robot merged 2 commits intoopenshift:masterfrom
ironcladlou:dns-api-refactor

Conversation

@ironcladlou
Copy link
Contributor

@ironcladlou ironcladlou commented Feb 13, 2019

Depends on openshift/installer#1233.
Blocks openshift/installer#1169.

Controller changes

These changes are primarily in support of cluster DNS management API changes, but also include the beginnings of larger refactor of the controller into a more discrete desired/current/apply-changes model. The general pattern being promoted is:

  • Desired: compute the desired state of a resource.
  • Current: get the current state of a resource.
  • Ensure: given the desired and current state of a resource in our managed graph, compute and apply any necessary changes.
  • Finalize: clean up any external resources which depend on the current resource.

This change is meant to be a minimal incremental refactor to help us start reigning in complexity with the goal of being able to support more sophisticated change management. Hopefully with enough such refactors new useful patterns will emerge, and in the meantime there is benefit in testability and legibility in this sort of decomposition.

Note about manifests.go: another important goal is to eliminate the ManifestFactory type and replace it with stateless functions which simply load and return manifest content — any additional processing of a manifest to compute desired state should be moved into controller files. This is to contain sprawl.

Note about resource naming: when trying to get the current state of a resource, instead of computing the desired state just to get the name of the resource to fetch, start extracting functions to compute names of resources which can take operator config into account (e.g. operator/operand namespaces) and which can be reused for both current and desired computations.

pkg/operator/controller/controller.go

Extract LB service management into a separate file, and add LB service finalization to the ingress deletion logic.

Extract ServiceMonitor management into a separate file. Eliminate the dependency on the prometheus project and switch to using Unstructured for managing servicemonitor resources; this fixes a race condition where the ServiceMonitor may not exist yet when we construct our client, which uses a caching discovery client that won't ever refresh and pick up the type later.

pkg/operator/controller/controller_lb.go

This file contains new and extracted pieces from the main controller file which support dealing with the LB service.

pkg/operator/controller/controller_dns.go

This file contains new functions to handle applying DNS records associated with an LB service. Note that dealing with current resources isn't yet supported; for now, always try to apply the desired state and rely on the DNS manager to manage no-ops efficiently.

pkg/manifests/manifests.go

Remove the RouterServiceCloud() function and replace it with a stateless LoadBalancerService() function which only loads contents of the manifest. Move all the logic for computing desired state into the controller.

Do the same for MetricsServiceMonitor().

DNS manager changes

pkg/dns/dns.go

Add new types to represent DNS records (taking us further down the slippery slope of reinventing external-dns). Use the new types with expanded DNS interface methods for Ensure() and Delete() to allow more sophisticated DNS management.

pkg/dns/aws/dns.go

Refactor away all assumptions about public and private zones and installer topology. Intead, each call to Ensure() or Delete() works with a single Record as input and uses the zone information encoded in the Record. This eliminates the need for any zone discovery. When a record has a zone addressed by tags rather than ID, cache the ID we find for the zone.

cmd/cluster-ingress-operator/main.go

Unwire ingress and cluster version config from DNS manager, replacing their usage with DNS config. This allows for a much simpler and more reliable DNS management implementation.

Misc

hack/uninstall.sh

Remove only managed components by default rather than the entire operator infra; this is better when trying to run a local operator process against a remote cluster.

@openshift-ci-robot openshift-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Feb 13, 2019
@openshift-ci-robot openshift-ci-robot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. approved Indicates a PR has been approved by an approver from all required OWNERS files. labels Feb 13, 2019
//
// 1. A public zone shared by all clusters with a domain name equal to the cluster base domain
// 2. A private zone for the cluster with a domain name equal to the cluster base domain
// Manager provides AWS DNS record management. In this implmentation, calling
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo: "implmentation".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

pkg/dns/dns.go Outdated
// Type is the DNS record type.
Type RecordType

// Alias are the options for an ALIAS record.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Alias is".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

"time"

ingressv1alpha1 "github.com/openshift/cluster-ingress-operator/pkg/apis/ingress/v1alpha1"
"github.com/openshift/cluster-ingress-operator/pkg/manifests"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an import cycle (self-import); can just delete it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

if _, err := f.RouterServiceCloud(ci); err != nil {
t.Errorf("invalid RouterServiceCloud: %v", err)
}
manifests.LoadBalancerService()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Delete the manifests. (this is in the manifests package).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

// TODO: This will need revisited when we stop defaulting .spec.ingressDomain
// and .spec.highAvailability as both can be nil but used with an effective
// system-provided default reported on status.
if ci.Spec.HighAvailability.Type != ingressv1alpha1.CloudClusterIngressHA || ci.Spec.IngressDomain == nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do indeed want eventually to replace ci.Spec.HighAvailability.Type with an effective value, but as long as we are using ci.Spec.HighAvailability, we should still check whether it is nil.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

// desired if the high availability type is Cloud. An LB service will declare an
// owner reference to the given deployment.
func desiredLoadBalancerService(ci *ingressv1alpha1.ClusterIngress, infra *configv1.Infrastructure, deployment *appsv1.Deployment) (*corev1.Service, error) {
if ci.Spec.HighAvailability.Type != ingressv1alpha1.CloudClusterIngressHA {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to check whether ci.Spec.HighAvailability is nil.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed


// ensureLoadBalancerService creates an LB service if one is desired but absent.
// Always returns the current LB service if one exists (whether it already
// existed or was created during the course of the function).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This means ensureRouterForIngress will ensureDNS an existing LB service even if none is desired, right? Do we want to do that?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If one exists at all, then ensureDNS will try and create any DNS for it. If later on the clusteringress is deleted, we'll tear it all down.

If the clusteringress was deleted we won't even be in this code path in the first place.

Not sure there's an error here yet. However, I think this decomposition lets us at least start reasoning about the scenarios...

@Miciah
Copy link
Contributor

Miciah commented Feb 13, 2019

Should getZoneID check whether the zone is private? I guess before I assumed it was not necessary to do so because we were coding for a specific use-case in which the public zone was not tagged with the cluster id, but as we make the zone look-up more generic, I believe we should revisit that assumption.

Edit: Never mind, we use getZoneID to look up both the private zone and the public zone. However, it might be worth documenting in the API that consumers of the DNS config's .spec.publicZone and .spec.privateZone are not responsible for filtering out private or public zones, respectively, which means that the DNS config must specify ids or tags that identify only public and private zones, respectively.

@ironcladlou
Copy link
Contributor Author

@Miciah

Should getZoneID check whether the zone is private? I guess before I assumed it was not necessary to do so because we were coding for a specific use-case in which the public zone was not tagged with the cluster id, but as we make the zone look-up more generic, I believe we should revisit that assumption.

Edit: Never mind, we use getZoneID to look up both the private zone and the public zone. However, it might be worth documenting in the API that consumers of the DNS config's .spec.publicZone and .spec.privateZone are not responsible for filtering out private or public zones, respectively, which means that the DNS config must specify ids or tags that identify only public and private zones, respectively.

I thought the intent of the DNS API was to allow us to specifically address a unique zone, regardless of its scope in the underlying platform (e.g. public/private, which is an implementation detail). Put another way, two zones can't have the same ID (or tag set as a proxy for ID) but different scopes. @abhinavdahiya, do you agree with that?


// Set up the DNS manager.
dnsManager, err := createDNSManager(kubeClient, operatorNamespace, infraConfig, dnsConfig, clusterVersionConfig)
dnsManager, err := createDNSManager(kubeClient, operatorNamespace, infraConfig, dnsConfig)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can also delete the get at line 80, and the RBAC rule in 00-cluster-role.yaml.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

@Miciah
Copy link
Contributor

Miciah commented Feb 13, 2019

Put another way, two zones can't have the same ID (or tag set as a proxy for ID) but different scopes.

The scenario I have in mind is that you have a public zone and a private zone and specify tags for each that select both. Presumably that is in invalid configuration, right? The API doesn't make that obvious.

@ironcladlou ironcladlou changed the title WIP: dns: refactor DNS handling dns: refactor DNS handling Feb 13, 2019
@openshift-ci-robot openshift-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Feb 13, 2019
* clean up constraint/override stanzas
* pin to kubernetes 1.12.5 (to match OKD)
* pin to controller-runtime v0.1.9 (the latest compatible with kubernetes 1.12.5)
* pin to aws v1.15.72
* upgrade openshift/api to pick up new config types
These changes are primarily in support of cluster DNS management API changes, but also include the beginnings of larger refactor of the controller into a more discrete desired/current/apply-changes model. The general pattern being promoted is:

* _Desired_: compute the desired state of a resource.
* _Current_: get the current state of a resource.
* _Ensure_: given the desired and current state of a  resource in our managed graph, compute and apply any necessary changes.
* _Finalize_: clean up any external resources which depend on the current resource.

This change is meant to be a minimal incremental refactor to help us start reigning in complexity with the goal of being able to support more sophisticated change management. Hopefully with enough such refactors new useful patterns will emerge, and in the meantime there is benefit in testability and legibility in this sort of decomposition.

_Note about `manifests.go`_: another important goal is to eliminate the [ManifestFactory](pkg/manifests/manifests.go) type and replace it with stateless functions which simply load and return manifest content — any additional processing of a manifest to compute desired state should be moved into controller files. This is to contain sprawl.

_Note about resource naming_: when trying to get the current state of a resource, instead of computing the desired state just to get the name of the resource to fetch, start extracting functions to compute names of resources which can take operator config into account (e.g. operator/operand namespaces) and which can be reused for both current and desired computations.

Extract LB service management into a separate file, and add LB service finalization to the ingress deletion logic.

Extract `ServiceMonitor` management into a separate file. Eliminate the dependency on the prometheus project and switch to using `Unstructured` for managing servicemonitor resources; this fixes a race condition where the `ServiceMonitor` may not exist yet when we construct our client, which uses a caching discovery client that won't ever refresh and pick up the type later.

This file contains new and extracted pieces from the main controller file which support dealing with the LB service.

This file contains new functions to handle applying DNS records associated with an LB service. Note that dealing with _current_ resources isn't yet supported; for now, always try to apply the desired state and rely on the DNS manager to manage no-ops efficiently.

Remove the `RouterServiceCloud()` function and replace it with a stateless `LoadBalancerService()` function which only loads contents of the manifest. Move all the logic for computing desired state into the controller.

Do the same for `MetricsServiceMonitor()`.

Add new types to represent DNS records (taking us further down the slippery slope of reinventing [external-dns](https://github.com/kubernetes-incubator/external-dns)). Use the new types with expanded DNS interface methods for `Ensure()` and `Delete()` to allow more sophisticated DNS management.

Refactor away all assumptions about public and private zones and installer topology. Intead, each call to `Ensure()` or `Delete()` works with a single `Record` as input and uses the zone information encoded in the `Record`. This eliminates the need for any zone discovery. When a record has a zone addressed by tags rather than ID, cache the ID we find for the zone.

Unwire ingress and cluster version config from DNS manager, replacing their usage with DNS config. This allows for a much simpler and more reliable DNS management implementation.

Remove only managed components by default rather than the entire operator infra; this is better when trying to run a local operator process against a remote cluster.
@ironcladlou
Copy link
Contributor Author

@pravisankar I threw in the ServiceMonitor change last minute to help with flakes since I was already fixing dependencies and doing refactoring around the area.

MetricsClusterRoleBinding = "assets/router/metrics/cluster-role-binding.yaml"
MetricsRole = "assets/router/metrics/role.yaml"
MetricsRoleBinding = "assets/router/metrics/role-binding.yaml"
MetricsServiceMonitorAsset = "assets/router/metrics/service-monitor.yaml"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like MetricsServiceMonitorAsset can be deleted now.

@Miciah
Copy link
Contributor

Miciah commented Feb 14, 2019

Woe:

time="2019-02-14T00:25:47Z" level=error msg="failed to reconcile: failed to ensure clusteringress openshift-ingress-operator/default: failed to ensure servicemonitor for default: no matches for kind \"ServiceMonitor\" in version \"monitoring.coreos.com/v1\""

https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_cluster-ingress-operator/121/pull-ci-openshift-cluster-ingress-operator-master-e2e-aws/531/artifacts/e2e-aws/pods/openshift-ingress-operator_ingress-operator-5c4dfcfd94-kzw8k_ingress-operator.log.gz

@abhinavdahiya
Copy link
Contributor

/retest

@ironcladlou
Copy link
Contributor Author

@Miciah was that before my service monitor patch? Such an error should now lead to retries

@ironcladlou
Copy link
Contributor Author

/retest

@Miciah
Copy link
Contributor

Miciah commented Feb 14, 2019

@Miciah was that before my service monitor patch? Such an error should now lead to retries

It was after.

@Miciah
Copy link
Contributor

Miciah commented Feb 14, 2019

/lgtm

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Feb 14, 2019
@ironcladlou
Copy link
Contributor Author

/retest

@ironcladlou
Copy link
Contributor Author

Router failure in the last failed run was due to the cluster ID mismatch bug (openshift/installer#762)

@ironcladlou
Copy link
Contributor Author

/retest

Copy link

@pravisankar pravisankar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@openshift-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: ironcladlou, Miciah, pravisankar

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:
  • OWNERS [Miciah,ironcladlou,pravisankar]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants