Skip to content

Conversation

@jim-minter
Copy link
Member

@jim-minter jim-minter commented Feb 6, 2021

Our/OCP's use of a private DNS zone prevents customers from overriding the vnet DNS. This PR removes private DNS zones from our architecture.

Current status:

  • feedback welcome
  • deployment manually tested in default/non-default domain, public/private API server and public/private console contexts
  • upgrade of existing clusters implemented and tested
  • unit tests are a WIP

Design ideas:

  1. at node level, interpose dnsmasq between clients and the vnet resolver; inject additional domains as necessary. This should cover off all node-level and pod-level DNS requests. I quite like this concept as we could later build on it for egress lockdown work. The main difficulty is populating the dnsmasq configuration inside the MachineConfig, hence:
  2. preallocate api, api-int and default router IPs much earlier in our cluster creation process so that we can lay down the above dnsmasq configuration in the install graph. The main wrinkle here is that there is a race condition introduced when the default router has private visibility, but I think we can probably get away with it. There is also upside here: by predefining these IPs up front, I think we could probably later refactor/simplify/speeden our post-install processes (e.g. pre-configure certificates and maybe avoid the associated API server rotation?)

Implementation summary:

  1. Installer changes
    • add the bootkube.ARODNSConfig type to pass IPs into the installer (a bit clunky)
    • populate the openshift-ingress/router-default LoadBalancer Service early, configuring it to use the IP we've already allocated. Wrinkle: currently the ingress operator fails to adopt the Service later; I'm not sure if that matters or not
    • configure and enable dnsmasq on bootstrap
    • configure and enable dnsmasq on masters and workers via MachineConfigs
    • modify machine config server URL and certificate to use api-int IP address as well as DNS, as the DNS not resolvable at the point that non-bootstrap ignition calls the API
    • disable ingress operator dynamic DNS (prevent trying to write to the private DNS zone which no longer exists).
  2. This PR:
    • create network resources (public IP(s), LBs, PLS for consistency) before install graph creation, so we can discover their IPs
    • if the default router has public visibility, create its public IP up front; it will be adopted later by the early populated openshift-ingress/router-default LoadBalancer Service
    • if the default router has private visibility, discover the highest unused subnet IP address; it will be adopted later by the early populated openshift-ingress/router-default LoadBalancer Service. We hope that no-one else takes the IP address in the interim
    • add api-int IP to internal model (I'm not actually sure this is necessary or desirable)
    • pass discovered IPs to installer
    • remove dynamic vnet dns validation and upgrade geneva action (this may be premature, depending on the upgrade story)
    • remove deployment of private dns zones (but not deletion, at least for onw)
    • implement upgrade

Test plan:

  • create new-style cluster
  • admin upgrade new-style cluster (no-op)
  • validate new controller behaviour (modify/delete configs and inspect)
  • create and upgrade old-style cluster
  • scale up upgraded cluster
  • redeploy an old cluster node on an upgrade cluster
  • openshift upgrade upgrade cluster
  • create new-style cluster with non-standard dns server, non-standard cluster domain
  • validate node hostnames
  • openshift upgrade upgrade cluster
  • create new-style cluster with non-standard dns server, non-standard cluster domain, private API server and ingress
  • openshift upgrade upgrade cluster

@jim-minter jim-minter marked this pull request as draft February 6, 2021 18:33
@mjudeikis
Copy link
Contributor

@jim-minter I think this is bit over the line how much we do to make OCP ARO.
I think it would be good to get upstream alignment on how this gonna play in the long term OCP plans. This can easily backfire for ARO in the long term if we implement something which will contradict wider product development.

If overall product direction will be similar (dnsmasq for the start) otherwise this sounds like 3.11 all over again.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pretty cool that this PR also cleans up KubeActions a bit

@cgwalters
Copy link

This seems strongly related to openshift/machine-config-operator#2258

@github-actions
Copy link

Please rebase pull request.

@github-actions github-actions bot added the needs-rebase branch needs a rebase label Feb 16, 2021
@github-actions github-actions bot removed the needs-rebase branch needs a rebase label Feb 16, 2021
@jim-minter jim-minter force-pushed the lbip branch 5 times, most recently from a7a417b to 28a17c5 Compare February 19, 2021 23:29
@jim-minter jim-minter changed the title Exploratory: remove private DNS zones from architecture Remove private DNS zones from architecture Feb 19, 2021
@jim-minter jim-minter force-pushed the lbip branch 6 times, most recently from c9fd165 to 4176b59 Compare February 21, 2021 21:20
@jim-minter jim-minter marked this pull request as ready for review February 21, 2021 22:28
Copy link
Contributor

@mjudeikis mjudeikis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is something :)
I think ok with most of the changes except few cosmetics changes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you elaborate why we need this fixup?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for existing clusters, the VM userdata (read by ignition on VM at first boot time) points to https://api-int:22623. Without changing this, scale up will fail because new VMs can't resolve api-int.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does not look related to this PR

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct; the first few commits in this PR are like that.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Readme what this does and why would be great

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reconcileRole name is very confusing. What roles this is? Kubernetes roles? Node roles? Azure roles? Took me a while to read down the rabbit hole to understand.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree, will fix and document

@mjudeikis
Copy link
Contributor

Few other questions:

  1. At what point we added ARO worker registry :
    jim-minter/installer@6fb6f85
  2. From installer, &bootkube.ARODNSConfig{} worth it making more general? AROConfig? For future re-usability and try to upstream it ?
  3. loggingConfig, aroDNSConfig - could be done by (2) and simplified?
  4. Not clear why existing behavior for non ARO changes:
    jim-minter/installer@9f4cb34#diff-2d86052ba31adcc485733d72fc5da379a06d23af4b276d92b472b0e5a5461b5cR52

@github-actions
Copy link

Please rebase pull request.

@github-actions github-actions bot added the needs-rebase branch needs a rebase label Feb 22, 2021
@github-actions github-actions bot removed the needs-rebase branch needs a rebase label Feb 22, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants