Skip to content

Conversation

@JM1
Copy link
Contributor

@JM1 JM1 commented Sep 21, 2023

OKD/FCOS uses FCOS without OKD/OCP code as bootimages for IPI, SNO and Agent-based Installer. FCOS uses systemd-resolved for handling DNS queries.

During installation, the bootstrap node or rendezvous host executes a bootkube.sh script which queries the cluster api.* and api-int.* endpoints to detect when the in-cluster control plane has come up. The DNS name resolution of these endpoints is supposed to be handled by CoreDNS which listens to 0.0.0.0:53 at the same node and is registered with 127.0.0.1 in /etc/resolv.conf.

Some tools such as curl which are called via bootkube.sh do not query /etc/resolv.conf but use Name Service Switch (NSS) which delegates queries to systemd-resolved. systemd-resolved uses split DNS to route DNS queries to specific nameservers depending on the dns domain settings of the network interfaces.

When the bootstrap node or rendezvous host is booted with FCOS, it will retrieve its network configuration via DHCP. The domain record which is received via DHCP (and is part of the OKD cluster domain) is associated with the network interface which handled the DHCP exchange. This causes systemd-resolved to send all DNS queries for the OKD cluster domain through the network DNS server, but never to CoreDNS.

OKD and OCP do not require the network DNS server to be able to resolve the api-int.* endpoint on bare-metal servers for HA deployments with IPI or Agent-based Installer (for SNO it is required). When api-int.* is not resolved by the network DNS servers, then services such as bootkube.sh at the rendezvous host (from Agent-based Installer) cannot query the in-cluster control plane and installation will stall (because systemd-resolved will not use CoreDNS due to Split DNS).

With this change, the cluster domain will be associated with the DNS server at 127.0.0.1 which is CoreDNS. systemd-resolved will then properly query CoreDNS when resolving api-int.* and Agent-based Installer can finish successfully.

cc @vrutkovs @andfasano @LorbusChris

@openshift-ci-robot openshift-ci-robot added jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. labels Sep 21, 2023
@openshift-ci-robot
Copy link
Contributor

@JM1: This pull request references Jira Issue OCPBUGS-19552, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.15.0) matches configured target version for branch (4.15.0)
  • bug is in the state New, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @gpei

The bug has been updated to refer to the pull request using the external bug tracker.

Details

In response to this:

OKD/FCOS uses FCOS without OKD/OCP code as bootimages for IPI, SNO and Agent-based Installer. FCOS uses systemd-resolved for handling DNS queries.

During installation, the bootstrap node or rendezvous host executes a bootkube.sh script which queries the cluster api.* and api-int.* endpoints to detect when the in-cluster control plane has come up. The DNS name resolution of these endpoints is supposed to be handled by CoreDNS which listens to 0.0.0.0:53 at the same node and is registered with 127.0.0.1 in /etc/resolv.conf.

Some tools such as dig and curl which are called via bootkube.sh do not query /etc/resolv.conf but use Name Service Switch (NSS) which delegates queries to systemd-resolved. systemd-resolved uses split DNS to route DNS queries to specific nameservers depending on the dns domain settings of the network interfaces.

When the bootstrap node or rendezvous host is booted with FCOS, it will retrieve its network configuration via DHCP. The domain record which is received via DHCP (and is part of the OKD cluster domain) is associated with the network interface which handled the DHCP exchange. This causes systemd-resolved to send all DNS queries for the OKD cluster domain through the network DNS server, but never to CoreDNS.

OKD and OCP do not require the network DNS server to be able to resolve the api-int.* endpoint on bare-metal servers for HA deployments with IPI or Agent-based Installer (for SNO it is required). When api-int.* is not resolved by the network DNS servers, then services such as bootkube.sh at the rendezvous host (from Agent-based Installer) cannot query the in-cluster control plane and installation will stall (because systemd-resolved will not use CoreDNS due to Split DNS).

With this change, the cluster domain will be associated with the DNS server at 127.0.0.1 which is CoreDNS. systemd-resolved will then properly query CoreDNS when resolving api-int.* and Agent-based Installer can finish successfully.

cc @vrutkovs @andfasano @LorbusChris

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@LorbusChris
Copy link
Contributor

great find and commit msg 👍

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Sep 21, 2023
@openshift-ci-robot
Copy link
Contributor

@JM1: This pull request references Jira Issue OCPBUGS-19552, which is valid.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.15.0) matches configured target version for branch (4.15.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @gpei

Details

In response to this:

OKD/FCOS uses FCOS without OKD/OCP code as bootimages for IPI, SNO and Agent-based Installer. FCOS uses systemd-resolved for handling DNS queries.

During installation, the bootstrap node or rendezvous host executes a bootkube.sh script which queries the cluster api.* and api-int.* endpoints to detect when the in-cluster control plane has come up. The DNS name resolution of these endpoints is supposed to be handled by CoreDNS which listens to 0.0.0.0:53 at the same node and is registered with 127.0.0.1 in /etc/resolv.conf.

Some tools such as curl which are called via bootkube.sh do not query /etc/resolv.conf but use Name Service Switch (NSS) which delegates queries to systemd-resolved. systemd-resolved uses split DNS to route DNS queries to specific nameservers depending on the dns domain settings of the network interfaces.

When the bootstrap node or rendezvous host is booted with FCOS, it will retrieve its network configuration via DHCP. The domain record which is received via DHCP (and is part of the OKD cluster domain) is associated with the network interface which handled the DHCP exchange. This causes systemd-resolved to send all DNS queries for the OKD cluster domain through the network DNS server, but never to CoreDNS.

OKD and OCP do not require the network DNS server to be able to resolve the api-int.* endpoint on bare-metal servers for HA deployments with IPI or Agent-based Installer (for SNO it is required). When api-int.* is not resolved by the network DNS servers, then services such as bootkube.sh at the rendezvous host (from Agent-based Installer) cannot query the in-cluster control plane and installation will stall (because systemd-resolved will not use CoreDNS due to Split DNS).

With this change, the cluster domain will be associated with the DNS server at 127.0.0.1 which is CoreDNS. systemd-resolved will then properly query CoreDNS when resolving api-int.* and Agent-based Installer can finish successfully.

cc @vrutkovs @andfasano @LorbusChris

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@JM1 JM1 force-pushed the okd-split-dns-fix branch from c4861fb to b6cc933 Compare September 21, 2023 17:22
@openshift-ci openshift-ci bot removed the lgtm Indicates that a PR is ready to be merged. label Sep 21, 2023
@JM1
Copy link
Contributor Author

JM1 commented Sep 21, 2023

Last force-push fixes the commit message only.

@LorbusChris
Copy link
Contributor

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Sep 21, 2023
…md-resolved

OKD/FCOS uses FCOS without OKD/OCP code as bootimages for IPI, SNO
and Agent-based Installer. FCOS uses systemd-resolved for handling
DNS queries.

During installation, the bootstrap node or rendezvous host executes
a bootkube.sh script which queries the cluster api.* and api-int.*
endpoints to detect when the in-cluster control plane has come up.
The DNS name resolution of these endpoints is supposed to be handled
by CoreDNS which listens to 0.0.0.0:53 at the same node and is
registered with 127.0.0.1 in /etc/resolv.conf.

Some tools such as curl which are called via bootkube.sh [0]
do not query /etc/resolv.conf but use Name Service Switch (NSS) [1]
which delegates queries to systemd-resolved. systemd-resolved uses
split DNS [2] to route DNS queries to specific nameservers depending
on the dns domain settings of the network interfaces.

When the bootstrap node or rendezvous host is booted with FCOS, it
will retrieve its network configuration via DHCP. The domain record
which is received via DHCP (and is part of the OKD cluster domain)
is associated with the network interface which handled the DHCP
exchange. This causes systemd-resolved to send all DNS queries for
the OKD cluster domain through the network DNS server, but never to
CoreDNS.

OKD and OCP do not require the network DNS server to be able to
resolve the api-int.* endpoint on bare-metal servers for HA
deployments with IPI or Agent-based Installer (for SNO it is
required). When api-int.* is not resolved by the network DNS
servers, then services such as bootkube.sh at the rendezvous host
(from Agent-based Installer) cannot query the in-cluster control
plane and installation will stall (because systemd-resolved will
not use CoreDNS due to Split DNS).

With this change, the cluster domain will be associated with the
DNS server at 127.0.0.1 which is CoreDNS. systemd-resolved will
then properly query CoreDNS when resolving api-int.* and
Agent-based Installer can finish successfully.

[0] https://github.com/openshift/installer/blob/master/data/data/bootstrap/files/usr/local/bin/bootstrap-verify-api-server-urls.sh
[1] https://www.mankier.com/5/nsswitch.conf
[2] https://fedoramagazine.org/systemd-resolved-introduction-to-split-dns/
@JM1 JM1 force-pushed the okd-split-dns-fix branch from b6cc933 to ed594dc Compare September 22, 2023 07:15
@openshift-ci openshift-ci bot removed the lgtm Indicates that a PR is ready to be merged. label Sep 22, 2023
@JM1
Copy link
Contributor Author

JM1 commented Sep 22, 2023

FYI Last patch is rebased on top of master and resolves merge conflicts.

@LorbusChris
Copy link
Contributor

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Sep 22, 2023
@JM1
Copy link
Contributor Author

JM1 commented Sep 23, 2023

/retest-required

@andfasano
Copy link
Contributor

/approve

@andfasano
Copy link
Contributor

Note: the CI has still some issues (not related to this patch), and they are affecting #7505 as well

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Oct 3, 2023

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: andfasano

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Oct 3, 2023
@openshift-ci-robot
Copy link
Contributor

/retest-required

Remaining retests: 0 against base HEAD a1cfa18 and 2 for PR HEAD ed594dc in total

@openshift-ci-robot
Copy link
Contributor

/retest-required

Remaining retests: 0 against base HEAD 8783558 and 1 for PR HEAD ed594dc in total

@openshift-ci-robot
Copy link
Contributor

/retest-required

Remaining retests: 0 against base HEAD bc15daa and 0 for PR HEAD ed594dc in total

@openshift-ci-robot
Copy link
Contributor

@JM1: This pull request references Jira Issue OCPBUGS-19552, which is valid.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.15.0) matches configured target version for branch (4.15.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @gpei

Details

In response to this:

OKD/FCOS uses FCOS without OKD/OCP code as bootimages for IPI, SNO and Agent-based Installer. FCOS uses systemd-resolved for handling DNS queries.

During installation, the bootstrap node or rendezvous host executes a bootkube.sh script which queries the cluster api.* and api-int.* endpoints to detect when the in-cluster control plane has come up. The DNS name resolution of these endpoints is supposed to be handled by CoreDNS which listens to 0.0.0.0:53 at the same node and is registered with 127.0.0.1 in /etc/resolv.conf.

Some tools such as curl which are called via bootkube.sh do not query /etc/resolv.conf but use Name Service Switch (NSS) which delegates queries to systemd-resolved. systemd-resolved uses split DNS to route DNS queries to specific nameservers depending on the dns domain settings of the network interfaces.

When the bootstrap node or rendezvous host is booted with FCOS, it will retrieve its network configuration via DHCP. The domain record which is received via DHCP (and is part of the OKD cluster domain) is associated with the network interface which handled the DHCP exchange. This causes systemd-resolved to send all DNS queries for the OKD cluster domain through the network DNS server, but never to CoreDNS.

OKD and OCP do not require the network DNS server to be able to resolve the api-int.* endpoint on bare-metal servers for HA deployments with IPI or Agent-based Installer (for SNO it is required). When api-int.* is not resolved by the network DNS servers, then services such as bootkube.sh at the rendezvous host (from Agent-based Installer) cannot query the in-cluster control plane and installation will stall (because systemd-resolved will not use CoreDNS due to Split DNS).

With this change, the cluster domain will be associated with the DNS server at 127.0.0.1 which is CoreDNS. systemd-resolved will then properly query CoreDNS when resolving api-int.* and Agent-based Installer can finish successfully.

cc @vrutkovs @andfasano @LorbusChris

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@vrutkovs
Copy link
Contributor

vrutkovs commented Oct 4, 2023

/hold cancel
/retest

@openshift-ci openshift-ci bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Oct 4, 2023
@openshift-ci-robot
Copy link
Contributor

/retest-required

Remaining retests: 0 against base HEAD bc15daa and 2 for PR HEAD ed594dc in total

@JM1
Copy link
Contributor Author

JM1 commented Oct 4, 2023

/retest-required

@openshift-ci-robot
Copy link
Contributor

/retest-required

Remaining retests: 0 against base HEAD 732271d and 1 for PR HEAD ed594dc in total

@openshift-ci-robot
Copy link
Contributor

/retest-required

Remaining retests: 0 against base HEAD 3a738d8 and 0 for PR HEAD ed594dc in total

@openshift-ci-robot
Copy link
Contributor

/hold

Revision ed594dc was retested 3 times: holding

@openshift-ci openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Oct 5, 2023
@openshift-ci-robot
Copy link
Contributor

@JM1: This pull request references Jira Issue OCPBUGS-19552, which is valid.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.15.0) matches configured target version for branch (4.15.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @gpei

Details

In response to this:

OKD/FCOS uses FCOS without OKD/OCP code as bootimages for IPI, SNO and Agent-based Installer. FCOS uses systemd-resolved for handling DNS queries.

During installation, the bootstrap node or rendezvous host executes a bootkube.sh script which queries the cluster api.* and api-int.* endpoints to detect when the in-cluster control plane has come up. The DNS name resolution of these endpoints is supposed to be handled by CoreDNS which listens to 0.0.0.0:53 at the same node and is registered with 127.0.0.1 in /etc/resolv.conf.

Some tools such as curl which are called via bootkube.sh do not query /etc/resolv.conf but use Name Service Switch (NSS) which delegates queries to systemd-resolved. systemd-resolved uses split DNS to route DNS queries to specific nameservers depending on the dns domain settings of the network interfaces.

When the bootstrap node or rendezvous host is booted with FCOS, it will retrieve its network configuration via DHCP. The domain record which is received via DHCP (and is part of the OKD cluster domain) is associated with the network interface which handled the DHCP exchange. This causes systemd-resolved to send all DNS queries for the OKD cluster domain through the network DNS server, but never to CoreDNS.

OKD and OCP do not require the network DNS server to be able to resolve the api-int.* endpoint on bare-metal servers for HA deployments with IPI or Agent-based Installer (for SNO it is required). When api-int.* is not resolved by the network DNS servers, then services such as bootkube.sh at the rendezvous host (from Agent-based Installer) cannot query the in-cluster control plane and installation will stall (because systemd-resolved will not use CoreDNS due to Split DNS).

With this change, the cluster domain will be associated with the DNS server at 127.0.0.1 which is CoreDNS. systemd-resolved will then properly query CoreDNS when resolving api-int.* and Agent-based Installer can finish successfully.

cc @vrutkovs @andfasano @LorbusChris

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@andfasano
Copy link
Contributor

Note: will require #7484 for a proper agent testing

@andfasano
Copy link
Contributor

/retest-required

@cybertron
Copy link
Member

/lgtm

Makes sense. This shouldn't affect the non-okd jobs since ocp doesn't use systemd-resolved for resolution.

@andfasano
Copy link
Contributor

/lgtm

Makes sense. This shouldn't affect the non-okd jobs since ocp doesn't use systemd-resolved for resolution.

Thanks @cybertron for the feedback

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Oct 5, 2023

@JM1: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/okd-scos-e2e-aws-ovn ed594dc link false /test okd-scos-e2e-aws-ovn
ci/prow/e2e-metal-single-node-live-iso ed594dc link false /test e2e-metal-single-node-live-iso

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@sadasu
Copy link
Contributor

sadasu commented Oct 10, 2023

/retest-required

@LorbusChris
Copy link
Contributor

/hold cancel

@openshift-ci openshift-ci bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Oct 11, 2023
@openshift-ci openshift-ci bot merged commit a7f89b2 into openshift:master Oct 11, 2023
@openshift-ci-robot
Copy link
Contributor

@JM1: Jira Issue OCPBUGS-19552: All pull requests linked via external trackers have merged:

Jira Issue OCPBUGS-19552 has been moved to the MODIFIED state.

Details

In response to this:

OKD/FCOS uses FCOS without OKD/OCP code as bootimages for IPI, SNO and Agent-based Installer. FCOS uses systemd-resolved for handling DNS queries.

During installation, the bootstrap node or rendezvous host executes a bootkube.sh script which queries the cluster api.* and api-int.* endpoints to detect when the in-cluster control plane has come up. The DNS name resolution of these endpoints is supposed to be handled by CoreDNS which listens to 0.0.0.0:53 at the same node and is registered with 127.0.0.1 in /etc/resolv.conf.

Some tools such as curl which are called via bootkube.sh do not query /etc/resolv.conf but use Name Service Switch (NSS) which delegates queries to systemd-resolved. systemd-resolved uses split DNS to route DNS queries to specific nameservers depending on the dns domain settings of the network interfaces.

When the bootstrap node or rendezvous host is booted with FCOS, it will retrieve its network configuration via DHCP. The domain record which is received via DHCP (and is part of the OKD cluster domain) is associated with the network interface which handled the DHCP exchange. This causes systemd-resolved to send all DNS queries for the OKD cluster domain through the network DNS server, but never to CoreDNS.

OKD and OCP do not require the network DNS server to be able to resolve the api-int.* endpoint on bare-metal servers for HA deployments with IPI or Agent-based Installer (for SNO it is required). When api-int.* is not resolved by the network DNS servers, then services such as bootkube.sh at the rendezvous host (from Agent-based Installer) cannot query the in-cluster control plane and installation will stall (because systemd-resolved will not use CoreDNS due to Split DNS).

With this change, the cluster domain will be associated with the DNS server at 127.0.0.1 which is CoreDNS. systemd-resolved will then properly query CoreDNS when resolving api-int.* and Agent-based Installer can finish successfully.

cc @vrutkovs @andfasano @LorbusChris

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

JM1 added a commit to JM1/openshift-installer that referenced this pull request Oct 27, 2023
OCP requires DNS records api.<cluster_domain> and *.apps.\
<cluster_domain> to be externally resolvable (<cluster_domain> is
<cluster_name>.<base_domain>). For SNO this list also includes DNS
record api-int.<cluster_domain>.

However, OCP does not enforce ownership of all subdomains of
<cluster_domain>. For example, it is allowed to host a disconnected
image registry at <registry_hostname>.<cluster_domain> and OCP shall
be able to resolve it using the user-supplied external DNS resolver.

PR openshift#7516 changed the systemd-resolved config of the bootstrap node /
rendezvous host to associate the complete <cluster_domain> with the
DNS server at 127.0.0.1 where CoreDNS is supposed to be listening.

When a disconnected image registry is used for cluster installation,
the registry is hosted at <registry_hostname>.<cluster_domain> and
the bootstrap node / rendezvous host does not retrieve its domain
from the DHCP server, then the registry's DNS name cannot be
resolved.
That is because in order to pull the CoreDNS image, the disconnected
registry must be connected. The split dns mechanism of systemd-\
resolved would cause it to send DNS requests for
<registry_hostname>.<cluster_domain> to 127.0.0.1 where CoreDNS is
expected to be running which is not.

When a bootstrap node / rendezvous host retrieves its domain
<cluster_domain> from a DHCP server (e.g. dnsmasq's '--domain'
option) then systemd-resolved would associate <cluster_domain> not
only with 127.0.0.1 but also with the physical network interface,
causing DNS requests for <registry_hostname>.<cluster_domain> to be
send out to 127.0.0.1 as well as the external DNS resolver.

This patch mitigates the DNS issue for other network setups. It
changes the systemd-resolved config to forward DNS requests to
CoreDNS only for domains which are resolvable by CoreDNS:

* api.<cluster_domain>
* api-int.<cluster_domain>.
* apps.<cluster_domain>

DNS requests for <registry_hostname>.<cluster_domain> and other
subdomains of <cluster_domain> will be send out to the external
DNS resolver.

Fixes openshift#7516
JM1 added a commit to JM1/openshift-installer that referenced this pull request Oct 27, 2023
OCP requires DNS records api.<cluster_domain> and *.apps.\
<cluster_domain> to be externally resolvable (<cluster_domain> is
<cluster_name>.<base_domain>). For SNO this list also includes DNS
record api-int.<cluster_domain>.

However, OCP does not enforce ownership of all subdomains of
<cluster_domain>. For example, it is allowed to host a disconnected
image registry at <registry_hostname>.<cluster_domain> and OCP shall
be able to resolve it using the user-supplied external DNS resolver.

PR openshift#7516 changed the systemd-resolved config of the bootstrap node /
rendezvous host to associate the complete <cluster_domain> with the
DNS server at 127.0.0.1 where CoreDNS is supposed to be listening.

When a disconnected image registry is used for cluster installation,
the registry is hosted at <registry_hostname>.<cluster_domain> and
the bootstrap node / rendezvous host does not retrieve its domain
from the DHCP server, then the registry's DNS name cannot be
resolved.
That is because in order to pull the CoreDNS image, the disconnected
registry must be connected. The split dns mechanism of systemd-\
resolved would cause it to send DNS requests for
<registry_hostname>.<cluster_domain> to 127.0.0.1 where CoreDNS is
expected to be running which is not.

When a bootstrap node / rendezvous host retrieves its domain
<cluster_domain> from a DHCP server (e.g. dnsmasq's '--domain'
option) then systemd-resolved would associate <cluster_domain> not
only with 127.0.0.1 but also with the physical network interface,
causing DNS requests for <registry_hostname>.<cluster_domain> to be
send out to 127.0.0.1 as well as the external DNS resolver.

This patch mitigates the DNS issue for other network setups. It
changes the systemd-resolved config to forward DNS requests to
CoreDNS only for domains which are resolvable by CoreDNS:

* api.<cluster_domain>
* api-int.<cluster_domain>.
* apps.<cluster_domain>

DNS requests for <registry_hostname>.<cluster_domain> and other
subdomains of <cluster_domain> will be send out to the external
DNS resolver.

Fixes openshift#7516
JM1 added a commit to JM1/openshift-installer that referenced this pull request Jan 22, 2024
OCP requires DNS records api.<cluster_domain> and *.apps.\
<cluster_domain> to be externally resolvable (<cluster_domain> is
<cluster_name>.<base_domain>). For SNO this list also includes DNS
record api-int.<cluster_domain>.

However, OCP does not enforce ownership of all subdomains of
<cluster_domain>. For example, it is allowed to host a disconnected
image registry at <registry_hostname>.<cluster_domain> and OCP shall
be able to resolve it using the user-supplied external DNS resolver.

PR openshift#7516 changed the systemd-resolved config of the bootstrap node /
rendezvous host to associate the complete <cluster_domain> with the
DNS server at 127.0.0.1 where CoreDNS is supposed to be listening.

When a disconnected image registry is used for cluster installation,
the registry is hosted at <registry_hostname>.<cluster_domain> and
the bootstrap node / rendezvous host does not retrieve its domain
from the DHCP server, then the registry's DNS name cannot be
resolved.
That is because in order to pull the CoreDNS image, the disconnected
registry must be connected. The split dns mechanism of systemd-\
resolved would cause it to send DNS requests for
<registry_hostname>.<cluster_domain> to 127.0.0.1 where CoreDNS is
expected to be running which is not.

When a bootstrap node / rendezvous host retrieves its domain
<cluster_domain> from a DHCP server (e.g. dnsmasq's '--domain'
option) then systemd-resolved would associate <cluster_domain> not
only with 127.0.0.1 but also with the physical network interface,
causing DNS requests for <registry_hostname>.<cluster_domain> to be
send out to 127.0.0.1 as well as the external DNS resolver.

This patch mitigates the DNS issue for other network setups. It
changes the systemd-resolved config to forward DNS requests to
CoreDNS only for domains which are resolvable by CoreDNS:

* api.<cluster_domain>
* api-int.<cluster_domain>.
* apps.<cluster_domain>

DNS requests for <registry_hostname>.<cluster_domain> and other
subdomains of <cluster_domain> will be send out to the external
DNS resolver.

Fixes openshift#7516

(cherry picked from commit 5380ad9)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants