Skip to content

Conversation

@jcpowermac
Copy link
Contributor

@openshift-ci-robot openshift-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Dec 11, 2019
@jcpowermac jcpowermac force-pushed the vmw_m3_ops_networking branch 2 times, most recently from 6245021 to 66ab0f5 Compare December 11, 2019 17:30
@russellb
Copy link
Contributor

The proposal section was slightly modified from: https://github.com/openshift/installer/blob/master/docs/design/baremetal/networking-infrastructure.md

What were the modifications? Can you add a link to this to the doc itself, as well?

@jcpowermac
Copy link
Contributor Author

The proposal section was slightly modified from: https://github.com/openshift/installer/blob/master/docs/design/baremetal/networking-infrastructure.md

What were the modifications? Can you add a link to this to the doc itself, as well?

Sure I can add a link to original document. I modified this text since this has been moved to MCO

 The `keepalived`
instance here is managed by systemd and a script is used to generate
the `keepalived` configuration before launching the service using
`podman`. 

Added repository links to pod and service configurations.

@jcpowermac jcpowermac force-pushed the vmw_m3_ops_networking branch 3 times, most recently from 4b41706 to 1aae954 Compare December 12, 2019 18:34
### Goals

Install an IPI OpenShift cluster on various on-premise non-cloud platforms that
provides internal DNS and load balancing that is minimally required for OpenShift
Copy link
Contributor

@abhinavdahiya abhinavdahiya Dec 12, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we add a section that provides information briefly on what this minimum requirements are?

## Proposal

The basis and significant portions for this proposal was taken from existing
[documentation](https://github.com/openshift/installer/blob/master/docs/design/baremetal/networking-infrastructure.md).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment on lines 69 to 71
In both cases, the installation process expects these ports to be
reachable on the bootstrap instance at first and then later on the
newly-deployed control plane machines.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's technically not correct, the expectation is that these are load balanced to any one of them as long as it's healthy.

There is no first bootstrap and then control-plane, it's all healthy endpoints.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@abhinavdahiya I removed this section from the original doc:

On other platforms (for example, see the AWS UPI
instructions
) an external
load-balancer is required to be configured in advance in order to
provide this access.

For bootstrap I am unable to find haproxy in use. From my review the expectation is the VIP moves to the control plane node once the bootstrap node is destroyed.

cc: @russellb


In cluster network infrastructure, a VIP (Virtual IP) is used to provide
failover of the API server across the control plane machines
(including the bootstrap instance). This "API VIP" is provided by the user
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(including the bootstrap instance)

should probably reword on the line of, bootstrap-host is part of this backend during bootstrapping.


In cluster network infrastructure, a VIP (Virtual IP) is used to provide
failover of the API server across the control plane machines
(including the bootstrap instance). This "API VIP" is provided by the user
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as an install-config.yaml [parameter]

I think we should just mention that it's provided by the user to the installer, because it can be prompted if needed by the terminal prompts and not just install-config.

(including the bootstrap instance). This "API VIP" is provided by the user
as an `install-config.yaml` [parameter](https://github.com/openshift/installer/blob/master/pkg/types/baremetal/platform.go#L57-L63)
and the installation process configures `keepalived` to manage this VIP.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would like to see included some restrictions of the API VIP...
example, is this global addressable? or only in the private network? is it part of the private network of cluster or some other private network out side cluster's etc.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems since we only have one API VIP whether or not it's RFC1918 or not matters less than the requirement that the VIP is accessible both by the cluster and external clients.

failover of the API server across the control plane machines
(including the bootstrap instance). This "API VIP" is provided by the user
as an `install-config.yaml` [parameter](https://github.com/openshift/installer/blob/master/pkg/types/baremetal/platform.go#L57-L63)
and the installation process configures `keepalived` to manage this VIP.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The installation process doesn't configure it, rather the cluster configures and maintains it.. so i think we should move away from it being install time only configuration.

as an `install-config.yaml` [parameter](https://github.com/openshift/installer/blob/master/pkg/types/baremetal/platform.go#L57-L63)
and the installation process configures `keepalived` to manage this VIP.

The API VIP first resides on the bootstrap instance.
Copy link
Contributor

@abhinavdahiya abhinavdahiya Dec 12, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this probably needs to be separated out into a section for lifecycle of API VIP during bootstrapping and after in the cluster, and keepalived setup.

Copy link
Contributor

@abhinavdahiya abhinavdahiya Dec 12, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

keepalived setup

It will be useful to describe

  • the init container's input and role
  • the actual keepalived inputs and configured behavior briefly.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also there seems to be difference in keepalived pod setup/configuration between bootstrap and cluster nodes see https://github.com/openshift/machine-config-operator/blob/fd8c53bfb8d97a5a7442a8810a0f2397f20b495d/templates/common/baremetal/files/baremetal-keepalived.yaml

So i think it is important to cover that.

are rendered by the Machine Config Operator.

The VIP will move to one of the control plane nodes, but only after the
bootstrap process has completed and the bootstrap instance is stopped. This happens
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the bootstrap instance is stopped

this is technically incorrect, the users shouldn't have to shutdown the bootstrap-host to move the API VIP to control-plane, we should be able to communicate to control-plane as soon as it's up..?

This seems like a bug...?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

@jcpowermac jcpowermac Dec 13, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@abhinavdahiya the bootstrap vrrp_interface priority is set to 50 and control plane 40. The only way for the VIP to move to the CP is for the boostrap instance of keepalived to be stopped or the entire machine. The VIP would then be under control of a CP node. haproxy would then load balance between the other CP nodes

These [instances](https://github.com/openshift/machine-config-operator/blob/master/templates/master/00-master/baremetal/files/baremetal-haproxy.yaml)
of `haproxy` are [configured](https://github.com/openshift/machine-config-operator/blob/master/templates/master/00-master/baremetal/files/baremetal-haproxy-haproxy.yaml)
to load balance the API traffic across all of the control plane nodes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would like to include the information wrt how are the backends discovered for the ha proxy and what's the health checking setup for the backend.

Externally resolvable DNS records are required for:

* `api.$cluster_name.$base-domain` -
* `*.apps.$cluster_name.$base_domain` -
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

internal services depend on this too..

In cluster networking infrastructure, the goal is is to automate as much of the
DNS requirements internal to the cluster as possible, leaving only a
small amount of public DNS configuration to be implemented by the user
before starting the installation process.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

leaving only a
small amount of public DNS configuration to be implemented by the user
before starting the installation process.

So the customer should be able to use the kubeconfig provided without any pre/post setup.

The only one that is kinda acceptable is the *.apps.$cluster_domain from outside the cluster.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@abhinavdahiya I would think that the A record for api and *.app would need to be create before starting install.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can to add this to limitations.


#### api-int hostname resolution

The CoreDNS server performing our internal DNS resolution includes
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The CoreDNS server

which the we haven't described any coredns server before this. I think we need a section on the coredns setup/inputs/behavior for internal dns.


The `mdns-publisher` is the component that runs on each host to make itself
discoverable by other hosts in the cluster. Control plane hosts currently
advertise both `etcd-NN` and `master-NN` names, and the worker nodes advertise
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would like to know how the publisher get the idx of the machine resp to other node in the cluster.

Comment on lines 173 to 180
One of the of virtual IP addresses required by in cluster networking infrastructure is used
for our self-hosted internal DNS - the “DNS VIP”. The location of the DNS VIP
is [managed](https://github.com/openshift/machine-config-operator/blob/master/manifests/baremetal/keepalived.conf.tmpl#L22)
by `keepalived`, similar to the management of the API VIP.

The control plane nodes are configured to use the DNS VIP as their primary DNS
server. The DNS VIP resides on the bootstrap host until the control plane
nodes come up, and then it will move to one of the control plane nodes.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this doesn't belong to this section.

Comment on lines 203 to 288
### Ingress High Availability

There is a third VIP used by in cluster networking infrastructure, and that is for Ingress.
The Ingress VIP will always reside on a node running an Ingress controller.
This ensures that we provide high availability for ingress by default.

The [configuration](https://github.com/openshift/machine-config-operator/blob/master/templates/worker/00-worker/baremetal/files/baremetal-keepalived-keepalived.yaml)
of this mechanism used to determine which nodes are running an ingress controller
is that `keepalived` will try to reach the local `haproxy` stats port number
using `curl`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This only provides the dns and routing for the default ingresscontroller.. what about if the user configure multiple for sharding etc.

How would the user configure a vip setup for different one, can the user even do that. we should capture the limitations.

cc @russellb


- https://github.com/openshift/installer/pull/1873
- https://github.com/openshift/machine-config-operator/pull/795
- https://github.com/openshift/api/pull/348
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's strange that the only user that needs the VIPs is machine-config-operator but the values are stored in global configuration... for future platforms we should look at fixing this..

- https://github.com/openshift/api/pull/348

### Risks and Mitigations

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we include that none of this setup has been verified to be resilient or performant... esp the amount of watches in large clusters to apiserver..

@jcpowermac jcpowermac force-pushed the vmw_m3_ops_networking branch from 1918fbe to 99e3942 Compare December 16, 2019 15:25
Comment on lines 78 to 93
The bootstrap keepalived VRRP instances have a higher weight using a `priority 50` than the control plane's `priority 40`.
Bootstrap will maintain its role as master until some intervention which in our case is destroying the bootstrap node.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would like to make sure we mention this in Limitations section.

Comment on lines 88 to 108
The control plane keepalived configuration uses service checks to either add or remove points to the instance weight.
Using the default `priority 40` and the service checks will determine which control plane node is the master VRRP instance.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The control plane keepalived configuration uses service checks

Can you provide details on the configured health checks... protocol, address, etc..

The APIVIP is provided by the user
via the `install-config.yaml` [parameter](https://github.com/openshift/installer/blob/master/pkg/types/baremetal/platform.go#L57-L63)
or `openshift-installer` terminal prompts. The machine config operator does the inital
render of the pod spec and configuration template. The initContainer does the final
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The machine config operator does the inital
render of the pod spec and configuration template. The initContainer does the final
templating of the configuration with baremetal-runtimecfg.

let's keep a separate paragraph for baremetal-runtimecfg. Also can we provide details around what inputs does this take/provide as API...??

@jcpowermac jcpowermac force-pushed the vmw_m3_ops_networking branch from 99e3942 to 3358e4b Compare December 18, 2019 21:32
@jcpowermac jcpowermac changed the title [wip] DNS and LB services in cluster for non-cloud IPI DNS and LB services in cluster for non-cloud IPI Dec 18, 2019
@openshift-ci-robot openshift-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Dec 18, 2019
@jcpowermac jcpowermac force-pushed the vmw_m3_ops_networking branch from 3358e4b to 3ce8f16 Compare January 10, 2020 14:52
@yboaron yboaron force-pushed the vmw_m3_ops_networking branch from 3ce8f16 to 46b36e6 Compare June 15, 2020 06:39
@yboaron yboaron force-pushed the vmw_m3_ops_networking branch from 46b36e6 to c978f7e Compare June 15, 2020 13:29
The minimal requirements includes:
* Internal DNS:
- hostname resolution for masters and workers nodes.
- `api-int` hostname resolution.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't it missing "wildcard apps subdomain resolution?

@abhinavdahiya
Copy link
Contributor

/lgtm

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Jun 15, 2020
@openshift-ci-robot
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: abhinavdahiya, jcpowermac

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci-robot openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 15, 2020
@openshift-merge-robot openshift-merge-robot merged commit 8987ab6 into openshift:master Jun 15, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants