From 7887e55b0be0231d39f6b625a44fec3bc2b78583 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Jaime=20Caama=C3=B1o=20Ruiz?= Date: Sat, 20 Nov 2021 19:31:59 +0000 Subject: [PATCH] Proposal to allow MTU changes --- enhancements/network/allow-mtu-changes.md | 363 ++++++++++++++++++++++ 1 file changed, 363 insertions(+) create mode 100644 enhancements/network/allow-mtu-changes.md diff --git a/enhancements/network/allow-mtu-changes.md b/enhancements/network/allow-mtu-changes.md new file mode 100644 index 0000000000..5b1bc04e93 --- /dev/null +++ b/enhancements/network/allow-mtu-changes.md @@ -0,0 +1,363 @@ +--- +title: allow-mtu-changes +authors: + - "@juanluisvaladas" + - "@jcaamano" +reviewers: + - "@danwinship" + - "@dcbw" + - "@knobunc" + - "@msherif1234" + - "@trozet" + - "@yuqi-zhang" +approvers: + - TBD +creation-date: 2021-10-07 +last-updated: 2021-12-02 +status: provisional +--- + +# Allow MTU changes + +This covers adding the capability to the cluster network operator of changing +the MTU post installation. + +## Release Signoff Checklist + +- [ ] Enhancement is `implementable` +- [ ] Design details are appropriately documented from clear requirements +- [ ] Test plan is defined +- [ ] Operational readiness criteria is defined +- [ ] Graduation criteria for dev preview, tech preview, GA +- [ ] User-facing documentation is created in [openshift-docs](https://github.com/openshift/openshift-docs/) + +## Summary + +Customers may need to change the MTU post-installation. However these changes +aren't trivial and may cause downtime, hence Cluster Network Operator currently +forbids them. + +This enhancement proposes an automated procedure launched on demand and +coordinated by the Cluster Network Operator. This procedure is based on draining +and rebooting the nodes sequentially after performing the required configuration +changes to minimize service disruption and to ensure that the +nodes have a healthy configuration and that workloads preserve connectivity +during and after the procedure. + +## Motivation + +While cluster administrators usually set the MTU correctly during the +installation, sometimes they need to change it afterwards for reasons such as +changes in the underlay or because they were set incorrectly at install time. + +### Goals + +* Allow to change MTU post install on OVN-Kubernetes or OpenshiftSDN. + +### Non-Goals + +* Other safe or unsafe configuration changes. + +## Proposal + +Cluster Network Operator (CNO from now on) will drive the MTU change when it +detects the change on the operator network configuration. The procedure is based +on two rolling reboot sequences that will be coordinated through Machine Config +Operator (MCO from now on). On the first reboot, routes MTU (`routable-mtu` from +now on) will be used to pin the effective MTU in use to the lower value allowing +changes to the MTU set on interfaces (`mtu` from now on) without disruption. On +a second reboot, once all the interfaces are set to the final MTU value, route +MTU is unset to make that interface MTU the one in effect. In more detail: + +1. Set `mtu` to the higher MTU value and `routable-mtu` to the lower MTU value + of the current and target MTU values. +2. Do a rolling reboot. As a node restarts, `routable-mtu` is set on routes + while interfaces have `mtu` configured. The node will effectively use the + lower `routable-mtu` for outgoing traffic, but be able to handle incoming + traffic up to the higher `mtu`. +4. Set `mtu` to the target MTU value and unset `routable-mtu`. +5. Do a rolling reboot. As a node restarts, the target MTU value is set on the + interfaces and the routes are reset to default MTU values. Since the MTU + effectively used in yet to be restarted nodes of the cluster is the lower one + but able to handle the higher one, this has no impact on traffic. + +Due to the reboots it is adequate to consider this an MTU migration rather than +just simply an MTU change. Given the impact that the reboots have on the +cluster, this operation will be safeguarded by a specific migration annotation +or field. + +The procedure will include handling the MTU of the cluster pod network and the +MTU of the host itself as is critical to do so in a coordinated manner to avoid +service disruptions and more reboots than the minimum necessary. + +The procedure will optionally leverage Node Feature Discovery operator (NFD from +now on) through witch information about MTU capabilities of the nodes as well as +information of MTU migration status will be annotated on the nodes. This +information can be used to validate and track progress of the procedure. + +Changes will be required on openshift-api, OpenshiftSDN, OVN-Kubernetes, CNO and +MCO. + +### OVN-Kubernetes & OpenshiftSDN: new `routable-mtu` configuration parameter + +A new `routable-mtu` setting for the CNI, additional to the already existing +`mtu` one, is introduced to be set as MTU in the following routes: +* default route in pods, covering traffic from a pod to any destination that is + not a pod on the same host. +* non link scoped host routes of interfaces managed by the CNI covering traffic + from host network to pods hosted on different nodes +* host routes of interfaces managed by the CNI covering traffic to the service + network. + +On startup, CNI should reset the MTU value on interfaces and routes to the +current setting of `mtu` and `routable-mtu` respectively even if such routes +already exist. If `routable-mtu` is unset, the CNI should unset the MTU on the +routes. + +The CNI should annotate the `mtu` and `routable-mtu` values used on each node +that will allow to track the mtu-migration procedure. + +### MCO: ability to configure host MTU migration + +A configuration file will be deployed through MCO with the MTU parameters that +need to be set in the nodes during MTU migration: +* `TARGET_MTU` should be persistently set in the network configuration of the + appropriate host interfaces. This setting should persist even if `TARGET_MTU` + is unset in a subsequent reboot. +* `MTU` should be dynamically set in the appropriate host interfaces. If `MTU` + is not set but `TARGET_MTU` is set, use the highest of `TARGET_MTU` or + current interface MTU as `MTU`. +* `ROUTABLE_MTU` should be dynamically set for link scoped routes of the + appropriate host interfaces. If `ROUTABLE_MTU` is not set but `TARGET_MTU` is + set, use the lowest of `TARGET_MTU` and current interface MTU + as`ROUTABLE_MTU`. + +Host tooling that configures host networking for the CNI should source this +configuration file and act accordingly. + +Host tooling should write a NFD source hook to setup annotations for current, +minimum and maximum supported MTU for the physical interface and route MTU if +any, on each node. + +### openshift-api: add MTU migration specific fields + +New `MTU.Network.From`, `MTU.Network.To` and `MTU.Machine.To` fields will be +added in the existing `NetworkMigration` type which is used in CNO's cluster and +operator network configurations and will contain the CNI and host MTU values to +migrate from and to respectively. + +### MCO: automating MTU migration rolling reboots + +MCO monitors `Network.Status.Migration` from the cluster network configuration. +This `Network.Status.Migration` is being already updated by CNO with the +validated contents of `Network.Spec.Migration` from the operator network +configuration. MCO should react to changes in `Migration.MTU` and if +`MTU.Machine.To` is set render the appropriate MachineConfig containing the MTU +configuration file with `TARGET_MTU` set to `MTU.Machine.To` and a dummy +parameter (that will be ignored other than causing a reboot) for +`MTU.Network.To`. + +This will result in the required rolling reboots as `MTU.Machine.To` or +`MTU.Network.To` are set and then unset during the MTU migration procedure. + +### CNO: allowing MTU migration + +The CNO monitors changes on the operator network configuration. When it detects +a change in `Network.Spec.Migration`: +1. Check that `MTU.Network.From`, `MTU.Network.To` and `MTU.Machine.To` are not + independently set. Otherwise report an unsupported or unsafe configuration. +2. Check that `MTU.Network.From` equals `Network.Status.ClusterNetworkMTU`. + Otherwise report an unsupported or unsafe configuration. +3. Check that `MTU.Network.To`, if set or the currently configured CNI `mtu` + otherwise, is valid MTU value with respect to `MTU.Machine.To`. If not, + report an unsupported or unsafe configuration. +4. Render & apply the CNI configuration using highest and lowest MTU values from + `MTU.Network.From` and `MTU.Network.To` as the new CNI `mtu` and + `routable-mtu` respectively, if different. +5. Render & apply the applied network configuration using `MTU.Network.To` as + the `mtu`. + +This allows a second step that requires no further changes where the +administrator can set the configured CNI `mtu` to `MTU.Network.To` at the same +time it removes the `Migration` information without such a change being declared +as unsafe, and completing the migration procedure. + +#### MTU decrease example + +Current MTU: 9000 (host 9100) +Target MTU: 1400 (host 1500) + +* Administrator updates the operator network configuration with + `MTU.Network.To=1400`, `MTU.Network.From=9000` and `MTU.Machine.To=1500`. +* CNO sets in the applied cluster configuration `mtu=1400`. +* MCO applies a MachineConfig with `TARGET_MTU=1500`. +* Administrator waits for the corresponding rolling reboot to be performed. +* Administrator checks through node annotations or directly on the nodes for + expected MTU values. + * On unexpected MTU values, a rollback could be attempted setting + `MTU.Network.To=9000`, `MTU.Network.From=9000` and `MTU.Machine.To=9100`. +* Administrator updates the operator network configuration with `mtu=1400` and + removing the `Migration.MTU` field. +* MCO removes `TARGET_MTU=1500` from MachineConfig. +* Administrator waits for the corresponding rolling reboot to be performed. +* Administrator checks through node annotations or directly on the nodes for + expected MTU values. + * On unexpected MTU values, a rollback could be achieved attempting the + opposite MTU migration. + +#### MTU increase example + +Current MTU: 1400 (host 1500) +Target MTU: 9000 (host 9100) + +* Administrator updates the operator network configuration with + `MTU.Network.To=9000`, `MTU.Network.From=1400` and `MTU.Machine.To=9100`. +* CNO sets in the applied cluster configuration `mtu=9000`. +* MCO applies a MachineConfig with `TARGET_MTU=9100`. +* Administrator waits for the corresponding rolling reboot to be performed. +* Administrator checks through node annotations or directly on the nodes for + expected MTU values. + * On unexpected MTU values, a rollback could be attempted setting + `MTU.Network.To=1400`, `MTU.Network.From=1400` and `MTU.Machine.To=1500`. +* Administrator updates the operator network configuration with `mtu=9000` and + removing the `Migration.MTU` field. +* MCO removes `TARGET_MTU=9100` from MachineConfig. +* Administrator waits for the corresponding rolling reboot to be performed. +* Administrator checks through node annotations or directly on the nodes for + expected MTU values. + * On unexpected MTU values, a rollback could be achieved attempting the + opposite MTU migration. + +### User Stories + +#### As an administrator, I want to change the cluster MTU after deployment + +### API Extensions + +New `MTU.Network.From`, `MTU.Network.To` and `MTU.Machine.To` fields in +`Network.Status.Migration` from `config.openshift.io/v1` and +`Network.Spec.Migration` from `operator.openshift.io/v1`. + +### Implementation Details/Notes/Constraints + +## Design Details + +### Operational Aspects of API Extensions + +#### Failure Modes + +#### Support Procedures + +### Open Questions + +* We might not be aware at any given time of host link scoped routes if these + are dynamically assigned through IPv6 RA. NetworkManager dispatcher does not + seem to trigger an event for route modifications resulting from IPv6 RA. We + might have to require the administrator to setup the static routes before + applying the procedure that may lead to additional reboots. + +### Test Plan + +Two new CI tests, one for IPv4 and another for IPv6, that will run upstream +network tests and a specific e2e test to verify that the MTU set is the one +effectively used in the different network paths, both after increasing and +decreasing the MTU values. The tests will be triggered by demand. + +### Graduation Criteria + +#### Dev Preview -> Tech Preview + +#### Tech Preview -> GA + +#### Removing a deprecated feature + +### Upgrade / Downgrade Strategy + +### Version Skew Strategy + +### Risks and Mitigations + +* A cluster wide MTU change procedure cannot be carried out in an instant and + incurs the risk of having different MTUs in use across the cluster at a given + time. While the described procedure in this enhancement prevents this with the + use of route MTUs, some consequences from using different MTUs in the cluster + are analyzed in the following section for context. + +#### Running the cluster with different MTUs + +On the process of a `live` change of the MTU, there is going to be traffic +endpoints temporarily using different MTU values. In general, if the path MTU to +an endpoint is known, fragmentation will occur or the application will be +informed that it is trying to send larger packets than possible so that it can +adjust. Additionally, connection oriented protocols, such as TCP, usually +negotiate their segment size based on the lower MTU of the endpoints on +connection. + +So generally, different MTUs on endpoints affect ongoing connection-oriented +traffic or connection-less traffic, when the known destination MTU is not the +actual destination MTU. In this case, the most likely scenario is that traffic +is dropped on the receiving end by OVS if larger than the destination MTU. + +There are circumstances that prevent an endpoint from being aware of the actual +MTU to a destination, which depends on Path MTU discovery and specific ICMP +`FRAG_NEEDED` messages: +* A firewall is blocking these ICMP messages or the ICMP messages are not being + relayed to the endpoint. +* There is no router between the endpoints generating these ICMP messages. + +Let's look at different scenarios. + +##### Node to node + +On the receiving end, a nic driver might size the buffers in relation to the +configured MTU and drop larger packets before they are handed off to the system. + +Past that, OVN-K sets up flows in OVS br-ex for packets that are larger than pod +MTU and sends them off to the network stack to generate ICMP `FRAG_NEEDED` +messages. If these packets exceed the MTU of br-ex, they will be dropped by OVS +and never reach the network stack. Otherwise they will reach the network stack +but not generate ICMP `FRAG_NEEDED` messages as network stack only does so for +traffic being forwarded and not for traffic with that node as final destination. + +As there is generally no router in between two cluster nodes, more than likely a +node would not be aware of the path MTU to another node. + +##### Node to pod + +As explained before, network stack receives larger packets between host MTU and +pod MTU and might cause ICMP `FRAG_NEEDED` messages to be sent to the +originating node such that a node might be aware of the proper path MTU when +reaching out to pod. Otherwise larger than pod MTU traffic will dropped by OVS. + +##### Pod to Node + +On this datapath, OVS at the destination node will drop the larger packets +without generating ICMP `FRAG_NEEDED` messages as the node is the final +destination of the traffic. The originating pod is never aware of the actual +path MTU. + +##### Pod to Pod + +This traffic is encapsulated with Geneve. The Geneve driver might drop it and +generate ICMP `FRAG_NEEDED` messages back to the originating pod if it is trying +to send packets that would not fit in the originating node MTU once +encapsulated. But OVN is not prepared to relay back these ICMP messages to the +originating pod so it would not be aware of an appropriate MTU to use. + +On the receiving end, OVS would drop the packet silently if larger than the +destination MTU of the veth interface. Even if this would not be the case, the +veth driver itself would drop the packet silently if over the MTU of the pod's +end veth interface. + +## Implementation History + +## Drawbacks + +## Alternatives + +* An alternative was considered to perform the MTU change "live", rolling out + the ovnkube-node daemon set with a new MTU value and changing all the running + pods MTU. This approach, while quicker and more convenient, was discarded as + it was considered less safe due to possible edge cases where the odd chance of + leaving pods behind with wrong MTU values was difficult to avoid, as well as + offering less guarantees of minimizing service disruption.