Skip to content
Closed
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
285 changes: 285 additions & 0 deletions enhancements/network/allow-mtu-changes.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,285 @@
---
title: allow-mtu-changes
authors:
- "@juanluisvaladas"
- "@jcaamano"
reviewers:
- "@danwinship"
- "@dcbw"
- "@knobunc"
- "@msherif1234"
approvers:
- TBD
creation-date: 2021-10-07
last-updated: 2021-10-14
status: provisional
---

# Allow MTU changes

This covers adding the capability to the cluster network operator of changing
the MTU post installation.

## Release Signoff Checklist

- [ ] Enhancement is `implementable`
- [ ] Design details are appropriately documented from clear requirements
- [ ] Test plan is defined
- [ ] Operational readiness criteria is defined
- [ ] Graduation criteria for dev preview, tech preview, GA
- [ ] User-facing documentation is created in [openshift-docs](https://github.com/openshift/openshift-docs/)

## Summary

Customers may need to change the MTU post-installation. However these changes
aren't trivial and may cause downtime, hence the CNO currently forbids them.

We propose a procedure that will be launched on demand. This procedure will
run pods on every node of the cluster and make the necessary changes in an
ordered and coordinated manner with a service disruption within the least
possible time, which if under a reasonable time of 10 minutes, should be well
under the typical TCP timeout interval.

## Motivation

While cluster administrators usually set the MTU correctly during the
installation, sometimes they need to change it afterwards for reasons such as
changes in the underlay or because they were set incorrectly at install time.

### Goals

* Allow to change MTU post install on OVN Kubernetes.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we had wanted to do this for openshift-sdn too?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we do but I understood that not at the same time and not for 4.10 anyway so I did not cover it in this enhancement just because of time constraints and the fact that I don't know anything about it.


### Non goals

* Change the MTU without service disruption.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* Change the MTU without service disruption.
* Change the MTU with absolutely no service disruption.


* Other safe or unsafe configuration changes.

## Proposal

The CNO monitors changes on the operator configuration. When it detects a MTU
change:
1. Set the `clusteroperator/network` conditions:
- Progressing: true
- Upgradeable: false
2. Check that the MTU value is valid, within threoretically min/max values.
3. Check that all the nodes are on Ready state.
4. Deploy pods on every node with `restartPolicy: Never` which are responsible
for validating the preconditions. If the preconditions are met the pod will
exit with code 0. Some of the preconditions that we will check are:
- The underlay network supports the intended MTU value.
5. Once all the previous pods finish successfully, deploy other set of pods with
`restartPolicy: Never` on every node that will handle the actual change of
the MTU (explained in more detail below). Wait for them to be ready and
running.
6. Ensure that the configmap ovnkube-config is synchronized with the new MTU
value.
7. If any previous steps (1-6) was unsuccesful, the CNO will set the
`clusteroperator/network` conditions to:
- Progressing: false
- Degraded: true
Update the operator configuration status with a description of the problem.
At this point the process is interrupted and we require manual intervention.
8. Force a rollout of the ovnkube-node daemonset. This will ensure
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this means the ovnkube upgrade is totally out of sync with the pod-level upgrades, and every node needs to wait for every other node to finish its pod-level upgrades before any of them can do the ovnkube-level upgrade.

A better approach might be: instead of having CNO force a re-rollout of the DaemonSet, just have the step 5 pod kill the local ovnkube-node process, forcing it to be restarted.

And then it could even choose to do that step before or after the pod-level fixes, depending on which direction the MTU is changing in...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The thing is that if we restart ovn-kube after the step 5 pod changes the MTUs, there is a time in between where new pods my allocate with the old MTU.

That's why we restart ovn-kube first, we'll do the roll-out with max unavailability so that is quick and then we let the step 5 pod proceed with the MTU changes. Yes, it is out of sync, but hopefully quick enough.

ovn-kubernetes uses the new MTU value for new pods as well as set the new MTU
on managed node interfaces like ovn-k8s-mp0, ovn-k8s-gw0 (local gateway mode)
and related routes.
9. Set the new MTU value to the applied-cluster config map AND wait for pods of
step 3 to complete successfully.
Comment on lines +88 to +89
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

step 5

10. If any of the previous steps (8,9) failed, reboot the node, wait for the
kubelet to be reporting as Ready again.
Comment on lines +90 to +91
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If any of the previous steps (8,9) failed on any nodes, drain and reboot the failed nodes, one at a time, and wait for each one to be reporting as Ready again.

If this step fails, set conditions to:
- Progressing: false
- Degraded: true
Update the operator configuration status with a description of the problem.
11. Upon completion, set conditions to:
- Progressing: false
- Degraded: false
Comment on lines +96 to +98
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

- Upgradeable: true


The steps to change the MTU performed by pods of previous step 3 are:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this pod detect if there are MTU problems after migration is complete, and post an event or something to indicate if it was successful or not?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you mean with detect?

If openshift has a verification procedure to health check deployments then we probably can suggest in the documentation to run it after this procedure.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean that in your step 5:

5. Once all the previous pods finish successfully, deploy other set of pods with
   `restartPolicy: Never` on every node that will handle the actual change of
   the MTU (explained in more detail below). Wait for them to be ready and
   running.

So you are going to deploy another set of pods that do the configuration. I'm wondering if these pods will remain for some time after configuration and if they can run a healthcheck until all of the nodes are finished updating MTU. The check could be pinging from this pod to other "configuration pods" on other nodes with max MTU. If it doesn't come up after some time, then maybe an event can be posted or something to indicate to the user that MTU change failed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Usually when you do a new deployment you ran some verification to check that the deployment has been done correctly and that cluster is healthy. If this exists for openshift we can suggest in documentation to run it again. Otherwise, it would probably be better to have a different set of pods with aliveness probe or the like rather than adding to these specific pods.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we have the network check target pods, but I was thinking something specifically scoped to the MTU change to give the user a signal that the MTU update worked as part of the MTU update process itself. Like the pods that you launch for doing the MTU upgrade exit successfully and log some message like MTU upgrade complete, or if they check network connectivity and something is now broken, they either crash or post an event to their pod saying MTU upgrade problem. If you think it's not necessary then that's fine to ignore.

Copy link
Contributor Author

@jcaamano jcaamano Nov 4, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would probably then use the network check target pods and enhance that for any specific MTU verification we think we need to do. Do you know where I can check them out?

These MTU change pods only change the MTU of pods, which is an operation for which we should know definitively if it succeeded or not, and is only one step of a 3 step process which also includes changing the host sdn interfaces MTU and the host external interfaces MTU, so I feel that a final verification of the MTU in these pods could be out of place.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1. So that we don't have nodes doing things at different times and we have
everything synchronized, the pods will wait until the MTU value on the
applied-cluster configmap changes.
2. Enter every network namespace. If an interface `eth0` exists in that
namespace with an ip address within the pod subnet, change the MTU of the
the veth pair.
Comment on lines +104 to +106
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
2. Enter every network namespace. If an interface `eth0` exists in that
namespace with an ip address within the pod subnet, change the MTU of the
the veth pair.
2. Enter every network namespace. If an interface `eth0` exists in that
namespace with an ip address within the pod subnet, change the MTU of the
that interface.

3. If any of these steps failed (1-3), the pod will exit with code 1, if all
were successful it will exit with code 0.
Comment on lines +107 to +108
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
3. If any of these steps failed (1-3), the pod will exit with code 1, if all
were successful it will exit with code 0.
3. If any of these steps failed (1-2), the pod will exit with code 1, if all
were successful it will exit with code 0.


An administrator should be able to deploy a machine-config object to change
the node MTU as well. If increasing the MTU, it will do so at the beginning
of the procedure. If decreasing the MTU, it will do so at the end of the
procedure.

### User Stories

#### As an administrator, I want to change the node MTU

An administrator should be able to deploy a machine-object config object
that configures the node MTU permanently. Ideally this would be achieved
through the ability to run configure-ovs with an MTU parameter.
configure-ovs should change the MTU of br-ex and ovs-if-phys0 with the
least impact on the existing configuration to avoid any unnecessary
disruption. This change should persist across reboots.

#### As an administrator, I want to change the cluster network MTU

An administrator should be able to change the cluster network MTU through
CNO configuration change. This would encompass the following tasks:

##### Implement a pod that changes the actual MTU on running pods
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So again, the parts talking about implementation details don't belong in "User Stories". And they're redundant with what you've already said, so you can just remove them.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will move them to a specific section. The thing is that there is no way to map (user) stories here to (non-user) stories in Jira.


Implement a pod that changes the actual MTU for both ends of the veth
pair for pods hosted in the node where the pod runs as described in the
proposal, and in the least possible time.

##### Add support in ovnkube-node to reset MTU on start

Make sure that upon restart, ovnkube-node resets the MTU on all the relevant
interfaces, like ovn-k8s-mp0, ovn-k8s-gw0, br-int as well as related routes
that currently have a MTU set.

##### Add support in CNO for MTU change coordination

Add support in CNO to allow and coordinate the MTU change for OVN-Kubernetes
as described in the proposal.

### Implementation Details/Notes/Constraints

## Design Details

### Open Questions

* If changing the MTU on a node fails, do we have guarantee that we can still
reboot the node?

### Test Plan
We will create the following tests:
1. An HTTPS server with a very large certificate, and multiple clients
in different nodes doing a single HTTPS request. The acceptance criteria
is TLS negotiation suceeds and HTTPS request returns 200 after every MTU
change.

Packet loss, TCP retransmissions, increased latency, and reduced bandwidth and
connectivity loss considered acceptable while the change is happening.

While previous test is running, we will decrease the MTU, and
once it's finished we'll increase it to it's previous value.

This test will be two new jobs in CI, one for IPv4 and another for IPv6, that
will be launched on demand.

### Risks and Mitigations

* If unexpected problems ocurr this procedure, the mitigation is an automated
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* If unexpected problems ocurr this procedure, the mitigation is an automated
* If unexpected problems occur during this procedure, the mitigation is an automated

node reboot. The worst possible outcome is a full unplanned reboot
of the cluster. Documentation should advertise of these possible
consequences. An alternate implementation with planned reboots is described
in the Alternatives section.
* Even though the procedure takes place under the absolute TCP timeout interval,
applications might have their own timeout implementation. Service disruption
and how applications handle it is a risk that might need to be considered on
per application basis but that can not be reasonably scoped in this
enhancement.
* During the procedure, different MTUs will be used throughout the cluster. Next
section analyzes the consequences in detail.

#### Running the cluster with different MTUs

On the process of a `live` change of the MTU, there is going to be traffic
endpoints temporarily using different MTU values. In general, if the path MTU
to an endpoint is known, fragmentation will occur or the application will be
informed that it is trying to send larger packets than possible so that it can
adjust. Additionally, connection oriented protocols, such as TCP, usually
negotiate their segment size based on the lower MTU of the endpoints on
connection.

So generally, different MTUs on endpoints affect ongoing connection-oriented
traffic or connection-less traffic, when the known destination MTU is not the
actual destination MTU. In this case, the most likely scenario is that traffic
is dropped on the receiving end by OVS if larger than the destination MTU.

There are circumstances that prevent an endpoint from being aware of the actual
MTU to a destination, which depends on Path MTU discovery and specific ICMP
`FRAG_NEEDED` messages:
Comment on lines +203 to +205
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

which in general seem to not work over OVS

* A firewall is blocking these ICMP messages or the ICMP messages are not being
relayed to the endpoint.
* There is no router between the endpoints generating these ICMP messages.

Let's look at different scenarios.

##### Node to node

On the receiving end, a nic driver might size the buffers in relation to the
configured MTU and drop larger packets before they are handed off to the system.

Past that, OVN-K sets up flows in OVS br-ex for packets that are larger than pod
MTU and sends them off to the network stack to generate ICMP `FRAG_NEEDED`
messages. If these packets exceed the MTU of br-ex, they will be dropped by OVS
and never reach the network stack. Otherwise they will reach the network stack
but not generate ICMP `FRAG_NEEDED` messages as network stack only does so for
traffic being forwarded and not for traffic with that node as final destination.

As there is generally no router in between two cluster nodes, more than likely a
node would not be aware of the path MTU to another node.

##### Node to pod

As explained before, network stack receives larger packets betwewen host MTU and
pod MTU and might cause ICMP `FRAG_NEEDED` messages to be sent to the originating
node such that a node might be aware of the proper path MTU when reaching out to
pod. Otherwise larger than pod MTU traffic will dropped by OVS.

##### Pod to Node

On this datapath, OVS at the destination node will drop the larger packets without
generating ICMP `FRAG_NEEDED` messages as the node is the final destination of the
traffic. The originating pod is never aware of the actual path MTU.

##### Pod to Pod

This traffic is encapsulated with geneve. The geneve driver might drop it and
generate ICMP `FRAG_NEEDED` messages back to the originating pod if it is trying
to send packets that would not fit in the originating node MTU once encapsulated.
But OVN is not prepared to relay back these ICMP messages to the originating pod
so it would not be aware of an appropiate MTU to use.

On the receiving end, OVS would drop the packet silently if larger than the
destination MTU of the veth interface. Even if this would not be the case, the
veth driver itself would drop the packet silently if over the MTU of the pod's
end veth interface.

## Alternatives

### New ovn-k setting: `routable-mtu`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this sounds better than the proposed solution... why aren't we doing it this way?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A double rolling reboot seemed unacceptable. But I don't know what is the latest stance on it. Perhaps @vpickard can comment on this.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I was concerned that 2 reboots would not be acceptable from a customer perspective. @mcurry-rh What are your thoughts on having to perform 2 reboots to change the mtu?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Replicating the feedback we got form @mcurry-rh
not ideal...acceptable...MTU adjustment is a rare event, so 2 reboots, while painful, is not fatal and achieves the objective
So the second alternative is based on already available node maintenance knowledge, simpler to implement and a safer approach all around while the main alternative is more efficient at the cost of that safety. We could prototype as well.
@abhat @trozet @knobunc @dcbw we would need to make a call on this. Do you have any opinion?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, if the cluster is actually "broken" because of the bad MTU, then having to do two reboots isn't that bad since you're probably not running anything useful anyway.

And if it's not broken, then the MTU change probably isn't urgent, and the procedure doesn't actually require that the two reboots happen back-to-back; they could happen 24 hours apart or something. (Right? The cluster is stable/consistent in the inter-reboot phase?) So we could even just make it so that the CNO doesn't initiate any rolling reboots itself, it just does:

  • CNO makes the initial change to mtu/routable-mtu
  • CNO observes nodes until it sees that every node has rebooted (for whatever reason) and is using the changed configuration.
  • CNO makes the second change to mtu/routable-mtu
  • CNO observes nodes until it sees that every node has rebooted and is using the changed configuration.
  • CNO updates the operator status accordingly

So then the admin could schedule two sets of rolling reboots on consecutive nights, or even just make the config change and then forget about it, and the first change would complete the next time they did a z-stream update and the second change would complete after the next update after that.

Copy link
Contributor Author

@jcaamano jcaamano Nov 10, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Right? The cluster is stable/consistent in the inter-reboot phase?)

Yes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"So, if the cluster is actually "broken" because of the bad MTU" -- That is not always a safe assumption. One case we had was where a customer had a large, running cluster and wanted to add new nodes. But the new nodes were on OpenShift and they needed to drop the MTU to allow for the VxLAN header in the OSP networking. I assume most cases will be like that, otherwise they could just reinstall...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"So, if the cluster is actually "broken" because of the bad MTU" -- That is not always a safe assumption.

Hence the "if"


OVN-Kube Node, upon start, sets that `routable-mtu` on all the host routes and
on all created pods routes. This will make all node-wide traffic effectively
use that MTU value even though the interfaces might be configured with a higher
MTU. Then, with a double rolling reboot procedure, it should be possible to
change the MTU with no service disruption.

Decrease example:
* Set in ovn-config a `routable-mtu` setting lower than the `mtu` setting.
* Do rolling reboot, as nodes restart they will effectively use lower MTU, but
since the actual interfaces MTU did not change they will not drop traffic
coming from other nodes.
* Set in ovn-config a `mtu` equal to `routable-mtu` or replace `mtu` with the
`routing-mtu` value and remove the latter.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure what routing-mtu is here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That should be routable-mtu

* Do rolling reboot, as nodes restart they will do so with interfaces configured
the expected MTU. As other nodes are effectively using this MTU setting, no
traffic drop is expected.

Increase example:
* Set in ovn-config the actual `mtu` as `routable-mtu` and a new `mtu` setting
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is confusing... I would suggest just using some numbers in your example.

Copy link
Contributor Author

@jcaamano jcaamano Nov 17, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Prototyped it in ovn-kubernetes/ovn-kubernetes#2654, perhaps the description I gave there is easier to understand:

routable-mtu setting is introduced to faciliate a procedure allowing to
change the MTU on a running cluster with minimum service disruption.

Given current and target mtu values:

1. Set mtu to the higher MTU value and routable-mtu to the lower
MTU value.
2. Do a rolling reboot. As a node restarts, routable-mtu is set on
all appropriate routes while interfaces have mtu configured.
The node will effectively use the lower routable-mtu for outgoing
traffic, but be able to handle incoming traffic up to the higher
mtu.
3. Change the MTU on all interfaces not handled by ovn-k to the target
MTU value. Since the MTU effectively used in the cluster is the lower
one, this has no impact on traffic.
4. Set mtu to the target MTU value and unset routable-mtu.
5. Do a rolling reboot. As a node restarts, the target MTU value is set
on the interfaces and the routes are reset to default MTU values.
Since the MTU effectively used in other nodes of the cluster is the
lower onei but able to handle the higher one, this has no impact on
traffic.

routable-mtu is set as the MTU for the following routes:

* pod default route
* non link scoped management port route,
* services route
* link scoped node routes

higher than `routable-mtu`.
* Do rolling reboot, nodes will restart with the higer MTU setting configured
on their interfaces but still be effectively using the lower MTU.
* Set in ovn-config a `mtu` equal to `routable-mtu` or replace `mtu` with the
value of `routing-mtu` value and remove the latter.
* Do rolling reboot, as nodes restart they will use the higher MTU. As other
nodes already have this MTU set on their interfaces no drops are expected.

These procedure should be coordinated with changing the MTU setting on br-ex
and its physical port.