[ macOS/ARM64 | Linux/AMD64 ]
Previous: Installing Kubernetes Control Plane
The control plane is working, and we have a nice, highly available, load balanced Kubernetes API at our disposal. It's time for the worker nodes to join the party.
This chapter is a mix of deployment instructions and explanations of Kubernetes' inner workings. In particular, we're going to take this chapter as an opportunity to dive a bit deeper into how container runtimes work, and what are the underlying mechanisms responsible for Kubernetes networking.
Table of Contents generated with DocToc
- Prerequisites
- Overview
- Turning control plane nodes into "pseudo-workers"
- Shell variables
- The Container Runtime
- CNI plugins
kubelet
- Scheduling a first pod
- Routing pod traffic via the host machine
- Authorizing
kube-apiserver
tokubelet
traffic kube-proxy
- Summary
Just like in the previous chapter, we'll be installing stuff
on multiple nodes at once (both control and worker VMs). It is recommended to do this with tmux
pane synchronization,
as described before.
The most important Kubernetes component running on worker nodes is kubelet
. It is responsible for announcing
worker node's presence in the cluster to kube-apiserver
, and it is the toplevel entity responsible for the lifecycle
of all the pods/containers running on a worker node.
However, kubelet
does not manage containers directly. This part of Kubernetes is highly abstracted, pluggable and
extensible. Namely, there are (at least) two abstract specifications that kubelet
integrates with:
The CRI is implemented by a container runtime while the CNI is implemented by the so called CNI plugins.
We'll need to install them manually and configure kubelet
properly to use them.
Finally, a worker node typically runs a kube-proxy
, a component responsible for handling and load balancing
traffic to Kubernetes Services.
kubelet
, container runtime and kube-proxy
are typically necessary only on worker nodes, as these are the
components needed to run actual cluster workloads, inside pods.
However, we'll install these components on control plane nodes as well. The reasons for that are
technical, the most important of them being the fact that kube-apiserver
occasionally needs to communicate with
services running inside the cluster (e.g. admission webhooks).
This requires control plane nodes to participate in the cluster overlay network, so that service Cluster IPs are
routable from them. This means that, at minimum, we need to run kube-proxy
on control plane nodes. Unfortunately,
kube-proxy
refuses to run on a non-registered node, so we are forced to turn control plane nodes into fully-configured
worker-like nodes with kubelet
and container runtime.
Having said that, we want to avoid running any actual workloads on control plane nodes. Fortunately, Kubernetes has mechanisms for excluding nodes from regular pod scheduling, and we'll take advantage of that.
Let's define some reusable shell variables for this chapter. Run this in the SSH shell on all control & worker nodes:
arch=arm64
k8s_version=1.31.0
cri_version=1.31.1
runc_version=1.1.13
containerd_version=1.7.20
cni_plugins_version=1.5.1
cni_spec_version=1.0.0
All further instructions assume availability of these variables (make sure to run everything in the same shell).
Let's start worker setup with installation of the container runtime. We'll take this step as an opportunity to do a little introduction (or refresh) on what containerization fundamentally is and how it is realized in Linux. If you're not interested in this theoretical introduction, you can skip it.
A container, in practice, is a regular Linux process, but run in a special way, so that it has a different (i.e. limited) view of its environment, in comparison to a plain, non-containerized process. The goal of containerization is to provide sufficient level of isolation between containerized processes, so that they cannot see or affect each other, or the host operating system. Despite their isolation, containerized processes still run in the same OS (kernel), which makes it a more lightweight alternative to full virtualization.
The Linux kernel implements two core features that make this isolation possible: the namespaces and the cgroups.
Namespaces put containerized processes into "sandboxes" where a process cannot "see" the outside of its sandbox. There are multiple namespace types, each one controlling a different aspect of what a process can see. The most important ones include:
- The mount namespace
Makes the containerized process see a completely different set of mount points than on the host operating system, effectively making it have its own, isolated filesystem tree. - The PID namespace
Assigns a new, virtual PID to the containerized process (usually equal to 1) and hides all other processes from it, unless they are running in the same namespace. - The user namespace
Creates an illusion for the containerized process of running as a different user (often theroot
user) than it is actually being run as. True system users are invisible for the containerized process. - The network namespace
Makes the containerized process see a completely different set of network interfaces than on the host operating system. Usually this involves creating some kind of virtual ethernet interface visible within the container. This virtual interface is then connected in some way (e.g. bridged) to host OS interfaces (invisibly to the container).
Cgroups are a mechanism for putting resource limits (CPU, memory, IO, etc.) on containerized processes.
A Linux system has a global cgroup hierarchy, represented by a special filesystem. In case of Ubuntu, the cgroup
hierarchy is already managed by systemd
. The container runtime must be aware of that in order to cooperate with
systemd
. You will see that reflected in various configuration options throughout this chapter.
So, if you're looking for a short, technical (Linux-specific) and concrete answer to the question "what is a container?", the answer would be:
A container is a process isolated from its host operating system and other processes using Linux namespaces and cgroups.
It is important to stress the flexibility of isolation provided by namespaces and cgroups. In particular, it is possible to run a process with partial isolation, e.g. using only a separate network namespace, while letting all other aspects of the system to be non-isolated. This is used in practice by Kubernetes to run pods with special "privileges". These pods can be used for direct configuration or monitoring of the nodes they run on.
Namespaces are also designed to be shared by multiple processes. This is also a standard thing in Kubernetes, e.g. all containers in a pod share the same network namespace.
The container runtime for our deployment consists of three elements:
containerd
, a system daemon that manages the lifecycle of containers, contains an implementation of the CRI, invoked bykubelet
runc
, a low-level utility for launching containerized processes, a reference implementation of the OCI, invoked bycontainerd
crictl
, a command line tool to inspect and manage containers, installed for usage by humans for monitoring and troubleshooting purposes
Note
Note how Docker is not involved in the container runtime, even though we are going to be running
Docker images. The relationship between Docker, containerd
, CRI, OCI, etc. is complex, and has evolved repeatedly
over time. Long story short, using containerd
and runc
is - for our purposes -
equivalent to using Docker, because nowadays Docker is built on top of these lower-level utilities, anyway.
We are only getting rid of Docker's "frontend" - which is nice if you want to use it directly but not essential for Kubernetes.
Download and install the container runtime binaries on all control & worker nodes:
crictl_archive=crictl-v${cri_version}-linux-${arch}.tar.gz
containerd_archive=containerd-${containerd_version}-linux-${arch}.tar.gz
wget -q --show-progress --https-only --timestamping \
https://github.com/kubernetes-sigs/cri-tools/releases/download/v${cri_version}/${crictl_archive} \
https://github.com/opencontainers/runc/releases/download/v${runc_version}/runc.${arch} \
https://github.com/containerd/containerd/releases/download/v${containerd_version}/${containerd_archive}
mkdir -p containerd
tar -xvf $crictl_archive
tar -xvf $containerd_archive -C containerd
cp runc.${arch} runc
chmod +x crictl runc
sudo cp crictl runc /usr/local/bin/
sudo cp containerd/bin/* /bin/
Configure containerd
:
sudo mkdir -p /etc/containerd/
cat << EOF | sudo tee /etc/containerd/config.toml
version = 2
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc]
runtime_type = "io.containerd.runc.v2"
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc.options]
SystemdCgroup = true
BinaryName = "/usr/local/bin/runc"
EOF
Create a systemd
unit file for containerd
:
cat <<EOF | sudo tee /etc/systemd/system/containerd.service
[Unit]
Description=containerd container runtime
Documentation=https://containerd.io
After=network.target
[Service]
ExecStartPre=/sbin/modprobe overlay
ExecStart=/bin/containerd
Restart=always
RestartSec=5
Delegate=yes
KillMode=process
OOMScoreAdjust=-999
LimitNOFILE=1048576
LimitNPROC=infinity
LimitCORE=infinity
[Install]
WantedBy=multi-user.target
EOF
Enable and run it:
sudo systemctl daemon-reload
sudo systemctl enable containerd
sudo systemctl start containerd
As already mentioned, kubelet
uses an abstraction layer called CNI (Container Network Interface) in order
to set up pod networking. The CNI is implemented by a set of plugins.
A CNI plugin is an executable program responsible for configuring some aspect of pod networking. Every plugin
is configured separately, and ultimately they are invoked in a chain, following a well-defined order. Collectively,
CNI plugins are responsible for configuring the network namespace for each pod. This includes setting up virtual
interfaces seen from within the pod, connecting them to the external world (the host system), and assigning IP addresses.
This may also include putting in place various, more complex network traffic manipulation mechanisms based on
lower-level Linux features such as iptables
, IPVS or eBPF.
The primary goal is to satisfy the fundamental assumption of Kubernetes networking: all pods in the cluster (across all nodes) must be able to communicate with each other without any network address translation. Pods use a dedicated, cluster-internal IP range. When pod-to-pod traffic needs to be forwarded between worker nodes, it is the responsibility of the CNI layer to set up some form of forwarding, tunnelling, etc. that is invisible to individual pods.
During control plane setup, we have already decided that 10.0.0.0/12 is going to be the IP range for all pods in the cluster.
Now we also need to split this range between individual nodes. We'll use the second octet of IP address to
encode VM id, and reduce subnet size to /16
.
Let's save this into some shell variables:
vmname=$(hostname -s)
case "$vmname" in
control*)
vmid=$((1 + ${vmname:7}));;
worker*)
vmid=$((4 + ${vmname:6}));;
*)
echo "expected control or worker VM, got $vmname"; return 1;;
esac
pod_cidr=10.${vmid}.0.0/16
Note how pod CIDR is disjoint from Service CIDR, which we have configured to 10.32.0.0/16
In this guide, we'll use a very simple setup provided by reference implementations of CNI plugins.
First, let's download and install them into the system:
cni_plugins_archive=cni-plugins-linux-${arch}-v${cni_plugins_version}.tgz
wget -q --show-progress --https-only --timestamping \
https://github.com/containernetworking/plugins/releases/download/v${cni_plugins_version}/${cni_plugins_archive}
sudo mkdir -p /opt/cni/bin
sudo tar -xvf $cni_plugins_archive -C /opt/cni/bin/
Now, enable and configure the desired plugins. We'll use two of them: one to set up the loopback interface, and another to set up a virtual ethernet interface bridged to host network.
sudo mkdir -p /etc/cni/net.d
cat <<EOF | sudo tee /etc/cni/net.d/10-bridge.conf
{
"cniVersion": "${cni_spec_version}",
"name": "bridge",
"type": "bridge",
"bridge": "cnio0",
"isGateway": true,
"ipMasq": true,
"ipam": {
"type": "host-local",
"ranges": [
[{"subnet": "${pod_cidr}"}]
],
"routes": [{"dst": "0.0.0.0/0"}]
}
}
EOF
cat <<EOF | sudo tee /etc/cni/net.d/99-loopback.conf
{
"cniVersion": "${cni_spec_version}",
"name": "lo",
"type": "loopback"
}
EOF
The CNI plugins are now ready to be invoked by kubelet
.
Download and install the kubelet
binary:
wget -q --show-progress --https-only --timestamping \
https://storage.googleapis.com/kubernetes-release/release/v${k8s_version}/bin/linux/${arch}/kubelet
chmod +x kubelet
sudo cp kubelet /usr/local/bin/
Copy all the necessary security-related files into place:
sudo mkdir -p /var/lib/kubelet/ /var/lib/kubernetes/
sudo cp ${vmname}-key.pem ${vmname}.pem /var/lib/kubelet/
sudo cp ${vmname}.kubeconfig /var/lib/kubelet/kubeconfig
sudo cp ca.pem /var/lib/kubernetes/
Configure kubelet
:
cat <<EOF | sudo tee /var/lib/kubelet/kubelet-config.yaml
kind: KubeletConfiguration
apiVersion: kubelet.config.k8s.io/v1beta1
authentication:
anonymous:
enabled: false
webhook:
enabled: true
x509:
clientCAFile: "/var/lib/kubernetes/ca.pem"
authorization:
mode: Webhook
clusterDomain: "cluster.local"
clusterDNS:
- "10.32.0.10"
resolvConf: "/run/systemd/resolve/resolv.conf"
runtimeRequestTimeout: "15m"
tlsCertFile: "/var/lib/kubelet/${vmname}.pem"
tlsPrivateKeyFile: "/var/lib/kubelet/${vmname}-key.pem"
containerRuntimeEndpoint: "unix:///var/run/containerd/containerd.sock"
cgroupDriver: "systemd"
EOF
if [[ $vmname =~ ^control[0-9]+ ]]; then cat <<EOF | sudo tee -a /var/lib/kubelet/kubelet-config.yaml
registerWithTaints:
- key: node-roles.kubernetes.io/control-plane
value: ""
effect: NoSchedule
EOF
fi
Important
The registerWithTaints
configuration option is appended only on control plane nodes, and it ensures that
they are excluded from regular pod scheduling (unless very explicitly requested).
Note
10.32.0.10 is the (arbitrarily chosen) address of a cluster-internal DNS server.
We will install it in the next chapter.
kubelet
must be explicitly aware of this address because it needs to be configured as the DNS server address
on every pod's virtual network interface.
Create a systemd
unit file:
cat <<EOF | sudo tee /etc/systemd/system/kubelet.service
[Unit]
Description=Kubernetes Kubelet
Documentation=https://github.com/kubernetes/kubernetes
After=containerd.service
Requires=containerd.service
[Service]
ExecStart=/usr/local/bin/kubelet \\
--config=/var/lib/kubelet/kubelet-config.yaml \\
--kubeconfig=/var/lib/kubelet/kubeconfig \\
--v=2
Restart=on-failure
RestartSec=5
[Install]
WantedBy=multi-user.target
EOF
Enable and run it:
sudo systemctl daemon-reload
sudo systemctl enable kubelet
sudo systemctl start kubelet
Warning
kubelet
by default requires that swap is turned off. This seems to be the case for Ubuntu cloud images.
However, just to be sure you can run sudo swapoff
on all worker nodes.
Upon launching kubelet
, worker nodes will join the cluster. To verify, run this on your host machine:
kubectl get nodes -o wide
You should see an output like this:
NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME
control0 Ready <none> 59s v1.28.3 192.168.1.11 <none> Ubuntu 22.04.3 LTS 5.15.0-83-generic containerd://1.7.7
control1 Ready <none> 59s v1.28.3 192.168.1.12 <none> Ubuntu 22.04.3 LTS 5.15.0-83-generic containerd://1.7.7
control2 Ready <none> 59s v1.28.3 192.168.1.13 <none> Ubuntu 22.04.3 LTS 5.15.0-83-generic containerd://1.7.7
worker0 Ready <none> 59s v1.28.3 192.168.1.14 <none> Ubuntu 22.04.3 LTS 5.15.0-83-generic containerd://1.7.7
worker1 Ready <none> 59s v1.28.3 192.168.1.15 <none> Ubuntu 22.04.3 LTS 5.15.0-83-generic containerd://1.7.7
worker2 Ready <none> 59s v1.28.3 192.168.1.16 <none> Ubuntu 22.04.3 LTS 5.15.0-83-generic containerd://1.7.7
At this point our Kubernetes deployment is starting to become functional. We should already be able to schedule some pods. Let's try it out:
kubectl run busybox --image=busybox --command -- sleep 3600
Then, run kubectl get pods -o wide
and you should see an output like this:
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
busybox 1/1 Running 0 6m4s 10.5.0.2 worker1 <none> <none>
Just out of curiosity, let's see what the CNI layer actually does. Go to the SSH shell of the worker node
running the pod (use Ctrl
+b
,z
in tmux
to zoom a single pane) and list network interfaces with
sudo ip addr
. Among the standard VM network interfaces, you should also see two new interfaces:
3: cnio0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
link/ether 46:ec:cb:d5:8a:ab brd ff:ff:ff:ff:ff:ff
inet 10.5.0.1/16 brd 10.5.255.255 scope global cnio0
valid_lft forever preferred_lft forever
inet6 fe80::44ec:cbff:fed5:8aab/64 scope link
valid_lft forever preferred_lft forever
4: veth609000bb@if2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master cnio0 state UP group default
link/ether 56:cc:ca:be:a7:51 brd ff:ff:ff:ff:ff:ff link-netns cni-408de7e4-b0c4-ad7c-51a3-cf76805b3289
inet6 fe80::54cc:caff:febe:a751/64 scope link
valid_lft forever preferred_lft forever
cnio0
is the bridge created by the bridge
CNI plugin. We can see that it got an IP address from the pod IP range
for this worker node. This way pods can communicate directly with the worker node, and it can serve as a default
routing gateway for pods.
veth609000bb
is a virtual ethernet interface. An interface like this is created for every pod.
There are some interesting details to note about it:
master cnio0
indicates that this interface is connected to the bridgelink-netns cni-408de7e4-b0c4-ad7c-51a3-cf76805b3289
indicates that this interface is connected to another interface in the network namespacecni-408de7e4-b0c4-ad7c-51a3-cf76805b3289
(an emulated point-to-point connection). As we can guess, this is going to be the pod's namespace.- The
@if2
part indicates the corresponding interface in the target network namespace
The virtual interface of the pod is on the other side of the point-to-point connection starting at veth609000bb
.
We cannot see it now. In order to see it, we must break into the network namespace. Fortunately, this is easy to
do with ip netns
command:
sudo ip netns exec cni-408de7e4-b0c4-ad7c-51a3-cf76805b3289 ip addr
This command executes an ip addr
command from within a specified network namespace. The output should look like this:
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
inet6 ::1/128 scope host
valid_lft forever preferred_lft forever
2: eth0@if4: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default
link/ether 3a:9e:9b:36:b7:08 brd ff:ff:ff:ff:ff:ff link-netnsid 0
inet 10.5.0.2/16 brd 10.5.255.255 scope global eth0
valid_lft forever preferred_lft forever
inet6 fe80::389e:9bff:fe36:b708/64 scope link
valid_lft forever preferred_lft forever
And this is finally what the pod sees. We can see its IP address configured on the eth0
virtual interface.
The @if4
and link-netnsid 0
confirm that this is the "other side" of veth609000bb
.
In the rudimentary setup that we are using now, pod networking also involves some address translation via
iptables
. Let's see what's going on there:
sudo iptables-save
We can see a chain and some rules that got created specifically for this particular pod:
:CNI-c8c65bddd829b2f007c0887f - [0:0]
-A POSTROUTING -s 10.5.0.2/32 -m comment --comment "name: \"bridge\" id: \"1719d3feb472cc90d9694f539d72fe284ad49ac4dae226e91936e3f80a326828\"" -j CNI-c8c65bddd829b2f007c0887f
-A CNI-c8c65bddd829b2f007c0887f -d 10.5.0.0/16 -m comment --comment "name: \"bridge\" id: \"1719d3feb472cc90d9694f539d72fe284ad49ac4dae226e91936e3f80a326828\"" -j ACCEPT
-A CNI-c8c65bddd829b2f007c0887f ! -d 224.0.0.0/4 -m comment --comment "name: \"bridge\" id: \"1719d3feb472cc90d9694f539d72fe284ad49ac4dae226e91936e3f80a326828\"" -j MASQUERADE
These rules effectively enable source NAT (-j MASQUERADE
) for when this pod communicates with another pod,
scheduled on another node.
The CNI configures a source NAT for communication between pods, but the destination address is not changed. This means that pod IP addresses must be routable within the local network where VMs live.
Unfortunately, this is a result of the fact that our network setup in this chapter is very rudimentary. It is regrettable that cluster-internal IP addresses show up outside the cluster, even on the host machine itself. We need to remedy this by adding appropriate routes on the host machine:
for vmid in $(seq 1 6); do
sudo route -n add -net 10.${vmid}.0.0/16 192.168.1.$((10 + $vmid))
done
Important
Make sure routes are added while at least one VM is running, so that the bridge interface exists. Unfortunately, if you stop all the VMs, the routes will be deleted.
A better solution to this problem would be to use a CNI implementation that does not expose cluster-internal IP addresses to the nodes' network. We'll do that in an extra chapter where we'll replace the default CNI plugins with Cilium.
As mentioned in Bootstrapping Kubernetes Security, some cluster operations
require kube-apiserver
to call kubelet
. Those operations include executing commands in pods, setting up port
forwarding, fetching pod logs, etc.
kubelet
needs to authorize these operations. It does that by... consulting kube-apiserver
, so we end up with
somewhat of a silly situation where kube-apiserver
just authorizes itself. Regardless of that, the RBAC rules
for this are not set up automatically. We need to put them in place manually.
On the host machine, invoke:
cat <<EOF | kubectl apply -f -
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
annotations:
rbac.authorization.kubernetes.io/autoupdate: "true"
labels:
kubernetes.io/bootstrapping: rbac-defaults
name: system:kube-apiserver-to-kubelet
rules:
- apiGroups:
- ""
resources:
- nodes/proxy
- nodes/stats
- nodes/log
- nodes/spec
- nodes/metrics
verbs:
- "*"
EOF
cat <<EOF | kubectl apply -f -
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
name: system:kube-apiserver
namespace: ""
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: system:kube-apiserver-to-kubelet
subjects:
- apiGroup: rbac.authorization.k8s.io
kind: User
name: kubernetes
EOF
Let's verify if it works by executing a command in the running busybox
pod:
kubectl exec -it busybox -- sh
The final component we need for a fully configured node is kube-proxy
, which is responsible for
handling and load balancing traffic destined for Kubernetes
Services.
Note
In an extra chapter,
we'll replace kube-proxy
with Cilium.
Download and install the binary:
wget -q --show-progress --https-only --timestamping \
https://storage.googleapis.com/kubernetes-release/release/v${k8s_version}/bin/linux/${arch}/kube-proxy
chmod +x kube-proxy
sudo cp kube-proxy /usr/local/bin/
Configure it:
sudo mkdir -p /var/lib/kube-proxy/
sudo cp kube-proxy.kubeconfig /var/lib/kube-proxy/kubeconfig
cat <<EOF | sudo tee /var/lib/kube-proxy/kube-proxy-config.yaml
kind: KubeProxyConfiguration
apiVersion: kubeproxy.config.k8s.io/v1alpha1
clientConnection:
kubeconfig: "/var/lib/kube-proxy/kubeconfig"
mode: "iptables"
clusterCIDR: "10.0.0.0/12"
EOF
Create systemd
unit file:
cat <<EOF | sudo tee /etc/systemd/system/kube-proxy.service
[Unit]
Description=Kubernetes Kube Proxy
Documentation=https://github.com/kubernetes/kubernetes
[Service]
ExecStart=/usr/local/bin/kube-proxy \\
--config=/var/lib/kube-proxy/kube-proxy-config.yaml
Restart=on-failure
RestartSec=5
[Install]
WantedBy=multi-user.target
EOF
Launch it:
sudo systemctl daemon-reload
sudo systemctl enable kube-proxy
sudo systemctl start kube-proxy
There's one more technical hurdle to overcome with kube-proxy
. By default, it uses iptables
to set up
Service IP handling and load balancing. Unfortunately, this does not always work well with our bridge-based CNI
configuration and default Linux behaviour.
Here's a problematic scenario:
- Pod A (10.4.0.2), running on
worker0
, connects to Service S (10.32.0.2) kube-proxy
load balancing chooses Pod B (10.4.0.3), also running onworker0
, as the endpoint for this connectioniptables
rules translate the destination Service address (10.32.0.2) to Pod B address (10.4.0.3)- Pod B receives the connection and responds. The returning packet has source 10.4.0.3 and destination 10.4.0.2
- At this point,
iptables
should translate the source address of the returning packet back to the Service address, 10.32.0.2. Unfortunately, this does not happen. As a result, Pod A receives a packet whose source address does not match its original destination address, and the packet is dropped.
Why don't iptables
fire on the returning packet? The reason is that a packet from 10.4.0.3 to 10.4.0.2 is a
Layer 2 only traffic - it just needs to pass the bridge shared between pods. iptables
, on the other hand, is a
Layer 3 thing.
So, overall, this behavior makes sense 🤷. Unfortunately, it breaks our deployment and we have to do something
about it. Luckily, there's a hack to force Linux to run iptables
even for bridge-only traffic:
sudo modprobe br_netfilter
Run this on all control and worker nodes. In order to make it persistent, add it to
cloud-init/user-data.control
and cloud-init/user-data.worker
:
write_files:
- path: /etc/modules-load.d/cloud-init.conf
content: |
br_netfilter
runcmd:
- modprobe br_netfilter
Note
We won't need this when we replace kube-proxy
with Cilium based solution
(or any other that doesn't use iptables
).
Let's deploy a dummy Deployment
with 3 replicas of an HTTP echo server, along with a Service
on top of it:
cat <<EOF | kubectl apply -f -
apiVersion: apps/v1
kind: Deployment
metadata:
name: echo
labels:
app: echo
spec:
replicas: 3
selector:
matchLabels:
app: echo
template:
metadata:
labels:
app: echo
spec:
containers:
- name: echo
image: hashicorp/http-echo
ports:
- containerPort: 5678
---
apiVersion: v1
kind: Service
metadata:
name: echo
spec:
selector:
app: echo
ports:
- protocol: TCP
port: 5678
targetPort: 5678
EOF
Let's test it out by running a pod that makes a request to this service. First we'll need the cluster IP of the
service (we don't have a cluster-internal DNS server installed yet). You can easily find out this ip with
kubectl get svc echo
. In my case, it was 10.32.152.5
Now, let's try to contact this service from within a node. Invoke this on any control or worker node:
$ curl http://10.32.152.5:5678
hello-world
You should see an output consisting of hello world
- which indicates that the service works and has returned
an HTTP response.
Now, let's take a peek in what's really going on. The service IP is picked up by iptables
rules and translated
into the IP of one of the pods implementing the service (randomly). If we go through the output of iptables-save
on any of the nodes, we can pick up the relevant parts:
*nat
...
-A PREROUTING -m comment --comment "kubernetes service portals" -j KUBE-SERVICES
-A OUTPUT -m comment --comment "kubernetes service portals" -j KUBE-SERVICES
...
-A KUBE-SERVICES -d 10.32.152.5/32 -p tcp -m comment --comment "default/echo cluster IP" -m tcp --dport 5678 -j KUBE-SVC-HV6DMF63W6MGLRDE
...
-A KUBE-SVC-HV6DMF63W6MGLRDE -m comment --comment "default/echo -> 10.4.0.14:5678" -m statistic --mode random --probability 0.33333333349 -j KUBE-SEP-7G5D55VBK7L326G3
-A KUBE-SVC-HV6DMF63W6MGLRDE -m comment --comment "default/echo -> 10.5.0.25:5678" -m statistic --mode random --probability 0.50000000000 -j KUBE-SEP-HS2AVEBF7XNLG3WC
-A KUBE-SVC-HV6DMF63W6MGLRDE -m comment --comment "default/echo -> 10.6.0.18:5678" -j KUBE-SEP-OQSOJ7ZUSSHWFS7Y
...
-A KUBE-SEP-7G5D55VBK7L326G3 -p tcp -m comment --comment "default/echo" -m tcp -j DNAT --to-destination 10.4.0.14:5678
-A KUBE-SEP-HS2AVEBF7XNLG3WC -p tcp -m comment --comment "default/echo" -m tcp -j DNAT --to-destination 10.5.0.25:5678
-A KUBE-SEP-OQSOJ7ZUSSHWFS7Y -p tcp -m comment --comment "default/echo" -m tcp -j DNAT --to-destination 10.6.0.18:5678
The interesting rules are the ones in the KUBE-SVC-HV6DMF63W6MGLRDE
chain, which are set up so that only one of
them fires, at random, with uniform probability. This is how kube-proxy
leverages iptables
to implement
load balancing.
In this chapter, we have:
- learned about container runtimes and foundation of Kubernetes networking
- learned about linux namespaces and cgroups, core kernel features that make containers possible
- installed the container runtime, CNI plugins,
kubelet
andkube-proxy
on control and worker nodes - tested the cluster by deploying pods and services
- peeked into the inner workings of CNI plugins and
kube-proxy
by inspecting network interfaces and namespaces, as well asiptables
rules that make up the Kubernetes overlay network