openshift · openshift-merge-robot · Nov 19, 2018 · Nov 14, 2018
diff --git a/docs/user/troubleshooting.md b/docs/user/troubleshooting.md
@@ -144,4 +144,111 @@ Images:
   ...
 ```
 
+### One or more nodes are never Ready (Network / CNI issues)
+
+You might see that one or more nodes are never ready, e.g
+
+```console
+$ kubectl get nodes
+NAME                           STATUS     ROLES     AGE       VERSION
+ip-10-0-27-9.ec2.internal      NotReady   master    29m       v1.11.0+d4cacc0
+...
+```
+
+This usually means that, for whatever reason, networking is not available on the node. You can confirm this by looking at the detailed output of the node:
+
+```console
+$ kubectl describe node ip-10-0-27-9.ec2.internal
+ ... (lots of output skipped)
+'runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: cni config uninitialized'
+```
+
+The first thing to determine is the status of the SDN. The SDN deploys three daemonsets:
+- *sdn-controller*, a control-plane component
+- *sdn*, the node-level networking daemon
+- *ovs*, the OpenVSwitch management daemon
+
+All 3 must be healthy (though only a single `sdn-controller` needs to be running). `sdn` and `ovs` must be running on every node, and DESIRED should equal AVAILABLE. On a healthy 2-node cluster you would see:
+
+```console
+$ kubectl -n openshift-sdn get daemonsets
+NAME             DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR                     AGE
+ovs              2         2         2       2            2           beta.kubernetes.io/os=linux       2h
+sdn              2         2         2       2            2           beta.kubernetes.io/os=linux       2h
+sdn-controller   1         1         1       1            1           node-role.kubernetes.io/master=   2h
+```
+
+If, instead, you get a diferent error message:
+
+```console
+$ kubectl -n openshift-sdn get daemonsets
+No resources found.
+```
+
+This means the network-operator didn't run. Skip ahead [to that section](#debugging-the-cluster-network-operator). Otherwise, let's debug the SDN.
+
+#### Debugging the openshift-sdn
+
+On the NotReady node, you need to find out which pods, if any, are in a bad state. Be sure to substitute in the correct `spec.nodeName` (or just remove it).
+
+```console
+$ kubectl -n openshift-sdn get pod --field-selector "spec.nodeName=ip-10-0-27-9.ec2.internal"
+NAME                   READY   STATUS             RESTARTS   AGE
+ovs-dk8bh              1/1     Running            1          52m
+sdn-8nl47              1/1     CrashLoopBackoff   3          52m
+```
+
+Then, retrieve the logs for the SDN (and the OVS pod, if that is failed):
+
+```sh
+kubectl -n openshift-sdn logs sdn-8nl47
+```
+
+Some common error messages:
+- `Cannot fetch default cluster network`:  This means the `sdn-controller` has failed to run to completion. Retrieve its logs with `kubectl -n openshift-sdn logs -l app=sdn-controller`.
+- `warning: Another process is currently listening on the CNI socket, waiting 15s`: Something has gone wrong, and multiple SDN processes are running. SSH to the node in question, capture the out of `ps -faux`. If you just need the cluster up, reboot the node.
+- Error messages about ovs or OpenVSwitch: Check that the `ovs-*` pod on the same node is healthy. Retrieve its logs with `kubectl -n openshift-sdn logs ovs-<name>`. Rebooting the node should fix it.
+- Any indication that the control plane is unavailable: Check to make sure the apiserver is reachable from the node. You may be able to find useful information via `journalctl -f -u kubelet`.
+
+If you think it's a misconfiguration, file a [network operator](https://github.com/openshift/cluster-network-operator) issue. RH employees can also try #forum-sdn.
+
+#### Debugging the cluster-network-operator
+The cluster network operator is responsible for deploying the networking components. It does this in response to a special object created by the installer.
+
+From a deployment perspective, the network operator is often the "canary in the coal mine." It runs very early in the installation process, after the master nodes have come up but before the bootstrap control plane has been torn down. It can be indicative of more subtle installer issues, such as long delays in bringing up master nodes or apiserver communication issues. Nevertheless, it can have other bugs.
+
+First, determine that the network configuration exists:
+
+```console
+$ kubectl get networkconfigs.networkoperator.openshift.io default -oyaml
+...
+spec:
+  additionalNetworks: null
+  clusterNetworks:
+  - cidr: 10.2.0.0/16
+    hostSubnetLength: 9
+  defaultNetwork:
+    openshiftSDNConfig:
+      mode: Networkpolicy
+    otherConfig: null
+    type: OpenshiftSDN
+  serviceNetwork: 10.3.0.0/16
+```
+
+If it doesn't exist, the installer didn't create it. You'll have to run `openshift-install create manifests` to determine why.
+
+Next, check that the network-operator is running:
+
+```sh
+kubectl -n openshift-cluster-network-operator get pods
+```
+
+And retrieve the logs. Note that, on multi-master systems, the operator will perform leader election and all other operators will sleep:
+
+```sh
+kubectl -n openshift-cluster-network-operator logs -l "k8s-app=cluster-network-operator"
+```
+
+If appropriate, file a [network operator](https://github.com/openshift/cluster-network-operator) issue. RH employees can also try #forum-sdn.
+
 [kubernetes-debug]: https://kubernetes.io/docs/tasks/debug-application-cluster/