openshift · bobfuru · Jul 27, 2020 · May 22, 2020
diff --git a/_topic_map.yml b/_topic_map.yml
@@ -309,6 +309,9 @@ Topics:
 - Name: Gathering data about your cluster
   File: gathering-cluster-data
   Distros: openshift-enterprise,openshift-webscale,openshift-origin
+- Name: Summarizing cluster specifications
+  File: summarizing-cluster-specifications
+  Distros: openshift-enterprise,openshift-webscale,openshift-origin
 - Name: Remote health monitoring with connected clusters
   Dir: remote_health_monitoring
   Distros: openshift-enterprise,openshift-dedicated,openshift-webscale,openshift-origin
@@ -321,6 +324,24 @@ Topics:
     File: opting-out-of-remote-health-reporting
   - Name: Using Insights to identify issues with your cluster
     File: using-insights-to-identify-issues-with-your-cluster
+- Name: Troubleshooting
+  Dir: troubleshooting
+  Distros: openshift-enterprise,openshift-dedicated,openshift-webscale,openshift-origin
+  Topics:
+  - Name: Troubleshooting installations
+    File: troubleshooting-installations
+  - Name: Verifying node health
+    File: verifying-node-health
+  - Name: Troubleshooting CRI-O container runtime issues
+    File: troubleshooting-crio-issues
+  - Name: Troubleshooting Operator issues
+    File: troubleshooting-operator-issues
+  - Name: Investigating Pod issues
+    File: investigating-pod-issues
+  - Name: Troubleshooting the Source-to-Image process
+    File: troubleshooting-s2i
+  - Name: Diagnosing OpenShift CLI (oc) issues
+    File: diagnosing-oc-issues
 ---
 Name: Web console
 Dir: web_console

diff --git a/modules/about-crio.adoc b/modules/about-crio.adoc
@@ -0,0 +1,10 @@
+// Module included in the following assemblies:
+//
+// * support/troubleshooting/troubleshooting-crio-issues.adoc
+
+[id="about-crio_{context}"]
+= About CRI-O container runtime engine
+
+CRI-O is a Kubernetes-native container runtime implementation that integrates closely with the operating system to deliver an efficient and optimized Kubernetes experience. CRI-O provides facilities for running, stopping, and restarting containers.
+
+The CRI-O container runtime engine is managed using a systemd service on each {product-title} cluster node. When container runtime issues occur, verify the status of the `crio` systemd service on each node. Gather CRI-O journald unit logs from nodes that manifest container runtime issues.
diff --git a/modules/about-sosreport.adoc b/modules/about-sosreport.adoc
@@ -0,0 +1,10 @@
+// Module included in the following assemblies:
+//
+// * support/gathering-cluster-data.adoc
+
+[id="about-sosreport_{context}"]
+= About `sosreport`
+
+`sosreport` is a tool that collects configuration details, system information, and diagnostic data from {op-system-base-full} and {op-system-first} systems. `sosreport` provides a standardized way to collect diagnostic information relating to a node, which can then be provided to Red Hat Support for issue diagnosis.
+
+In some support interactions, Red Hat Support may ask you to collect a `sosreport` archive for a specific {product-title} node. For example, it might sometimes be necessary to review system logs or other node-specific data that is not included within the output of `oc adm must-gather`.
diff --git a/modules/accessing-running-pods.adoc b/modules/accessing-running-pods.adoc
@@ -0,0 +1,42 @@
+// Module included in the following assemblies:
+//
+// * support/troubleshooting/investigating-pod-issues.adoc
+
+[id="accessing-running-pods_{context}"]
+= Accessing running Pods
+
+You can review running Pods dynamically by opening a shell inside a Pod or by gaining network access through port forwarding.
+
+.Prerequisites
+
+* You have access to the cluster as a user with the `cluster-admin` role.
+* Your API service is still functional.
+* You have installed the OpenShift CLI (`oc`).
+
+.Procedure
+
+. Switch into the project that contains the Pod you would like to access. This is necessary because the `oc rsh` command does not accept the `-n` namespace option:
++
+----
+$ oc project <namespace>
+----
+
+. Start a remote shell into a Pod:
++
+----
+$ oc rsh <pod_name>  <1>
+----
+<1> If a Pod has multiple containers, `oc rsh` defaults to the first container unless `-c <container_name>` is specified.
+
+. Start a remote shell into a specific container within a Pod:
++
+----
+$ oc rsh -c <container_name> pod/<pod_name>
+----
+
+. Create a port forwarding session to a port on a Pod:
++
+----
+$ oc port-forward <pod_name> <host_port>:<pod_port>  <1>
+----
+<1> Enter `Ctrl+C` to cancel the port forwarding session.
diff --git a/modules/checking-load-balancer-configuration.adoc b/modules/checking-load-balancer-configuration.adoc
@@ -0,0 +1,47 @@
+// Module included in the following assemblies:
+//
+// * support/troubleshooting/troubleshooting-installations.adoc
+
+[id="checking-load-balancer-configuration_{context}"]
+= Checking a load balancer configuration before {product-title} installation
+
+Check your load balancer configuration prior to starting an {product-title} installation.
+
+.Prerequisites
+
+* You have configured an external load balancer of your choosing, in preparation for an {product-title} installation. The following example is based on a {op-system-base-full} host using HAProxy to provide load balancing services to a cluster.
+* You have configured DNS in preparation for an {product-title} installation.
+* You have SSH access to your load balancer.
+
+.Procedure
+
+. Check that the `haproxy` systemd service is active:
++
+----
+$ ssh <user_name>@<load_balancer> systemctl status haproxy
+----
+
+. Verify that the load balancer is listening on the required ports. The following example references ports `80`, `443`, `6443`, and `22623`.
++
+* For HAProxy instances running on {op-system-base-full} 6, verify port status by using the `netstat` command:
++
+----
+$ ssh <user_name>@<load_balancer> netstat -nltupe | grep -E ':80|:443|:6443|:22623'
+----
++
+* For HAProxy instances running on {op-system-base-full} 7 or 8, verify port status by using the `ss` command:
++
+----
+$ ssh <user_name>@<load_balancer> ss -nltupe | grep -E ':80|:443|:6443|:22623'
+----
++
+[NOTE]
+====
+Red Hat recommends the `ss` command instead of `netstat` in {op-system-base-full} 7 or later. `ss` is provided by the iproute package. For more information on the `ss` command, see the link:https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/7/html/performance_tuning_guide/sect-red_hat_enterprise_linux-performance_tuning_guide-tool_reference-ss[{op-system-base-full} 7 Performance Tuning Guide].
+====
++
+. Check that the wildcard DNS record resolves to the load balancer:
++
+----
+$ dig <wildcard_fqdn> @<dns_server>
+----
diff --git a/modules/copying-files-pods-and-containers.adoc b/modules/copying-files-pods-and-containers.adoc
@@ -0,0 +1,35 @@
+// Module included in the following assemblies:
+//
+// * support/troubleshooting/investigating-pod-issues.adoc
+
+[id="copying-files-pods-and-containers_{context}"]
+= Copying files to and from Pods and containers
+
+You can copy files to and from a Pod to test configuration changes or gather diagnostic information.
+
+.Prerequisites
+
+* You have access to the cluster as a user with the `cluster-admin` role.
+* Your API service is still functional.
+* You have installed the OpenShift CLI (`oc`).
+
+.Procedure
+
+. Copy a file to a Pod:
++
+----
+$ oc cp <local_path> <pod_name>:/<path> -c <container_name>  <1>
+----
+<1> Note that a Pod's first container will be selected if the `-c` option is not specified.
+
+. Copy a file from a Pod:
++
+----
+$ oc cp <pod_name>:/<path>  -c <container_name><local_path>  <1>
+----
+<1> Note that a Pod's first container will be selected if the `-c` option is not specified.
++
+[NOTE]
+====
+For `oc cp` to function, the `tar` binary must be available within the container.
+====
diff --git a/modules/determining-where-installation-issues-occur.adoc b/modules/determining-where-installation-issues-occur.adoc
@@ -0,0 +1,34 @@
+// Module included in the following assemblies:
+//
+// * support/troubleshooting/troubleshooting-installations.adoc
+
+[id="determining-where-installation-issues-occur_{context}"]
+= Determining where installation issues occur
+
+When troubleshooting {product-title} installation issues, you can monitor installation logs to determine at which stage issues occur. Then, retrieve diagnostic data relevant to that stage.
+
+{product-title} installation proceeds through the following stages:
+
+. Ignition configuration files are created.
+
+. The bootstrap machine boots and starts hosting the remote resources required for the master machines to boot.
+
+. The master machines fetch the remote resources from the bootstrap machine and finish booting.
+
+. The master machines use the bootstrap machine to form an etcd cluster.
+
+. The bootstrap machine starts a temporary Kubernetes control plane using the new etcd cluster.
+
+. The temporary control plane schedules the production control plane to the master machines.
+
+. The temporary control plane shuts down and passes control to the production control plane.
+
+. The bootstrap machine adds {product-title} components into the production control plane.
+
+. The installation program shuts down the bootstrap machine.
+
+. The control plane sets up the worker nodes.
+
+. The control plane installs additional services in the form of a set of Operators.
+
+. The cluster downloads and configures remaining components needed for the day-to-day operation, including the creation of worker machines in supported environments.
diff --git a/modules/gathering-application-diagnostic-data.adoc b/modules/gathering-application-diagnostic-data.adoc
@@ -0,0 +1,107 @@
+// Module included in the following assemblies:
+//
+// * support/troubleshooting/troubleshooting-s2i.adoc
+
+[id="gathering-application-diagnostic-data_{context}"]
+= Gathering application diagnostic data to investigate application failures
+
+Application failures can occur within running application Pods. In these situations, you can retrieve diagnostic information with these strategies:
+
+* Review events relating to the application Pods.
+* Review the logs from the application Pods, including application-specific log files that are not collected by the {product-title} logging framework.
+* Test application functionality interactively and run diagnostic tools in an application container.
+
+.Prerequisites
+
+* You have access to the cluster as a user with the `cluster-admin` role.
+* You have installed the OpenShift CLI (`oc`).
+
+.Procedure
+
+. List events relating to a specific application Pod. The following example retrieves events for an application Pod named `my-app-1-akdlg`:
++
+----
+$ oc describe pod/my-app-1-akdlg
+----
+
+. Review logs from an application Pod:
++
+----
+$ oc logs -f pod/my-app-1-akdlg
+----
+
+. Query specific logs within a running application Pod. Logs that are sent to stdout are collected by the {product-title} logging framework and are included in the output of the preceding command. The following query is only required for logs that are not sent to stdout.
++
+.. If an application log can be accessed without root privileges within a Pod, concatenate the log file as follows:
++
+----
+$ oc exec my-app-1-akdlg -- cat /var/log/my-application.log
+----
++
+.. If root access is required to view an application log, you can start a debug container with root privileges and then view the log file from within the container. Start the debug container from the project's deployment configuration. Pod users typically run with non-root privileges, but running troubleshooting Pods with temporary root privileges can be useful during issue investigation:
++
+----
+$ oc debug dc/my-deployment-configuration --as-root -- cat /var/log/my-application.log
+----
++
+[NOTE]
+====
+You can access an interactive shell with root access within the debug Pod if you run `oc debug dc/<deployment_configuration> --as-root` without appending `-- <command>`.
+====
+
+. Test application functionality interactively and run diagnostic tools, in an application container with an interactive shell.
+.. Start an interactive shell on the application container:
++
+----
+$ oc exec -it my-app-1-akdlg /bin/bash
+----
++
+.. Test application functionality interactively from within the shell. For example, you can run the container's entry point command and observe the results. Then, test changes from the command line directly, before updating the source code and rebuilding the application container through the S2I process.
++
+.. Run diagnostic binaries available within the container.
++
+[NOTE]
+====
+Root privileges are required to run some diagnostic binaries. In these situations you can start a debug Pod with root access, based on a problematic Pod's deployment configuration, by running `oc debug dc/<deployment_configuration> --as-root`. Then, you can run diagnostic binaries as root from within the debug Pod.
+====
+
+. If diagnostic binaries are not available within a container, you can run a host's diagnostic binaries within a container's namespace by using `nsenter`. The following example runs `ip ad` within a container's namespace, using the host`s `ip` binary.
+.. Enter into a debug session on the target node. This step instantiates a debug Pod called `<node_name>-debug`:
++
+----
+$ oc debug node/my-cluster-node
+----
++
+.. Set `/host` as the root directory within the debug shell. The debug Pod mounts the host's root file system in `/host` within the Pod. By changing the root directory to `/host`, you can run binaries contained in the host's executable paths:
++
+----
+# chroot /host
+----
++
+[NOTE]
+====
+{product-title} {product-version} cluster nodes running {op-system-first} are immutable and rely on Operators to apply cluster changes. Accessing cluster nodes using SSH is not recommended and nodes will be tainted as _accessed_. However, if the {product-title} API is not available, or the kubelet is not properly functioning on the target node, `oc` operations will be impacted. In such situations, it is possible to access nodes using `ssh core@<node>.<cluster_name>.<base_domain>` instead.
+====
++
+.. Determine the target container ID:
++
+----
+# crictl ps
+----
++
+.. Determine the container's process ID. In this example, the target container ID is `a7fe32346b120`:
++
+----
+# crictl inspect a7fe32346b120 --output yaml | grep 'pid:' | awk '{print $2}'
+----
++
+.. Run `ip ad` within the container's namespace, using the host's `ip` binary. This example uses `31150` as the container's process ID. The `nsenter` command enters the namespace of a target process and runs a command in its namespace. Because the target process in this example is a container's process ID, the `ip ad` command is run in the container's namespace from the host:
++
+----
+# nsenter -n -t 31150 -- ip ad
+----
++
+[NOTE]
+====
+Running a host's diagnostic binaries within a container's namespace is only possible if you are using a privileged container such as a debug node.
+====
diff --git a/modules/gathering-bootstrap-diagnostic-data.adoc b/modules/gathering-bootstrap-diagnostic-data.adoc
@@ -0,0 +1,70 @@
+// Module included in the following assemblies:
+//
+// * support/troubleshooting/troubleshooting-installations.adoc
+
+[id="gathering-bootstrap-diagnostic-data_{context}"]
+= Gathering bootstrap node diagnostic data
+
+When experiencing bootstrap-related issues, you can gather `bootkube.service` `journald` unit logs and container logs from the bootstrap node.
+
+.Prerequisites
+
+* You have SSH access to your bootstrap node.
+* You have the fully qualified domain name of the bootstrap node.
+* If you are hosting Ignition configuration files by using an HTTP server, you must have the HTTP server's fully qualified domain name and the port number. You must also have SSH access to the HTTP host.
+
+.Procedure
+
+. If you have access to the bootstrap node's console, monitor the console until the node reaches the login prompt.
+
+. Verify the Ignition file configuration.
++
+* If you are hosting Ignition configuration files by using an HTTP server.
++
+.. Verify the bootstrap node Ignition file URL. Replace `<http_server_fqdn>` with HTTP server's fully qualified domain name:
++
+----
+$ curl -I http://<http_server_fqdn>:<port>/bootstrap.ign  <1>
+----
+<1> The `-I` option returns the header only. If the Ignition file is available on the specified URL, the command returns `200 OK` status. If it is not available, the command returns `404 file not found`.
++
+.. To verify that the Ignition file was received by the bootstrap node, query the HTTP server logs on the serving host. For example, if you are using an Apache web server to serve Ignition files, enter the following command:
++
+----
+$ grep -is 'bootstrap.ign' /var/log/httpd/access_log
+----
++
+If the bootstrap Ignition file is received, the associated `HTTP GET` log message will include a `200 OK` success status, indicating that the request succeeded.
++
+.. If the Ignition file was not received, check that the Ignition files exist and that they have the appropriate file and web server permissions on the serving host directly.
++
+* If you are using a cloud provider mechanism to inject Ignition configuration files into hosts as part of their initial deployment.
++
+.. Review the bootstrap node's console to determine if the mechanism is injecting the bootstrap node Ignition file correctly.
+
+. Verify the availability of the bootstrap node's assigned storage device.
+
+. Verify that the bootstrap node has been assigned an IP address from the DHCP server.
+
+. Collect `bootkube.service` journald unit logs from the bootstrap node. Replace `<bootstrap_fqdn>` with the bootstrap node's fully qualified domain name:
++
+----
+$ ssh core@<bootstrap_fqdn> journalctl -b -f -u bootkube.service
+----
++
+[NOTE]
+====
+The `bootkube.service` log on the bootstrap node outputs etcd `connection refused` errors, indicating that the bootstrap server is unable to connect to etcd on master nodes. After etcd has started on each master node and the nodes have joined the cluster, the errors should stop.
+====
++
+. Collect logs from the bootstrap node containers.
+.. Collect the logs using `podman` on the bootstrap node. Replace `<bootstrap_fqdn>` with the bootstrap node's fully qualified domain name:
++
+----
+$ ssh core@<bootstrap_fqdn> 'for pod in $(sudo podman ps -a -q); do sudo podman logs $pod; done'
+----
+
+. If the bootstrap process fails, verify the following.
++
+* You can resolve `api.<cluster_name>.<base_domain>` from the installation host.
+* The load balancer proxies port 6443 connections to bootstrap and master nodes. Ensure that the proxy configuration meets {product-title} installation requirements.