openshift · DirectedSoul1 · Feb 10, 2020 · Feb 10, 2020 · Feb 10, 2020 · Feb 10, 2020
diff --git a/docs/user/metal/troubleshooting_Ipv6.md b/docs/user/metal/troubleshooting_Ipv6.md
@@ -0,0 +1,328 @@
+# IPv6 Disconnected Install Troubleshooting
+
+In this document we will analyze common scenarios of errors/failures which can cuase the cluster installation to fail. 
+
+## Common failures Basic checks before Installing
+
+Below are some precautions we should follow before even attempting the OCP4 Ipv6 deployments. 
+
+* [Ipv4-BM-IPI](https://github.com/openshift-kni/baremetal-deploy/blob/master/install-steps.md#preparing-the-provision-node-for-openshift-install) has some pre-requisites which will be applied fpr ipv6 deployments as well. 
+
+### The architecture for IPv6 Disconnected Install(Nodes can vary as per requirements) 
+* Jump Host which acts as a DNSMasq, Registry. 
+* Worker/Provisioning node 
+* 3 Master nodes. 
+
+Before starting the installation, we need to set up a DNSmasq conf file which will assign IPv6 IP's to all the nodes on `eno3` interface.A sample file is provided below.  
+
+* A Jump Host should have one interface say `eno1` on Ipv4(Public Network), `eno3` on **Private VLAN** this will be used to provide **IPv6 IP's on the isolated VLAN**. 
+
+```
+domain-needed
+bogus-priv
+strict-order
+#expand-hosts
+bind-dynamic
+dhcp-sequential-ip
+dhcp-authoritative
+dhcp-fqdn
+log-queries
+domain=<Your_Domain_Name>
+resolv-file=/etc/resolv.conf.upstream
+interface=eno3
+enable-ra
+server=<Dns_Server>
+
+dhcp-range=<Ipv6-Range-1>,<IPv6-Range-2>,<Network-Subnet> 
+dhcp-option=option6:dns-server,[Jump-Hosts-IPv6-IP]
+dhcp-option=option6:ntp-server,[::]   
+
+local=/<Your_FQDN>/
+
+#static addresses
+address=/.apps.<Cluster-Name>.<Domain-Name>/<Ipv6-IP>
+address=/api.<Cluster-Name>.<Domain-Name>/<Ipv6-IP>
+address=/ns1.<Cluster-Name>.<Domain-Name>/<Ipv6-IP>
+address=/registry.<Cluster-Name>.<Domain-Name>/**<IPv6-IP-Jump-Host>**
+
+#reserved dynamic addresses
+dhcp-host=<MAC-Address-eno3-Interface>,master-0.<Cluster-Name>.<Domain-Name>,[IPv6-IP-master-0]
+dhcp-host=<MAC-Address-eno3-Interface>,master-1.<Cluster-Name>.<Domain-Name>,[IPv6-IP-master-1]
+dhcp-host=<MAC-Address-eno3-Interface>,master-2.<Cluster-Name>.<Domain-Name>,[IPv6-IP-master-1]
+dhcp-host=<MAC-Address-eno3-Interface>,worker-0.<Cluster-Name>.<Domain-Name>,[IPv6-IP-worker-0]
+```
+
+A sample `etc/resolv.conf.upstream` file is as shown below
+
+```
+# Generated by NetworkManager
+search <Cluster-Name>.<Domain-Name>
+nameserver <IPv6-IP-Jump-Host>
+```
+
+A sample `/etc/hosts` file should be configured as shown
+
+```
+<IPv6-IP-Jump-Host> registry.<Cluster-Name>.<Domain-Name>
+```
+
+**Note:** After all the above file changes, make sure you can resolve all the above hosts using nslookup and dig commands.
+
+## Create a Disconnected Registry
+
+* [Create a Disconnected Registry](https://docs.openshift.com/container-platform/4.3/installing/install_config/installing-restricted-networks-preparations.html#installing-restricted-networks-preparations)
+
+**Note:** While Mirroring the registry, please make sure to add the additional registry.svc.io token to your pull-secret.txt file, else the mirroring process fails. [Add-Token](https://docs.openshift.com/container-platform/4.3/installing/install_config/installing-restricted-networks-preparations.html#Procedure) 
+
+After the registry is mirrored, confirm that you can access this in your disconnected environment
+
+```
+# curl -k -6 -u kni:kni https://registry.<Cluster-Name>.<Domain-Name>:5000/v2/_catalog
+{"repositories":["<Repo-Name>"]}
+```
+
+Now, we need to verify the images are served from local http service(either create an nginx local server or httpd container service) 
+
+```
+# curl -I -s http://<Jump-Host>.<Domain-Name>
+HTTP/1.1 200 OK
+
+#export $RHCOS_OPENSTACK_URI=rhcos-43.81.202001142154.0-openstack.x86_64.qcow2.gz
+
+# curl -I -s http://<Jump-Host>.<Domain-Name>/$RHCOS_OPENSTACK_URI | grep HTTP
+HTTP/1.1 200 OK
+
+#export $RHCOS_QEMU_URI=rhcos-43.81.202001142154.0-qemu.x86_64.qcow2.gz
+
+# curl -I -s http://<Jump-Host>.<Domain-Name>/$RHCOS_QEMU_URI | grep HTTP
+HTTP/1.1 200 OK
+```
+
+## Network Settings:
+
+* Since this is a disconnected install we need two VLAN to be provisioned, `eno1` being in Public Ipv4 network and `eno3` in private ipv6 VLAN. 
+
+* Worker/Provisioning node should have networking bridges set up as below:
+
+```
+baremetal: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
+    link/ether 14:9e:cf:c5:38:19 brd ff:ff:ff:ff:ff:ff
+    inet6 fd35:919d:4042:2:c7ed:9a9f:a9ec:13/128 scope global dynamic noprefixroute 
+       valid_lft 3494sec preferred_lft 3494sec
+    inet6 fe80::58bd:46f2:6adc:24b0/64 scope link noprefixroute 
+       valid_lft forever preferred_lft forever
+provisioning: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
+    link/ether 14:9e:cf:c5:30:a3 brd ff:ff:ff:ff:ff:ff
+    inet 172.22.0.1/24 brd 172.22.0.255 scope global noprefixroute provisioning
+       valid_lft forever preferred_lft forever
+    inet6 fe80::99de:813c:2744:e1/64 scope link noprefixroute 
+       valid_lft forever preferred_lft forever
+```
+Here `eno1` should be a slave to `baremetal` bridge and `eno3` to `provisioning` bridge. 
+
+* Make sure you can reach the VIP's of API, NS and APPS before starting the deployments. 
+
+* Create an install-config file as [Install-config](https://gist.github.com/hardys/30809af4d2d6f89354cec60241b4883d)
+
+* Verify the syntax of install-config.yaml file, confirm that you can reach the `bootstrapOSImage` and `clusterOSImage` URL from the Jump Host. 
+
+
+**Important**:  Make sure to power off the master nodes before you begin the installation via iDRAC/iLOM.
+
+Feel free to use this below script:
+
+```
+#!/bin/bash
+
+echo "***Power Off master-0,master1,master-2****"
+ipmitool -I lanplus -U root -P <Password> -H <iDRAC-ipv6-ip> power off
+ipmitool -I lanplus -U root -P <Password> -H <iDRAC-ipv6-ip> power off
+ipmitool -I lanplus -U root -P <Password> -H <iDRAC-ipv6-ip> power off
+``` 
+
+## During the deployment
+
+Create your metal3-config.yaml file as shown below 
+
+```yaml
+ apiVersion: v1
+ kind: ConfigMap
+ metadata:
+   name: metal3-config
+   namespace: openshift-machine-api
+ data:
+   cache_url: rhcos-43.81.202001142154.0-qemu.x86_64.qcow2.gz
+   deploy_kernel_url: http://172.22.0.1:6180/images/ironic-python-agent.kernel
+   deploy_ramdisk_url: http://172.22.0.1:6180/images/ironic-python-agent.initramfs
+   dhcp_range: 172.22.0.10,172.22.0.100
+   http_port: "6180"
+   ironic_endpoint: http://172.22.0.1:6385/v1/
+   ironic_inspector_endpoint: http://172.22.0.3:5050/v1/
+   provisioning_interface: eno1
+   provisioning_ip: 172.22.0.1/24
+   rhcos_image_url: http://[Ipv6-IP-Registry]/rhcos-43.81.202001142154.0-openstack.x86_64.qcow2.gz
+```
+
+After the installation is triggered, Check/verify bootstrap VM is up and running from worker node: 
+
+```
+[kni@provisioner]$ sudo virsh list
+ Id    Name                           State
+----------------------------------------------------
+ 12    kni7-xf6fq-bootstrap           running
+```
+
+If there is no sign of Bootstarp VM running on the provioner node, check if the libvirtd is running(if not, start the service) on the provisioner node. 
+
+```
+systemctl status libvirtd
+● libvirtd.service - Virtualization daemon
+   Loaded: loaded (/usr/lib/systemd/system/libvirtd.service; enabled; vendor preset: enabled)
+   Active: active (running) since Mon 2020-02-03 13:08:51 EST; 2 weeks 0 days ago
+     Docs: man:libvirtd(8)
+```
+
+If re-running the deployment, please make sure to delete all the older VM including its volume as shown below:
+
+```
+[kni@provisioner]$ sudo virsh pool-list
+ Name                 State      Autostart 
+-------------------------------------------
+ default              active     yes       
+
+[kni@provisioner]$ sudo virsh vol-list default 
+ Name                 Path                                    
+------------------------------------------------------------------------------
+ kni7-xf6fq-bootstrap /var/lib/libvirt/images/kni7-xf6fq-bootstrap
+ kni7-xf6fq-bootstrap.ign /var/lib/libvirt/images/kni7-xf6fq-bootstrap.ign
+
+[kni@provisioner]$ sudo virsh vol-delete --vol /var/lib/libvirt/images/kni7-xf6fq-bootstrap 
+Vol /var/lib/libvirt/images/kni7-xf6fq-bootstrap deleted
+
+[kni@provisioner]$ sudo virsh vol-delete --vol /var/lib/libvirt/images/kni7-xf6fq-bootstrap.ign 
+Vol /var/lib/libvirt/images/kni7-xf6fq-bootstrap.ign deleted
+```
+
+Once this is done delete the VM including the storage pool:
+
+```
+[kni@provisioner]$ sudo virsh destroy kni7-xf6fq-bootstrap
+Domain kni7-xf6fq-bootstrap destroyed
+
+[kni@provisioner]$ sudo virsh undefine kni7-xf6fq-bootstrap
+Domain kni7-xf6fq-bootstrap has been undefined
+```
+
+You will get the IP of this VM by issuing `sudo virsh net-dhcp-leases` usually always the private IP of this VM is `172.22.0.2` , you can login to this VM using the private SSH key with core as its user:
+
+```
+# ssh -i ~/.ssh/id_rsa [email protected]
+```
+
+and then issuing the journalctl command to inspect the logs:
+
+```
+[core@localhost ~]$ journalctl -b -f -u bootkube.service
+-- Logs begin at Thu 2019-12-05 10:23:22 UTC. --
+Feb 05 10:29:51 localhost bootkube.sh[3119]: https://etcd-0.kni7.cloud.lab.eng.bos.redhat.com:2379 is unhealthy: failed to commit proposal: context deadline exceeded
+Feb 05 10:29:51 localhost bootkube.sh[3119]: https://etcd-1.kni7.cloud.lab.eng.bos.redhat.com:2379 is unhealthy: failed to commit proposal: context deadline exceeded
+Feb 05 10:29:51 localhost bootkube.sh[3119]: Error: unhealthy cluster
+Feb 05 10:29:51 localhost podman[7743]: 2020-02-05 10:29:51.487999651 +0000 UTC m=+5.434743491 container died a62fd32f5b22e117e1fa5885ae956c5085f3ac5cc47fe88256b8282a9b45695d (image=quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:b3f6e79c60f60a3a8b8422adf82182de632ba2594b34a5b2e56e3564186efe77, name=etcdctl)
+```
+Once logged in to this VM, please check if all the pods(dnsmasq,mariadb,httpd, ironic pods etc) are in running state. 
+
+To check the logs of all the containers , please the below commands:
+
+```
+sudo podman logs ironic-api
+```
+You will get the error messages if the master nodes are not `powered on` by inspecting the ironic pods. 
+
+Once the ironic poewers on the master nodes, verify if RHCOS images are pushed to the cluster, once again inspect the logs of httpd and ironic pods logs from Bootstrap VM. 
+
+## BootstrapVM cannot download RHCOS image
+
+After the Bootstrap VM goes up comes a phase where podman raises 2 containers:
+
+```
+ipa-downloader
+coreos-downloader
+```
+
+These both containers will download resources from (in our case) the jump / bastion host since its a disconnected install.This is the relevant section where is this step performed, modify it in order to let the provisioner to access to the resources:
+
+```
+bootstrapOSImage: http://[<IPv6-IP-Registry]/rhcos-43.81.202001142154.0-qemu.x86_64.qcow2.gz?sha256=9d999f55ff1d44f7ed7c106508e5deecd04dc3c06095d34d36bf1cd127837e0c
+clusterOSImage: http://[<IPv6-IP-Registry>]/rhcos-43.81.202001142154.0-openstack.x86_64.qcow2.gz?sha256=a1bda656fa0892f7b936fdc6b6a6086bddaed5dafacedcd7a1e811abb78fe3b0
+```
+
+##Check the status of these containers using 
+
+```
+#podman logs -f ipa-downloader
+
+#podman logs -f coreos-downloader
+
+```
+After the installation is complete copy the config file to the `~/.kube/config` dir to interact with the cluster
+
+```
+cp /install-dir/ocp/auth/kubeconfig  ~/.kube/config
+```
+Or 
+
+```
+export KUBECONFIG=/install-dir/ocp/auth/kubeconfig
+```
+
+Verify if all the control plane nodes are up and running 
+
+```
+oc get nodes
+NAME                                         STATUS   ROLES           AGE    VERSION
+master-0.kni7.cloud.lab.eng.bos.redhat.com   Ready    master,worker   4h   v1.16.2
+master-1.kni7.cloud.lab.eng.bos.redhat.com   Ready    master,worker   4h   v1.16.2
+master-2.kni7.cloud.lab.eng.bos.redhat.com   Ready    master,worker   4h   v1.16.2
+```
+
+There might be a scenario where the nodes can be stuck in `NotReady` state. To check the exact errors, login to the nodes and check for logs.
+
+```
+#ssh core@master-0 
+#crictl ps (look for the error in pod status if any)
+#crictl logs <pod-in-error-state>
+```
+Check for pod errors if any in the cluster in any namespaces
+
+```
+oc get namespaces 
+For Example: oc get pods -n openshift-kube-apiserver
+```
+
+## Bootstrap VM cannot boot up my nodes
+
+This is a usual problem, maybe we add a typo on the install-config.yaml file or just the privileges are not right on my user. In the worst case you have a problem with your network prerequisites regarding routing and accesses to iDrac from Ironic. 
+To check the problems here we have 3 basic containers all regarding ironic:
+    * Ironic-api
+    * Ironic-conductor
+    * Ironic-inspector
+
+The first to enter in action is the conductor which tries to perform the login and execute the action over IPMI, check the logs there in order to see if there is a problem on this side.
+
+```
+# podman logs -f ironic-conductor
+```
+
+The logs will show up, if there are any errors. 
+
+
+We can also follow the deployment process by seeing the `.openshift_install.log` which will be under `install-dir` folder
+
+```
+# ls -latr kni7/
+auth/                                 .openshift_install.log                terraform.baremetal.auto.tfvars.json  terraform.tfvars.json
+metadata.json                         .openshift_install_state.json         terraform.tfstate                     tls/
+```
+
+**Note:** For more details please refer [Troubleshooting](https://github.com/openshift/installer/blob/master/docs/user/troubleshooting.md) 
+