Skip to content
Closed
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
328 changes: 328 additions & 0 deletions docs/user/metal/troubleshooting_Ipv6.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,328 @@
# IPv6 Disconnected Install Troubleshooting

In this document we will analyze common scenarios of errors/failures which can cuase the cluster installation to fail.

## Common failures Basic checks before Installing

Below are some precautions we should follow before even attempting the OCP4 Ipv6 deployments.

* [Ipv4-BM-IPI](https://github.com/openshift-kni/baremetal-deploy/blob/master/install-steps.md#preparing-the-provision-node-for-openshift-install) has some pre-requisites which will be applied fpr ipv6 deployments as well.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Link is stale


### The architecture for IPv6 Disconnected Install(Nodes can vary as per requirements)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These seem specific to a development/testing environment. I don't think I'd want these documented here

* Jump Host which acts as a DNSMasq, Registry.
* Worker/Provisioning node
* 3 Master nodes.

Before starting the installation, we need to set up a DNSmasq conf file which will assign IPv6 IP's to all the nodes on `eno3` interface.A sample file is provided below.

* A Jump Host should have one interface say `eno1` on Ipv4(Public Network), `eno3` on **Private VLAN** this will be used to provide **IPv6 IP's on the isolated VLAN**.

```
domain-needed
bogus-priv
strict-order
#expand-hosts
bind-dynamic
dhcp-sequential-ip
dhcp-authoritative
dhcp-fqdn
log-queries
domain=<Your_Domain_Name>
resolv-file=/etc/resolv.conf.upstream
interface=eno3
enable-ra
server=<Dns_Server>

dhcp-range=<Ipv6-Range-1>,<IPv6-Range-2>,<Network-Subnet>
dhcp-option=option6:dns-server,[Jump-Hosts-IPv6-IP]
dhcp-option=option6:ntp-server,[::]

local=/<Your_FQDN>/

#static addresses
address=/.apps.<Cluster-Name>.<Domain-Name>/<Ipv6-IP>
address=/api.<Cluster-Name>.<Domain-Name>/<Ipv6-IP>
address=/ns1.<Cluster-Name>.<Domain-Name>/<Ipv6-IP>
address=/registry.<Cluster-Name>.<Domain-Name>/**<IPv6-IP-Jump-Host>**

#reserved dynamic addresses
dhcp-host=<MAC-Address-eno3-Interface>,master-0.<Cluster-Name>.<Domain-Name>,[IPv6-IP-master-0]
dhcp-host=<MAC-Address-eno3-Interface>,master-1.<Cluster-Name>.<Domain-Name>,[IPv6-IP-master-1]
dhcp-host=<MAC-Address-eno3-Interface>,master-2.<Cluster-Name>.<Domain-Name>,[IPv6-IP-master-1]
dhcp-host=<MAC-Address-eno3-Interface>,worker-0.<Cluster-Name>.<Domain-Name>,[IPv6-IP-worker-0]
```

A sample `etc/resolv.conf.upstream` file is as shown below

```
# Generated by NetworkManager
search <Cluster-Name>.<Domain-Name>
nameserver <IPv6-IP-Jump-Host>
```

A sample `/etc/hosts` file should be configured as shown

```
<IPv6-IP-Jump-Host> registry.<Cluster-Name>.<Domain-Name>
```

**Note:** After all the above file changes, make sure you can resolve all the above hosts using nslookup and dig commands.

## Create a Disconnected Registry

* [Create a Disconnected Registry](https://docs.openshift.com/container-platform/4.3/installing/install_config/installing-restricted-networks-preparations.html#installing-restricted-networks-preparations)

**Note:** While Mirroring the registry, please make sure to add the additional registry.svc.io token to your pull-secret.txt file, else the mirroring process fails. [Add-Token](https://docs.openshift.com/container-platform/4.3/installing/install_config/installing-restricted-networks-preparations.html#Procedure)

After the registry is mirrored, confirm that you can access this in your disconnected environment

```
# curl -k -6 -u kni:kni https://registry.<Cluster-Name>.<Domain-Name>:5000/v2/_catalog
{"repositories":["<Repo-Name>"]}
```

Now, we need to verify the images are served from local http service(either create an nginx local server or httpd container service)

```
# curl -I -s http://<Jump-Host>.<Domain-Name>
HTTP/1.1 200 OK

#export $RHCOS_OPENSTACK_URI=rhcos-43.81.202001142154.0-openstack.x86_64.qcow2.gz

# curl -I -s http://<Jump-Host>.<Domain-Name>/$RHCOS_OPENSTACK_URI | grep HTTP
HTTP/1.1 200 OK

#export $RHCOS_QEMU_URI=rhcos-43.81.202001142154.0-qemu.x86_64.qcow2.gz

# curl -I -s http://<Jump-Host>.<Domain-Name>/$RHCOS_QEMU_URI | grep HTTP
HTTP/1.1 200 OK
```

## Network Settings:

* Since this is a disconnected install we need two VLAN to be provisioned, `eno1` being in Public Ipv4 network and `eno3` in private ipv6 VLAN.

* Worker/Provisioning node should have networking bridges set up as below:

```
baremetal: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
link/ether 14:9e:cf:c5:38:19 brd ff:ff:ff:ff:ff:ff
inet6 fd35:919d:4042:2:c7ed:9a9f:a9ec:13/128 scope global dynamic noprefixroute
valid_lft 3494sec preferred_lft 3494sec
inet6 fe80::58bd:46f2:6adc:24b0/64 scope link noprefixroute
valid_lft forever preferred_lft forever
provisioning: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
link/ether 14:9e:cf:c5:30:a3 brd ff:ff:ff:ff:ff:ff
inet 172.22.0.1/24 brd 172.22.0.255 scope global noprefixroute provisioning
valid_lft forever preferred_lft forever
inet6 fe80::99de:813c:2744:e1/64 scope link noprefixroute
valid_lft forever preferred_lft forever
```
Here `eno1` should be a slave to `baremetal` bridge and `eno3` to `provisioning` bridge.

* Make sure you can reach the VIP's of API, NS and APPS before starting the deployments.

* Create an install-config file as [Install-config](https://gist.github.com/hardys/30809af4d2d6f89354cec60241b4883d)

* Verify the syntax of install-config.yaml file, confirm that you can reach the `bootstrapOSImage` and `clusterOSImage` URL from the Jump Host.


**Important**: Make sure to power off the master nodes before you begin the installation via iDRAC/iLOM.

Feel free to use this below script:

```
#!/bin/bash

echo "***Power Off master-0,master1,master-2****"
ipmitool -I lanplus -U root -P <Password> -H <iDRAC-ipv6-ip> power off
ipmitool -I lanplus -U root -P <Password> -H <iDRAC-ipv6-ip> power off
ipmitool -I lanplus -U root -P <Password> -H <iDRAC-ipv6-ip> power off
```

## During the deployment

Create your metal3-config.yaml file as shown below

```yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: metal3-config
namespace: openshift-machine-api
data:
cache_url: rhcos-43.81.202001142154.0-qemu.x86_64.qcow2.gz
deploy_kernel_url: http://172.22.0.1:6180/images/ironic-python-agent.kernel
deploy_ramdisk_url: http://172.22.0.1:6180/images/ironic-python-agent.initramfs
dhcp_range: 172.22.0.10,172.22.0.100
http_port: "6180"
ironic_endpoint: http://172.22.0.1:6385/v1/
ironic_inspector_endpoint: http://172.22.0.3:5050/v1/
provisioning_interface: eno1
provisioning_ip: 172.22.0.1/24
rhcos_image_url: http://[Ipv6-IP-Registry]/rhcos-43.81.202001142154.0-openstack.x86_64.qcow2.gz
```

After the installation is triggered, Check/verify bootstrap VM is up and running from worker node:

```
[kni@provisioner]$ sudo virsh list
Id Name State
----------------------------------------------------
12 kni7-xf6fq-bootstrap running
```

If there is no sign of Bootstarp VM running on the provioner node, check if the libvirtd is running(if not, start the service) on the provisioner node.

```
systemctl status libvirtd
● libvirtd.service - Virtualization daemon
Loaded: loaded (/usr/lib/systemd/system/libvirtd.service; enabled; vendor preset: enabled)
Active: active (running) since Mon 2020-02-03 13:08:51 EST; 2 weeks 0 days ago
Docs: man:libvirtd(8)
```

If re-running the deployment, please make sure to delete all the older VM including its volume as shown below:

```
[kni@provisioner]$ sudo virsh pool-list
Name State Autostart
-------------------------------------------
default active yes

[kni@provisioner]$ sudo virsh vol-list default
Name Path
------------------------------------------------------------------------------
kni7-xf6fq-bootstrap /var/lib/libvirt/images/kni7-xf6fq-bootstrap
kni7-xf6fq-bootstrap.ign /var/lib/libvirt/images/kni7-xf6fq-bootstrap.ign

[kni@provisioner]$ sudo virsh vol-delete --vol /var/lib/libvirt/images/kni7-xf6fq-bootstrap
Vol /var/lib/libvirt/images/kni7-xf6fq-bootstrap deleted

[kni@provisioner]$ sudo virsh vol-delete --vol /var/lib/libvirt/images/kni7-xf6fq-bootstrap.ign
Vol /var/lib/libvirt/images/kni7-xf6fq-bootstrap.ign deleted
```

Once this is done delete the VM including the storage pool:

```
[kni@provisioner]$ sudo virsh destroy kni7-xf6fq-bootstrap
Domain kni7-xf6fq-bootstrap destroyed

[kni@provisioner]$ sudo virsh undefine kni7-xf6fq-bootstrap
Domain kni7-xf6fq-bootstrap has been undefined
```

You will get the IP of this VM by issuing `sudo virsh net-dhcp-leases` usually always the private IP of this VM is `172.22.0.2` , you can login to this VM using the private SSH key with core as its user:

```
# ssh -i ~/.ssh/id_rsa [email protected]
```

and then issuing the journalctl command to inspect the logs:

```
[core@localhost ~]$ journalctl -b -f -u bootkube.service
-- Logs begin at Thu 2019-12-05 10:23:22 UTC. --
Feb 05 10:29:51 localhost bootkube.sh[3119]: https://etcd-0.kni7.cloud.lab.eng.bos.redhat.com:2379 is unhealthy: failed to commit proposal: context deadline exceeded
Feb 05 10:29:51 localhost bootkube.sh[3119]: https://etcd-1.kni7.cloud.lab.eng.bos.redhat.com:2379 is unhealthy: failed to commit proposal: context deadline exceeded
Feb 05 10:29:51 localhost bootkube.sh[3119]: Error: unhealthy cluster
Feb 05 10:29:51 localhost podman[7743]: 2020-02-05 10:29:51.487999651 +0000 UTC m=+5.434743491 container died a62fd32f5b22e117e1fa5885ae956c5085f3ac5cc47fe88256b8282a9b45695d (image=quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:b3f6e79c60f60a3a8b8422adf82182de632ba2594b34a5b2e56e3564186efe77, name=etcdctl)
```
Once logged in to this VM, please check if all the pods(dnsmasq,mariadb,httpd, ironic pods etc) are in running state.

To check the logs of all the containers , please the below commands:

```
sudo podman logs ironic-api
```
You will get the error messages if the master nodes are not `powered on` by inspecting the ironic pods.

Once the ironic poewers on the master nodes, verify if RHCOS images are pushed to the cluster, once again inspect the logs of httpd and ironic pods logs from Bootstrap VM.

## BootstrapVM cannot download RHCOS image

After the Bootstrap VM goes up comes a phase where podman raises 2 containers:

```
ipa-downloader
coreos-downloader
```

These both containers will download resources from (in our case) the jump / bastion host since its a disconnected install.This is the relevant section where is this step performed, modify it in order to let the provisioner to access to the resources:

```
bootstrapOSImage: http://[<IPv6-IP-Registry]/rhcos-43.81.202001142154.0-qemu.x86_64.qcow2.gz?sha256=9d999f55ff1d44f7ed7c106508e5deecd04dc3c06095d34d36bf1cd127837e0c
clusterOSImage: http://[<IPv6-IP-Registry>]/rhcos-43.81.202001142154.0-openstack.x86_64.qcow2.gz?sha256=a1bda656fa0892f7b936fdc6b6a6086bddaed5dafacedcd7a1e811abb78fe3b0
```

##Check the status of these containers using

```
#podman logs -f ipa-downloader

#podman logs -f coreos-downloader

```
After the installation is complete copy the config file to the `~/.kube/config` dir to interact with the cluster

```
cp /install-dir/ocp/auth/kubeconfig ~/.kube/config
```
Or

```
export KUBECONFIG=/install-dir/ocp/auth/kubeconfig
```

Verify if all the control plane nodes are up and running

```
oc get nodes
NAME STATUS ROLES AGE VERSION
master-0.kni7.cloud.lab.eng.bos.redhat.com Ready master,worker 4h v1.16.2
master-1.kni7.cloud.lab.eng.bos.redhat.com Ready master,worker 4h v1.16.2
master-2.kni7.cloud.lab.eng.bos.redhat.com Ready master,worker 4h v1.16.2
```

There might be a scenario where the nodes can be stuck in `NotReady` state. To check the exact errors, login to the nodes and check for logs.

```
#ssh core@master-0
#crictl ps (look for the error in pod status if any)
#crictl logs <pod-in-error-state>
```
Check for pod errors if any in the cluster in any namespaces

```
oc get namespaces
For Example: oc get pods -n openshift-kube-apiserver
```

## Bootstrap VM cannot boot up my nodes

This is a usual problem, maybe we add a typo on the install-config.yaml file or just the privileges are not right on my user. In the worst case you have a problem with your network prerequisites regarding routing and accesses to iDrac from Ironic.
To check the problems here we have 3 basic containers all regarding ironic:
* Ironic-api
* Ironic-conductor
* Ironic-inspector

The first to enter in action is the conductor which tries to perform the login and execute the action over IPMI, check the logs there in order to see if there is a problem on this side.

```
# podman logs -f ironic-conductor
```

The logs will show up, if there are any errors.


We can also follow the deployment process by seeing the `.openshift_install.log` which will be under `install-dir` folder

```
# ls -latr kni7/
auth/ .openshift_install.log terraform.baremetal.auto.tfvars.json terraform.tfvars.json
metadata.json .openshift_install_state.json terraform.tfstate tls/
```

**Note:** For more details please refer [Troubleshooting](https://github.com/openshift/installer/blob/master/docs/user/troubleshooting.md)