-
Notifications
You must be signed in to change notification settings - Fork 1.5k
Adding IPv6 Troubleshooting guide #3088
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Closed
Closed
Changes from all commits
Commits
Show all changes
5 commits
Select commit
Hold shift + click to select a range
e7b62a9
Adding IPv6 Troubleshooting guide
DirectedSoul1 d8eb40f
Update Troubleshooting_Ipv6.md
DirectedSoul1 722d353
Rename Troubleshooting_Ipv6.md to docs/user/metal/Troubleshooting_Ipv…
DirectedSoul1 092b9d6
Update and rename Troubleshooting_Ipv6.md to troubleshooting_Ipv6.md
DirectedSoul1 8aefbb3
Update troubleshooting_Ipv6.md
DirectedSoul1 File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,328 @@ | ||
| # IPv6 Disconnected Install Troubleshooting | ||
|
|
||
| In this document we will analyze common scenarios of errors/failures which can cuase the cluster installation to fail. | ||
|
|
||
| ## Common failures Basic checks before Installing | ||
|
|
||
| Below are some precautions we should follow before even attempting the OCP4 Ipv6 deployments. | ||
|
|
||
| * [Ipv4-BM-IPI](https://github.com/openshift-kni/baremetal-deploy/blob/master/install-steps.md#preparing-the-provision-node-for-openshift-install) has some pre-requisites which will be applied fpr ipv6 deployments as well. | ||
|
|
||
| ### The architecture for IPv6 Disconnected Install(Nodes can vary as per requirements) | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. These seem specific to a development/testing environment. I don't think I'd want these documented here |
||
| * Jump Host which acts as a DNSMasq, Registry. | ||
| * Worker/Provisioning node | ||
| * 3 Master nodes. | ||
|
|
||
| Before starting the installation, we need to set up a DNSmasq conf file which will assign IPv6 IP's to all the nodes on `eno3` interface.A sample file is provided below. | ||
|
|
||
| * A Jump Host should have one interface say `eno1` on Ipv4(Public Network), `eno3` on **Private VLAN** this will be used to provide **IPv6 IP's on the isolated VLAN**. | ||
|
|
||
| ``` | ||
| domain-needed | ||
| bogus-priv | ||
| strict-order | ||
| #expand-hosts | ||
| bind-dynamic | ||
| dhcp-sequential-ip | ||
| dhcp-authoritative | ||
| dhcp-fqdn | ||
| log-queries | ||
| domain=<Your_Domain_Name> | ||
| resolv-file=/etc/resolv.conf.upstream | ||
| interface=eno3 | ||
| enable-ra | ||
| server=<Dns_Server> | ||
|
|
||
| dhcp-range=<Ipv6-Range-1>,<IPv6-Range-2>,<Network-Subnet> | ||
| dhcp-option=option6:dns-server,[Jump-Hosts-IPv6-IP] | ||
| dhcp-option=option6:ntp-server,[::] | ||
|
|
||
| local=/<Your_FQDN>/ | ||
|
|
||
| #static addresses | ||
| address=/.apps.<Cluster-Name>.<Domain-Name>/<Ipv6-IP> | ||
| address=/api.<Cluster-Name>.<Domain-Name>/<Ipv6-IP> | ||
| address=/ns1.<Cluster-Name>.<Domain-Name>/<Ipv6-IP> | ||
| address=/registry.<Cluster-Name>.<Domain-Name>/**<IPv6-IP-Jump-Host>** | ||
|
|
||
| #reserved dynamic addresses | ||
| dhcp-host=<MAC-Address-eno3-Interface>,master-0.<Cluster-Name>.<Domain-Name>,[IPv6-IP-master-0] | ||
| dhcp-host=<MAC-Address-eno3-Interface>,master-1.<Cluster-Name>.<Domain-Name>,[IPv6-IP-master-1] | ||
| dhcp-host=<MAC-Address-eno3-Interface>,master-2.<Cluster-Name>.<Domain-Name>,[IPv6-IP-master-1] | ||
| dhcp-host=<MAC-Address-eno3-Interface>,worker-0.<Cluster-Name>.<Domain-Name>,[IPv6-IP-worker-0] | ||
| ``` | ||
|
|
||
| A sample `etc/resolv.conf.upstream` file is as shown below | ||
|
|
||
| ``` | ||
| # Generated by NetworkManager | ||
| search <Cluster-Name>.<Domain-Name> | ||
| nameserver <IPv6-IP-Jump-Host> | ||
| ``` | ||
|
|
||
| A sample `/etc/hosts` file should be configured as shown | ||
|
|
||
| ``` | ||
| <IPv6-IP-Jump-Host> registry.<Cluster-Name>.<Domain-Name> | ||
| ``` | ||
|
|
||
| **Note:** After all the above file changes, make sure you can resolve all the above hosts using nslookup and dig commands. | ||
|
|
||
| ## Create a Disconnected Registry | ||
|
|
||
| * [Create a Disconnected Registry](https://docs.openshift.com/container-platform/4.3/installing/install_config/installing-restricted-networks-preparations.html#installing-restricted-networks-preparations) | ||
|
|
||
| **Note:** While Mirroring the registry, please make sure to add the additional registry.svc.io token to your pull-secret.txt file, else the mirroring process fails. [Add-Token](https://docs.openshift.com/container-platform/4.3/installing/install_config/installing-restricted-networks-preparations.html#Procedure) | ||
|
|
||
| After the registry is mirrored, confirm that you can access this in your disconnected environment | ||
|
|
||
| ``` | ||
| # curl -k -6 -u kni:kni https://registry.<Cluster-Name>.<Domain-Name>:5000/v2/_catalog | ||
| {"repositories":["<Repo-Name>"]} | ||
| ``` | ||
|
|
||
| Now, we need to verify the images are served from local http service(either create an nginx local server or httpd container service) | ||
|
|
||
| ``` | ||
| # curl -I -s http://<Jump-Host>.<Domain-Name> | ||
| HTTP/1.1 200 OK | ||
|
|
||
| #export $RHCOS_OPENSTACK_URI=rhcos-43.81.202001142154.0-openstack.x86_64.qcow2.gz | ||
|
|
||
| # curl -I -s http://<Jump-Host>.<Domain-Name>/$RHCOS_OPENSTACK_URI | grep HTTP | ||
| HTTP/1.1 200 OK | ||
|
|
||
| #export $RHCOS_QEMU_URI=rhcos-43.81.202001142154.0-qemu.x86_64.qcow2.gz | ||
|
|
||
| # curl -I -s http://<Jump-Host>.<Domain-Name>/$RHCOS_QEMU_URI | grep HTTP | ||
| HTTP/1.1 200 OK | ||
| ``` | ||
|
|
||
| ## Network Settings: | ||
|
|
||
| * Since this is a disconnected install we need two VLAN to be provisioned, `eno1` being in Public Ipv4 network and `eno3` in private ipv6 VLAN. | ||
|
|
||
| * Worker/Provisioning node should have networking bridges set up as below: | ||
|
|
||
| ``` | ||
| baremetal: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000 | ||
| link/ether 14:9e:cf:c5:38:19 brd ff:ff:ff:ff:ff:ff | ||
| inet6 fd35:919d:4042:2:c7ed:9a9f:a9ec:13/128 scope global dynamic noprefixroute | ||
| valid_lft 3494sec preferred_lft 3494sec | ||
| inet6 fe80::58bd:46f2:6adc:24b0/64 scope link noprefixroute | ||
| valid_lft forever preferred_lft forever | ||
| provisioning: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000 | ||
| link/ether 14:9e:cf:c5:30:a3 brd ff:ff:ff:ff:ff:ff | ||
| inet 172.22.0.1/24 brd 172.22.0.255 scope global noprefixroute provisioning | ||
| valid_lft forever preferred_lft forever | ||
| inet6 fe80::99de:813c:2744:e1/64 scope link noprefixroute | ||
| valid_lft forever preferred_lft forever | ||
| ``` | ||
| Here `eno1` should be a slave to `baremetal` bridge and `eno3` to `provisioning` bridge. | ||
|
|
||
| * Make sure you can reach the VIP's of API, NS and APPS before starting the deployments. | ||
|
|
||
| * Create an install-config file as [Install-config](https://gist.github.com/hardys/30809af4d2d6f89354cec60241b4883d) | ||
|
|
||
| * Verify the syntax of install-config.yaml file, confirm that you can reach the `bootstrapOSImage` and `clusterOSImage` URL from the Jump Host. | ||
|
|
||
|
|
||
| **Important**: Make sure to power off the master nodes before you begin the installation via iDRAC/iLOM. | ||
|
|
||
| Feel free to use this below script: | ||
|
|
||
| ``` | ||
| #!/bin/bash | ||
|
|
||
| echo "***Power Off master-0,master1,master-2****" | ||
| ipmitool -I lanplus -U root -P <Password> -H <iDRAC-ipv6-ip> power off | ||
| ipmitool -I lanplus -U root -P <Password> -H <iDRAC-ipv6-ip> power off | ||
| ipmitool -I lanplus -U root -P <Password> -H <iDRAC-ipv6-ip> power off | ||
| ``` | ||
|
|
||
| ## During the deployment | ||
|
|
||
| Create your metal3-config.yaml file as shown below | ||
|
|
||
| ```yaml | ||
| apiVersion: v1 | ||
| kind: ConfigMap | ||
| metadata: | ||
| name: metal3-config | ||
| namespace: openshift-machine-api | ||
| data: | ||
| cache_url: rhcos-43.81.202001142154.0-qemu.x86_64.qcow2.gz | ||
| deploy_kernel_url: http://172.22.0.1:6180/images/ironic-python-agent.kernel | ||
| deploy_ramdisk_url: http://172.22.0.1:6180/images/ironic-python-agent.initramfs | ||
| dhcp_range: 172.22.0.10,172.22.0.100 | ||
| http_port: "6180" | ||
| ironic_endpoint: http://172.22.0.1:6385/v1/ | ||
| ironic_inspector_endpoint: http://172.22.0.3:5050/v1/ | ||
| provisioning_interface: eno1 | ||
| provisioning_ip: 172.22.0.1/24 | ||
| rhcos_image_url: http://[Ipv6-IP-Registry]/rhcos-43.81.202001142154.0-openstack.x86_64.qcow2.gz | ||
| ``` | ||
|
|
||
| After the installation is triggered, Check/verify bootstrap VM is up and running from worker node: | ||
|
|
||
| ``` | ||
| [kni@provisioner]$ sudo virsh list | ||
| Id Name State | ||
| ---------------------------------------------------- | ||
| 12 kni7-xf6fq-bootstrap running | ||
| ``` | ||
|
|
||
| If there is no sign of Bootstarp VM running on the provioner node, check if the libvirtd is running(if not, start the service) on the provisioner node. | ||
|
|
||
| ``` | ||
| systemctl status libvirtd | ||
| ● libvirtd.service - Virtualization daemon | ||
| Loaded: loaded (/usr/lib/systemd/system/libvirtd.service; enabled; vendor preset: enabled) | ||
| Active: active (running) since Mon 2020-02-03 13:08:51 EST; 2 weeks 0 days ago | ||
| Docs: man:libvirtd(8) | ||
| ``` | ||
|
|
||
| If re-running the deployment, please make sure to delete all the older VM including its volume as shown below: | ||
|
|
||
| ``` | ||
| [kni@provisioner]$ sudo virsh pool-list | ||
| Name State Autostart | ||
| ------------------------------------------- | ||
| default active yes | ||
|
|
||
| [kni@provisioner]$ sudo virsh vol-list default | ||
| Name Path | ||
| ------------------------------------------------------------------------------ | ||
| kni7-xf6fq-bootstrap /var/lib/libvirt/images/kni7-xf6fq-bootstrap | ||
| kni7-xf6fq-bootstrap.ign /var/lib/libvirt/images/kni7-xf6fq-bootstrap.ign | ||
|
|
||
| [kni@provisioner]$ sudo virsh vol-delete --vol /var/lib/libvirt/images/kni7-xf6fq-bootstrap | ||
| Vol /var/lib/libvirt/images/kni7-xf6fq-bootstrap deleted | ||
|
|
||
| [kni@provisioner]$ sudo virsh vol-delete --vol /var/lib/libvirt/images/kni7-xf6fq-bootstrap.ign | ||
| Vol /var/lib/libvirt/images/kni7-xf6fq-bootstrap.ign deleted | ||
| ``` | ||
|
|
||
| Once this is done delete the VM including the storage pool: | ||
|
|
||
| ``` | ||
| [kni@provisioner]$ sudo virsh destroy kni7-xf6fq-bootstrap | ||
| Domain kni7-xf6fq-bootstrap destroyed | ||
|
|
||
| [kni@provisioner]$ sudo virsh undefine kni7-xf6fq-bootstrap | ||
| Domain kni7-xf6fq-bootstrap has been undefined | ||
| ``` | ||
|
|
||
| You will get the IP of this VM by issuing `sudo virsh net-dhcp-leases` usually always the private IP of this VM is `172.22.0.2` , you can login to this VM using the private SSH key with core as its user: | ||
|
|
||
| ``` | ||
| # ssh -i ~/.ssh/id_rsa [email protected] | ||
| ``` | ||
|
|
||
| and then issuing the journalctl command to inspect the logs: | ||
|
|
||
| ``` | ||
| [core@localhost ~]$ journalctl -b -f -u bootkube.service | ||
| -- Logs begin at Thu 2019-12-05 10:23:22 UTC. -- | ||
| Feb 05 10:29:51 localhost bootkube.sh[3119]: https://etcd-0.kni7.cloud.lab.eng.bos.redhat.com:2379 is unhealthy: failed to commit proposal: context deadline exceeded | ||
| Feb 05 10:29:51 localhost bootkube.sh[3119]: https://etcd-1.kni7.cloud.lab.eng.bos.redhat.com:2379 is unhealthy: failed to commit proposal: context deadline exceeded | ||
| Feb 05 10:29:51 localhost bootkube.sh[3119]: Error: unhealthy cluster | ||
| Feb 05 10:29:51 localhost podman[7743]: 2020-02-05 10:29:51.487999651 +0000 UTC m=+5.434743491 container died a62fd32f5b22e117e1fa5885ae956c5085f3ac5cc47fe88256b8282a9b45695d (image=quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:b3f6e79c60f60a3a8b8422adf82182de632ba2594b34a5b2e56e3564186efe77, name=etcdctl) | ||
| ``` | ||
| Once logged in to this VM, please check if all the pods(dnsmasq,mariadb,httpd, ironic pods etc) are in running state. | ||
|
|
||
| To check the logs of all the containers , please the below commands: | ||
|
|
||
| ``` | ||
| sudo podman logs ironic-api | ||
| ``` | ||
| You will get the error messages if the master nodes are not `powered on` by inspecting the ironic pods. | ||
|
|
||
| Once the ironic poewers on the master nodes, verify if RHCOS images are pushed to the cluster, once again inspect the logs of httpd and ironic pods logs from Bootstrap VM. | ||
|
|
||
| ## BootstrapVM cannot download RHCOS image | ||
|
|
||
| After the Bootstrap VM goes up comes a phase where podman raises 2 containers: | ||
|
|
||
| ``` | ||
| ipa-downloader | ||
| coreos-downloader | ||
| ``` | ||
|
|
||
| These both containers will download resources from (in our case) the jump / bastion host since its a disconnected install.This is the relevant section where is this step performed, modify it in order to let the provisioner to access to the resources: | ||
|
|
||
| ``` | ||
| bootstrapOSImage: http://[<IPv6-IP-Registry]/rhcos-43.81.202001142154.0-qemu.x86_64.qcow2.gz?sha256=9d999f55ff1d44f7ed7c106508e5deecd04dc3c06095d34d36bf1cd127837e0c | ||
| clusterOSImage: http://[<IPv6-IP-Registry>]/rhcos-43.81.202001142154.0-openstack.x86_64.qcow2.gz?sha256=a1bda656fa0892f7b936fdc6b6a6086bddaed5dafacedcd7a1e811abb78fe3b0 | ||
| ``` | ||
|
|
||
| ##Check the status of these containers using | ||
|
|
||
| ``` | ||
| #podman logs -f ipa-downloader | ||
|
|
||
| #podman logs -f coreos-downloader | ||
|
|
||
| ``` | ||
| After the installation is complete copy the config file to the `~/.kube/config` dir to interact with the cluster | ||
|
|
||
| ``` | ||
| cp /install-dir/ocp/auth/kubeconfig ~/.kube/config | ||
| ``` | ||
| Or | ||
|
|
||
| ``` | ||
| export KUBECONFIG=/install-dir/ocp/auth/kubeconfig | ||
| ``` | ||
|
|
||
| Verify if all the control plane nodes are up and running | ||
|
|
||
| ``` | ||
| oc get nodes | ||
| NAME STATUS ROLES AGE VERSION | ||
| master-0.kni7.cloud.lab.eng.bos.redhat.com Ready master,worker 4h v1.16.2 | ||
| master-1.kni7.cloud.lab.eng.bos.redhat.com Ready master,worker 4h v1.16.2 | ||
| master-2.kni7.cloud.lab.eng.bos.redhat.com Ready master,worker 4h v1.16.2 | ||
| ``` | ||
|
|
||
| There might be a scenario where the nodes can be stuck in `NotReady` state. To check the exact errors, login to the nodes and check for logs. | ||
|
|
||
| ``` | ||
| #ssh core@master-0 | ||
| #crictl ps (look for the error in pod status if any) | ||
| #crictl logs <pod-in-error-state> | ||
| ``` | ||
| Check for pod errors if any in the cluster in any namespaces | ||
|
|
||
| ``` | ||
| oc get namespaces | ||
| For Example: oc get pods -n openshift-kube-apiserver | ||
| ``` | ||
|
|
||
| ## Bootstrap VM cannot boot up my nodes | ||
|
|
||
| This is a usual problem, maybe we add a typo on the install-config.yaml file or just the privileges are not right on my user. In the worst case you have a problem with your network prerequisites regarding routing and accesses to iDrac from Ironic. | ||
| To check the problems here we have 3 basic containers all regarding ironic: | ||
| * Ironic-api | ||
| * Ironic-conductor | ||
| * Ironic-inspector | ||
|
|
||
| The first to enter in action is the conductor which tries to perform the login and execute the action over IPMI, check the logs there in order to see if there is a problem on this side. | ||
|
|
||
| ``` | ||
| # podman logs -f ironic-conductor | ||
| ``` | ||
|
|
||
| The logs will show up, if there are any errors. | ||
|
|
||
|
|
||
| We can also follow the deployment process by seeing the `.openshift_install.log` which will be under `install-dir` folder | ||
|
|
||
| ``` | ||
| # ls -latr kni7/ | ||
| auth/ .openshift_install.log terraform.baremetal.auto.tfvars.json terraform.tfvars.json | ||
| metadata.json .openshift_install_state.json terraform.tfstate tls/ | ||
| ``` | ||
|
|
||
| **Note:** For more details please refer [Troubleshooting](https://github.com/openshift/installer/blob/master/docs/user/troubleshooting.md) | ||
|
|
||
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Link is stale