Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Help Request: PXE freezes during node deployment #7461

Open
speedymiata opened this issue Aug 1, 2024 · 8 comments
Open

Help Request: PXE freezes during node deployment #7461

speedymiata opened this issue Aug 1, 2024 · 8 comments

Comments

@speedymiata
Copy link

I'm trying to use xcat to deploy rhel 8.9 onto a compute node, but the compute node fails to finish booting at this point:

Configuring (net0 ac:1f:6b:bc:db:ec)...... ok
net0: 192.168.32.12/255.255.240.0 gw 192.168.47.245
net0: fe80::ae1f:6bff:febc:dbec/64
Next server: 192.168.47.245
Filename: http://192.168.47.245:80/tftpboot/xcat/xnba/nets/192.168.32.0_20.uefi
http://192.168.47.245:80/tftpboot/xcat/xnba/nets/192.168.32.0_20.uefi... ok
192.168.32.0_20.uefi : 304 bytes [script]
http://192.168.47.245:80/tftpboot/xcat/genesis.kernel.x86_64... ok
http://192.168.47.245:80/tftpboot/xcat/genesis.fs.x86_64.gz... ok
  • The PXE process starts with the NIC getting an IP address.
  • The node retrieves genesis.kernel and genesis.fs.
  • At this point, the node freezes, and does not produce any more output.

During this process, I see this on the xcat head node:

[root@xcat_adm ~]# xcatprobe osdeploy -n cn01
The install NIC in current server is ib0                                                                       [INFO]
All nodes to be deployed are valid                                                                             [ OK ]
-------------------------------------------------------------
Start capturing every message during OS provision process....
-------------------------------------------------------------

[cn01] 13:56:23 Receive DHCPDISCOVER via ens2f0
[cn01] 13:56:24 Send DHCPOFFER on 192.168.32.72 back to ac:1f:6b:bc:db:ec via ens2f0
[cn01] 13:56:26 DHCPREQUEST for 192.168.32.72 (192.168.47.245) from ac:1f:6b:bc:db:ec via ens2f0
[cn01] 13:56:26 Send DHCPACK on 192.168.32.72 back to ac:1f:6b:bc:db:ec via ens2f0
[cn01] 13:56:26 Via TFTP download xcat/xnba.efi
[cn01] 13:56:27 Via TFTP download xcat/xnba.efi
[cn01] 13:56:30 Receive DHCPDISCOVER via ens2f0
[cn01] 13:56:31 Send DHCPOFFER on 192.168.32.12 back to ac:1f:6b:bc:db:ec via ens2f0
[cn01] 13:56:31 DHCPREQUEST for 192.168.32.12 (192.168.47.245) from ac:1f:6b:bc:db:ec via ens2f0
[cn01] 13:56:31 Send DHCPACK on 192.168.32.12 back to ac:1f:6b:bc:db:ec via ens2f0
[cn01] 13:56:39 Via HTTP get /tftpboot/xcat/xnba/nets/192.168.32.0_20.uefi
[cn01] 13:56:39 Via HTTP get /tftpboot/xcat/genesis.kernel.x86_64
[cn01] 13:56:39 Via HTTP get /tftpboot/xcat/genesis.fs.x86_64.gz
[cn01] 13:57:23 Receive DHCPDISCOVER via ens2f0
[cn01] 13:57:24 Send DHCPOFFER on 192.168.32.28 back to ac:1f:6b:bc:db:ec via ens2f0
[cn01] 13:57:24 DHCPREQUEST for 192.168.32.28 (192.168.47.245) from ac:1f:6b:bc:db:ec via ens2f0
[cn01] 13:57:24 Send DHCPACK on 192.168.32.28 back to ac:1f:6b:bc:db:ec via ens2f0

I still have a lot to learn about xcat, so I'll be extremely grateful for any and all help that's offered.

Additional information:

[root@xcat_adm ~]# lsdef -t node cn01
Object name: cn01
    arch=x86_64
    bmc=192.168.36.48
    cons=ipmi
    consoleenabled=1
    currchain=boot
    currstate=install rhels8.9.0-x86_64-compute
    getmac=ipmi
    hostnames=cn01
    ip=192.168.84.248
    mac=ac:1f:6b:bc:db:ec
    mgt=ipmi
    netboot=xnba
    nicips.ib0=192.168.84.248
    nicips.ipmi=192.168.36.48
    nicips.eno1=192.168.36.248
    nicnetworks.eno1=ipmi-net
    nicnetworks.ib0=ib-net
    nictypes.eno1=Ethernet
    nictypes.ib0=InfiniBand
    os=rhels8.9.0
    postbootscripts=otherpkgs
    postscripts=syslog,remoteshell,syncfiles
    profile=compute
    provmethod=rhels8.9.0-x86_64-install-compute
    serialport=1
    serialspeed=115200
    status=powering-on
    statustime=08-01-2024 13:54:21
[root@xcat_adm ~]# lsdef -t osimage rhels8.9.0-x86_64-install-compute
Object name: rhels8.9.0-x86_64-install-compute
    imagetype=linux
    osarch=x86_64
    osdistroname=rhels8.9.0-x86_64
    osname=Linux
    osvers=rhels896.0
    partitionfile=s:/install/custom/partitionfile/rhels8.9.0-x86_64-install-compute_partitions.sh
    pkgdir=/install/rhels8.9.0/x86_64
    pkglist=/install/custom/pkglist/rhel8-pkglist-compute.pkglist
    postscripts=custom/rhel-8.9-postscript-compute.sh
    profile=compute
    provmethod=install
    template=/install/custom/template/rhels8.9.0-x86_64-install-compute.tmpl
@speedymiata
Copy link
Author

Its been a few days since I posted this help request. Is there another forum I should repost the request to?

@Obihoernchen
Copy link
Member

Obihoernchen commented Aug 4, 2024

Your config seems to be a little bit weird.

Do you really want to deploy the image via IPoIB?

    ip=192.168.84.248
    mac=ac:1f:6b:bc:db:ec
    nicips.ib0=192.168.84.248

The MAC address seems to be from the ethernet device but you specify a IPoIB IP for the node.
ip needs to be your eno1 IP and match the MAC address you specify, not ib0.

What does makedhcp -q cn01 show?

@speedymiata
Copy link
Author

I do not want to deploy the image via IPoIB. That's the joy of an inherited system, right there - I want to use the Ethernet network for image deployment.

makedhcp -a cn01 shows:

[root@xcat_adm ~]# makedhcp -q  cn01
cn01: ip-address = 192.168.84.248, hardware-address = ac:1f:6b:bc:db:ec

I also went ahead and executed chdef cn01 ip=192.168.36.248 to try to get the system to deploy the image over the Ethernet network, but this didn't have the desired effect. The node still hangs at the same point in the boot process. What else do I need to do, to switch to Ethernet?

@Obihoernchen
Copy link
Member

Did you run nodeset/rinstall cn01 osimage=rhels8.9.0-x86_64-install-compute afterwards?
Furthermore, you should make sure the nodes boots via ETH first or disable IB PXE ROM.
You may also want to disable DHCP for your IPoIB network with setting site.dhcpinterfaces to your Mgmt. node ethernet interface.

@speedymiata
Copy link
Author

I used rinstall, yes, and I followed it up with xcatprobe osdeploy -n cn01. I'm also using rcons to manually select the Eth interface as the boot device - it is most certainly starting with it first.

After running rinstall, the makedhcp command's output still hasn't changed. It still shows the IB interface's IP of .84.248.

@Obihoernchen
Copy link
Member

Oh sorry, yes you need to run makedhcp cn01 before.
Then makedhcp -q cn01 should show the correct IP.

@speedymiata
Copy link
Author

This seems odd. After running makedhcp cn01, re-running makedhcp -1 cn01 does not indicate that a change was made. The IB address is still present.

But according to my lsdef for this node, I've set ip to the Ethernet interface's address. Are there any other items I should check?

[root@xcat_adm ~]# lsdef  cn01
Object name: cn01
    arch=x86_64
    bmc=192.168.36.48
    cons=ipmi
    consoleenabled=1
    currchain=boot
    currstate=install rhels8.6.0-x86_64-compute
    getmac=ipmi
    hostnames=cn01
    installnic=ac:1f:6b:bc:db:ec
    ip=192.168.36.248
    mac=ac:1f:6b:bc:db:ec
    mgt=ipmi
    netboot=xnba
    nicips.ib0=192.168.84.248
    nicips.ipmi=192.168.36.48
    nicips.eno1=192.168.36.248
    nicnetworks.eno1=ipmi-net
    nicnetworks.ib0=ib-net
    nictypes.eno1=Ethernet
    nictypes.ib0=InfiniBand
    os=rhels8.6.0
    postbootscripts=otherpkgs
    postscripts=syslog,remoteshell,syncfiles
    profile=compute
    provmethod=rhels8.6.0-x86_64-install-compute-
    serialport=1
    serialspeed=115200
    status=powering-on

@speedymiata
Copy link
Author

Here's what happened while "playing" with the makedhcp command after referencing the man page:

[root@wxcat_adm ~]# makedhcp -q  cn01
cn01: ip-address = 192.168.84.248, hardware-address = ac:1f:6b:bc:db:ec
[root@wxcat_adm ~]# makedhcp -d  cn01
[root@wxcat_adm ~]# makedhcp -q  cn01
[root@wxcat_adm ~]# makedhcp -n  cn01
Renamed existing dhcp configuration file to  /etc/dhcp/dhcpd.conf.xcatbak

Warning: [wxcat_adm]: No dynamic range specified for 192.168.80.0. If hardware discovery is being used, a dynamic range is required.
[root@wxcat_adm ~]# makedhcp -q  cn01
[root@wxcat_adm ~]# makedhcp   cn01
[root@wxcat_adm ~]# makedhcp -q  cn01
cn01: ip-address = 192.168.84.248, hardware-address = ac:1f:6b:bc:db:ec

I'll admit that I still have a lot to learn about xcat, but it still seems quite strange that its not "picking up" the IP address I've specified in the node definition. Is there something I have to refresh? Apply?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants