Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

'Failed to parse network config' error when using net.toml with VMware Variant #3685

Open
justdan96 opened this issue Dec 28, 2023 · 13 comments
Labels
type/bug Something isn't working type/support User support related issues.

Comments

@justdan96
Copy link

justdan96 commented Dec 28, 2023

Image I'm using:
vmware-k8s-1.24-v1.17.0

What I expected to happen:
Provide a net.toml file with the below contents, allowing the imaage to use static IP addressing to come online. I have also used eno1 instead of the MAC address:

version = 3

["00:50:56:a6:8f:91".static4]
primary = true
enabled = true
addresses = ["10.0.0.156/22"]

[["00:50:56:a6:8f:91".route]]
to = "default"
via = "10.0.0.1"
route-metric = 100

What actually happened:
The OS fails to start up correctly, it fails with the following error message:

netdog[2212]: Unable to read/parse network config from '/var/lib/bottlerocket/net.toml': Failed to parse network config: data did not match any variant of untagged enum NetworkDeviceV1

How to reproduce the problem:
Spin up a VMware VM from the OVA. Use a live CD to add a net.toml file to the 12th partition with the contents above, assuming they are correct. Remove the CD and boot the VM.

It initially appears as if Bottlerocket can't find the network interface correctly, but as it happens so early in the boot process I'm not even sure how to troubleshoot this further. I tried addressing by interface name and MAC address and neither works.

@justdan96 justdan96 added status/needs-triage Pending triage or re-evaluation type/bug Something isn't working labels Dec 28, 2023
@yeazelm
Copy link
Contributor

yeazelm commented Dec 28, 2023

Hello @justdan96, it looks like you might be missing the version = 3 at the top of the file. This cryptic message is a known issue: #2657. What this error basically means is netdog which parses net.toml can't figure out what type of device you are specifying. I think it might just be the version setting at the top. The rest of your configuration looks correct.

@yeazelm yeazelm added type/support User support related issues. and removed status/needs-triage Pending triage or re-evaluation labels Dec 28, 2023
@justdan96
Copy link
Author

I might have just missed the version string when I copied to GitHub, I'll check that tomorrow morning.

@justdan96
Copy link
Author

justdan96 commented Dec 29, 2023

@yeazelm I've edited my top comment, the file does indeed specify version = 3 at the top.

@yeazelm
Copy link
Contributor

yeazelm commented Dec 29, 2023

I believe the issue is the enabled = true. enabled is only valid for DHCP, static is assumed to be enabled when configured. That should resolve the issue for your configuration.

@justdan96
Copy link
Author

I think maybe I have got a little further by removing enabled = true - now the error is:

thar-be-settings[2084]: Restart command failed - 'netdog write-resolv-conf': Unable to read '/var/lib/netdog/primary_mac_address': No such file or directory (os error 2)

@yeazelm
Copy link
Contributor

yeazelm commented Dec 30, 2023

The file that it is saying doesn't exist is written here:

write_primary_interface(&primary_interface)?;
but the error is coming from https://github.com/bottlerocket-os/bottlerocket/blob/develop/sources/api/netdog/src/cli/mod.rs#L84 where it can't find the primary interface file (either interface name or mac address). This wouldn't normally be possible if the network configuration was written correctly. Can you post more output? I'm curious to know if the generate-network-confg actually still failed but made it far enough to get to this point. Do you get the same error if you use eno1?

@justdan96
Copy link
Author

justdan96 commented Jan 2, 2024

The error above was with eno1, I can try again with a MAC address. The output scrolls pretty fast so I'm not sure how much more I can transcribe but I'll give it a go.

Is there any way to enable debug logging for the early boot process? Or any way to get the list of possible network interfaces printed out during boot?

@justdan96
Copy link
Author

By recording the screen and stepping through the footage I can see the following sequence of messages, I've removed some I don't think are relevant:

[ OK ] Started ACPI event daemon
[ 2.367888] netdog[1940]: Unable to read/parse network config: data did not match any variant of untagged enum NetworkDeviceV1
[FAILED] Failed to start Generate network configuration
See 'systemctl status generate-network-config.service' for details
[DEPEND] Dependency failed for Preparation for Network
 Starting wicked DHCPv4 supplicant service...
 Starting wicked DHCPv4 supplicant service...
...
[ OK ] Started wicked DHCPv4 supplicant service
[ OK ] Started wicked DHCPv4 supplicant service
...
[ OK ] Finished Disable kexec load syscalls.
[ 3.203628] thar-be-settings[2048]: Restart command failed - 'netdog write-resolv-conf': Unable to read '/var/lib/netdog/primary_mac_address': No such file or directory (os error 2)
[FAILED] Failed to start Applies settings to create config files.
See 'systemctl status settings-applier.service' for details.
[DEPEND] Dependency failed for Bottlerocket initial configuration complete.
[DEPEND] Dependency failed for Isolates configured.target
[DEPEND] Dependency failed for Sets the hostname.

Just in case it is my user data I'll paste the redacted version of that below:

[settings]
motd = "This is a Bottlerocket VM."

[settings.network]
hostname = "server-lab"

[settings.kubernetes]
pod-infra-container-image = "rancher/pause:3.6"
cloud-provider = "external"
allowed-unsafe-sysctls = ["net.ipv4.tcp_tw_reuse"]
api-server = "https://rancher-lab.example.com/k8s/clusters/c-lllll"
authentication-mode = "tls"
cluster-dns-ip = "10.0.0.10"
server-tls-bootstrap = false
cluster-certificate = "..."
bootstrap-token = "..."
node-ip = "10.0.0.156"

[settings.host-containers.admin]
enabled = true
user-data = "..."

[settings.host-containers.control]
enabled = false

[settings.dns]
name-servers = ["10.0.0.71", "10.0.0.72"]
search-list = ["example.com"]

[settings.kernel.sysctl]
"net.ipv4.tcp_tw_reuse" = "1"

[settings.ntp]
time-servers = ["ntp-1.example.com", "ntp-2.example.com", "ntp-3.example.com"]

[settings.metrics]
send-metrics = false
metrics-url = "https://metrics.bottlerocket.aws/v1/metrics"
service-checks = ["apiserver", "chronyd", "containerd", "host-containerd", "kubelet", "vmtoolsd"]

@justdan96
Copy link
Author

I noticed that net.ifnames=0 is set for the VMware variants so the interface name wouldn't be eno1, unfortunately eth0 and the MAC address aren't working either.

@yeazelm
Copy link
Contributor

yeazelm commented Jan 2, 2024

Thanks @justdan96 for that link. It reminded me of the critical piece here which is the next line configuring netdog. That should be forcing on DHCP for the first interface. I don't believe a net.toml will work in this case. I was caught off guard that it looks to be attempting to read it but the logic should be forcing DHCP.

@justdan96
Copy link
Author

I guess we can say that static IP addressing for the VMware variant is just not possible. That's very disappointing for us, since we won't be in a position to have a DHCP server in place for our VMware environments for maybe another year.

@yeazelm
Copy link
Contributor

yeazelm commented Jan 2, 2024

After looking at https://github.com/bottlerocket-os/bottlerocket/blob/develop/sources/api/netdog/src/cli/mod.rs#L182 which parses this configuration. Theoretically it should be able to use that net.toml file but I don't know if we have tested using a net.toml in VMware. From reading the code, I'm hopeful it might work, but we would need it to be able to parse that file correctly.

Is there any way to enable debug logging for the early boot process? Or any way to get the list of possible network interfaces printed out during boot?

This has been a pain point with debugging bottlerocket on new installs and the issue I linked above (#2657) only solves half the problem. Getting that first boot can be challenging. One method that might work for you is to build the vmware-dev variant which has a shell on the console. This allows you to look around a bit while troubleshooting what is going on. You would probably need to enable DHCP on that first interface if you can in a test VM which would like you boot without issues and get to the console. You'd also want to strip out as much user-data as you can to avoid that tripping things up, I usually just provide a motd:

[settings]
motd = "This is a Bottlerocket VM."

Once it boots, you could see what interfaces are present and what they are named.

I guess we can say that static IP addressing for the VMware variant is just not possible.

I'm not super familiar with VMware configurations but I believe you can also set up the configuration to provide statically assigned IP addresses via DHCP to each VM individually so you could "trick" the VM into getting DHCP when its actually static. This is defendant on a lot of the VMware configuration but might be a workaround to consider.

For what it is worth:

[ 2.367888] netdog[1940]: Unable to read/parse network config: data did not match any variant of untagged enum NetworkDeviceV1
[FAILED] Failed to start Generate network configuration

Is stating that the file in invalid, so its still not parsing correctly, which is back to the original error. One trick I've tried is getting this error to go away with something simple for eth0 and then "stepping" through a few interface names to see if that helps. Nonetheless, wicked may fight us here if its reading DHCP from the kernel commandline but being told something different from netdog. I'd have to dive in to the code to see if that would even work.

We still might not get you to a working state even with these workarounds but hopefully this helps.

@justdan96
Copy link
Author

I removed the netdog line from the kernel parameters in the Cargo.toml, built a new variant and then tried to deploy from that new OVA with the net.toml file. The untagged enum error appears using eth0 as the interface, but I am pretty sure the interface name is correct as in the startup log I can see:

[ 1.814755] vmxnet3 0000:03:00.0 eth0: NIC Link is Up 10000 Mbps

For now I'll have to park this bit of work. I guess until something like https://docs.rs/serde-untagged/latest/serde_untagged/ is integrated into netdog we won't know why it isn't recognising the interface, and I don't want to continue down a rabbit hole of investigation if this is a setup that isn't supported.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type/bug Something isn't working type/support User support related issues.
Projects
None yet
Development

No branches or pull requests

2 participants