Skip to content

VNet route conflict diag: Handle changes to interface index#56936

Draft
ravicious wants to merge 2 commits intomasterfrom
r7s/route-conflict-idx
Draft

VNet route conflict diag: Handle changes to interface index#56936
ravicious wants to merge 2 commits intomasterfrom
r7s/route-conflict-idx

Conversation

@ravicious
Copy link
Copy Markdown
Member

The problem

One of the customers shared a video showing a problem with VNet. Throughout the video, a VPN software that's running on the device seems to be constantly reconnecting. At one point in the video, the VNet diag report shows this:

VNet diag report

It means that the diag check classified 4.0.0.0/6 and 8.0.0.0/5 as route destinations belonging to VNet's utun5 interface. However, the report also clearly shows that VNet uses 100.64.0.0/10 as its IPv4 CIDR range. This means that it'd never set up routes like 4.0.0.0/6 (the routes are set up in lib/vnet/osconfig_darwin.go).

The way that the diag check collects the route destinations is as follows:

  1. Receive the network interface name as an argument (utun5 in this case).
  2. Get the interface index that corresponds to this interface name.
  3. Fetch all routes currently set up on the system. Group them into two buckets: ones that have interface index equal to that from step 2 (thus belong to VNet) and ones that don't.
  4. Check if there are any destinations in the other group that overlap with VNet destinations.

The only explanation for the behavior from the video that we can think of is the index of VNet's interface suddenly changing. I don' know how this could happen, all I could find on the internet was a single SO question where someone mentions a similar problem. We suspect it might had to do with the VPN software constantly reconnecting and rebuilding its network interface, but I wasn't able to reproduce this with Tailscale for example. The VPN software the customer uses must be doing something extra.

The fix

In the current version of VNet, there are two situations where this mismatch could theoretically happen:

  • If the diag check doesn't find any destinations that belong to VNet, it sleeps for 500ms and then fetches destinations again, up to two times. This is because the diag check runs soon after VNet starts and it might take a moment for the admin process of VNet to set up those routes. However, between those retries, the diag check doesn't refetch the interface index, assuming it won't change.
    • This can be fixed by always refetching the interface index.
  • Even if the diag check never retries fetching destinations, the interface index could potentially change between points 2 and 3 of the logic described above.
    • To fix this, we can check if VNet routes as identified by the diag check indeed belong to the IPv4 CIDR ranges used by VNet, which come from vnet_config resource of each root cluster.

@ravicious ravicious added no-changelog Indicates that a PR does not require a changelog entry backport/branch/v17 backport/branch/v18 labels Jul 18, 2025
Comment thread lib/vnet/osconfig_windows.go
@ravicious ravicious force-pushed the r7s/route-conflict-idx branch from df665c1 to 7df18bd Compare July 22, 2025 14:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backport/branch/v17 backport/branch/v18 no-changelog Indicates that a PR does not require a changelog entry

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant