Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Possible configd problem: Sections of config.xml related to NAT are not evaluated anymore #7562

Closed
noseshimself opened this issue Jun 27, 2024 · 14 comments
Assignees
Labels
support Community support

Comments

@noseshimself
Copy link

noseshimself commented Jun 27, 2024

Important notices

Describe the bug
After an involuntary update of the master node of a HA cluster from OPNsense 24.1.5_3-amd64 to OPNsense 24.1.9_4-amd64 all 1:1 NAT settings were gone on the running system (gone as in

image
) although the relevant parts of the configuration file are still in the correct positions and completely intact:

  <onetoone>
     <external>217.7.50.198</external>
     <category/>
     <descr>nextcloud.gerstel.com</descr>
     <interface>lan</interface>
     <type>binat</type>
     <source>
       <address>192.168.111.8</address>
     </source>
     <destination>
       <any>1</any>
     </destination>
   </onetoone>

The slave system running OPNsense 24.1.5_3-amd64 is still working after pushing the configuration over
image
so I'm assuming that the syntax was sufficiently correct even after the upgrade.

Tip: to validate your setup was working with the previous version, use opnsense-revert (https://docs.opnsense.org/manual/opnsense_tools.html#opnsense-revert)

As the bug is stopping the production of a large number of internet-facing servers we had to demote the master to slave and are now using the backup system as master via CARP.

With python being upgraded from 3.9 to 3.11 reverting seems to be an extremely impractical solution, too.

To Reproduce

Steps to reproduce the behavior:

  1. Set up (1:1?) NAT rules
  2. Verify them to be working
  3. Upgrade
  4. See error

Expected behavior

NAT rules being applied to the running system upon booting.

Describe alternatives you considered

Crying loudly. Seeing the rules still being available on the backup system, stopping to cry and failing over.

Relevant log files

The log file got considerably larger after the update but I can't seem to find anything relevant.

-rw-------   1 root  wheel   510405 Jun 27 01:33 configd_20240627.log
-rw-------   1 root  wheel  5386277 Jun 26 23:59 configd_20240626.log
-rw-------   1 root  wheel   264174 Jun 25 23:59 configd_20240625.log
-rw-------   1 root  wheel   263266 Jun 24 23:59 configd_20240624.log
-rw-------   1 root  wheel   264333 Jun 23 23:59 configd_20240623.log

Environment

Software version used and hardware type if relevant, e.g.:
User-Agent Mozilla/5.0 (X11; CrOS x86_64 14541.0.0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/126.0.0.0 Safari/537.36
FreeBSD 13.2-RELEASE-p11 stable/24.1-n255023-99a14409566 SMP amd64
OPNsense 24.1.9_4 908aac0
Plugins os-OPNProxy-1.0.5_1 os-OPNWAF-1.1 os-OPNcentral-1.7 os-dmidecode-1.1_1 os-dyndns-1.27_3 os-lldpd-1.1_2 os-maltrail-1.10 os-nextcloud-backup-1.0_1 os-redis-1.1_2 os-rfc2136-1.8_2 os-shadowsocks-1.1 os-squid-1.0_2 os-sunnyvalley-1.4_3 os-theme-cicada-1.35 os-theme-rebellion-1.8.10 os-theme-tukan-1.27_1 os-theme-vicuna-1.45_1 os-vnstat-1.3_1
Time Thu, 27 Jun 2024 01:08:27 +0000
OpenSSL 3.0.14
Python 3.11.9
PHP 8.2.20

@fichtner
Copy link
Member

You need to upgrade the slave as well.

@fichtner fichtner added the support Community support label Jun 27, 2024
@noseshimself
Copy link
Author

Sorry, no.

If you were right, turning off the slave and only running the updated master node should fix the problem.

  1. Turn off both routers.
  2. turn on gw-ext-1.
  3. No 1:1 NAT.

@fichtner
Copy link
Member

I don’t have your setup nor a way to support you through community support to assess your current config.xml state.

@noseshimself
Copy link
Author

I don’t have your setup nor a way to support you through community support to assess your current config.xml state.

I could of course send the current config.xml to you; I have a serious problem testing it as I do not have an identical device I can take off-line and test with the current configuration or I would have done that for verification.

All I really need is someone putting that config.xml into a current OPNsense at factory defaults and see if the 1:1 NAT mappings are there to verify whether this is a config problem or a firmware problem.

Or you tell me how I can opnsense-revert down to OPNsense 24.1.5_3-amd64 without the python downgrade killing me on the way...

@AdSchellevis AdSchellevis self-assigned this Jun 28, 2024
@AdSchellevis AdSchellevis added bug Production bug and removed support Community support labels Jun 28, 2024
@noseshimself
Copy link
Author

Thank you for getting me several steps ahead...

The issue title probably needs changing; it is a migration problem.

One is negligible: The shadowsocks migration is failing; i'll split that one off, see #7578

The problem referred here is logged as

<147>1 2024-06-28T13:58:30+00:00 gw-ext-1.gerstel.com config 15771 - [meta sequenceId="14"] [OPNsense\Firewall\Filter:npt.rule.d8addb07-6908-4e73-84c2-3ff93be2af91.destination_net] Please specify a valid network segment or IP address.{2003:4e:6010::b:217.7.50.193/128}
<147>1 2024-06-28T13:58:30+00:00 gw-ext-1.gerstel.com config 15771 - [meta sequenceId="15"] Model OPNsense\Firewall\Filter can't be saved, skip ( OPNsense\Base\ValidationException: [OPNsense\Firewall\Filter:npt.rule.d8addb07-6908-4e73-84c2-3ff93be2af91.destination_net] Please specify a valid network segment or IP address.{2003:4e:6010::b:217.7.50.193/128}
<147>1 2024-06-28T13:58:30+00:00 gw-ext-1.gerstel.com config 15771 - [meta sequenceId="16"]  in /usr/local/opnsense/mvc/app/models/OPNsense/Base/BaseModel.php:649
<147>1 2024-06-28T13:58:30+00:00 gw-ext-1.gerstel.com config 15771 - [meta sequenceId="17"] Stack trace:
<147>1 2024-06-28T13:58:30+00:00 gw-ext-1.gerstel.com config 15771 - [meta sequenceId="18"] #0 /usr/local/opnsense/mvc/app/models/OPNsense/Base/BaseModel.php(774): OPNsense\Base\BaseModel->serializeToConfig()
<147>1 2024-06-28T13:58:30+00:00 gw-ext-1.gerstel.com config 15771 - [meta sequenceId="19"] #1 /usr/local/opnsense/mvc/script/run_migrations.php(54): OPNsense\Base\BaseModel->runMigrations()
<147>1 2024-06-28T13:58:30+00:00 gw-ext-1.gerstel.com config 15771 - [meta sequenceId="20"] #2 {main} )

The config.xml has been cleaned; all lines with passwords are gone so HA/carp migration might fail. Don't worry, that part is working
config.txt
(sorry for naming it .txt I needed to pass it through the security checkpoint and forgot changing it back).

@AdSchellevis
Copy link
Member

@noseshimself if you remove the entry with 2003:4e:6010::b:217.7.50.193/128 the issue might be solved, it's indeed bad input data. At a first I don't expect a bug yet.

If you fetch the old overview page using:

curl -o /usr/local/www/firewall_nat_1to1.php https://raw.githubusercontent.com/opnsense/core/stable/23.7/src/www/firewall_nat_1to1.php

You should be able to remove the item via the then available firewall_nat_1to1.php page on the box, next trigger the migration again.

Don't forget to remove the old file when you're done.

@AdSchellevis AdSchellevis added support Community support and removed bug Production bug labels Jun 28, 2024
@noseshimself
Copy link
Author

The only occurrences of "2003:4e:6010" in config.xml are

      <gateway_item uuid="e7d69e13-3aac-4503-8f28-4a99bf68838e">
        <disabled>0</disabled>
        <name>GW_TBusinessConnect_IPv6</name>
        <descr>Default-Router im T-BusinessConnect (IPv6)</descr>
        <interface>lan</interface>
        <ipprotocol>inet6</ipprotocol>
        <gateway>2003:4e:6010::1</gateway>
        <defaultgw>1</defaultgw>
        <fargw>0</fargw>
        <monitor_disable>1</monitor_disable>
        <monitor_noroute>0</monitor_noroute>
        <monitor/>
        <force_down>0</force_down>
        <priority>255</priority>
        <weight>1</weight>
        <latencylow/>
        <latencyhigh/>
        <losslow/>
        <losshigh/>
        <interval/>
        <time_period/>
        <loss_interval/>
        <data_length/>
      </gateway_item>
    <LAN>
      <if>igb3</if>
      <descr>SYNC</descr>
      <enable>1</enable>
      <lock>1</lock>
      <spoofmac/>
      <ipaddr>192.168.2.1</ipaddr>
      <subnet>30</subnet>
    </LAN>
    <lan>
      <if>igb1_vlan11</if>
      <descr>TBusinessConnect</descr>
      <enable>1</enable>
      <lock>1</lock>
      <spoofmac/>
      <blockpriv>1</blockpriv>
      <blockbogons>1</blockbogons>
      <ipaddr>217.7.50.229</ipaddr>
      <subnet>29</subnet>
      <ipaddrv6>2003:4E:6010::B:217.7.50.229</ipaddrv6>
      <subnetv6>48</subnetv6>
      <gatewayv6>GW_TBusinessConnect_IPv6</gatewayv6>
    </lan>

And to be honest this is irritating me already because I can't find any log entries of system administrators in the ticket system telling me where the static IPv6 address came from. or why they are tagged as "LAN" and "lan" and why the sync connection is "LAN"...

The slave router doesn't have this in its configuration.

There is nothing referring to it in the NAT section at all:

    <onetoone>
      <external>217.7.50.193</external>
      <descr>proxy.gerstel.com</descr>
      <interface>lan</interface>
      <type>binat</type>
      <source>
        <address>192.168.111.30</address>
      </source>
      <destination>
        <any>1</any>
      </destination>
    </onetoone>

and using the old version of the page does not show anything IPv6-y that can be fixed at all.

@Monviech
Copy link
Member

Monviech commented Jun 28, 2024

Sorry for budding in but theoretically this is a valid ipv6 address.

It is called IPv6 with embedded IPv4 address.
2003:4e:6010::b:217.7.50.193

The last 16bits are allowed to be used like this.

https://datatracker.ietf.org/doc/html/rfc4291#section-2.2 check the third example.

Edit: Oops this is about 1:1 NAT, sorry xD. Just realized.

@noseshimself
Copy link
Author

You did not read the problem description: Nobody set a static IPv6 address on that interface (I can't find any documentation related to it) and even if somebody might have done so, there should not be any implicit NAT rule that cannot be migrated in a section that did not contain any before migration.

Besides that: This is the outward-facing interface and it does not seem to be a good idea to add an IPv6 address there not being assigned by the rated provider (and I know the IPv6 block assigned there).

@noseshimself
Copy link
Author

noseshimself commented Jun 28, 2024

@noseshimself if you remove the entry with 2003:4e:6010::b:217.7.50.193/128 the issue might be solved, it's indeed bad input data. At a first I don't expect a bug yet.

I was looking in the wrong place, but I guess this has to be added to #7578 and I should have read the error message.

If I run the migration script on the configuration of the (still working) slave something (the shadowsocks migration?) is adding this?

    <npt>
      <category/>
      <descr>proxy.gerstel.com</descr>
      <interface>lan</interface>
      <source>
        <address>FC47:5253:544C::6F:192.168.111.30</address>
      </source>
      <destination>
        <address>2003:4E:6010::B:217.7.50.193</address>
      </destination>
    </npt>

and as we did not do anything IPv6 there I never expected rules showing up there. I just checked all configuration backups back to 2022 and found the first daily change where the IPv6 entries started showing up but can't ask the responsible admin anymore -- he left.

After removing the npt-related section from config.xml and rerunning

root@gw-ext-1:/conf # /usr/local/opnsense/mvc/script/run_migrations.php
*** OPNsense\Shadowsocks\Local Migration failed, check log for details
Migrated OPNsense\Firewall\Filter from 0.0.0 to 1.0.4

If I'm reapplying run_migrations.php to the config file before the upgrade I'm getting the same error messages again so the root cause of the problem is the npt entry that never caused any problem (e.g. annoying messages like "hey, I was told to set up an IPv6 NAT rule for an address that is nowhere to be found")

@AdSchellevis
Copy link
Member

@noseshimself case closed then?

@noseshimself
Copy link
Author

I would say so but someone has to find out where the npt entry was coming from. The routers in question were not doing anything with the IPv6 addresses I found.

@AdSchellevis
Copy link
Member

let's close this then, tracking origins of local configuration changes is not something we can assist with in community time here.

@noseshimself
Copy link
Author

(I'm digging through four years of nightly backups to find out how this npt mapping was created and why it is not on the slave -- should I find proof that it was automatically created after installing the exit part of shadowsocks on the gateway I'll open a new issue.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
support Community support
Development

No branches or pull requests

4 participants