Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BGP, teamd, dhcp-relay starts with a delay after an upgrade #13117

Closed
stepanblyschak opened this issue Dec 20, 2022 · 2 comments · Fixed by #13283
Closed

BGP, teamd, dhcp-relay starts with a delay after an upgrade #13117

stepanblyschak opened this issue Dec 20, 2022 · 2 comments · Fixed by #13283
Assignees
Labels
Request for 202211 Branch Triaged this issue has been triaged

Comments

@stepanblyschak
Copy link
Collaborator

Description

Steps to reproduce the issue:

  1. Upgrade from 202205 to master image
  2. Observe that BGP, teamd, dhcp_relay start with a significant delay:
admin@anc-ali-t10:~$ docker ps
CONTAINER ID   IMAGE                                COMMAND                  CREATED         STATUS         PORTS     NAMES
35ef5938740e   b72e99b9f3ee                         "/usr/bin/docker_ini…"   6 seconds ago   Up 4 seconds             dhcp_relay
b531d87c11ae   docker-fpm-frr:latest                "/usr/bin/docker_ini…"   8 seconds ago   Up 7 seconds             bgp
d5750996e280   docker-sonic-telemetry:latest        "/usr/local/bin/supe…"   2 minutes ago   Up 2 minutes             telemetry
bd98f9e3b2db   docker-sonic-mgmt-framework:latest   "/usr/local/bin/supe…"   2 minutes ago   Up 2 minutes             mgmt-framework
464123660ea0   docker-snmp:latest                   "/usr/local/bin/supe…"   2 minutes ago   Up 2 minutes             snmp
1c96288afca9   docker-platform-monitor:latest       "/usr/bin/docker_ini…"   4 minutes ago   Up 4 minutes             pmon
9251d8a262f4   docker-lldp:latest                   "/usr/bin/docker-lld…"   4 minutes ago   Up 4 minutes             lldp
13d9f779720b   docker-router-advertiser:latest      "/usr/bin/docker-ini…"   5 minutes ago   Up 5 minutes             radv
452b305d83d5   docker-syncd-mlnx:latest             "/usr/local/bin/supe…"   5 minutes ago   Up 5 minutes             syncd
98068d56673f   docker-orchagent:latest              "/usr/bin/docker-ini…"   5 minutes ago   Up 5 minutes             swss
046eb3ab63a0   docker-eventd:latest                 "/usr/local/bin/supe…"   5 minutes ago   Up 5 minutes             eventd
08628d47da2f   docker-database:latest               "/usr/local/bin/dock…"   5 minutes ago   Up 5 minutes             database

Seems this is the PR that caused additional delay: #11796.

Describe the results you received:

BGP, teamd start 5 min after swss

Describe the results you expected:

BGP, teamd start earlier on boot

Output of show version:

Version based on 14a5ec7

Output of show techsupport:

(paste your output here or download and attach the file here )

Additional information you deem important (e.g. issue happens only occasionally):

@stephenxs
Copy link
Collaborator

this also failed warm reboot due to the delay.

@stephenxs
Copy link
Collaborator

stephenxs commented Dec 29, 2022

In the PR it updated the logic to enable a feature in the FEATURE table in “init_cfg.json”: it reads some additional tables like “DEVICE_RUNTIME_METADATA” to determine whether bgp, teamd should be enabled or not.
The motivation of the PR, I think, is to determine whether the features should be enabled or not when the image is installed from scratch.
However, there is a building time logic that masks features according to the FEATURE table in “init_cfg.json” in build_scripts/mask_disabled_services.py which is called by sonic_debian_extension.j2 which is called by build time script build_debian.sh
As a result, the bgp and teamd services are masked from the image.

stephens@arc-build-server-2:/builds2/stephens/wip/202205/sonic-buildimage/fsroot-mellanox/etc/systemd/system$ ls {bgp*,teamd*} -l
lrwxrwxrwx 1 root root 9 Dec 15 16:41 bgp.service -> /dev/null
lrwxrwxrwx 1 root root 9 Dec 15 16:41 teamd.service -> /dev/null

When the DUT is rebooted to the new image, nobody takes care of trhe services and they keep masked until 5 minutes later when sonic-host-services restarts the services on handling the FEATURE table.

relevant log message

Dec 10 22:25:00.523680 qa-eth-vt02-4-3800 NOTICE systemd[1]: hostcfgd.service: Main process exited, code=exited, status=1/FAILURE
Dec 10 22:25:00.524738 qa-eth-vt02-4-3800 NOTICE systemd[1]: hostcfgd.service: Unit entered failed state.
Dec 10 22:25:00.525183 qa-eth-vt02-4-3800 WARNING systemd[1]: hostcfgd.service: Failed with result 'exit-code'.
Dec 10 22:25:12 sonic systemd[1]: Started Delays hostcfgd daemon until SONiC has started.
Dec 10 22:25:12 sonic systemd[1]: Started Delays hostcfgd daemon until SONiC has started.
Dec 10 22:26:38.748517 sonic INFO hostcfgd: ConfigDB connect success
Dec 10 22:26:38.765768 sonic INFO hostcfgd: Waiting for systemctl to finish initialization
Dec 10 22:30:25.818530 qa-eth-vt02-4-3800 NOTICE root: WARMBOOT_FINALIZER : Some components didn't finish reconcile: bgp ...
Dec 10 22:30:26.268771 qa-eth-vt02-4-3800 INFO hostcfgd: systemctl has finished initialization -- proceeding ...
Dec 10 22:30:26.279078 qa-eth-vt02-4-3800 INFO hostcfgd: Updating feature 'bgp' systemd config file related to auto-restart ...
Dec 10 22:30:26.279511 qa-eth-vt02-4-3800 INFO hostcfgd: Feautre 'bgp' systemd config file related to auto-restart is updated!
Dec 10 22:30:26.279557 qa-eth-vt02-4-3800 INFO hostcfgd: Reloading systemd configuration files ...
Dec 10 22:30:26.705274 qa-eth-vt02-4-3800 INFO hostcfgd: Systemd configuration files are reloaded!
Dec 10 22:30:26.724447 qa-eth-vt02-4-3800 INFO hostcfgd: Running cmd: 'sudo systemctl unmask bgp.service'
Dec 10 22:30:26.743052 qa-eth-vt02-4-3800 INFO hostcfgd[13895]: Removed /etc/systemd/system/bgp.service.
Dec 10 22:30:27.105740 qa-eth-vt02-4-3800 INFO hostcfgd: Running cmd: 'sudo systemctl enable bgp.service'
Dec 10 22:30:27.124633 qa-eth-vt02-4-3800 INFO hostcfgd[13913]: Created symlink /etc/systemd/system/sonic.target.wants/bgp.service → /lib/systemd/system/bgp.service.
Dec 10 22:30:27.502304 qa-eth-vt02-4-3800 INFO hostcfgd: Running cmd: 'sudo systemctl start bgp.service'
Dec 10 22:30:28.535576 qa-eth-vt02-4-3800 INFO hostcfgd: Feature bgp is enabled and started
Dec 10 22:30:28.547782 qa-eth-vt02-4-3800 INFO hostcfgd: Updating feature 'database' systemd config file related to auto-restart ...

The same syndrome can be observed in warm reboot flow, which causes ~5 minutes traffic drop in warm reboot.
SPC2-test_sonic_dump_qa-eth-vt02-4-3800_2022-12-11T003242.tar.gz

@gechiang gechiang added the Triaged this issue has been triaged label Jan 4, 2023
abdosi added a commit to abdosi/sonic-buildimage that referenced this issue Jan 6, 2023
sonic-net#13117

Signed-off-by: Abhishek Dosi <[email protected]>
yxieca pushed a commit that referenced this issue Jan 7, 2023
…excplicitly (#13283)

What I did:
Fix : #13117

How I did:
During build time mask only those feature/services that are disabled explicitly. Some of the features ((eg: teamd/bgp/dhcp-relay/mux/etc..)) state is determine run-time so for those feature by default service will be up and running and then later hostcfgd will mask them if needed.

So Default behavior will be

init_cfg.json.j2 during build time make state as disabled then mask the service
init_cfg.json.j2 during build time make state as another jinja2 template render string than do no mask the service
init_cfg.json.j2 during build time make state as enabled then do not mask the service

How I verify:
Manual Verification.

Signed-off-by: Abhishek Dosi <[email protected]>
mssonicbld pushed a commit to mssonicbld/sonic-buildimage that referenced this issue Jan 7, 2023
…excplicitly (sonic-net#13283)

What I did:
Fix : sonic-net#13117

How I did:
During build time mask only those feature/services that are disabled explicitly. Some of the features ((eg: teamd/bgp/dhcp-relay/mux/etc..)) state is determine run-time so for those feature by default service will be up and running and then later hostcfgd will mask them if needed.

So Default behavior will be

init_cfg.json.j2 during build time make state as disabled then mask the service
init_cfg.json.j2 during build time make state as another jinja2 template render string than do no mask the service
init_cfg.json.j2 during build time make state as enabled then do not mask the service

How I verify:
Manual Verification.

Signed-off-by: Abhishek Dosi <[email protected]>
yxieca pushed a commit that referenced this issue Jan 9, 2023
…excplicitly (#13283) (#13296)

What I did:
Fix : #13117

How I did:
During build time mask only those feature/services that are disabled explicitly. Some of the features ((eg: teamd/bgp/dhcp-relay/mux/etc..)) state is determine run-time so for those feature by default service will be up and running and then later hostcfgd will mask them if needed.

So Default behavior will be

init_cfg.json.j2 during build time make state as disabled then mask the service
init_cfg.json.j2 during build time make state as another jinja2 template render string than do no mask the service
init_cfg.json.j2 during build time make state as enabled then do not mask the service

How I verify:
Manual Verification.

Signed-off-by: Abhishek Dosi <[email protected]>

Signed-off-by: Abhishek Dosi <[email protected]>
Co-authored-by: abdosi <[email protected]>
mssonicbld pushed a commit to mssonicbld/sonic-buildimage that referenced this issue Feb 3, 2023
…excplicitly (sonic-net#13283)

What I did:
Fix : sonic-net#13117

How I did:
During build time mask only those feature/services that are disabled explicitly. Some of the features ((eg: teamd/bgp/dhcp-relay/mux/etc..)) state is determine run-time so for those feature by default service will be up and running and then later hostcfgd will mask them if needed.

So Default behavior will be

init_cfg.json.j2 during build time make state as disabled then mask the service
init_cfg.json.j2 during build time make state as another jinja2 template render string than do no mask the service
init_cfg.json.j2 during build time make state as enabled then do not mask the service

How I verify:
Manual Verification.

Signed-off-by: Abhishek Dosi <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Request for 202211 Branch Triaged this issue has been triaged
Projects
None yet
5 participants