-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Zebra crash is observed when management VRF is disabled. Crash seen with simple linux scripts as well #3798
Comments
Fixed as part of #3763 |
@prsunny : Before raising this gitissue, we had integrated this PR in our code and found that the crash exists even with this fix. Once if we see a successful Jenkins image, we will retest it using Jenkins image and update you. |
Zebra crash is seen in #136 as well.
|
When VRF is getting deleted, kernel sends a netlink for "vrf deletion" and it sends one other netlink for "link deletion" (the link whose name is same as vrf name). Looks like the zebra is not handling the deletion of vrf followed by deletion of the link. |
@tylerlinp , can you please check this? |
We reproduced it.
|
Yes, crash is because issue frr#5369(bug 1). But it is different #3763 that only fixed delroute failure(bug 2). There should be bug 3 in frr7.2, that is when downing interface the routes pointing to it didn't clear(I think frr7.1 maybe do right), such as default route here existing until vrf mgmt removed. I found 2 other issues when reading scripts. a) in interfaces.j2, |
@ryan44guo , @prsunny : Regarding temporary workaround/solution # 1 to remove this route from kernel manually, I am not sure what is missing. As per the current sequence of operations, following things are already happening.
|
@tylerlinp : W.r.t. crash, can we assume that the crash will be fixed irrespective of the 2 other issues that you mentioned? Regarding the 2 other issues, please see my inline replies. b) in interfaces-config.sh, down eth0 and lo. I think it is useless and that has been done in systemctl restart networking. |
@kannankvs Yes, the crash will be fixed irrespective of the 2 other issues that I mentioned. To avoid crash, we should assure to delete routes first then to down eth0. In interfaces.j2, change
others: a) At least set master before add ip, or else lo-m add a global ip then move vrf.
I think lo-m should create with mgmt rather than lo up, lo-m and lo has no relation. b) You are right. It maybe useful to ifdown eth0 first. |
c) What is the meaning to add a directly connected route in interfaces.j2? |
@tylerlinp : Couple of doubts.
NOTE: Now that the 201911 branch has been pulled out, let us try to do the changes that are mandatory for the release. If you think that any of the above changes are mandatory, request you to provide more reasons for the change which can be produced while raising PR. |
@tylerlinp : regarding the "connected" routes, we followed the code that already existed before mvrf implementation. |
|
|
@kannankvs logically, pre-down and down both should be right, but that's only logically. |
@ryan44guo : Got it. As mentioned earlier, we will go ahead and make the changes. Just to understand it better, if we have route delete command in "down", is it not getting executed? |
@kannankvs : I am not very sure about the "down" hook call is before or after the down command. So I tested it, it had not send out DELROUTE messages. So I think the "down" hook is called after the "down" command, but before post-down. |
@ryan44guo , @tylerlinp, @prsunny : I changed the interfaces.j2 and raised the PR3853. Request you to kindly review and approve. |
|
Closing as #3853 is merged |
@wendani , is it a new issue? if so, can you raise an issue |
I'll make this open to track the proper fix of the issue. |
@prsunny will sync up with @pavel-shirshov |
was mentioned as seen in latest 201911 |
I checked this. I found the issue with zebra on master. I see that zebra crashes when we issue /sbin/reboot. I asked @tahmed-dev to investigate further. |
@kannankvs, is this issue still being seen? I just used the tip off master on my DUT and could not produces it by either method 1 mentioned in the issue details nor by issuing /sbin/reboot:
|
Description
Zebra crash is seen upon deleting mvrf. From the GDB traces, the route node does not have valid data and the zebra process is crashed while trying to access NULL/Invalid rnode->table pointer.
Various methods have been experimented to recreate the issue using linux commands in a script (without using management vrf commands) to isolate the problem. From those methods, it is clear that crash is observed with simple linux scripts as well.
** Create VRF - Creation cgroup , eth0 ,mgmt vrf link **
Linux commands (ran in script):
cgcreate -g l3mdev:mgmt
cgset -r l3mdev.master-device=mgmt mgmt
ip link add dev mgmt type vrf table 5000
ip link set mgmt up
ip link set dev eth0 master mgmt
** Delete VRF - Delete eth0,mgmt vrf link **
ip link set dev eth0 nomaster
ip link delete dev mgmt
Steps to reproduce the issue:
** Method1 **
config vrf add mgmt
config vrf del mgmt
root@sonic-z9264f-03:/var/core# ls -l
-rw-rw-rw- 1 root root 3509004 Nov 3 12:41 zebra.1572784872.36.core.gz
** Method2 **
As explained earlier, using those linux commands in two scripts. Script1 for creating mvrf and Script2 for deleting mvrf.
Describe the results you received:
On debugging, got the following callback trace in zebra where rn->table is NULL/Invalid.
GDB back trace:
#0 0x00007f5f7062bfff in raise () from /lib/x86_64-linux-gnu/libc.so.6
[Current thread is 1 (Thread 0x7f5f71a7ed40 (LWP 43))]
(gdb) bt
#0 0x00007f5f7062bfff in raise () from /lib/x86_64-linux-gnu/libc.so.6
#1 0x00007f5f7062d42a in abort () from /lib/x86_64-linux-gnu/libc.so.6
#2 0x00007f5f7167939f in core_handler (signo=11, siginfo=0x7ffd2896ceb0,
context=) at lib/sigevent.c:228
#3
#4 srcdest_rnode_table (rn=) at ./lib/srcdest_table.h:90
#5 rib_dest_table (dest=0x557be2369110) at zebra/rib.h:471
#6 rib_dest_vrf (dest=0x557be2369110) at zebra/rib.h:479
#7 netlink_route_info_fill (re=0x0, dest=0x557be2369110, cmd=25,
ri=0x7ffd2896d230) at zebra/zebra_fpm_netlink.c:294
#8 zfpm_netlink_encode_route (cmd=25, dest=dest@entry=0x557be2369110,
re=re@entry=0x0, in_buf=in_buf@entry=0x557be2307c64 "0", in_buf_len=8188)
at zebra/zebra_fpm_netlink.c:572
#9 0x00007f5f6e5ad8ed in zfpm_encode_route (msg_type=,
in_buf_len=, in_buf=0x557be2307c64 "0", re=,
dest=0x557be2369110) at zebra/zebra_fpm.c:887
#10 zfpm_build_route_updates () at zebra/zebra_fpm.c:990
#11 zfpm_build_updates () at zebra/zebra_fpm.c:1149
#12 zfpm_write_cb (thread=0x7ffd289714c0) at zebra/zebra_fpm.c:1187
#13 0x00007f5f71686a50 in thread_call (thread=thread@entry=0x7ffd289714c0)
at lib/thread.c:1531
#14 0x00007f5f71656fa8 in frr_run (master=0x557be212c9c0) at lib/libfrr.c:1054
#15 0x0000557be038c9b3 in main (argc=9, argv=0x7ffd289718c8)
Describe the results you expected:
Expected zebra not to crash.
Additional information you deem important (e.g. issue happens only occasionally):
When the linux script is not giving the crash, few additional linux commands in create and delete will help sometimes.
Additional Commands in Create script:
cgcreate -g l3mdev:mgmt
cgset -r l3mdev.master-device=mgmt mgmt
Additional Commands in Delete script:
cgdelete -g l3mdev:mgmt
Any master image.
The text was updated successfully, but these errors were encountered: