Skip to content

Interface validation failures when issuing "interface VlanXYZ" command #5718

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
santoshdoke opened this issue Jan 22, 2020 · 5 comments
Closed
Assignees
Labels
triage Needs further investigation

Comments

@santoshdoke
Copy link

When issuing interface command, a validation error is seen intermittently. This is on FRR 7.2 stable version + fix from #5513.

interface Vlan1301

% Configuration failed: validation error.

#zebra[40]: lib_interface_destroy: only inactive interfaces can be deleted
#zebra[40]: [EC 100663331] nb_callback_configuration: error processing configuration change: error [validation error] event [validate] operation [destroy] xpath [/frr-interface:lib/interface[name='Vlan1301'][vrf='Vrf_RED']]

(put "x" in "[ ]" if you already tried following)
[x ] Did you check if this is a duplicate issue?
[ ] Did you test it on the latest FRRouting/frr master branch?

To Reproduce
Steps to reproduce the behavior:

  1. Create VRF in Linux using ip link command.
  2. Bind interface to VRF. Say Vlan1301.
  3. Launch VTYSH.
  4. Enter interface config mode using 'interface Vlan1301'.
  5. Validation fails.

Expected behavior
Interface command should succeed.

Versions

  • OS Kernel: Debian 4.9.189-3+deb9u2
  • FRR Version - 7.2 stable

Additional context
Validation failure is from this back trace. Since the interface is active, the lib_interface_destroy() returns failure. Not sure if this the interface destroy is expected to be called for "interface Vlan1301" command. Suspect there may be some timing issue with VRF update (fix in 5513) and the running config diffs.

(gdb) bt

#0 lib_interface_destroy (event=NB_EV_VALIDATE, dnode=0x555dcacf2cb0) at lib/if.c:1444

#1 0x00007f488f190ad0 in nb_callback_configuration (event=event@entry=NB_EV_VALIDATE,

change=change@entry=0x555dcacf3440) at lib/northbound.c:817                       

#2 0x00007f488f190c1d in nb_candidate_validate_changes (changes=changes@entry=0x7fff878ae1b0,

candidate=0x555dcaac33d0) at lib/northbound.c:595                                         

#3 0x00007f488f190e74 in nb_candidate_commit_prepare (candidate=0x555dcaac33d0,

client=NB_CLIENT_CLI, user=0x555dcacf3df0, comment=0x0, transaction=0x7fff878ae200)       

at lib/northbound.c:646                                                                   

#4 0x00007f488f191166 in nb_candidate_commit (candidate=,

client=client@entry=NB_CLIENT_CLI, user=user@entry=0x555dcacf3df0,                        

save_transaction=save_transaction@entry=false, comment=comment@entry=0x0,                 

transaction_id=transaction_id@entry=0x0) at lib/northbound.c:708                          

#5 0x00007f488f193d7c in nb_cli_apply_changes (vty=vty@entry=0x555dcacf3df0,

xpath_base_fmt=xpath_base_fmt@entry=0x7fff878ae5b0 "/frr-interface:lib/interface[name='Vlan1301'][vrf='VrfRed']") at lib/northbound_cli.c:187                                                

#6 0x00007f488f17e650 in interface_magic (self=, argc=,

argv=<optimized out>, vrf_name=0x555dcac59900 "VrfRed", ifname=0x555dcacf3c70 "Vlan1301",   

vty=0x555dcacf3df0) at lib/if.c:1259                                                        

#7 interface (self=, vty=0x555dcacf3df0, argc=,

argv=<optimized out>) at ./lib/if_clippy.c:51                                               

#8 0x00007f488f167e8d in cmd_execute_command_real (vline=vline@entry=0x555dcab2e5a0,

vty=vty@entry=0x555dcacf3df0, cmd=cmd@entry=0x0, filter=FILTER_RELAXED)                     

at lib/command.c:1070                                                                       

#9 0x00007f488f16a0cf in cmd_execute_command (vline=vline@entry=0x555dcab2e5a0,

vty=vty@entry=0x555dcacf3df0, cmd=0x0, vtysh=vtysh@entry=0) at lib/command.c:1129           

#10 0x00007f488f16a235 in cmd_execute (vty=vty@entry=0x555dcacf3df0,

cmd=cmd@entry=0x555dcacf0b20 "interface Vlan1301 vrf VrfRed ", matched=matched@entry=0x0,   

vtysh=vtysh@entry=0) at lib/command.c:1283                                                  

#11 0x00007f488f1ba912 in vty_command (vty=vty@entry=0x555dcacf3df0,

buf=0x555dcacf0b20 "interface Vlan1301 vrf VrfRed ") at lib/vty.c:516                       

#12 0x00007f488f1bab96 in vty_execute (vty=vty@entry=0x555dcacf3df0) at lib/vty.c:1285

---Type to continue, or q to quit---

#13 0x00007f488f1bd4ec in vtysh_read (thread=) at lib/vty.c:2119

#14 0x00007f488f1b5550 in thread_call (thread=thread@entry=0x7fff878b0de0) at lib/thread.c:1540

#15 0x00007f488f1858a8 in frr_run (master=0x555dcaa4b9c0) at lib/libfrr.c:1054

#16 0x0000555dca19accb in main (argc=10, argv=0x7fff878b11e8) at zebra/main.c:480

@santoshdoke santoshdoke added the triage Needs further investigation label Jan 22, 2020
@santoshdoke
Copy link
Author

After removing the running-config changes in if_update_to_new_vrf() function, lib/if.c, this issue is no longer seen (the below code under HACK comment was removed). I see that the running-config does show correct VRF binding, even after removing this code. But, not sure what else would break by removing this change. Any inputs/suggestions?

/*
* HACK: Change the interface VRF in the running configuration directly,
* bypassing the northbound layer. This is necessary to avoid deleting
* the interface and readding it in the new VRF, which would have
* several implications.
*/
if (yang_module_find("frr-interface")) {
struct lyd_node *if_dnode;

	if_dnode = yang_dnode_get(
		running_config->dnode,
		"/frr-interface:lib/interface[name='%s'][vrf='%s']/vrf",
		ifp->name, old_vrf->name);
	if (if_dnode) {
		nb_running_unset_entry(if_dnode->parent);
		yang_dnode_change_leaf(if_dnode, vrf->name);
		nb_running_set_entry(if_dnode->parent, ifp);
		running_config->version++;
	}
}

@rwestphal
Copy link
Member

@santoshdoke thanks for reporting this issue.

I have a pending optimization commit that completely ditches the running_config_entries hash table in favor of storing YANG user pointers directly in the lyd_node structure. I tend to believe this should fix this inconsistency problem that happens when we have config changes coming from both the CLI and kernel at the same time. I should open a PR for this next week after the Carnaval holidays.

@idryzhov
Copy link
Contributor

Hi @rwestphal, any updates on this?

@idryzhov
Copy link
Contributor

Hi @santoshdoke, can you, please, check if my PR #5999 fixes your issue?

rwestphal added a commit to opensourcerouting/frr that referenced this issue Mar 24, 2020
Commit ccd43ad introduced a hash table in the northbound
layer to optimize the management of users pointers associated to
YANG configuration nodes (like containers and lists). While that
optimization was enormously effective, it introduced the following
two problems:
* Increased memory usage of 264 bytes per configuration node. This
  might not seem like much, but it can account for hundreds of
  megabytes for large configurations (e.g. thousands of route-maps
  and ACLs).
* Inconsistencies in the hash table when configuration changes come
  from the kernel (bypassing the normal transactional process). This
  in turns leads to problems like the northbound rejecting valid
  configuration changes.

The solution consists of reverting parts of commit ccd43ad
(the hash table bits) and introducing the yang_dnode_replace()
function. This function was designed to replace one YANG data node by
another one, doing the least amount of work possible. The idea is to
optimize the common use case of reading one configuration command at
time when parsing the startup configuration. Normally replacing one
configuration with 1000 commands by another one with 1001 commands
meant freeing the first one and duplicating the second, both expensive
operations. Now we compare config A with config B and change only
the parts of A that need to be changed. The nice side effect of this
optimization is that the existing user pointers in the original
configuration are not lost during the replace operation. In other
words, we can use the 'priv' pointer of the 'lyd_node' structure again
since they are persistent now. This means the 'running_config_entries'
hash table isn't necessary anymore and can be removed.

Fixes FRRouting#5718.

Signed-off-by: Renato Westphal <[email protected]>
rwestphal added a commit to opensourcerouting/frr that referenced this issue Mar 24, 2020
Commit ccd43ad introduced a hash table in the northbound
layer to optimize the management of users pointers associated to
YANG configuration nodes (like containers and lists). While that
optimization was enormously effective, it introduced the following
two problems:
* Increased memory usage of 264 bytes per configuration node. This
  might not seem like much, but it can account for hundreds of
  megabytes for large configurations (e.g. thousands of route-maps
  and ACLs).
* Inconsistencies in the hash table when configuration changes come
  from the kernel (bypassing the normal transactional process). This
  in turns leads to problems like the northbound rejecting valid
  configuration changes.

The solution consists of reverting parts of commit ccd43ad
(the hash table bits) and introducing the yang_dnode_replace()
function. This function was designed to replace one YANG data node by
another one, doing the least amount of work possible. The idea is to
optimize the common use case of reading one configuration command at
time when parsing the startup configuration. Normally replacing one
configuration with 1000 commands by another one with 1001 commands
meant freeing the first one and duplicating the second, both expensive
operations. Now we compare config A with config B and change only
the parts of A that need to be changed. The nice side effect of this
optimization is that the existing user pointers in the original
configuration are not lost during the replace operation. In other
words, we can use the 'priv' pointer of the 'lyd_node' structure again
since they are persistent now. This means the 'running_config_entries'
hash table isn't necessary anymore and can be removed.

Fixes FRRouting#5718.

Signed-off-by: Renato Westphal <[email protected]>
rwestphal added a commit to opensourcerouting/frr that referenced this issue Mar 25, 2020
Commit ccd43ad introduced a hash table in the northbound
layer to optimize the management of users pointers associated to
YANG configuration nodes (like containers and lists). While that
optimization was enormously effective, it introduced the following
two problems:
* Increased memory usage of 264 bytes per configuration node. This
  might not seem like much, but it can account for hundreds of
  megabytes for large configurations (e.g. thousands of route-maps
  and ACLs).
* Inconsistencies in the hash table when configuration changes come
  from the kernel (bypassing the normal transactional process). This
  in turns leads to problems like the northbound rejecting valid
  configuration changes.

The solution consists of reverting parts of commit ccd43ad
(the hash table bits) and introducing the yang_dnode_replace()
function. This function was designed to replace one YANG data node by
another one, doing the least amount of work possible. The idea is to
optimize the common use case of reading one configuration command at
time when parsing the startup configuration. Normally replacing one
configuration with 1000 commands by another one with 1001 commands
meant freeing the first one and duplicating the second, both expensive
operations. Now we compare config A with config B and change only
the parts of A that need to be changed. The nice side effect of this
optimization is that the existing user pointers in the original
configuration are not lost during the replace operation. In other
words, we can use the 'priv' pointer of the 'lyd_node' structure again
since they are persistent now. This means the 'running_config_entries'
hash table isn't necessary anymore and can be removed.

Fixes FRRouting#5718.

Signed-off-by: Renato Westphal <[email protected]>
@rwestphal
Copy link
Member

As discussed on the #yang channel, it turns out this problem was caused by a bug in the libyang cache mechanism (please see CESNET/libyang#1048).

The solution is to update to the latest libyang from the devel branch (or build older libyang versions with -DENABLE_CACHE=OFF). For the stable branches we'll try to backport the libyang fix and update all libyang 0.16 packages.

Thanks @santoshdoke and @idryzhov for reporting and helping to fix this problem. I'll close this issue now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
triage Needs further investigation
Projects
None yet
3 participants