-
Notifications
You must be signed in to change notification settings - Fork 882
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ensure purging neighbor cache for stale deletes #1433
Conversation
When stale delete notifications are received, we still need to make sure to purge sandbox neighbor cache because these stale deletes are most typically out of order delete notifications and if an add for the peermac was received before the delete of the old peermac,vtep pair then we process that and replace the kernel state but the old neighbor state in the sandbox cache remains. That needs to be purged when we finally get the out of order delete notification. Signed-off-by: Jana Radhakrishnan <[email protected]>
// leftover sandbox neighbor cache and not actually delete the | ||
// kernel state. | ||
if (eid == pEntry.eid && vtep.Equal(pEntry.vtep)) || | ||
(eid != pEntry.eid && !vtep.Equal(pEntry.vtep)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As a curiosity, this could be compacted to if (eid == pEntry.eid) == vtep.Equal(pEntry.vtep) { }
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That will be clever but it will be awfully lot difficult to decipher the intent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently peerDbDelete
returns false
in the error cases (ie: d.peerDb.mp[nid]
not present or peerKey not present). Both are still valid error cases. I think we should retain the boolean return and add peerEntry
as additional return type. So the existing !d.peerDbDelete
check can be retained in peerDelete
and the if check added above is not necessary. Its hard to follow the logic with the current check. We just need the eid == pEntry.eid && vtep.Equal(pEntry.vtep)
in the DeleteNeighbor
call. wdyt ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Even though the logic in the PR may be a little hard to follow it is absolutely necessary because even if endpoint Ids did not match we need to remove state in sandbox neighbor case (look for eid != pEntry.eid && !vtep.Equal(pEntry.vtep)
) which might have existed from a previous neighbor add that did not get removed because we got another add before the delete.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok. yeah, I see why we can't bail out on a false
return from peerDbDelete
. But I think the if check should also also include eid != pEntry.eid && vtep.Equal(pEntry.vtep)
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No it should not. If we do we will delete the vtep for the current valid entry which is specifically why we are doing the eid match to avoid a stale(out of order) delete from actually deleting anything. About the only thing that a stale delete should do is remove a stale sandbox neighbor cache.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I was thinking of the case where the task gets rescheduled and placed on the same node. But in that case the neighbor entry is actually correct and not stale.
LGTM. Good that you caught this in the testing. It would have been impossible to debug in live setups.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes.
LGTM |
@mrjana Change in #1403 was to avoid sending the stale delete events from networkdb. Shouldn't that avoid overlay driver getting stale deletes ? Is this change specifically to handle the case of a out of order delete when the same IP/mac has been used for a different endpoint after a task reschedule ? |
Yeah this is specifically for the same IP/mac being used in a different key and that happens more often than I would like for thanks to the (correc) default IPAM behavior of giving out the last release IP. |
Its probably many seconds between a task going down and that event gets processed by the manager, rescheduling/dispatching happens, the tasks starts running on the new node and a new table event gets generated for that. Its a bit surprising that an out of order UDP packet can be received after such a delay. I am wondering may be this symptom was seen because of gossip retransmission of the delete event and won't be an issue after #1403 fix ? |
This problem happens when someone creates and removes services in a loop at rapid click. It doesn't happen normally and yes I saw this issue while doing the back-to-back service create/rm test on top of #1403 fix. |
peerDbWg.Wait() | ||
|
||
d.peerDb.Lock() | ||
pMap, ok := d.peerDb.mp[nid] | ||
if !ok { | ||
d.peerDb.Unlock() | ||
return false | ||
return peerEntry{} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mrjana The key for peerEntry
in peerNetworkMap.mp is just the peer IP & mac
pKey := peerKey{
peerIP: peerIP,
peerMac: peerMac,
}
In the case of out of order delete we will end up deleting the entry for the newly created peer entry. I think we need eid as part of the key because it will change in this case where we get an add before the delete of the old ep ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We won't delete it incorrectly. We do check for eid match below and that is needed in the caller as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah, sorry... its right below.
When stale delete notifications are received, we still need to make sure
to purge sandbox neighbor cache because these stale deletes are most
typically out of order delete notifications and if an add for the
peermac was received before the delete of the old peermac,vtep pair then
we process that and replace the kernel state but the old neighbor state
in the sandbox cache remains. That needs to be purged when we finally
get the out of order delete notification.
Signed-off-by: Jana Radhakrishnan [email protected]