[SDN-2482] API enhancement to add support for Admin Policy Based External Route CRs#3462
Conversation
a5fba20 to
aca21a9
Compare
398f323 to
85a0dd4
Compare
4f9a2f2 to
ae5b479
Compare
d6ed36e to
9069b27
Compare
|
/retest-failed |
9069b27 to
e9cb2e6
Compare
|
/retest-failed |
|
/restest-failed |
|
/retest |
|
@jordigilh : Came back to this for reviews this morning, its still not passing the unit test job; so I'll wait till we have a green CI, but thanks for combining the commits here and meanwhile I will review: #3491 -> thanks for breaking that into a new PR. |
e9cb2e6 to
1800c12
Compare
|
/retest-failed |
|
Oops, something went wrong: |
|
/retest-failed |
1800c12 to
6dcada1
Compare
There was a problem hiding this comment.
Thanks @jordigilh for consolidating the commits together, much appreciated!
Started by reviewing the API changes (so commit1 only; when I am happy with commit1 will move to commit2) -> these take the longest to be implemented properly.
We are also missing the actual yaml generated? See update-codegen file and dist/templates stuff. I'd like the yaml in the next iteration so that I can test it.
|
|
||
| // AdminPolicyBasedExternalRouteSpec defines the desired state of AdminPolicyBasedExternalRoute | ||
| type AdminPolicyBasedExternalRouteSpec struct { | ||
| Policies []*ExternalPolicy `json:"policies"` |
There was a problem hiding this comment.
move these to first commit? why are these changes here? makes review hard, I go commit by commit.. (so the 1st commit is no longer relevant when you change things over on top of it)
There was a problem hiding this comment.
Ack. I will rebase and split the changes that are specific to the crd api and those implementing the controllers
| // +kubebuilder:object:root=true | ||
| // +kubebuilder:subresource:status | ||
|
|
||
| // AdminPolicyBasedExternalRoute is the Schema for the AdminPolicyBasedExternalRoutes API |
There was a problem hiding this comment.
Please add meaningful field descriptions -> these get rendered into user docs on the yaml when folks do oc describe.
| type AdminPolicyBasedExternalRoute struct { | ||
| metav1.TypeMeta `json:",inline"` | ||
| metav1.ObjectMeta `json:"metadata,omitempty"` | ||
| Spec AdminPolicyBasedExternalRouteSpec `json:"spec,omitempty"` |
There was a problem hiding this comment.
add description to Spec and Status fields
|
@jordigilh : also looks like you need to rebase. |
|
|
||
| // delAllHybridRoutePolicies deletes all the 501 hybrid-route-policies that | ||
| // force pod egress traffic to be rerouted to a gateway router for local gateway mode. | ||
| // Called when migrating to SGW from LGW. |
There was a problem hiding this comment.
I copied this from the legacy code.... 😄 not sure if it was required or not... but happy to remove code 😄
9636d47 to
8ec1cb3
Compare
| m.routePolicySyncCache.LoadOrStore(policy.Name, policy) | ||
| func (m *externalPolicyManager) storeRoutePolicyInCache(policyInfo *adminpolicybasedrouteapi.AdminPolicyBasedExternalRoute) error { | ||
| return m.routePolicySyncCache.DoWithLock(policyInfo.Name, func(policyName string) error { | ||
| m.routePolicySyncCache.Store(policyName, policyInfo) |
There was a problem hiding this comment.
@trozet I had to enhance the syncMap an add a Store function to make sure I was able to overwrite the stored value. The available LoadOrStore() function only allowed me to write once the value and any subsequent calls would return the existing value.
Since the route policy object can change over time, I am forced to update the value in the cache as changes are detected, hence the Store() function.
Let me know what you think.
There was a problem hiding this comment.
@npinaeva FYI
How come you you need this capability? Looks like because on update event you find the "currentPolicy" in the cache, check the differences, then store the updated one?
There was a problem hiding this comment.
Correct. Because I update the stored object. If I had a struct that wrapped the object, I would not need this as I would only be updating a field within the already contained struct. But since I'm changing the reference to the object, I have to rewrite what's there.
Note that I don't have this problem with the namespace info cache, as I only update the fields of the struct.
There was a problem hiding this comment.
well good news: I don't need this anymore now that I'm wrapping the manifest object into a struct that contains the route policy manifest and a bool to flag the route policy as being deleted. I removed the Store() function and reverted to use LoadOrStore() instead.
9fcdf10 to
748fbe1
Compare
|
/retest-failed |
1 similar comment
|
/retest-failed |
trozet
left a comment
There was a problem hiding this comment.
Lets discuss the conntrack stuff with @tssurya tmrw. I think there are still some comments that were not addressed with your fix up:
https://github.com/ovn-org/ovn-kubernetes/pull/3462/files#r1181695369
https://github.com/ovn-org/ovn-kubernetes/pull/3462/files#r1181701804
https://github.com/ovn-org/ovn-kubernetes/pull/3462/files#r1181709584
https://github.com/ovn-org/ovn-kubernetes/pull/3462/files#r1181710245
https://github.com/ovn-org/ovn-kubernetes/pull/3462/files#r1181890894
| m.routePolicySyncCache.LoadOrStore(policy.Name, policy) | ||
| func (m *externalPolicyManager) storeRoutePolicyInCache(policyInfo *adminpolicybasedrouteapi.AdminPolicyBasedExternalRoute) error { | ||
| return m.routePolicySyncCache.DoWithLock(policyInfo.Name, func(policyName string) error { | ||
| m.routePolicySyncCache.Store(policyName, policyInfo) |
There was a problem hiding this comment.
@npinaeva FYI
How come you you need this capability? Looks like because on update event you find the "currentPolicy" in the cache, check the differences, then store the updated one?
| klog.Warningf("Attempting to delete policy %s from a namespace that does not exist %s", routePolicy.Name, ns.Name) | ||
| continue | ||
| } | ||
| err = m.removePolicyFromNamespace(ns.Name, routePolicy, cacheInfo) |
There was a problem hiding this comment.
but here you are removing the policy from namespaces without locking the policy first. I would expect this kind of logic:
- Lock policy
- remove it from every namespace
- now remove it from the cache
Without this, you could have another thread, lets say namespace watcher. I think this could potentially happen:
- Thread policy enters this function starts removing policy from all of the namespaces.
- Simultaneously a new namespace is added that is targeted by the policy. Thread namespace policy adds this policy to the namespace.
- Thread policy now removes it from the cache.
Now you have a deleted policy being applied to a namespace. I think in the past we had a similar problem in network policy code. iirc we added a deleted bool to the policy itself, and set that first with a lock before we start deleting policies. Then the other threads can check this value to see if it should add or not.
@npinaeva may have other ideas
|
|
||
| // delAllHybridRoutePolicies deletes all the 501 hybrid-route-policies that | ||
| // force pod egress traffic to be rerouted to a gateway router for local gateway mode. | ||
| // Called when migrating to SGW from LGW. |
f3708bc to
17064c5
Compare
92d5146 to
ee7d693
Compare
|
/retest-failed |
1 similar comment
|
/retest-failed |
trozet
left a comment
There was a problem hiding this comment.
few comments, otherwise i think it is good to go
trozet
left a comment
There was a problem hiding this comment.
few comments, otherwise i think it is good to go
a4bf549 to
1cddf9c
Compare
…(informer,lister,api) Signed-off-by: jordigilh <jgil@redhat.com>
1cddf9c to
9190200
Compare
* Implements controllers for Admin Policy Based External Route to handle changes to namespaces, pods and admin policy based external route CRs. * Initialize in master node to handle interactions with the north bound DB. Initialize in worker nodes to handle changes to the conntrack (delete ECMP entries when a gateway IP is no longer a valid external gateway IP) * Implements repair() function for the master node. * Integrates with the annotation logic to avoid duplications in cache by sharing the externalGWCache and EXGWCacheMutex objects between the annotation and controller logic. * Updates the annotation logic to ensure the namespace anontation k8s.ovn.org/external-gw-pod-ips is updated when changes occur in a CR instance that coexists in the same namespace and that can impact the list of dynamic gateway IPs. * The implementation no longer relies on namespace annotations, including "k8s.ovn.org/external-gw-pod-ips", instead it uses its own cache structure to identify the valid pod IPs for a given namespace. * Implement E2E tests for admin policy based external route. The tests are a duplication of the existing annotated based logic for external gateways using the CR instead. Signed-off-by: jordigilh <jgil@redhat.com>
9190200 to
91046e8
Compare
This PR aims to implement the external gateway part of the enhancement defined openshift/enhancements#1338.
Notes:
externalGWCacheandexGWCacheMutexobjects to avoid both logics from deleting common IPs by accident. Common IPs would be those that are defined in a policy CR and an annotation. By sharing it avoid duplicating the IPs and also deleting them when either an annotation or a policy removes it from a namespace. These objects reside in theExternalControllerbut they are exposed so that the annotation logic can reach out to them.repair()for external gateways is only performed by the controller. Since both logics share the same cache, there is no need to perform it twice and the controller is the long term actor that should be perform it.repair()looks up the GW IPs from both the policies and the namespace and pod annotations and performs the cleanup evaluating the route using both sources.Issues I found:
repair()where it did not split the slice to handle multiple namespaces, while it is being done here for instance.externalGWCacheas it processes the information. I'm not sure if that's the correct way to do it, but I felt that if the cache was not populated duringrepair()it would lead to duplicates in the north bound DB when processing the policies post restart.@trozet @tssurya PTAL.
PD: I early created PR #3448 with the intent to expose the early WIP. Since the review was going to happen once the code was completed, I decided to close the PR to try to avoid spamming notifications for each commit I made in the process.