Skip to content

[SDN-2482] API enhancement to add support for Admin Policy Based External Route CRs#3462

Merged
trozet merged 2 commits into
ovn-kubernetes:masterfrom
jordigilh:sdn_2482_api_enhancement_external_gateway
May 25, 2023
Merged

[SDN-2482] API enhancement to add support for Admin Policy Based External Route CRs#3462
trozet merged 2 commits into
ovn-kubernetes:masterfrom
jordigilh:sdn_2482_api_enhancement_external_gateway

Conversation

@jordigilh
Copy link
Copy Markdown
Contributor

@jordigilh jordigilh commented Mar 4, 2023

This PR aims to implement the external gateway part of the enhancement defined openshift/enhancements#1338.

Notes:

  • Integration with the annotation based logic translated in these changes:
    • Shared externalGWCache and exGWCacheMutex objects to avoid both logics from deleting common IPs by accident. Common IPs would be those that are defined in a policy CR and an annotation. By sharing it avoid duplicating the IPs and also deleting them when either an annotation or a policy removes it from a namespace. These objects reside in the ExternalController but they are exposed so that the annotation logic can reach out to them.
    • repair() for external gateways is only performed by the controller. Since both logics share the same cache, there is no need to perform it twice and the controller is the long term actor that should be perform it.
    • The repair() looks up the GW IPs from both the policies and the namespace and pod annotations and performs the cleanup evaluating the route using both sources.
    • When deleting a gateway IP, the controller and the annotation logic will check if the other one is also managing the same IP to avoid deleting it from the shared cache by mistake. Adding the same IP is not an issue since there was already a check in place to avoid duplicating the gateway IP in a route.

Issues I found:

  • I noticed a potential bug when parsing the pod annotations in the repair() where it did not split the slice to handle multiple namespaces, while it is being done here for instance.
  • The repair logic now populates the externalGWCache as it processes the information. I'm not sure if that's the correct way to do it, but I felt that if the cache was not populated during repair() it would lead to duplicates in the north bound DB when processing the policies post restart.

@trozet @tssurya PTAL.

PD: I early created PR #3448 with the intent to expose the early WIP. Since the review was going to happen once the code was completed, I decided to close the PR to try to avoid spamming notifications for each commit I made in the process.

@jordigilh jordigilh force-pushed the sdn_2482_api_enhancement_external_gateway branch 2 times, most recently from a5fba20 to aca21a9 Compare March 4, 2023 03:32
@coveralls
Copy link
Copy Markdown

coveralls commented Mar 4, 2023

Coverage Status

Coverage: 52.331% (-0.6%) from 52.944% when pulling a0921ef on jordigilh:sdn_2482_api_enhancement_external_gateway into a053d08 on ovn-org:master.

@jordigilh jordigilh force-pushed the sdn_2482_api_enhancement_external_gateway branch 5 times, most recently from 398f323 to 85a0dd4 Compare March 6, 2023 15:49
@jordigilh jordigilh force-pushed the sdn_2482_api_enhancement_external_gateway branch 2 times, most recently from 4f9a2f2 to ae5b479 Compare March 13, 2023 20:52
@jordigilh jordigilh changed the title [WIP] [SDN-2482] API enhancement to add support for Admin Based Policy External Route CRs [SDN-2482] API enhancement to add support for Admin Based Policy External Route CRs Mar 17, 2023
@jordigilh jordigilh force-pushed the sdn_2482_api_enhancement_external_gateway branch 3 times, most recently from d6ed36e to 9069b27 Compare March 17, 2023 12:07
@jordigilh
Copy link
Copy Markdown
Contributor Author

/retest-failed

@jordigilh jordigilh force-pushed the sdn_2482_api_enhancement_external_gateway branch from 9069b27 to e9cb2e6 Compare March 17, 2023 15:56
@jordigilh
Copy link
Copy Markdown
Contributor Author

/retest-failed

@jordigilh
Copy link
Copy Markdown
Contributor Author

/restest-failed

@jordigilh
Copy link
Copy Markdown
Contributor Author

/retest

@tssurya
Copy link
Copy Markdown
Contributor

tssurya commented Mar 20, 2023

@jordigilh : Came back to this for reviews this morning, its still not passing the unit test job; so I'll wait till we have a green CI, but thanks for combining the commits here and meanwhile I will review: #3491 -> thanks for breaking that into a new PR.

@jordigilh jordigilh force-pushed the sdn_2482_api_enhancement_external_gateway branch from e9cb2e6 to 1800c12 Compare March 20, 2023 22:00
@jordigilh
Copy link
Copy Markdown
Contributor Author

/retest-failed

@ovn-robot
Copy link
Copy Markdown
Collaborator

Oops, something went wrong:

This workflow is already running

@jordigilh
Copy link
Copy Markdown
Contributor Author

/retest-failed

@jordigilh jordigilh force-pushed the sdn_2482_api_enhancement_external_gateway branch from 1800c12 to 6dcada1 Compare March 21, 2023 17:56
@tssurya tssurya self-requested a review March 24, 2023 09:22
Copy link
Copy Markdown
Contributor

@tssurya tssurya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @jordigilh for consolidating the commits together, much appreciated!

Started by reviewing the API changes (so commit1 only; when I am happy with commit1 will move to commit2) -> these take the longest to be implemented properly.
We are also missing the actual yaml generated? See update-codegen file and dist/templates stuff. I'd like the yaml in the next iteration so that I can test it.

Comment thread go-controller/pkg/crd/adminpolicybasedroute/v1/types.go

// AdminPolicyBasedExternalRouteSpec defines the desired state of AdminPolicyBasedExternalRoute
type AdminPolicyBasedExternalRouteSpec struct {
Policies []*ExternalPolicy `json:"policies"`
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

move these to first commit? why are these changes here? makes review hard, I go commit by commit.. (so the 1st commit is no longer relevant when you change things over on top of it)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ack. I will rebase and split the changes that are specific to the crd api and those implementing the controllers

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

// +kubebuilder:object:root=true
// +kubebuilder:subresource:status

// AdminPolicyBasedExternalRoute is the Schema for the AdminPolicyBasedExternalRoutes API
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add meaningful field descriptions -> these get rendered into user docs on the yaml when folks do oc describe.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ack

type AdminPolicyBasedExternalRoute struct {
metav1.TypeMeta `json:",inline"`
metav1.ObjectMeta `json:"metadata,omitempty"`
Spec AdminPolicyBasedExternalRouteSpec `json:"spec,omitempty"`
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add description to Spec and Status fields

Comment thread go-controller/pkg/crd/adminpolicybasedroute/v1/types.go Outdated
Comment thread go-controller/pkg/crd/adminpolicybasedroute/v1/types.go
Comment thread go-controller/pkg/crd/adminpolicybasedroute/v1/types.go
Comment thread go-controller/pkg/crd/adminpolicybasedroute/v1/types.go Outdated
Comment thread go-controller/pkg/crd/adminpolicybasedroute/v1/types.go Outdated
@tssurya
Copy link
Copy Markdown
Contributor

tssurya commented Mar 24, 2023

@jordigilh : also looks like you need to rebase.

@jordigilh jordigilh requested a review from jcaamano as a code owner May 4, 2023 20:28
Comment thread go-controller/pkg/ovn/controller/apbroute/master_controller.go
Comment thread go-controller/pkg/ovn/controller/apbroute/master_controller.go

// delAllHybridRoutePolicies deletes all the 501 hybrid-route-policies that
// force pod egress traffic to be rerouted to a gateway router for local gateway mode.
// Called when migrating to SGW from LGW.
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I copied this from the legacy code.... 😄 not sure if it was required or not... but happy to remove code 😄

Comment thread go-controller/pkg/ovn/controller/apbroute/network_client.go
Comment thread go-controller/pkg/ovn/controller/apbroute/network_client.go
Comment thread go-controller/pkg/ovn/controller/apbroute/external_controller_pod.go Outdated
Comment thread go-controller/pkg/ovn/controller/apbroute/external_controller_pod.go Outdated
Comment thread go-controller/pkg/ovn/controller/apbroute/network_client.go Outdated
Comment thread go-controller/pkg/crd/adminpolicybasedroute/v1/types.go Outdated
@jordigilh jordigilh force-pushed the sdn_2482_api_enhancement_external_gateway branch 3 times, most recently from 9636d47 to 8ec1cb3 Compare May 4, 2023 20:44
m.routePolicySyncCache.LoadOrStore(policy.Name, policy)
func (m *externalPolicyManager) storeRoutePolicyInCache(policyInfo *adminpolicybasedrouteapi.AdminPolicyBasedExternalRoute) error {
return m.routePolicySyncCache.DoWithLock(policyInfo.Name, func(policyName string) error {
m.routePolicySyncCache.Store(policyName, policyInfo)
Copy link
Copy Markdown
Contributor Author

@jordigilh jordigilh May 4, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@trozet I had to enhance the syncMap an add a Store function to make sure I was able to overwrite the stored value. The available LoadOrStore() function only allowed me to write once the value and any subsequent calls would return the existing value.
Since the route policy object can change over time, I am forced to update the value in the cache as changes are detected, hence the Store() function.
Let me know what you think.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@npinaeva FYI

How come you you need this capability? Looks like because on update event you find the "currentPolicy" in the cache, check the differences, then store the updated one?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct. Because I update the stored object. If I had a struct that wrapped the object, I would not need this as I would only be updating a field within the already contained struct. But since I'm changing the reference to the object, I have to rewrite what's there.
Note that I don't have this problem with the namespace info cache, as I only update the fields of the struct.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

well good news: I don't need this anymore now that I'm wrapping the manifest object into a struct that contains the route policy manifest and a bool to flag the route policy as being deleted. I removed the Store() function and reverted to use LoadOrStore() instead.

@jordigilh jordigilh force-pushed the sdn_2482_api_enhancement_external_gateway branch 2 times, most recently from 9fcdf10 to 748fbe1 Compare May 5, 2023 00:25
@jordigilh
Copy link
Copy Markdown
Contributor Author

/retest-failed

1 similar comment
@jordigilh
Copy link
Copy Markdown
Contributor Author

/retest-failed

Copy link
Copy Markdown
Contributor

@trozet trozet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment thread go-controller/pkg/ovn/controller/apbroute/external_controller.go
m.routePolicySyncCache.LoadOrStore(policy.Name, policy)
func (m *externalPolicyManager) storeRoutePolicyInCache(policyInfo *adminpolicybasedrouteapi.AdminPolicyBasedExternalRoute) error {
return m.routePolicySyncCache.DoWithLock(policyInfo.Name, func(policyName string) error {
m.routePolicySyncCache.Store(policyName, policyInfo)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@npinaeva FYI

How come you you need this capability? Looks like because on update event you find the "currentPolicy" in the cache, check the differences, then store the updated one?

klog.Warningf("Attempting to delete policy %s from a namespace that does not exist %s", routePolicy.Name, ns.Name)
continue
}
err = m.removePolicyFromNamespace(ns.Name, routePolicy, cacheInfo)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but here you are removing the policy from namespaces without locking the policy first. I would expect this kind of logic:

  1. Lock policy
  2. remove it from every namespace
  3. now remove it from the cache

Without this, you could have another thread, lets say namespace watcher. I think this could potentially happen:

  1. Thread policy enters this function starts removing policy from all of the namespaces.
  2. Simultaneously a new namespace is added that is targeted by the policy. Thread namespace policy adds this policy to the namespace.
  3. Thread policy now removes it from the cache.

Now you have a deleted policy being applied to a namespace. I think in the past we had a similar problem in network policy code. iirc we added a deleted bool to the policy itself, and set that first with a lock before we start deleting policies. Then the other threads can check this value to see if it should add or not.

@npinaeva may have other ideas

Comment thread go-controller/pkg/ovn/controller/apbroute/external_controller_policy.go Outdated

// delAllHybridRoutePolicies deletes all the 501 hybrid-route-policies that
// force pod egress traffic to be rerouted to a gateway router for local gateway mode.
// Called when migrating to SGW from LGW.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tssurya fyi

Comment thread go-controller/pkg/ovn/egressgw.go Outdated
Comment thread test/e2e/external_gateways.go
@jordigilh jordigilh force-pushed the sdn_2482_api_enhancement_external_gateway branch 2 times, most recently from f3708bc to 17064c5 Compare May 12, 2023 11:15
Comment thread go-controller/pkg/ovn/controller/apbroute/external_controller_pod.go Outdated
Comment thread go-controller/pkg/ovn/controller/apbroute/external_controller_pod.go Outdated
Comment thread go-controller/pkg/ovn/controller/apbroute/external_controller_policy.go Outdated
Comment thread go-controller/pkg/ovn/controller/apbroute/external_controller_policy.go Outdated
@jordigilh jordigilh force-pushed the sdn_2482_api_enhancement_external_gateway branch 3 times, most recently from 92d5146 to ee7d693 Compare May 17, 2023 13:43
@jordigilh
Copy link
Copy Markdown
Contributor Author

/retest-failed

1 similar comment
@jordigilh
Copy link
Copy Markdown
Contributor Author

/retest-failed

Copy link
Copy Markdown
Contributor

@trozet trozet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

few comments, otherwise i think it is good to go

Comment thread go-controller/pkg/ovn/controller/apbroute/external_controller.go Outdated
Comment thread go-controller/pkg/ovn/controller/apbroute/external_controller_policy.go Outdated
Comment thread go-controller/pkg/ovn/controller/apbroute/external_controller.go Outdated
Copy link
Copy Markdown
Contributor

@trozet trozet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

few comments, otherwise i think it is good to go

@jordigilh jordigilh force-pushed the sdn_2482_api_enhancement_external_gateway branch 2 times, most recently from a4bf549 to 1cddf9c Compare May 19, 2023 22:20
…(informer,lister,api)

Signed-off-by: jordigilh <jgil@redhat.com>
@jordigilh jordigilh force-pushed the sdn_2482_api_enhancement_external_gateway branch from 1cddf9c to 9190200 Compare May 24, 2023 20:03
* Implements controllers for Admin Policy Based External Route to handle changes to namespaces, pods and admin policy based external route CRs.
* Initialize in master node to handle interactions with the north bound DB. Initialize in worker nodes to handle changes to the conntrack (delete ECMP entries when a gateway IP is no longer a valid external gateway IP)
* Implements repair() function for the master node.
* Integrates with the annotation logic to avoid duplications in cache by sharing the externalGWCache and EXGWCacheMutex objects between the annotation and controller logic.
* Updates the annotation logic to ensure the namespace anontation k8s.ovn.org/external-gw-pod-ips is updated when changes occur in a CR instance that coexists in the same namespace and that can impact the list of dynamic gateway IPs.
* The implementation no longer relies on namespace annotations, including "k8s.ovn.org/external-gw-pod-ips", instead it uses its own cache structure to identify the valid pod IPs for a given namespace.
* Implement E2E tests for admin policy based external route. The tests are a duplication of the existing annotated based logic for external gateways using the CR instead.

Signed-off-by: jordigilh <jgil@redhat.com>
@jordigilh jordigilh force-pushed the sdn_2482_api_enhancement_external_gateway branch from 9190200 to 91046e8 Compare May 24, 2023 20:27
@trozet trozet merged commit 1ac592f into ovn-kubernetes:master May 25, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants