Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor endpoint management #14745

Merged
merged 9 commits into from
Feb 3, 2021

Conversation

joestringer
Copy link
Member

@joestringer joestringer commented Jan 27, 2021

Review commit-by-commit.

This PR aims to address some technical debt relating to the EndpointManager. Despite the naming, up until this PR the management of endpoints has been handled from pkg/endpoint, with the call flow being Daemon -> Endpoint.Expose() / Endpoint.Delete() -> EndpointManager.UpdateReferences() / EndpointManager.RemoveReferences() and friends. This required defining an EndpointManager type inside the Endpoint, and passing it through as a parameter just so that Endpoint could call EndpointManager at the right time. This relationship is the opposite of what one might expect, and leads to subsequent difficulties when trying to allow other subsystems to interact with the set of available endpoints (because they cannot rely on the EndpointManager to manage endpoints, they must pull in complexity from granular dependencies on the Endpoint package.) Many of the details of these interfaces were previously exposed to the callers, increasing complexity across the codebase. This PR aims to reduce such abstraction leakage to make the code simpler and more reliable.

This PR moves the endpoint lifecycle more towards this:

  • Endpoint constructor
  • EndpointManager.AddEndpoint() to make the endpoint findable from other subsystems.
    • Indirectly calls Endpoint.Start() to handle events from other subsystems.
  • If shutting down the agent, Endpoint.Stop() gracefully stops handling events without initiating the delete process.
  • EndpointManager.RemoveEndpoint() to prevent other subsystems from finding the endpoint.
    • As part of this, first hide the endpoint from other subsystems by removing it from the EndpointManager, then call endpoint.Delete() which does Endpoint.Stop() to stop handling ongoing events.

There are no functional changes intended by this PR, but the ordering of some operations has been shuffled around slightly, in particular the locking management between Endpoint vs. EndpointManager during endpoint deletion and the order of removing the endpoint from the EndpointManager vs. stopping the goroutines that handle events from other subsystems.

This was initiated while trying to figure out a clean way to port #14541 to the master branch.

@joestringer joestringer added the release-note/misc This PR makes changes that have no direct user impact. label Jan 27, 2021
@@ -763,7 +763,7 @@ func (d *Daemon) EndpointUpdate(id string, cfg *models.EndpointConfigurationSpec
return api.Error(PatchEndpointIDConfigFailedCode, err)
}
}
if err := ep.UpdateReferences(d.endpointManager); err != nil {
if err := d.endpointManager.UpdateReferences(ep); err != nil {
Copy link
Member Author

@joestringer joestringer Jan 27, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Incidentally, I noticed that this path is actually broken. If we intend to change any of the Identifiers from this path, then this code is not actually removing the previous identifiers, it's only creating new references in the endpointmanager for the endpoint. This could lead to stale references using old identifiers.

I don't think this is actually a viable use of the config API today so I didn't bother to fix it, and also I was trying to keep the scope of this PR relatively small so I can move on to the PR that I want to build on top of these changes.

@joestringer
Copy link
Member Author

test-me-please

@joestringer joestringer force-pushed the submit/ep-manager-refactor branch 2 times, most recently from 4baa97b to 4356668 Compare January 29, 2021 05:00
@joestringer
Copy link
Member Author

test-gke

1 similar comment
@joestringer
Copy link
Member Author

test-gke

@joestringer joestringer force-pushed the submit/ep-manager-refactor branch from 4356668 to 865cfb3 Compare January 29, 2021 05:31
@joestringer
Copy link
Member Author

test-me-please

@joestringer joestringer force-pushed the submit/ep-manager-refactor branch from 865cfb3 to 2669e26 Compare January 29, 2021 22:40
@joestringer joestringer added the kind/cleanup This includes no functional changes. label Jan 29, 2021
@joestringer joestringer force-pushed the submit/ep-manager-refactor branch from 2669e26 to 0c63de5 Compare January 29, 2021 22:43
@joestringer
Copy link
Member Author

test-me-please

@joestringer joestringer force-pushed the submit/ep-manager-refactor branch from 0c63de5 to 84d58e0 Compare January 30, 2021 01:24
@joestringer
Copy link
Member Author

test-me-please

@joestringer joestringer marked this pull request as ready for review January 30, 2021 04:35
@joestringer joestringer requested a review from a team January 30, 2021 04:35
@joestringer joestringer requested a review from a team as a code owner January 30, 2021 04:35
@joestringer
Copy link
Member Author

Hit flakes #12511 (net-next), #13011 (kernel-4.9). Otherwise green on CI.

Copy link
Member

@christarazi christarazi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎉 This is great! One of the more satisfying PRs to see. LGTM overall, very clean commits and well-documented history. A few minor questions.

pkg/endpoint/manager.go Outdated Show resolved Hide resolved
pkg/endpoint/manager.go Outdated Show resolved Hide resolved
pkg/endpoint/endpoint.go Outdated Show resolved Hide resolved
To assist the sharing of code and appropriate locking access, move this
code around such that there is a dedicated endpointid.Identifiers type
to collect Identifiers for endpoint, such as the Cilium Endpoint ID,
container ID, k8s namespace/name, etc. Furthermore, move the access and
handling of these types all to a single file so they're in one place
rather than spread across multiple files.

Signed-off-by: Joe Stringer <[email protected]>
The Unexpose() function that this is being factored out of is only
called from two places: Endpoint.Delete() which already calls
Endpoint.Stop(), and EndpointManager.WaitEndpointRemoved() which waits
for this goroutine to complete.

Endpoint.Stop() itself performs the two steps being refactored here,
plus one extra step: closeBPFProgramChannel() which idempotently
preempts ongoing builds.

Due to the idempotent nature of Stop() and the fact that
EndpointManager.WaitEndpointRemoved() is waiting on the results of the
Endpoint.Stop() anyway, we are safe to rearrange the ordering of these
steps here. Furthermore, EndpointManager.WaitEndpointRemoved() is only
used from unit tests so the runtime impact of this should be zero.

Signed-off-by: Joe Stringer <[email protected]>
The EndpointManager, as its name suggests, is intended to handle the
management of endpoints: Addition, deletion, etc. Due to ~historic
reasons~, up until now the code to handle exposing the endpoint to other
subsystems in Cilium was encoded inside the endpoint package, and the
endpoint manager was passed down to the endpoint to facilitate this.

This commit flips this around, such that the EndpointManager is now in
charge, and it calls enough logic in the Endpoint to prepare it to
handle events from the rest of the system, then subsequently exposes
the endpoint in the manager to allow other subsystems to find it.

Signed-off-by: Joe Stringer <[email protected]>
This commit moves the logic for unexposing an endpoint from the manager
into the endpointmanager package without changing any of the locking
logic, so that the next commit will have fewer changes to better
emphasize the changes in locking behaviour.

Signed-off-by: Joe Stringer <[email protected]>
Shift the core entrypoint into the package code for endpoint deletion
away from the Endpoint package into the EndpointManager. No functional
changes.

Signed-off-by: Joe Stringer <[email protected]>
@joestringer joestringer force-pushed the submit/ep-manager-refactor branch from 84d58e0 to e639065 Compare February 1, 2021 20:41
@joestringer
Copy link
Member Author

test-me-please

@joestringer joestringer force-pushed the submit/ep-manager-refactor branch from e639065 to 50f4184 Compare February 1, 2021 22:47
@joestringer
Copy link
Member Author

test-me-please

Copy link
Member

@tklauser tklauser left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR is nicely structured and with the details in the commit messages made it really convenient to review also for someone not (yet) deeply familiar with the code 🚀

A few questions and nit-pick suggestions for follow-up PRs inline, nothing blocking.

pkg/endpointmanager/manager.go Outdated Show resolved Hide resolved
pkg/endpoint/endpoint.go Outdated Show resolved Hide resolved
pkg/endpoint/endpoint_test.go Outdated Show resolved Hide resolved
Copy link
Member

@christarazi christarazi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Last question for me is do we need the Introduce Disconnect() commit any more? For review, it served a nice purpose to ease inspection, but now endpoint: Rearrange endpoint Unexpose vs Stop essentially removes it. It is a helpful for history purposes, so maybe a middle-ground would be to include the context in the latter commit also.

@joestringer
Copy link
Member Author

Yeah I debated whether to fold that commit back into the earlier commits to remove the churn in the middle. Now that I had a fully green CI run I feel a bit more confident that I won't need to throw away the extra patch on top, so maybe I should touch that up. I can roll in a couple of minor bits of feedback from @tklauser at the same time.

Prior to this commit, endpoint deletion was organized like this:
1. Grab the Endpoint lock,
2. Stop all event-processing goroutines,
3. Deregister the endpoint from the EndpointManager, thereby hiding the
   endpoint from other subsystems,
4. Clean up the rest of the endpoint state on the filesystem etc.,
5. Then notifies the monitor of the endpoint deletion.
6. Release lock

This commit aims to flip this the other way so that first, endpoint
deletion is triggered through EndpointManager.RemoveEndpoint() which:
1. Grabs the EndpointManager lock,
2. Deregisters the endpoint from the EndpointManager,
3. Stops all event-processing goroutines,
4. Clean up the rest of the endpoint state on the filesystem etc.,
5. Then notifies the monitor of the endpoint deletion.
6. Release lock

e.aliveCancel() is a CancelFunc and hence is idempotent, so safe to call
outside of holding the Endpoint lock (multiple times doesn't matter).

Signed-off-by: Joe Stringer <[email protected]>
Previously, this code did two things in a separate goroutine:
* Release the endpoint ID, a numeric identifier for the endpoint which
  is allocated in-memory by the Cilium agent
* Get the state of the endpoint to decide whether failures to deallocate
  are worthy of a log message.

Given that the endpoint ID allocation is handled in-memory, modulo any
potential locking issues, it should be possible to handle deallocation
of the endpoint ID within a reasonable period inside the daemon.

Until recently, this code would be called while holding the endpoint
lock; this would prevent the GetState() call here from grabbing the
state to determine whether to log; if this was run in the same goroutine
as the outer logic, it would have caused a deadlock.

However, now that the locking is changed (and also there is now a
SetState() call earlier in this same function which invalidates the
check), we can simplify the logging here.

Signed-off-by: Joe Stringer <[email protected]>
These dummy structures are no longer used after recent refactoring.

Signed-off-by: Joe Stringer <[email protected]>
This is a pure code move, no functional changes.

Signed-off-by: Joe Stringer <[email protected]>
@joestringer joestringer force-pushed the submit/ep-manager-refactor branch from 50f4184 to d64fe6a Compare February 2, 2021 23:04
@joestringer
Copy link
Member Author

joestringer commented Feb 2, 2021

I did some cleanup / shuffling around based on the above feedback. Here's everything except for the revert of the WaitEndpointRemoved() function rename patch which I changed my mind about, WaitEndpointRemoved is back for use in unit tests:

diff --git a/pkg/endpoint/endpoint.go b/pkg/endpoint/endpoint.go
index f4986e21cfab..2c544a6c618a 100644
--- a/pkg/endpoint/endpoint.go
+++ b/pkg/endpoint/endpoint.go
@@ -2076,6 +2076,9 @@ type IPReleaser interface {

 // Delete cleans up all resources associated with this endpoint, including the
 // following:
+// * all goroutines managed by this Endpoint (EventQueue, Controllers)
+// * removal from the endpointmanager, resulting in new events not taking effect
+// on this endpoint
 // * cleanup of datapath state (BPF maps, proxy configuration, directories)
 // * releasing IP addresses allocated for the endpoint
 // * releasing of the reference to its allocated security identity
@@ -2089,9 +2092,8 @@ func (e *Endpoint) Delete(ipam IPReleaser, conf DeleteConfig) []error {
        // return here. Ignore the request if the endpoint is already
        // disconnected.
        if err := e.lockAlive(); err != nil {
-               return errs
+               return []error{}
        }
-
        e.setState(StateDisconnecting, "Deleting endpoint")

        // If dry mode is enabled, no changes to BPF maps are performed
diff --git a/pkg/endpoint/endpoint_test.go b/pkg/endpoint/endpoint_test.go
index bec949f149f4..45a9661b84f9 100644
--- a/pkg/endpoint/endpoint_test.go
+++ b/pkg/endpoint/endpoint_test.go
@@ -18,7 +18,6 @@ package endpoint

 import (
        "context"
-       "net"
        "reflect"
        "testing"
        "time"
@@ -720,9 +719,3 @@ func BenchmarkEndpointGetModel(b *testing.B) {
                e.GetModel()
        }
 }
-
-type ipReleaserDummy struct{}
-
-func (i *ipReleaserDummy) ReleaseIP(ip net.IP) error {
-       return nil
-}
diff --git a/pkg/endpointmanager/manager.go b/pkg/endpointmanager/manager.go
index 811e64e30083..adcddfee76ea 100644
--- a/pkg/endpointmanager/manager.go
+++ b/pkg/endpointmanager/manager.go
@@ -305,9 +305,6 @@ func (mgr *EndpointManager) ReleaseID(ep *endpoint.Endpoint) error {
 // unexpose removes the endpoint from the endpointmanager, so subsequent
 // lookups will no longer find the endpoint.
 func (mgr *EndpointManager) unexpose(ep *endpoint.Endpoint) {
-       mgr.mutex.Lock()
-       defer mgr.mutex.Unlock()
-
        // Fetch the identifiers; this will only fail if the endpoint is
        // already disconnected, in which case we don't need to proceed with
        // the rest of cleaning up the endpoint.
@@ -318,6 +315,9 @@ func (mgr *EndpointManager) unexpose(ep *endpoint.Endpoint) {
        }
        previousState := ep.GetState()

+       mgr.mutex.Lock()
+       defer mgr.mutex.Unlock()
+
        // This must be done before the ID is released for the endpoint!
        mgr.removeIDLocked(ep.ID)
        mgr.RemoveIPv6Address(ep.IPv6)
@@ -346,6 +346,14 @@ func (mgr *EndpointManager) RemoveEndpoint(monitor regeneration.Owner, ipam endp
        return ep.Delete(ipam, conf)
 }

+// WaitEndpointRemoved waits until all operations associated with Remove of
+// the endpoint have been completed.
+// Note: only used for unit tests, to avoid ep.Delete()
+func (mgr *EndpointManager) WaitEndpointRemoved(ep *endpoint.Endpoint) {
+       mgr.unexpose(ep)
+       ep.Stop()
+}
+
 // RemoveAll removes all endpoints from the global maps.
 func (mgr *EndpointManager) RemoveAll() {
        mgr.mutex.Lock()

EDIT: Ah, I was a little overzealous: this bit doesn't make sense. I can fix it up before merge or send a simple follow-up to remove these lines:

+// * removal from the endpointmanager, resulting in new events not taking effect
+// on this endpoint

@joestringer
Copy link
Member Author

test-me-please

@tklauser tklauser merged commit 26b33f3 into cilium:master Feb 3, 2021
@tklauser tklauser mentioned this pull request Feb 3, 2021
@joestringer joestringer deleted the submit/ep-manager-refactor branch February 3, 2021 16:40
tklauser added a commit that referenced this pull request Aug 10, 2023
This method is no longer used outside of the package since the
refactoring of endpoint management (namely the removal of the
of the endpoint.endpointManager interface) in PR #14745.

Signed-off-by: Tobias Klauser <[email protected]>
aditighag pushed a commit that referenced this pull request Aug 10, 2023
This method is no longer used outside of the package since the
refactoring of endpoint management (namely the removal of the
of the endpoint.endpointManager interface) in PR #14745.

Signed-off-by: Tobias Klauser <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/cleanup This includes no functional changes. release-note/misc This PR makes changes that have no direct user impact.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants