-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor endpoint management #14745
Refactor endpoint management #14745
Conversation
@@ -763,7 +763,7 @@ func (d *Daemon) EndpointUpdate(id string, cfg *models.EndpointConfigurationSpec | |||
return api.Error(PatchEndpointIDConfigFailedCode, err) | |||
} | |||
} | |||
if err := ep.UpdateReferences(d.endpointManager); err != nil { | |||
if err := d.endpointManager.UpdateReferences(ep); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Incidentally, I noticed that this path is actually broken. If we intend to change any of the Identifiers from this path, then this code is not actually removing the previous identifiers, it's only creating new references in the endpointmanager for the endpoint. This could lead to stale references using old identifiers.
I don't think this is actually a viable use of the config API today so I didn't bother to fix it, and also I was trying to keep the scope of this PR relatively small so I can move on to the PR that I want to build on top of these changes.
test-me-please |
4baa97b
to
4356668
Compare
test-gke |
1 similar comment
test-gke |
4356668
to
865cfb3
Compare
test-me-please |
865cfb3
to
2669e26
Compare
2669e26
to
0c63de5
Compare
test-me-please |
0c63de5
to
84d58e0
Compare
test-me-please |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🎉 This is great! One of the more satisfying PRs to see. LGTM overall, very clean commits and well-documented history. A few minor questions.
To assist the sharing of code and appropriate locking access, move this code around such that there is a dedicated endpointid.Identifiers type to collect Identifiers for endpoint, such as the Cilium Endpoint ID, container ID, k8s namespace/name, etc. Furthermore, move the access and handling of these types all to a single file so they're in one place rather than spread across multiple files. Signed-off-by: Joe Stringer <[email protected]>
The Unexpose() function that this is being factored out of is only called from two places: Endpoint.Delete() which already calls Endpoint.Stop(), and EndpointManager.WaitEndpointRemoved() which waits for this goroutine to complete. Endpoint.Stop() itself performs the two steps being refactored here, plus one extra step: closeBPFProgramChannel() which idempotently preempts ongoing builds. Due to the idempotent nature of Stop() and the fact that EndpointManager.WaitEndpointRemoved() is waiting on the results of the Endpoint.Stop() anyway, we are safe to rearrange the ordering of these steps here. Furthermore, EndpointManager.WaitEndpointRemoved() is only used from unit tests so the runtime impact of this should be zero. Signed-off-by: Joe Stringer <[email protected]>
The EndpointManager, as its name suggests, is intended to handle the management of endpoints: Addition, deletion, etc. Due to ~historic reasons~, up until now the code to handle exposing the endpoint to other subsystems in Cilium was encoded inside the endpoint package, and the endpoint manager was passed down to the endpoint to facilitate this. This commit flips this around, such that the EndpointManager is now in charge, and it calls enough logic in the Endpoint to prepare it to handle events from the rest of the system, then subsequently exposes the endpoint in the manager to allow other subsystems to find it. Signed-off-by: Joe Stringer <[email protected]>
This commit moves the logic for unexposing an endpoint from the manager into the endpointmanager package without changing any of the locking logic, so that the next commit will have fewer changes to better emphasize the changes in locking behaviour. Signed-off-by: Joe Stringer <[email protected]>
Shift the core entrypoint into the package code for endpoint deletion away from the Endpoint package into the EndpointManager. No functional changes. Signed-off-by: Joe Stringer <[email protected]>
84d58e0
to
e639065
Compare
test-me-please |
e639065
to
50f4184
Compare
test-me-please |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR is nicely structured and with the details in the commit messages made it really convenient to review also for someone not (yet) deeply familiar with the code 🚀
A few questions and nit-pick suggestions for follow-up PRs inline, nothing blocking.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Last question for me is do we need the Introduce Disconnect()
commit any more? For review, it served a nice purpose to ease inspection, but now endpoint: Rearrange endpoint Unexpose vs Stop
essentially removes it. It is a helpful for history purposes, so maybe a middle-ground would be to include the context in the latter commit also.
Yeah I debated whether to fold that commit back into the earlier commits to remove the churn in the middle. Now that I had a fully green CI run I feel a bit more confident that I won't need to throw away the extra patch on top, so maybe I should touch that up. I can roll in a couple of minor bits of feedback from @tklauser at the same time. |
Prior to this commit, endpoint deletion was organized like this: 1. Grab the Endpoint lock, 2. Stop all event-processing goroutines, 3. Deregister the endpoint from the EndpointManager, thereby hiding the endpoint from other subsystems, 4. Clean up the rest of the endpoint state on the filesystem etc., 5. Then notifies the monitor of the endpoint deletion. 6. Release lock This commit aims to flip this the other way so that first, endpoint deletion is triggered through EndpointManager.RemoveEndpoint() which: 1. Grabs the EndpointManager lock, 2. Deregisters the endpoint from the EndpointManager, 3. Stops all event-processing goroutines, 4. Clean up the rest of the endpoint state on the filesystem etc., 5. Then notifies the monitor of the endpoint deletion. 6. Release lock e.aliveCancel() is a CancelFunc and hence is idempotent, so safe to call outside of holding the Endpoint lock (multiple times doesn't matter). Signed-off-by: Joe Stringer <[email protected]>
Previously, this code did two things in a separate goroutine: * Release the endpoint ID, a numeric identifier for the endpoint which is allocated in-memory by the Cilium agent * Get the state of the endpoint to decide whether failures to deallocate are worthy of a log message. Given that the endpoint ID allocation is handled in-memory, modulo any potential locking issues, it should be possible to handle deallocation of the endpoint ID within a reasonable period inside the daemon. Until recently, this code would be called while holding the endpoint lock; this would prevent the GetState() call here from grabbing the state to determine whether to log; if this was run in the same goroutine as the outer logic, it would have caused a deadlock. However, now that the locking is changed (and also there is now a SetState() call earlier in this same function which invalidates the check), we can simplify the logging here. Signed-off-by: Joe Stringer <[email protected]>
These dummy structures are no longer used after recent refactoring. Signed-off-by: Joe Stringer <[email protected]>
This is a pure code move, no functional changes. Signed-off-by: Joe Stringer <[email protected]>
50f4184
to
d64fe6a
Compare
I did some cleanup / shuffling around based on the above feedback. Here's everything except for the revert of the
EDIT: Ah, I was a little overzealous: this bit doesn't make sense. I can fix it up before merge or send a simple follow-up to remove these lines:
|
test-me-please |
This method is no longer used outside of the package since the refactoring of endpoint management (namely the removal of the of the endpoint.endpointManager interface) in PR #14745. Signed-off-by: Tobias Klauser <[email protected]>
This method is no longer used outside of the package since the refactoring of endpoint management (namely the removal of the of the endpoint.endpointManager interface) in PR #14745. Signed-off-by: Tobias Klauser <[email protected]>
Review commit-by-commit.
This PR aims to address some technical debt relating to the
EndpointManager
. Despite the naming, up until this PR the management of endpoints has been handled frompkg/endpoint
, with the call flow beingDaemon
->Endpoint.Expose()
/Endpoint.Delete()
->EndpointManager.UpdateReferences()
/EndpointManager.RemoveReferences()
and friends. This required defining anEndpointManager
type inside theEndpoint
, and passing it through as a parameter just so thatEndpoint
could callEndpointManager
at the right time. This relationship is the opposite of what one might expect, and leads to subsequent difficulties when trying to allow other subsystems to interact with the set of available endpoints (because they cannot rely on theEndpointManager
to manage endpoints, they must pull in complexity from granular dependencies on theEndpoint
package.) Many of the details of these interfaces were previously exposed to the callers, increasing complexity across the codebase. This PR aims to reduce such abstraction leakage to make the code simpler and more reliable.This PR moves the endpoint lifecycle more towards this:
Endpoint
constructorEndpointManager.AddEndpoint()
to make the endpoint findable from other subsystems.Endpoint.Start()
to handle events from other subsystems.Endpoint.Stop()
gracefully stops handling events without initiating the delete process.EndpointManager.RemoveEndpoint()
to prevent other subsystems from finding the endpoint.EndpointManager
, then callendpoint.Delete()
which doesEndpoint.Stop()
to stop handling ongoing events.There are no functional changes intended by this PR, but the ordering of some operations has been shuffled around slightly, in particular the locking management between
Endpoint
vs.EndpointManager
during endpoint deletion and the order of removing the endpoint from theEndpointManager
vs. stopping the goroutines that handle events from other subsystems.This was initiated while trying to figure out a clean way to port #14541 to the
master
branch.