-
Notifications
You must be signed in to change notification settings - Fork 33
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ENHANCEMENT] add dry-run mode for ANP/BANP #230
Comments
This is definitely an important scenario for user tooling. In March at KubeCon (recording, slides), we proposed that upstream tooling would solve this scenario, and there was good excitement and feedback which was the catalyst for putting #221 (same scenario) on the roadmap for Policy Assistant. I think we should solve this scenario in upstream tooling (Policy Assistant) as opposed to a dry-run mode downstream. A few reasons:
There were two concerns I recall from yesterday’s discussion around Policy Assistant’s solution to this scenario:
Taking a Step BackI think there is a larger problem we are trying to solve. A user wants to add a new policy to a cluster. There are a few considerations:
The dry-run mode is meant to address only the last question. Policy Assistant can provide a complete, easy-to-parse answer to that question, and it can also answer the other questions. If dry-run mode aims to ease concerns that an implementation might have unexpected behavior, then I’d say that implementation is not conformant? We should strongly discourage users from creating policies which have “undefined” behavior delegated to implementations (discourage via Policy Assistant)? |
Great summary @huntergregory ! |
Aren't theoretical connections (i.e. summary of all possible connections) more informative? For instance, what if a denied connection will happen in the future but the downstream tooling does not observe it? |
I think these 2 tools just give different kind of information. For example, I am convinced that I allowed everything my pod needs, but forgot to allow ingress from monitoring. In the static analysis, everything will look fine, because it will just tell me the same thing as my ANP, but if I run it on a real cluster I will find that denied connection from the monitoring namespace and figure out I forgot it. Another example is using FQDN: statically you don't know to which IP it will be resolved, so you can't actually say that the higher-priority deny rule will drop traffic that should be allowed by a lower-priority FQDN rule. In the real cluster with real traffic I will be able to see that. |
I'd like to understand the argument for downstream tooling a bit more in Tuesday's meeting. I believe Policy Assistant's dry-run is by nature more informative than a CNI's dry-run, and it'd be great to get your opinion and others' opinions on how to best display that information. The first example is not quite accurate:
Policy Assistant could account for / resolve FQDNs?
|
Discussed extensively today (6/4/24) in the SIG meeting.
I see now what Nadia means: while Policy Assistant explains the connections which are now allowed/denied by a policy, the CNI dry-run mode would clearly highlight real connections being dropped. So CNI dry-run mode would hold the admin's hand a bit more. |
We'll want to prevent folks from shooting themselves in the foot for a CNI dry-run mode. A few pitfalls discussed today:
|
I agree with much of @huntergregory's assessment, and even more implementation hurdles could be discussed (such as distributed logging in large clusters not being scaleable, etc.), which I am happy to do if this comment is unsuccessful. However, I want to focus on one crucial point: This feature goes beyond proposing policy enforcement syntax that requires CNIs to enforce network connection rules within a proverbial "black box". It requires CNIs to audit policies for users in a standardized way. This feature considerably expands the scope of what this SIG defines for CNIs to implement from a security perspective. While a standardized logging mechanism from the CNI is only alluded to in this proposal, something like it is necessary for this feature. If that is the case, then this feature proposal violates SIG Network's charter against specifying the CNI (which is outside of the Kubernetes project). This proposal is beyond this SIG's traditional ambit and its authority. |
These are valid points, that we should definitely outline if we ever implement this feature, but I think there is a little misunderstanding of how the dry-run mode should be used.
Getting back to the mentioned pitfalls:
This feature is definitely most useful when logging is possible, but one of the reasons that makes me think it can be considered for the API is that this field (or a well-known label?) should change the behaviour of the network policy to "log" (or do nothing if the plugin doesn't support logging). This functionality can't be implemented just downstream on top of the upstream API, because it will break the semantics of a network policy (spec says allow, but nothing actually happens). So the main effect of this feature in the API is that the policy in dry-run mode won't affect the traffic. Sounds useless by itself, but this will allow implementations to enable logging on top of it and provide a useful tool without the need to create downstream CRDs. |
You do not get to decide how users see a feature. Many of them will see it as authoritative (which is not a problem for any other feature in SIG-Network-API) and shoot themselves in the foot with this thing.
I see two problems with this thinking:
This feature is most useful in larger clusters with numerous disparate policies (where admins might be confused about the effect of adding a policy). That means that every node affected by the policy (and maybe all of them, implementation dependent) will have to send logs to some central place (the kubeapiserver?). I think there are legitimate scalability questions for such a solution, but we cannot really review them until we have a full proposal. @npinaeva I think our path forward is this:
|
We talked about this in the sync today, I think as a next step it could only help to share the context with some CNI folks to get their thoughts and considerations. |
That is the only requirement for this field, and the only reason to make it an API field: it will be required to be considered conformant. I don't think ignoring a netpol is a difficult-to-implement-feature for any plugin. There is no intention for this feature upstream to specify anything extra than the fact that the netpol rules should not be applied in dry-run mode. If we don't think this feature is useful without specifying logging/observability side-effects, then we can just reject it. |
That is incorrect. The charter says (1) SIG Network does not own the CNI specification, and (2) SIG Network does not own any specific network plugin. (Furthermore, these are just clarifications, not "rules".) SIG Network absolutely owns the definition of what functionality is required in a conforming Kubernetes pod network (and that already includes functionality which is not part of CNI; CNI does not have anything to say about node-to-node traffic, for example). (Not saying that this feature is necessarily a good idea, or that any of your other objections to it are wrong, just that this particular object is incorrect.) |
It seems really really weird to define this without already having (standardized) support for logging. Also, from an API perspective, objects describe state, not actions, so "dry-run mode" isn't quite the right model. The linked Calico feature with "staged" policies is more semantically correct; it's not that you're instructing the plugin to audit a particular policy, it's that you're creating a policy whose semantics are inherently "audit these rules" rather than "enforce these rules". |
Alternatively, the feature could be implemented by updating the policy's |
It's not difficult but it is a breaking change in the API.
This is what I'm talking about. Specifying action rather than state, especially a specific type of output (logging, in this case) that is unrelated to the API this group is meant to work on, seems out of scope and off the charter. CNI logs are within the purview of the CNI specification and are even mentioned in the project spec.
I find this dubious. In a large cluster, with potentially thousands of endpoints affected by the new dry-run policy, how is the CNI meant to determine which of the logged packets/flows to apply to the status field of one policy? Especially with ANPs, the affected policy might be downstream of the dry-run policy. |
It is related to the API this group is working on though. It's an ANP feature.
This feature is not talking about CNI at all. Yes, people keep saying "CNI", but they don't mean it, people just misuse the term "CNI" to refer to the sum of all of the pod-network-related components in a cluster, including in particular in this case, the NP/ANP implementation. The ANP spec currently requires an ANP implementation to have certain powers like "being able to drop packets traveling between pods" and "being able to drop packets going from a pod to an external IP". There are other powers that ANP implementations are not currently required to have, but which we have discussed requiring, like "being able to drop packets traveling from one node to another node" or "being able to distinguish packets from different hostNetwork pods on the same node". This NPEP suggests adding the requirement that ANP implementations have to have the power of "being able to log messages in a certain way in response to certain packets traveling between pods". That falls entirely within the ANP spec, and is within the scope of the NP API WG. (It may happen that some implementations of Kubernetes pod networking are written in such a way that adding this feature would require interactions between some component which might be called "a NetworkPolicy implementation" and some component which might be called "a CNI plugin", but the fact that a particular implementation splits up networking features up in a particular way does not mean that the upstream APIs must reflect that same split.) (And again, I am not saying we should add this feature, I'm just saying that we have to actually decide on the merits of the feature, because it is in scope as a possible ANP feature.) (And "the merits of the feature" absolutely includes the extent to which people are able/willing to implement it, since we can't get to GA without multiple compatible implementations of the feature.) |
Right, I have not updated the description for a while, fixed now. I think the easiest solution for upstream is to say that the dry-run mode action is "ignore" instead of "enforce", which allows converting it to "audit" downstream, as audit doesn't change netpol behaviour, but adds a side-effect on top. |
The Kubernetes project currently lacks enough contributors to adequately respond to all issues. This bot triages un-triaged issues according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues. This bot triages un-triaged issues according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle rotten |
/remove-lifecycle rotten |
Is your enhancement request related to a problem? Please describe.
Users may want to ensure no unexpected connections will be denied by a new (B)ANP.
A dry-run mode should not affect any connections, and allow the networking plugin to provide feedback (e.g. via logging or observability tools) to see which connections will be dropped/allowed once this (B)ANP is enforced.
Describe the solution you'd like
A potential solution if to have a dry-run flag, that would make sure (B)ANP is not actually enforced.
It could allow network plugins to add logging/observability on top of this flag, but also make sure that the behaviour (not enforcing (B)ANP) is the same for all plugins.
While this flag is most useful in combination with plugin-specific logging/observability, network plugins don't have to provide anything on top of it.
The workflow should go something like this:
Similar features:
Describe alternatives you've considered
Additional context
First discussed on May 21, 2024 in the SIG meeting.
Don't forget to provide recommendations for existing connections that may not be affected by the new netpol.
The text was updated successfully, but these errors were encountered: