Kubernetes network policies specify how groups of pods are
allowed to communicate with each other and other network endpoints. Each
policy is represented as an instance of the K8s resource NetworkPolicy
. A
policy uses labels to select a grouping of pods and specifies a list of rules
that determine which traffic is allowed to and from the selected pods.
Contiv/VPP implements the Kubernetes Network API, including the latest
features, such as egress policies and IP blocks.
For a packet forwarding engine, such as VPP, this is an overly abstract definition for access control between endpoints. K8s Policies with their rules need to be mapped onto a semantically equivalent set of basic 6-tuple rules:
(source IP, source port, destination IP, destination port, protocol, action)
where protocol
is one of {TCP
, UDP
, ANY
} and action
is either Deny
or Permit
. This mapping is performed by the policy plugin
and the resulting 6-tuple rules are installed into VPP either as L2/L3 ACLs
by the [VPP/ACL plugin][acl-plugin] (a component of the Ligato VPP Agent),
or as L4 session rules in the VPPTCP network stack directly over
GoVPP.
The mapping of Kubernetes network policies to ACLs and VPPTCP session rules is implemented by the policy plugin using a data-flow based approach. The plugin consists of multiple components that are stacked on top of each other, with data moving from the top layer to the bottom layer. Each layer gets policy-related data from the layer above it, and performs a transformation that yields a less abstract data representation for the layer below. In this way the abstraction level decreases with each layer until it reaches the format of policy rules used by the target network stack. The top layer of the stack accepts K8s state data from Etcd, where it is reflected from the K8s API by the KSR. The bottom layer of the stack outputs rendered policies to one or more network stacks (e.g. vswitch such as the VPP, host stack, ...) in a format consumable by the respective stack. This layer will typically contain a separate Renderer for each network stack. The layers in-between perform policy processing with the assistance of in-memory caches.
Every layer is described here in detail with an extra focus on data transformations, starting from the top and including references to the actual source code.
The Policy Plugin Skeleton implements the Ligato plugin API, which makes it pluggable with the Ligato CN-Infra framework.
Inside the Policy Plugin's Init()
method all the layers are initialized and
dependency injection is performed - at the very minimum, every layer must depend
on at least the layer below so that it can pass transformed data further down
the stack.
Additionally, the plugin itself is an event handler, registered into the main event loop of the Contiv agent after the ipnet plugin. This ensures that connectivity between pods and the VPP is established before any rules are installed.
The Policy plugin reads the state of 3 Kubernetes resources
reflected into etcd
by KSR: network policies, pods
and namespaces, to build the set of policy rules for rendering.
Apart from policies, the pod and namespace state data must also be
watched to learn their current attachment of labels that may be referenced by
policies. Snapshot of Kubernetes state data received with resync events
and changes delegated by KubeStateChange
event are just propagated further
into the policy Cache without any processing.
Unpacks update and resync events and
stores the current full snapshot of policy-related K8s state data
in-memory using Index Maps (idxmap) from the Ligato CN-infra framework.
The cache provides an API to be notified whenever a policy, pod or
namespace state data changes. The notifications are propagated via callbacks,
one for a resource instance at a time. A watcher must implement the
PolicyCacheWatcher
interface with methods such as AddPolicy()
,
UpdatePolicy()
, etc. This interface is implemented by the Policy Processor -
the first layer in the data-flow stack that performs actual data
transformations. Additionally, the cache exposes various lookup methods
(e.g. get pods by label selector), which are used by all the layers (but mostly
by the Processor).
The policy processor is notified by the Cache whenever a change related to policy configuration occurs. Additionally, it receives a full snapshot from the cache during the Resync event.
For each change, the processor decides if the re-configuration needs to be postponed until more data is available. Typically, policies cannot be installed for a pod until it has been assigned an IP address.
If a change carries enough information, the processor determines the list of pods with a possibly outdated policy configuration (all pods for RESYNC):
- For a changed policy this includes all the pods that the policy had assigned before and after the change.
- For a changed pod (labels, IP address), this results in re-configuration of all pods with a policy referencing the changed pod before or after the change.
- For a changed namespace, all pods with a policy referencing the changed namespace before or after the change need to be re-configured.
Note: re-configuration triggered by the processor for a given pod does not necessarily cause the rules to be re-written in the network stacks. The layers below, most notably the renderers, ensure that only the minimal set of changes - if any - are applied to get the configuration in-sync with the K8s state.
For each pod with possibly outdated policy configuration, the Processor calculates
the set of policies currently assigned to it. The policies are then converted
to a less-abstract ContivPolicy
type defined in the Configurator API.
ContivPolicy
is simpler because it contains:
- Matching lists of pods evaluated from Label Selectors
- Port numbers translated from port names
- Lists of pods representing the namespaces containing the pods
Pod data with the assigned set of Contiv policies are then passed further into to the Configurator for re-configuration.
The main task of the Configurator is to translate a ContivPolicy into
a semantically equivalent set of basic 6-tuple rules, split into ingress and
egress side from the vswitch point of view. A 6-tuple is defined as type
ContivRule
in the Renderer API.
The rules are installed into a network stack (e.g. a vswitch) by the layer below - the Renderer(s). To support multiple underlying network stacks, the configurator allows to register multiple renderers, each receiving the same data with the responsibility for the management of the access control in its own network stack.
The procedure of translating a set of Contiv policies into ingress/egress rules can be described by the following pseudo-code:
GenerateRules
input: direction (ingress or egress - pod point of view), set of ContivPolicies
output: list of ContivRules - opposite direction, but from the vswitch point of view
for every policy:
if the policy doesn't apply to this direction:
skip
for every match: // match = all endpoints OR set of pods + set of IPBlocks referenced by the policy
// Cache is used to get IP address of the pod.
get IP address for every matched pod
// Since 6-tuples do not define port ranges, we cannot efficiently
// implement IPBlocks with excluded sub-ranges by overlapping PERMIT+DENY
// rules. Instead we perform subtraction over IP subnets
// (function subtractSubnet() from configurator_impl.go) which results
// in multiple PERMIT-only rules (with one deny-the-rest at the end).
subtract excluded subnets from the CIDR of every matched IPBlock
// Generate 6-tuples (src-IP, src-port, dst-IP, dst-port, protocol, action)
if match is all endpoints:
if match includes all ports:
add rule (ANY, ANY, ANY, ANY, ANY, PERMIT)
else:
for every matched port, protocol:
add rule (ANY, ANY, ANY, port, protocol, PERMIT)
else:
for every matched pod's IP address:
if match includes all ports:
if direction is ingress:
add rule (pod-IP, ANY, ANY, ANY, ANY, PERMIT)
else:
add rule (ANY, ANY, port-IP, ANY, ANY, PERMIT)
else:
for every matched port, protocol:
if direction is ingress:
add rule (pod-IP, ANY, ANY, port, protocol, PERMIT)
else:
add rule (ANY, ANY, port-IP, port, protocol, PERMIT)
for every matched IPBlock (after subtraction of excluded sub-ranges):
if match includes all ports:
if direction is ingress:
add rule (IPBlock, ANY, ANY, ANY, ANY, PERMIT)
else:
add rule (ANY, ANY, IPBlock, ANY, ANY, PERMIT)
else:
for every matched port, protocol:
if direction is ingress:
add rule (IPBlock, ANY, ANY, port, protocol, PERMIT)
else:
add rule (ANY, ANY, IPBlock, port, protocol, PERMIT)
// Deny the rest of the traffic.
if not all was PERMITed:
add rule (ANY, ANY, ANY, ANY, ANY, DENY)
GenerateRules
is implemented by PolicyConfiguratorTxn.generateRules()
and it is executed for both directions to obtain separate lists of ingress and
egress Contiv rules.
Since the pod for which the rules are generated is given, the ingress rules have the source IP unset, i.e. 0.0.0.0/ (match all). Conversely, egress rules have their destination IP unset. The ingress rules are supposed to be applied for all the traffic entering VPP from the given pod, whereas egress rules should be confronted with all the traffic leaving VPP towards the pod.
The order at which the rules are applied for a given pod is important as well. The renderer which applies the rules for the destination network stack has 3 valid options of ordering:
- Apply the rules in the exact same order as passed by the Configurator
- Apply PERMIT rules before DENY rules: this is possible because there is always only one DENY rule that blocks traffic not matched by any PERMIT rule.
- Apply more-specific rule, i.e covering less traffic, before less-specific ones.
ContivRule-s have a total order defined on them using the method
ContivRule.Compare(other)
. It holds that ifcr1
matches subset of the traffic matched bycr2
, thencr1<cr2
. This ordering may be helpful if the destination network stack uses the longest prefix match algorithm for logarithmic rule lookup, as opposed to list-based linear lookup.
Not every network stack supports access-control for both directions, however. Additionally, services allow to reference a group of pods by VIP but rules only consider real pod IP addresses. This means that translation and load-balancing have to happen before the ingress rules are applied, which is not possible in VPP. The renderers therefore have to further transform and combine ingress and egress rules into a single direction, as described in [rule transformations][rule-transformations].
A policy Renderer implements rendering (= installation) of Contiv rules into a specific network stack. What exactly the rules get mapped into and how the rendering operates may be different for each network stack. The renderer only has to implement the PolicyRendererAPI interface and then it can be registered with the Configurator. Another requirement, which obviously cannot be enforced by the interface, is that the rendered access control configuration in the underlying stack semantically reflects the last received ingress & egress rules for every pod. The semantics of rules is described in the Renderer API and also in this document in section ContivRule semantics.
The 6-tuple ContivRule has been designed to be as simple as possible while still being expressive enough to describe Kubernetes network policies. This should allow to write renderers for even the most basic access control (AC) implementations. Still, not every network stack provides AC separately for ingress and egress directions. Furthermore, the rules should apply to traffic after service VIPs were translated to pods selected by the load-balancer. As is the case with the VPP/ACL plugin, this requirement often cannot be satisfied using ingress rules. Another potential limitation may be that the target AC does not operate per-interface, but instead applies rules installed in a single global rule table. Further rule transformation may therefore be necessary, even at the renderer layer. We provide the Renderer Cache, used by both the ACL and VPPTCP renderers, which not only maintains a snapshot of currently rendered rules, but also allows to work around the aforementioned limitations by combining ingress with egress as described in the next section.
Both VPP/ACL and VPPTCP have limitations that prevent ingress and egress rules received from the configurator from being installed as is, without any changes.
For VPP/ACL the ingress ACLs cannot be used with interfaces that connect pods to the vswitch. This is because traffic flows through these ACLs before it reaches the nat44* graph nodes, meaning that the translation of service VIPs executes later. However, K8s network policies run below services in the sense that they are meant to be applied against real Pod IP addresses, not against virtual service IP addresses.
VPPTCP, on the other hand, does not even provide per-interface egress AC.
Every namespace (connection with a pod) provides its own local table of session
rules, which is only applied against traffic entering VPP from the namespace
but not confronted with the connections initiated in the egress direction. The
egress side is matched by a single per-node global table. This table is
bypassed, however, if communicating pods are deployed on the same node
(fall-through
optimization).
The rules for ingress and egress direction are therefore combined into a single selected direction - egress for ACL and ingress for VPPTCP. For simplicity, we will now describe the algorithm specifically for the egress side used by ACLs. The same algorithm is used by VPPTCP renderer (implementation is parametrized), only source and destination IPs are swapped and the resulting direction is ingress.
To calculate pod's egress rules that include restrictions imposed by ingress rules of other pods, the following algorithm expressed in pseudo-code is used:
CombineRules
input: target pod - denoted as pod1, pod1's egress rules, every other *known* pod's ingress rules
output: pod1's egress rules intersected with ingress rules of other pods
for every other known pod - denoted as pod2:
get a set of TCP, UDP ports that pod2 can access on pod1 by pod2's ingress rules
- denote ingressTCP & ingressUDP
get a set of TCP, UDP ports that pod1 has opened for pod1 by pod1's egress rules
- denote egressTCP & egressUDP
if egressTCP is not subset of ingressTCP
OR
egressUDP is not subset of ingressUDP:
from pod1's egress table remove all rules with source IP == pod2-IP
interset ingressTCP with egressTCP - denote allowedTCP
interset ingressUDP with egressUDP - denote allowedUDP
// Generate 6-tuples (src-IP, src-port, dst-IP, dst-port, protocol, action):
for every port from allowedTCP:
insert into pod1's egress table rule (pod2-IP, ANY, ANY, port, TCP, PERMIT)
for every port from allowedUDP:
insert into pod1's egress table rule (pod2-IP, ANY, ANY, port, UDP, PERMIT)
// deny the rest of the traffic from pod2 to pod1:
insert into pod1's egress table rule (pod2-IP, ANY, ANY, ANY, ANY, DENY)
Notice that pod's egress rules are only combined with other known pods.
The renderer is not supplied with policy configuration for pods without any
policy assigned or pods deployed on other nodes. A Pod without any policy is
open to all traffic, therefore there are no ingress rules to combine with.
Pods deployed on other nodes, however, may have restrictions imposed on the
ingress side. Therefore, CombineRules
alone is not sufficient to ensure that
ingress rules are reflected into egress ACLs. It is thus necessary to filter
traffic leaving the node based on ingress rules of all pods deployed on it.
Ingress rules of all local pods are unified into a single global table assigned
to the egress side of the interfaces connecting the node with the rest of the
cluster.
The global table is build using the following algorithm:
BuildGlobalTable:
input: ingress rules of locally deployed pods
output: single egress "global" table applying all ingress rules for traffic leaving the node
create empty global table
for every locally deployed pod:
for every pod's ingress rule:
change rule's source IP from ANY to pod's IP address
add rule into the global table
// add rule to allow the rest of the traffic
add into global table rule (ANY, ANY, ANY, ANY, ANY, PERMIT)
Again, the actual implementation is parametrized, allowing to choose the direction for which the global table should be build for:
- ingress rules of locally deployed pods -> single egress global table: used by ACL
- egress rules of locally deployed pods -> single ingress global table: used by VPPTCP
With these transformations, the order in which the rules can be applied is more strict than when they originally arrived from the configurator - the order between PERMIT and DENY rules now matters. The renderer which applies the rules for the destination network stack has now only two valid options of ordering:
- Apply the rules in the exact same order as returned by the Cache for each table. Used by the ACL Renderer.
- Apply more-specific rule before less-specific ones, i.e. the longest prefix match algorithm. Used by the VPPTCP Renderer.
Both VPPTCP and ACL renderer create their own instance of the same
Renderer Cache. The cache maintains a snapshot of all rules
currently rendered and allows to easily calculate the minimal set of changes
that need to be applied in a given transaction. The rules are inserted into
the cache as received from the configurator - unprocessed and split into ingress
and egress sides. Internally the cache performs the transformations described
in the section above. The algorithm CombineRules
is
implemented by RendererCacheTxn.buildLocalTable()
. The implementation is
parametrized, destination to which the rules should be combined is selected
during the cache initialization (egress for ACL, ingress for VPPTCP).
The rules are grouped into tables represented by the type ContivRuleTable
defined in the Cache API and the full configuration is represented
as a list of local tables, applied on the ingress or the egress side of pods,
and a single global table - generated using the BuildGlobalTable
algorithm
implemented by RendererCacheTxn.rebuildGlobalTable()
, applied onto the
interfaces connecting the node with the rest of the cluster.
The list of local tables is minimalistic in the sense that pods with the same set of rules will share the same local table. Whether shared tables are installed in one instance or as separate copies for each associated pod is up to the renderer. Usually this is determined by the capabilities of the destination network stack, e.g. VPP/ACL plugin allows to assign single ACL to multiple interfaces, but VPPTCP requires to build session rule table individually for each application namespace.
ACL Renderer installs ContivRule
s into VPP as ACLs from
VPP/ACP plugin. The renderer uses the cache
to convert ingress and egress rules into per-pod egress ACLs (local tables),
each assigned to a TAP interface connecting the VPP with the corresponding pod,
and a single egress ACL (global table) assigned to interfaces connecting the
node with the rest of the cluster: GigE interfaces, the loop interface in the
BD with VXLANs and the TAP interface connecting the VPP with the host. Pods
with the same policy configuration share the same ACL.
The key method is RendererTxn.renderACL()
implementing the conversion between
an instance of ContivRuleTable
into the protobuf-based representation of ACL
used in the northbound API of the ligato/vpp-agent. Every
ContivRule is mapped into a single Acl.Rule
. Match.IpRule
is filled with
values from the 6-tuple - port ranges always include either all ports or a single
one (the rules are not compacted together). Generated ACLs are appended to the
transaction prepared for the given event by the
Controller plugin. The controller then commits the
transaction with ACLs (and potentially also with some more changes from other
plugins) into the ligato/vpp-agent via the local client.
At the end of the pipe is ACL plugin from the VPP-Agent,
which applies ACL changes into VPP through binary APIs. For each transaction,
the cache is used to determine the minimal set of ACLs that need to be sent to
vpp-agent to be added/updated or deleted.
By splitting the rules into ingress and egress, K8s network policies allow to
block a connection with certain parameters in one direction, while the same
connection can be allowed if is is initiated in the opposite direction. For ACLs
it means that if the egress ACL of the destination pod allows connection-initiating
SYN packet, the egress ACL of the source pod should not block the replied SYN-ACK
or any other packet of that connection. This behaviour is achieved by attaching
a so called Reflective ACL
- allowing + reflecting all the traffic - onto
to ingress side of every pod with non-empty egress ACL. The effect is that the
SYN packet coming to the VPP from a pod automatically creates a free pass for
replies returning to the pod. The restrictions imposed by policies are therefore
always applied only by the destination pod's egress ACL or by the global (egress)
ACL, not by the source pod's egress ACL. It is important to note that a connection
is marked for reflection before it goes through the NAT, i.e. with possibly VIP as
the destination. This is OK because the replies have their source already SNAT-ed
back to VIP before the packet travels through egress ACL of the source pod,
matching the entry for reflection.
VPPTCP Renderer installs ContivRule
s into VPP as session
rules for the VPPTCP network stack. The renderer uses cache
to convert ingress and egress rules into per-application-namespace (= pod) ingress
local tables and a single ingress global table.
VPPTCP uses a slightly different representation of the policy rule denoted
SessionRule
(tuple with more entries). When put into the context of the target
table, ContivRule can be easily mapped to SessionRule(s) - this is implemented
by convertContivRule()
from session_rule.go.
Session rules are installed into VPP directly through GoVPP (i.e. not using ligato/vpp-agent). The cache is used to calculate the minimal number of changes needed to apply to get the session rules in-sync with the configuration of K8s policies.