DAG for policy reconciliation #29

guicassolato · 2023-10-16T11:31:03Z

Problem statement

Kuadrant's current policy reconciliation process is too centered around the policy objects, not very (if anything at all) conscious of the topology underneath, other than by successively querying the cluster API.

This has been resulting in:

Occasional cyclic triggering of the reconciliation loop
Many requests to the kube API server (see also Fanout status update problems)
- Slow down the overall reconciliation loop
- Risk of being occasionally rate-limited by kube API server
Relative blindness by policy controller implementers about the different kinds of resource events that need to be watched
Relying too heavily on annotations to track the back-refs
Each new policy kind requires a lot of work to be implemented

Example-driven explanation

                                   ┌───────────┐
                 ┌──EnvoyFilter-1  │ Limitador │    ┌──EnvoyFilter-2
     rlp-1────┐  │                 └───────────┘    │
              │  ├──WasmPlugin-1                    ├──WasmPlugin-2
              ▼  │                                  │
           ┌─────┴┐                           ┌─────┴┐
     ┌────►│ gw-1 │◄────┬────────────┐  ┌────►│ gw-2 │◄────┐
     │     └──────┘     │            │  │     └──────┘     │
     │                  │            │  │                  │
     │                  │            │  │                  │
┌────┴────┐       ┌─────┴───┐      ┌─┴──┴────┐       ┌─────┴───┐
│ route-1 │       │ route-2 │      │ route-3 │       │ route-4 │
└─────────┘       └─────────┘      └─────────┘       └─────────┘
     ▲                                  ▲                  ▲
     │                                  │                  │
     │                                  │                  │
   rlp-2                              rlp-3              rlp-4

Reconciliation of rlp-2 (created after rlp-1) requires triggering the reconciliation of rlp-1 again, to recalculate the scope of rlp-1 – i.e. to update WasmPlugin-1 and Limitador, which in turn have just been updated because rlp-2 itself
Similarly, rlp-3 requires recalculating WasmPlugin-1 and Limitador, apart from creating EnvoyFilter-2 and WasmPlugin-2
Getting to the affected gateways involves:
a. inspecting the specs of the targeted routes for parentRefs;
b. listing all RLPs for gateway-targeting ones;
c. trusting the state of the back-ref annotations.
Reconciliation of any policy event involves trying to detect what kind of event triggered it – i.e. policy created/updated/deleted, route created/updated/deleted, gateway created/updated/deleted
Other events need to be watched for reconciliation back from the source of truth (policies + network topology) – e.g. wasmplugin/envoyfilter/limitador modified/deleted

Possible solution

Keep a version of the topology in-memory as a DAG (Directed Acyclic Graph)
Rely more on the informers pattern, to replace/complement controller-runtime, possibly replacing the “traditional” reconciliation loops as we known them today
Recompute the effective policies top-down, from affected gateways and downwards to the leaves
Distinguish between events that affect the topology, events that just require recomputing and reapplying effective policies, and events that just require reapplying previously computed states.

Reasons to do it

Reduce (significantly) the number of requests to kube API, therefore also improve performance (speed) of reconciliation
Move away from annotations as the way to track back-refs to the policies, by relying on the DAG to navigate the topology instead
Simplify reconciliation loop regarding detection of the kind of resource event
Improve clarity regarding the different kinds of events that trigger reconciliation (by having to define each kind of event and corresponding callback function) → improve coverage of scenarios (kinds of resource events)
Possibility to react quicker and more efficiently, by sometimes not having to trigger “full” reconciliation but acting more directly according to each kind of event

Reason NOT to do it

Involves rewriting the operators
Possibly more resources (CPU, Mem) required by the policy controller

Challenges

Bootstrapping the tree of pre-existing resources in-memory may take some non-negligible time – i.e. consider the impact for the readiness state of the controller
Achieve enough level of abstraction so it works for all policy implementers (i.e. not only for Kuadrant)
Avoid re-inventing the wheel – watch out for weird combination of the informers patterns and straightforward reconcilers
Reeducate devs on the new pattern – no longer “textbook” controller-runtime

The text was updated successfully, but these errors were encountered:

guicassolato · 2024-07-03T16:25:41Z

kuadrant/policy-machinery can be employed for this.

guicassolato · 2024-08-16T13:52:25Z

Splitting this in 2 parts:

Part 1: GatewayAPI topology (DAG 1.0) kuadrant-operator#530
Part 2: rfc: Policy Machinery for reconciliation #95

guicassolato added the RFC Request For Comments label Oct 16, 2023

guicassolato mentioned this issue Oct 16, 2023

[authpolicy-v2] route selectors Kuadrant/kuadrant-operator#256

Merged

5 tasks

eguzki mentioned this issue Nov 13, 2023

RateLimitPolicy controller reconcile logic hardening Kuadrant/kuadrant-operator#74

Closed

4 tasks

guicassolato assigned guicassolato and eguzki Nov 13, 2023

guicassolato added this to Kuadrant Nov 27, 2023

guicassolato moved this to Todo in Kuadrant Nov 27, 2023

guicassolato moved this from Todo to In Progress in Kuadrant Nov 27, 2023

guicassolato added the target/next label Nov 27, 2023

guicassolato added this to Kuadrant Service Protection Nov 27, 2023

guicassolato moved this to Needs refinement in Kuadrant Service Protection Nov 27, 2023

guicassolato mentioned this issue Dec 12, 2023

Wasmplugin controller: new approach using Kuadrant Topology (DAG) Kuadrant/kuadrant-operator#317

Closed

7 tasks

alexsnaps mentioned this issue Dec 19, 2023

Envoy Gateway Support Kuadrant/kuadrant-operator#325

Closed

11 tasks

guicassolato mentioned this issue Jul 3, 2024

RFC: Policy Machinery for reconciliation #94

Closed

guicassolato mentioned this issue Jul 3, 2024

rfc: Policy Machinery for reconciliation #95

Merged

eguzki mentioned this issue Sep 10, 2024

GatewayAPI topology (DAG 1.0) Kuadrant/kuadrant-operator#530

Closed

12 tasks

eguzki removed their assignment Sep 18, 2024

eguzki moved this from In Progress to Done in Kuadrant Oct 18, 2024

eguzki closed this as completed Oct 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DAG for policy reconciliation #29

DAG for policy reconciliation #29

guicassolato commented Oct 16, 2023 •

edited

Loading

guicassolato commented Jul 3, 2024

guicassolato commented Aug 16, 2024

DAG for policy reconciliation #29

DAG for policy reconciliation #29

Comments

guicassolato commented Oct 16, 2023 • edited Loading

Problem statement

Example-driven explanation

Possible solution

Reasons to do it

Reason NOT to do it

Challenges

guicassolato commented Jul 3, 2024

guicassolato commented Aug 16, 2024

guicassolato commented Oct 16, 2023 •

edited

Loading