Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DAG for policy reconciliation #29

Closed
guicassolato opened this issue Oct 16, 2023 · 2 comments
Closed

DAG for policy reconciliation #29

guicassolato opened this issue Oct 16, 2023 · 2 comments
Assignees
Labels
RFC Request For Comments target/next

Comments

@guicassolato
Copy link
Contributor

guicassolato commented Oct 16, 2023

Problem statement

Kuadrant's current policy reconciliation process is too centered around the policy objects, not very (if anything at all) conscious of the topology underneath, other than by successively querying the cluster API.

This has been resulting in:

  • Occasional cyclic triggering of the reconciliation loop
  • Many requests to the kube API server (see also Fanout status update problems)
    • Slow down the overall reconciliation loop
    • Risk of being occasionally rate-limited by kube API server
  • Relative blindness by policy controller implementers about the different kinds of resource events that need to be watched
  • Relying too heavily on annotations to track the back-refs
  • Each new policy kind requires a lot of work to be implemented

Example-driven explanation

                                   ┌───────────┐
                 ┌──EnvoyFilter-1  │ Limitador │    ┌──EnvoyFilter-2
     rlp-1────┐  │                 └───────────┘    │
              │  ├──WasmPlugin-1                    ├──WasmPlugin-2
              ▼  │                                  │
           ┌─────┴┐                           ┌─────┴┐
     ┌────►│ gw-1 │◄────┬────────────┐  ┌────►│ gw-2 │◄────┐
     │     └──────┘     │            │  │     └──────┘     │
     │                  │            │  │                  │
     │                  │            │  │                  │
┌────┴────┐       ┌─────┴───┐      ┌─┴──┴────┐       ┌─────┴───┐
│ route-1 │       │ route-2 │      │ route-3 │       │ route-4 │
└─────────┘       └─────────┘      └─────────┘       └─────────┘
     ▲                                  ▲                  ▲
     │                                  │                  │
     │                                  │                  │
   rlp-2                              rlp-3              rlp-4
  1. Reconciliation of rlp-2 (created after rlp-1) requires triggering the reconciliation of rlp-1 again, to recalculate the scope of rlp-1 – i.e. to update WasmPlugin-1 and Limitador, which in turn have just been updated because rlp-2 itself
  2. Similarly, rlp-3 requires recalculating WasmPlugin-1 and Limitador, apart from creating EnvoyFilter-2 and WasmPlugin-2
  3. Getting to the affected gateways involves:
    a. inspecting the specs of the targeted routes for parentRefs;
    b. listing all RLPs for gateway-targeting ones;
    c. trusting the state of the back-ref annotations.
  4. Reconciliation of any policy event involves trying to detect what kind of event triggered it – i.e. policy created/updated/deleted, route created/updated/deleted, gateway created/updated/deleted
  5. Other events need to be watched for reconciliation back from the source of truth (policies + network topology) – e.g. wasmplugin/envoyfilter/limitador modified/deleted

Possible solution

  • Keep a version of the topology in-memory as a DAG (Directed Acyclic Graph)
  • Rely more on the informers pattern, to replace/complement controller-runtime, possibly replacing the “traditional” reconciliation loops as we known them today
  • Recompute the effective policies top-down, from affected gateways and downwards to the leaves
  • Distinguish between events that affect the topology, events that just require recomputing and reapplying effective policies, and events that just require reapplying previously computed states.

Reasons to do it

  1. Reduce (significantly) the number of requests to kube API, therefore also improve performance (speed) of reconciliation
  2. Move away from annotations as the way to track back-refs to the policies, by relying on the DAG to navigate the topology instead
  3. Simplify reconciliation loop regarding detection of the kind of resource event
  4. Improve clarity regarding the different kinds of events that trigger reconciliation (by having to define each kind of event and corresponding callback function) → improve coverage of scenarios (kinds of resource events)
  5. Possibility to react quicker and more efficiently, by sometimes not having to trigger “full” reconciliation but acting more directly according to each kind of event

Reason NOT to do it

  1. Involves rewriting the operators
  2. Possibly more resources (CPU, Mem) required by the policy controller

Challenges

  1. Bootstrapping the tree of pre-existing resources in-memory may take some non-negligible time – i.e. consider the impact for the readiness state of the controller
  2. Achieve enough level of abstraction so it works for all policy implementers (i.e. not only for Kuadrant)
  3. Avoid re-inventing the wheel – watch out for weird combination of the informers patterns and straightforward reconcilers
  4. Reeducate devs on the new pattern – no longer “textbook” controller-runtime
@guicassolato
Copy link
Contributor Author

kuadrant/policy-machinery can be employed for this.

@guicassolato
Copy link
Contributor Author

@eguzki eguzki removed their assignment Sep 18, 2024
@eguzki eguzki moved this from In Progress to Done in Kuadrant Oct 18, 2024
@eguzki eguzki closed this as completed Oct 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
RFC Request For Comments target/next
Projects
Status: Done
Status: Needs refinement
Development

No branches or pull requests

2 participants