-
Notifications
You must be signed in to change notification settings - Fork 49
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RFC] Create Autonomous Lattice Controller #177
Comments
YES! We identified that we would need this if we wanted to run a wasmcloud operator in each kubernetes cluster. I was hoping that this would be included. 😁
I haven't quite wrapped my head around overlapping resources yet (I hadn't even considered it as a possibility when thinking though the requirements for a As a worked example, What happens here?: Deployment A.1 => 10 providers plz => 10 providers claimed by A
Link definitions are the thing that killed us repeatedly when building https://github.com/redbadger/wasmcloud-k8s-demo. I recognise that things are improved in -otp, but it feels like we will still need a way to declarative way to specify links. Could the resource tagging strategy that's being proposed for capability providers also be applied to links? I also have an impression in my head that links can be constrained to capability providers with a given set of tags, but I suspect I might have dreamed that one. I think you want to be able to specify these two types of link:
I also realise that it's the things you don't build that make a product successful, so I'm happy for this bit to be dumped into the river of dreams if it helps to get the rest out of the door more quickly. |
I think your comment implies that we want to enhance the |
The OTP version has |
I am considering using the Open Application Model as a means of declaring application deployments to the lattice controller. Would love some feedback on whether folks think this is a suitable use case for OAM or not. OAM is supported by Microsoft and Alibaba Cloud |
This is implemented by https://github.com/wasmCloud/wadm and further iteration will be tracked there |
Create Autonomous Lattice Controller
This RFC submits for comment the proposal that we create an autonomous lattice controller responsible for managing and reconciling declarative, lattice-wide deployments.
Summary
The lattice control interface is an imperative interface. Manifest files, as they were originally introduced in the pre-0.5/OTP versions of wasmCloud, are also imperative. Manifest file components and lattice control interface commands are all imperative--they instruct a single host to perform a single action. With this API, we can tell hosts to start actors and to stop actors, to start and stop providers, and we can even perform an auction where we ask the entire lattice for a list of hosts containing a set of constraints. The purpose of this RFC is to propose a layer of autonomous operation above the imperative control interface.
Design Detail
The following are preliminary design details for the autonomous lattice controller, which will hereafter simply be called the lattice controller. The proposed design includes the creation of an application that is deployed into a single lattice. This application will be responsible for monitoring and interrogating the observed state of a lattice and, through the use of imperative APIs and other controls at its disposal, issuing the appropriate commands to reconcile the gap between observed state and desired state.
In the case of the lattice controller, the desired state is declared through a set of deployments. As each new deployment is submitted to the lattice controller, it will validate the deployment declaration and then begin managing that deployment through a control loop that consists of comparing the observed state against the desired state and issuing the appropriate commands to reconcile.
Observing the Lattice
Generating a cohesive view of observed state involves a number of resources, including:
wasmbus.evt.{prefix}
subject contain a number of events that, when applied to aggregates, can produce state. For example, events likeactor_started
andactor_stopped
can be applied to an actor aggregate to update the actor's current stateDeployment Definitions (Desired State)
A deployment definition contains the following key pieces of information:
Deployments explicitly do not contain link definitions. Link definitions are entities created by operations for runtime actor configuration, and that configuration persists regardless of the number of instances of entities like actors and providers.
Deployment Spreads
A deployment spread is a definition for how a deployment should spread a given entity across the available hosts. Spreads are defined by ratios and a set of label-value pairs that define the constraint for that ratio. For example, if you wanted to ensure that your lattice always had 3 copies of the HTTP Server Provider running, and that 66% of those instances must be running on hosts tagged with
zone1
, and 33% of them must be running on hosts tagged withzone2
(a weighted failover scenario), you might define your spread as follows:zone
==zone1
zone
==zone2
With such a spread definition, and an instance requirement of 3, the lattice controller would always attempt to ensure that you have 1 instance of the provider running in
zone2
and 2 instances of the provider running inzone1
. If the lattice controller determined that the available resources in the lattice don't support the desired spread (e.g. there's only one host listed with thezone:zone1
label), then a fallback policy would be used to either choose a random host (which could also fail if that host is already running that provider) or give up and only partially satisfy the deployment, which would leave the deployment in a failed state.The various ratios within a spread must always sum to 1. We may come up with more complex ways of defining spreads in the future, but the core definition of a spread is a ratio applied to a set of constraints. Spreads can be applied to the deployment definitions for actors or providers, though actor spreads are more easily satisfied since more than one instance of the same actor can run within a single host.
Scope of a Deployment
Multiple deployments will co-exist within a single lattice, and the resources used by those deployments can co-exist with "unmanaged" resources. Actors and providers that are started manually will be left alone by the lattice controller (with some potential exceptions, discussed next).
As mentioned, a deployment describes a set of actors and providers. Actors and providers that are deployed by a lattice controller will be tagged as such, and the controller will manage only those resources that are part of a deployment.
Dealing with Overlapping Resources
As a general rule, deployments will never interfere with resources that either belong to different deployments or are unmanaged. However, there are a few exceptions that stem from the fact that capability providers are designed to be reused by many actors and a single host cannot contain multiple instances of the same provider (at least not until we get WASI-based providers). These exceptions are:
Controllers will never claim or reuse unmanaged actors or actors tagged with a different deployment.
Managing Scale (Large Lattices/Many Deployments)
If the lattice controller is ever deployed into a lattice where the rate of change events becomes so high that it cannot process the state change computations and desired imperative command outputs fast enough, then one of two solutions is recommended:
Interacting with the Lattice Controller
The lattice controller will expose an API through which deployment definitions can be submitted. Deployments are immutable, and as such, every time you publish a new deployment with a given name, that deployment is given a monotonically increasing revision number.
You can also use the lattice controller's API to query the observed state of the lattice, which can be very handy when building third-party tooling or simply performing routine troubleshooting tasks while doing development and testing.
Deployment Rollbacks
The lattice controller API will also support the ability to roll back a deployment, which is done by specifying the revision number to roll back to.
Drawbacks
The following is a list of identified drawbacks to implementing this RFC as described herein.
Is this just k8s for wasm?
Very early on, wasmCloud drew a line in the sand saying that it would not be responsible for scheduling the host process. It is entirely up to the wasmCloud consumer to figure out how and where and when they would like to start the host process. Once the process is running, we do indeed provide a way to manipulate the contents of the host (through the lattice control interface API).
One possible drawback to this that comes to mind is are we just re-inventing Kubernetes, but for wasm? While Kubernetes does allow the scheduling of far more resources than wasmCloud deployments (which only schedule 2 resource types), a devil's advocate could suggest that we're reinventing a wheel here.
Rationale and Alternatives
The main rationale for this approach is that we want to combine our desire for declarative, simple, "self-managing" deployments with our desire to remain compatible with, not competing against an entire ecosystem of tooling and available to the entire CNCF ecosystem without lock-in. The following is an itemization of some alternative approaches that were considered.
Tight Coupling with Kubernetes
One alternative to building our own lattice controller would be to simply bundle up all of the state observance, reconciliation, and deployment storage logic and stuff it inside a Kubernetes operator. While this approach might leverage more of the existing functionality of Kubernetes as a platform, this approach also prevents this capability from being used by anything other than kubernetes. Does our desire to enable first and third party tooling to manage declarative deployments offset the "isn't this just k8s for wasm" argument? We feel that we can accomplish much more by creating the lattice controller and then exposing the controller's API to thin veneer tooling like a Kubernetes operator than by embedding all of the functionality inside the black box of a "fat operator"
The text was updated successfully, but these errors were encountered: