Skip to content
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
160 changes: 160 additions & 0 deletions design/proposed/2020-openstackserver-controller.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,160 @@
# Problem statement

We currently have 2 callers of compute.CreateInstance: the machine controller for OpenStackMachine, and the cluster controller for the bastion[^mapo].

[^mapo]: Incidentally, although it is not a primary concern of this project there is also a third consumer in OpenShift's [Machine API Provider OpenStack](https://github.com/openshift/machine-api-provider-openstack).

Of these, the bastion has been most problematic. The principal current problem with the bastion is that, unlike OpenStackMachine, it is mutable because the bastion is not required to have the same lifecycle as the cluster which contains it. The current API allows the user to modify the bastion, and the controller is expected to delete the current one and create a new one. This is a problem in edge cases where the spec is required to determine if resources were previously created and need to be cleaned up, or not. This is a fundamental design problem, but another less fundamental but still very real problem is that trying to integrate the bastion into both the cluster and (effectively) machine creation and deletion flows creates fragile code and has resulted in numerous bugs.

# Potential solutions

## Do nothing

The current solution has sufficed in practice for quite a long time. Additionally, we now have much better test coverage of bastion operations than we have had in the past, so we should hopefully catch issues sooner.

The primary issue with this is the drag it places on bugfixes and new development, because the code flow of bastion deletion in particular can be hard to reason about.

## Create an OpenStackMachine for the bastion

This is an attractive solution because it better models the independent lifecycle of the bastion. When a new bastion is created we create a new OpenStackMachine. If the bastion spec is updated we delete the old one and create a new one, with the old and new objects having their own, independent state.

The problem with this approach is that OpenStackMachine has other contracts. It cannot be created independently, but requires a CAPI Machine object. Other controllers have expectations of Machine objects, for example that they will eventually have a NodeRef.

Creating a Machine/OpenStackMachine object for the bastion could negatively impact other CAPI users.

## Create a new controller for the bastion

This is essentially the same as using OpenStackMachine, except using a new CRD which can be independent of other CAPI objects.

The obvious disadvantage is that it requires a new CRD and controller.

# Proposal: An OpenStackServer CRD and controller

## Philosophy

We will create a new controller which is capable, in principal, of doing all of the OpenStack resource creation tasks common to bastion and OpenStackMachine creation. That is:

* Resolve OpenStack resource parameters to specific resources (e.g. image, networks, subnets, security groups, server groups)
* Create and delete ports
* Create and delete volumes
* Create and delete the server

It will have no dependency on Machine or OpenStackCluster. So things it will explicitly not do include:

* Attaching the Bastion floating IP
* Creating a loadbalancer member
* Adding a port for the default cluster network if no ports are provided
* Referencing an AvailabilityZone from the Machine object
* Adding default tags from the OpenStackCluster object

## API

Based on OpenStackMachineSpec, with modifications to accomodate the above.


```go
type OpenStackServerSpec struct {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there anything we need to do for existing customer to move to here? transparent or need define those ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a great point. I should make it clear in the doc that this change will happen automatically and the end user does not need to take any action. I'll update the doc.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've just pushed a small change which adds:

  • The upgrade is automatic and will not recreate existing OpenStack resources
  • The new API will be in v1alpha1

AvailabilityZone string
Flavor string
Image ImageParam
SSHKeyName string
Ports []PortOpts
SecurityGroups []SecurityGroupParam
Trunk bool
Tags []string
ServerMetadata []ServerMetadata
ConfigDrive *bool
RootVolume *RootVolume
AdditionalBlockDevices []AdditionalBlockDevice
ServerGroup *ServerGroupParam
IdentityRef *OpenStackIdentityReference
FloatingIPPoolRef *corev1.TypedLocalObjectReference
UserDataRef *corev1.TypedLocalObjectReference
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so you want a new object for this one? Any reason why not passing a string anymore?

}

type OpenStackServerStatus struct {
InstanceID optional.String
InstanceState *InstanceState
Addresses []corev1.NodeAddress
Resolved *ResolvedMachineSpec
Resources *MachineResources
Conditions clusterv1.Conditions
}
```

As the new API is non-trivial we should initially create it in v1alpha1.

### Upgrading an existing deployment

OpenStackServer would be a new API and does not affect any existing API.

In the same way that OpenStackMachine currently has an 'adoption' phase where it will adopt existing resources, OpenStackServer should adopt matching resources which it would have created. On upgrade to a new version of CAPO which manages the bastion with an OpenStackServer object, I would expect the flow to be:

* Cluster controller creates OpenStackServer and waits for it to report Ready
* OpenStackServer controller looks for existing resources matching its given spec
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This part is a bit tricky, we'll need to compare every field out of instanceStatus, I started to play with it and it became messy.
Not against this idea, just making sure we really want to go down that path.

The risk of not doing that is that an instance would be adopted by OpenStackServer that wouldn't be matching 100% the spec if the instance would have been created by the OpenStackServer controller.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking the adoption behaviour would be identical to what we currently do in the OpenStackMachine controller. Literally the same code. We're basically just adopting by name.

* OpenStackServer adopts existing resources
* OpenStackServer reports Ready

The end-user should not be required to take any action, and the upgrade process should not involve the deletion and recreation of existing OpenStack resources.

### Notes

* ProviderID is not present: this is a Machine property
* UserDataRef is added. If present it must exist.
* AvailabilityZone is added, and refers explicitly
* It is an error for Ports to be empty. Defaulting the cluster network must be done by the controller creating the object.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for ports in the OpenStackServerSpec? Do we want the same API as OpenStackMachineSpec ?

* No managed security group will be added automatically. The managed security group must be added explicitly by the controller creating the object.
* A Volume specified with `from: Machine` will use the value of `AvailabilityZone` instead of `Machine.FailureDomain`.

It will set the following `Conditions`:
* `Ready`: `True` when the server is fully provisioned. Once set to `True` will never be subsequently set to `False`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It sounds strange to me to never change it back, even though we could have Failed=True. Is that intentional?
So we could have Ready=True and Failed=True at the same time 🤔

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about 3 Conditions:

  • Provisioned: a latch which is never set to false after the server has been provisioned
  • Ready: an ephemeral 'readiness' state. Can be false due to transient errors.
  • Failed: permanent failure. No further reconciliation. Never set to false after being set to true.

* `Failed`: `True` means the OpenStackServer controller will not continue to reconcile this object. If set on a server which is not `Ready`, the server will never become `Ready`. If set on a server which is `Ready`, the server has suffered some terminal condition which cannot be resolved automatically. For example it may be in the `ERROR` state, or it may have been deleted.

## Changes to the cluster controller

Firstly the OpenStackCluster will report `Ready` as soon as the cluster infrastructure is created, before creating the bastion. This is not strictly necessary for this change, but it simplifies the bastion creation logic and it makes sense anyway.

The pseudocode of the cluster controller logic for the bastion becomes:
```
if spec.Bastion == nil {
if OpenStackServer for bastion exists {
delete OpenStackServer object
return
}
} else {
serverObj := fetch OpenStackServer object
if serverObj does not exist {
substitute tags and default network into bastion spec if required
create OpenStackServer object
return
}
if server spec does not match bastion.Spec {
deleteOpenStackServer(bastion)
return
}
Comment on lines +131 to +134
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't we need to modify the OpenStackCluster here too to force an event?

if server is Ready {
attach bastion floating IP to server
populate bastion status in OpenStackCluster
}
}
```

The OpenStackServer for the bastion is owned by the OpenStackCluster, and is deleted automatically along with it.

The `Enabled` flag for the bastion in OpenStackClusterSpec is deprecated but not removed. Setting `Enabled` explicitly to false will continue to not create a bastion, although there is no longer any need to do this. Validation will be updated to permit changing the bastion spec without requiring the bastion to be deleted first.

`Resolved` and `Resources` from `BastionStatus` would not be removed, but would no longer be populated. This state is now stored in the `OpenStackServer`.

## Optional: Changes to the machine controller

The primary goal of this change is to fix the bastion, so it is not necessary to change the machine controller also. However, once we have implemented the change to the bastion it may make sense to update the machine controller to use the same controller. i.e. The machine controller would:

* Create an OpenStackServer object corresponding to the machine spec plus resolved defaults from the OpenStackCluster.
* Wait for it to be created
* Do any API loadbalancer tasks required
* Copy required CAPI fields from OpenStackServer to OpenStackMachine
* Set ProviderID

## Note (but out of scope of this document)

OpenShift's [Machine API Provider OpenStack](https://github.com/openshift/machine-api-provider-openstack) could create an `OpenStackServer` object instead of calling directly into `compute.CreateInstance`.