✨ Additional data volumes for machines#1668
✨ Additional data volumes for machines#1668k8s-ci-robot merged 1 commit intokubernetes-sigs:mainfrom
Conversation
✅ Deploy Preview for kubernetes-sigs-cluster-api-openstack ready!
To edit notification comments on pull requests, go to your Netlify site configuration. |
|
Hi @mkjpryor. Thanks for your PR. I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
mdbooth
left a comment
There was a problem hiding this comment.
Looking good! Lets start the discussion.
| NameSuffix string `json:"nameSuffix"` | ||
| Size int `json:"diskSize,omitempty"` | ||
| VolumeType string `json:"volumeType,omitempty"` | ||
| AvailabilityZone string `json:"availabilityZone,omitempty"` |
There was a problem hiding this comment.
@stephenfin Very much interested in your input here.
I've proposed adding useMachineAZ (default false) to a future api version for the root volume. What do you think the semantics of this one should be?
Options:
- Empty AZ means no AZ
- Empty AZ means something machine related
- Anything else?
Is there any reason to have an option here to use the same AZ as the machine?
There was a problem hiding this comment.
I was planning to make the semantics the same as the root volume for now. I think that is less confusing.
Then when we change the root volume we can change this as well. My plan is to have one "volume creating" function so hopefully the logic doesn't have to be duplicated.
There was a problem hiding this comment.
The problem is we know the semantics of the rootVolume are wrong, and we're going to have a minor upgrade problem there. If we know what we're going to do with rootVolume I'd prefer to get this right in the first instance.
See #1662 for WIP.
There was a problem hiding this comment.
What is left to do on the root volume WIP? Should we get that merged first?
There was a problem hiding this comment.
It looks like a sensible change to me. Are you worried about it being a breaking change because the default behaviour changes from "machine AZ" to "no AZ"?
(Which should have been the default all along IMHO, but what's done is done...)
There was a problem hiding this comment.
I guess if we want a non-breaking change, then the useServerAZ flag defaults to true.
There was a problem hiding this comment.
I guess if we want a non-breaking change, then the useServerAZ flag defaults to true.
That's what I believe is already CAPO's behaviour on rootVolumes:
cluster-api-provider-openstack/pkg/cloud/services/compute/instance.go
Lines 439 to 442 in 9d183bd
There was a problem hiding this comment.
Yes - I mirrored the current root behaviour for the additional volumes on purpose.
I think we should change the behaviour for all types of volume together in a separate PR.
There was a problem hiding this comment.
I don't like this behaviour for the same reason we need to change it for the root volume, namely what @JohnGarbutt called the '80% case' is impossible. We are definitely have to change it.
However, I can also see the benefit of being consistent with the root volume, and making a breaking change to both APIs at the same time in the future.
There was a problem hiding this comment.
We came to an agreement to bulk change it via another PR. I added it to the TODO.
mdbooth
left a comment
There was a problem hiding this comment.
Please could you also add field comments of the form
// FieldName does X. It defaults to Y. It is incompatible with Z.|
/ok-to-test |
8491f71 to
28501e5
Compare
| RootVolume *RootVolume `json:"rootVolume,omitempty"` | ||
|
|
||
| // DataVolumes is a list of specifications for volumes to attach to the server instance | ||
| DataVolumes []DataVolume `json:"dataVolumes,omitempty"` |
There was a problem hiding this comment.
Not a big deal, but just to mention the CAPZ used DataDisks.
I wonder if we want to follow that pattern, since our data might end-up being on volumes and/or local disks at some point.
There was a problem hiding this comment.
I used DataVolumes to match RootVolume. I don't really have a strong opinion.
There was a problem hiding this comment.
I think we're likely to end up with:
- RootVolume
- DataVolumes
- LocalDataVolume
I'm with @mkjpryor on the lack of a strong opinion on the name. I agree that DataVolumes is consistent with RootVolume. Presumably they're also referenced in... KubeadmControlPlane? What are they called there?
There was a problem hiding this comment.
The things you have to set in kubeadmcontrolplane and/or kubeadmbootstrapconfig correspond to cloud-init fields, so DiskSetup and Mounts.
There was a problem hiding this comment.
But the OpenStack terminology for these things is volume, so my feeling is that we stick with that 🤷♂️
There was a problem hiding this comment.
Could we settle on the API, considering that we'll support multiple ephemeral disks and volumes.
Option A. DataVolumes and LocalDataVolumes where each of them have the options needed for their use-case.
Option B. One field (maybe name it BlockDevices to avoid confusion with volumes`) with all fields that we need for both use cases.
I have a preference for Option B. (and the rename) for simplicity's sake.
There was a problem hiding this comment.
My preference is also for option B with the name BlockDevices. I like using a name that mirrors the OpenStack API attribute that we are ultimately setting - I think it is probably less confusing for users.
There was a problem hiding this comment.
Could we capture that these block devices explicitly don't include the root volume. I feel that DataDisks and DataVolumes already capture that in the word Data. If we really like BlockDevices, perhaps AdditionalBlockDevices?
| // This is then available in the instance metadata, allowing devices to be | ||
| // identified by tag inside the server. | ||
| Tag string `json:"tag,omitempty"` | ||
| } |
There was a problem hiding this comment.
This can be addressed in a separate PR but for supporting local disks, we'll need to add more fields.
There was a problem hiding this comment.
Just one I think, to indicate whether it should be a local or Cinder disk? The rest can be reused in both cases.
There was a problem hiding this comment.
If we accept my argument that there will be a single local disk then I would prefer it to have a separate field because it's more obvious to the user and doesn't require validation. I would only put a local flag here if we decide it's useful to have multiple local disks.
But as you say we don't need to consider this yet.
There was a problem hiding this comment.
@mdbooth as I raised in Slack, I think I have a use case for multiple ephemeral devices carved out of the ephemeral volume, so I would like to see an option that permits that.
There was a problem hiding this comment.
But it is something for a future PR anyway
There was a problem hiding this comment.
@mkjpryor Please look at my recent refactor & patches, I think the way we're proposing now is the same behaviour as with rootVolumes. And indeed this can be bulked changed later.
There was a problem hiding this comment.
If we were to add support for local ephemeral drives, we would need the DestinationType field and that's it, I think.
|
👍 excellent start to me! Beside a few comments and missing doc (and tests?) this is good to me. Note: I'm very interested by local disks instead volumes, but with your PR I can easily see it extended via a separate PR. However I have no objection (beside making the PR too big) if you include it in here. |
I feel like it should be a separate PR, just in case there are unintended consequences that need addressing. |
|
/test pull-cluster-api-provider-openstack-test |
|
/test pull-cluster-api-provider-openstack-build |
93fc4f6 to
fceff67
Compare
| DeleteOnTermination: true, | ||
| // TODO(mkjpryor) uncomment once gophercloud patch is merged | ||
| // https://github.com/gophercloud/gophercloud/pull/2777 | ||
| // Tag: dataVolumeSpec.Tag, |
There was a problem hiding this comment.
Once gophercloud/gophercloud#2789 is merged and a release was produced, I'll update vendoring here.
|
TODO:
|
See above comment. I'd also like to make it clearer that |
| // This is then available in the instance metadata, allowing devices to be | ||
| // identified by tag inside the server. | ||
| Tag string `json:"tag,omitempty"` | ||
| } |
There was a problem hiding this comment.
If we were to add support for local ephemeral drives, we would need the DestinationType field and that's it, I think.
test/e2e/suites/e2e/e2e_test.go
Outdated
| bootVolume := firstVolume | ||
| additionalVolume := secondVolume | ||
| // The boot volume is the one with the "bootable" flag set. | ||
| if firstVolume.Bootable != "true" { // This is genuinely a string, not a bool |
There was a problem hiding this comment.
Curious as to why we need this gymnastic. Don't we have the guarantee the first volume is always the boot volume?
There was a problem hiding this comment.
I need to do more testing to confirm that listing volumes provide an output ordered by creation time. I'll look at that today because I agree that this is ugly 😄 .
There was a problem hiding this comment.
I tested it and indeed it seems that the volume used for boot is always first.
| - diskSize: 1 | ||
| nameSuffix: extravol | ||
| volumeType: ${OPENSTACK_VOLUME_TYPE_ALT} | ||
| availabilityZone: ${OPENSTACK_FAILURE_DOMAIN} |
There was a problem hiding this comment.
Perhaps it would be good to drop the availabilityZone there, to replicate the test we do for rootVolume?
We're testing setting an availabilityZone for the worker.
test/e2e/suites/e2e/e2e_test.go
Outdated
| Expect(bootVolume.VolumeType).To(Equal(volumeTypeAlt)) | ||
|
|
||
| additionalVolume := additionalVolumes[machine.Name] | ||
| Expect(additionalVolume.AvailabilityZone).To(Equal(failureDomainAlt)) |
There was a problem hiding this comment.
If we want to ensure the additionalVolumes behave the same as rootVolume, we should probably replicate the test, i.e. expect the value to be failureDomain.
| "block_device_mapping_v2": []map[string]interface{}{ | ||
| { | ||
| "delete_on_termination": true, | ||
| "destination_type": "local", |
There was a problem hiding this comment.
IIUC, we don't support local yet, so it's weird to have it there.
There was a problem hiding this comment.
it's the default block device mapping when no root volume is being used.
There was a problem hiding this comment.
If you want to specify a non-bootable block device, you also have to specify a bootable one. That means that we need to specify the "from image to local" block device
| might create the volume, but the instance create fails. This would leave us with a dangling | ||
| volume with no instance. | ||
| 1. Create the volume | ||
| 2. Create the instance with the created volume in iAdditionalBlockDevices with DeleteOnTermination=true |
There was a problem hiding this comment.
vim user spotted.
| 2. Create the instance with the created volume in iAdditionalBlockDevices with DeleteOnTermination=true | |
| 2. Create the instance with the created volume in AdditionalBlockDevices with DeleteOnTermination=true |
There was a problem hiding this comment.
Feels like we should mention that the root volume is added to AdditionalBlockDevices as it's not completely obvious, no?
There was a problem hiding this comment.
Good point. I'm updating the doc.
| BootIndex: 0, | ||
| DeleteOnTermination: true, | ||
| }) | ||
| } else { |
There was a problem hiding this comment.
Just to make sure I understand the reason to set an ephemeral drive from the image, is that because if we don't set it and we have additional block devices, then we don't know where to boot the instance from?
There was a problem hiding this comment.
yes exactly. We need an ephemeral storage to be created using the imageID and be bootable (bootindex 0).
There was a problem hiding this comment.
Basically, yeah. OpenStack won't let you set non-bootable block devices without setting a bootable one
| return nil, err | ||
| } | ||
| if existingVolume != nil { | ||
| if existingVolume.Size != opts.Size { |
There was a problem hiding this comment.
Don't we need to also check the availability zone and the type to ensure the existing volume conforms to the spec?
Alternatively, it might be simpler to error out when there is an existing volume with the name we want to use.
There was a problem hiding this comment.
What if we do this:
- When we create a volume, we put the Kubernetes UUIDs of the OpenStackCluster and the OpenStackMachine in the volume metadata
- openstack.infrastructure.cluster.x-k8s.io/cluster-uuid
- openstack.infrastructure.cluster.x-k8s.io/machine-uuid
- When we search for a volume by name and find one, we then check if the UUIDs in the metadata match the cluster/machine being reconciled
- If they do, we own the volume and that is the volume we return
- If they don't, we don't own the volume and we error out
If that's a pattern people agree on, I'll be happy to do this via a separated PR if possible. For the sake of not adding too many design changes into one PR.
There was a problem hiding this comment.
@mdbooth and I also discussed doing this for Neutron resources using tags
There was a problem hiding this comment.
That looks reasonable. I would just change the logic to search for volume by name and by metadata in a single query if that is possible. If it returns a result, that's the one we own and we can use it, otherwise, we know we need to create a new one.
There was a problem hiding this comment.
Not sure the single query is possible unfortunately, although I might have missed something. I would be surprised if the metadata items were indexed for searching.
There was a problem hiding this comment.
There was a problem hiding this comment.
We came to the agreement that we (I) will take care of it in a separated PR.
There was a problem hiding this comment.
If we ever cut a release containing this feature which does not add the metadata described by @EmilienM then we have an upgrade issue here which would be a whole lot more complicated. If we don't think we can get it into this PR, lets not wait around to implement it.
There was a problem hiding this comment.
Like I said on another comment, I'm working on it and this will be done before the next release is cut.
9fca026 to
de72667
Compare
|
/hold cancel This is ready for review.
|
mdbooth
left a comment
There was a problem hiding this comment.
This is really close, thanks!
Can you please also consider how soon we can add the volume metadata support? Ideally this will be within the next couple of months to avoid an upgrade issue.
| NameSuffix string `json:"nameSuffix"` | ||
| Size int `json:"diskSize,omitempty"` | ||
| VolumeType string `json:"volumeType,omitempty"` | ||
| AvailabilityZone string `json:"availabilityZone,omitempty"` |
There was a problem hiding this comment.
I don't like this behaviour for the same reason we need to change it for the root volume, namely what @JohnGarbutt called the '80% case' is impossible. We are definitely have to change it.
However, I can also see the benefit of being consistent with the root volume, and making a breaking change to both APIs at the same time in the future.
| return nil, err | ||
| } | ||
| if existingVolume != nil { | ||
| if existingVolume.Size != opts.Size { |
There was a problem hiding this comment.
If we ever cut a release containing this feature which does not add the metadata described by @EmilienM then we have an upgrade issue here which would be a whole lot more complicated. If we don't think we can get it into this PR, lets not wait around to implement it.
test/e2e/suites/e2e/e2e_test.go
Outdated
| Expect(volumes).To(HaveLen(2)) | ||
|
|
||
| bootVolume, err := shared.GetOpenStackVolume(e2eCtx, volumes[0].ID) | ||
| rootVolume, err := shared.GetOpenStackVolume(e2eCtx, volumes[0].ID) |
There was a problem hiding this comment.
Is the order of AttachedVolumes stable? I would guess strictly not, although possibly stable enough in practise for this test?
There was a problem hiding this comment.
it's apparently stable. The volume attachments listed for a server seem to be indexed by their creation, and the rootVolume is always created first. I was surprised myself too but I've tested it a few times and it was always the case 🤷 .
I'll try to find the code in Nova to confirm what I saw.
There was a problem hiding this comment.
The API output is generated here. The bdms parameter is populated here. I followed that call to nova.objects.BlockDeviceMappingList.bdms_by_instance_uuid all the way down to the DB layer and didn't spot any sorting being done. This leads me to believe we're probably relying on the order of the rows being returned by the DB. As you likely know, MySQL doesn't guarantee sorting without a specific ORDER BY so that doesn't seem wise since it won't be consistent.
You should double check this (!) but assuming this is correct, you probably want to explore device tagging. In general, nova doesn't guarantee anything wrt device ordering and asks that you use device tagging if you need this. I don't know if gophercloud support this but you simply want to add the tags field to the BDM for the root volume, e.g.
{
"block_device_mapping_v2": [
{
"uuid": "<image_id>",
"boot_index": 0,
"source_type": "image",
"destination_type": "volume",
"volume_size": "<volume_size>",
"tags": ["foo", "bar"]
}
]
}(note: this requires API microversion 2.42).
You can then use the volume attachments API (GET /servers/{serverID}/os-volume-attachments) to list the attached volumes with tags, since this info is not returned in the os-extended-volumes:volumes_attached field in the GET /servers/{serverID} response.
There was a problem hiding this comment.
right, that was my thought as well. I'll investigate it.
There was a problem hiding this comment.
For now, I've addressed this comment with:
cluster-api-provider-openstack/test/e2e/suites/e2e/e2e_test.go
Lines 579 to 592 in 6855f55
Until we find a better solution. With device tagging that I plan to work on next, we'll probably make this better.
I'm working on it now, I think the PR will be pushed before Tuesday next week. |
6855f55 to
b76c92c
Compare
lentzi90
left a comment
There was a problem hiding this comment.
Good job with this!
I agree with the decision to implement the AZ in the same way as for the root volume. It will be easier to understand it this way and then break both at the same time.
Could you please add some doc strings for at least the more complicated functions? I commented below on those I think are most important.
|
/test pull-cluster-api-provider-openstack-e2e-full-test |
b76c92c to
26e9191
Compare
|
@lentzi90 comments addressed. Thanks for the review! |
mdbooth
left a comment
There was a problem hiding this comment.
Please could you respond to the 3 API requests:
- Rename
nameSuffixtoname - Remove tag in favour of just using
name - Addition of strategic merge patch annotation to
additionalBlockDevices
There's also an outstanding request from @lentzi90 for a missing docstring.
Apart from that, this looks good and has good test coverage. I don't think there's nothing complicated in the above requests.
api/v1alpha7/types.go
Outdated
| type AdditionalBlockDevice struct { | ||
| // NameSuffix is the suffix to be added to volume names. | ||
| // It will be combined with the machine name to make the volume name. | ||
| NameSuffix string `json:"nameSuffix"` |
There was a problem hiding this comment.
Could we call this name instead? The docstring can explain that the volume name will be the machine name with this appended.
There was a problem hiding this comment.
I don't understand why you want to name it Name. This is not a name, this is a name suffix.
This would be a lie to the API user that their block device would be named like this, because this wouldn't be the case.
The actual block device name is chosen from within the code:
volumeName(instanceSpec.Name, blockDevice.NameSuffix)There was a problem hiding this comment.
See below. It would be the name of the volume in the context of this machine. It would also be the name of the volume in the additionalBlockDevices list.
api/v1alpha7/types.go
Outdated
| // Tag is a tag to add to the volume attachment. | ||
| // This is then available in the instance metadata, allowing devices to be | ||
| // identified by tag inside the server. | ||
| Tag string `json:"tag,omitempty"` |
There was a problem hiding this comment.
Is there any reason to have a separate tag element instead of just using the name?
|
To explain the above API requests, consider the case where a user wants to specify an additional volume to hold etcd. Before: additionalBlockDevices:
- nameSuffix: etcd
size: 10
tag: etcdAfter: additionalBlockDevices:
- name: etcd
size: 10I prefer the latter. |
26e9191 to
61afcac
Compare
We now have the ability for a machine to have additional block devices to be attached. Here is an example on how to use `additionalBlockDevices` for adding additional Cinder volumes attached to the server instance: ```yaml additionalBlockDevices: - nameSuffix: dbvol1 size: 50 volumeType: my-volume-type availabilityZone: az0 tag: database ``` Co-authored-by: Matt Pryor <matt@stackhpc.com> Co-authored-by: Emilien Macchi <emilien@redhat.com>
61afcac to
94d9690
Compare
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: mdbooth, mkjpryor The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
Pending tests, of course. @lentzi90 Can you lgtm if you're happy with the API changes? |
|
/test pull-cluster-api-provider-openstack-e2e-full-test |
What this PR does / why we need it:
This PR adds the ability for a machine to have additional data volumes attached, e.g. for etcd.
Which issue(s) this PR fixes (optional, in
fixes #<issue number>(, fixes #<issue_number>, ...)format, will close the issue(s) when PR gets merged):Fixes #1286
Special notes for your reviewer: