-
Notifications
You must be signed in to change notification settings - Fork 535
Adds proposal for AutoDetection on LSO #190
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
160bcc5 to
83c1950
Compare
jarrpa
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like where this is going. I hope people don't try to overload the initial API with too much functionality, but leave room to expand it later so we can get started working on an MVP. I've left a couple comments, mostly for language clarity.
409ff65 to
309f097
Compare
309f097 to
ab62f5b
Compare
We had a meeting regarding this last on 20th February, and we came to conclude that we will be developing a separate UI for the LSO and users will be directed to do LSO configuration first before setting up the LSO. |
dcb8a42 to
e11d94a
Compare
Signed-off-by: Ashish Ranjan <[email protected]> Signed-off-by: Rohan CJ <[email protected]>
e11d94a to
fa69016
Compare
- Add non-goal: coexisting with other owners of local-storage - Move examples to their own section. Signed-off-by: Rohan CJ <[email protected]>
|
Made some changes in a new commit that is to be squashed later:
|
gnufied
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks closer to what we discussed. It is still missing the flow of local-storage-operator when a AutoDetectVolume type is created by the user.
| VolumeMode PersistentVolumeMode `json:"volumeMode,omitempty"` | ||
| // FSType type to create when volumeMode is Filesystem | ||
| // +optional | ||
| FSType string `json:"fsType,omitempty"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought we also discussed to have a boolean "dry-run" value which just finds the devices and not really creates them. We are also going to need to design a type that could hold the devices found on the node. We can use configmap or a new type...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I disagree with the dry-run value approach. CRs already represent desired state. Using them for dry-run state just doesn't seem right. Discovery should store the available device list somewhere that the UI can load it, present the devices to the user, then create CRs when the user is ready to commit the devices.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, but then that still leaves us with an unsolved problem. We want the ability for the UI to trigger a filtering of the set of all discovered devices without 1. Commiting to whatever the result it or 2. Forcing the UI to reimplement the filtering logic of the backend. What's your counter proposal here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The UI is given all the metadata it needs to present the available devices to the user and the UI has to display a reasonable UI around those properties. The filtering itself seems like it would already need to be baked into the UI, such as knowing which nodes hold which devices and how much storage is on each node.
Where is the filtering complicated enough that we wouldn't have reasonable confidence that the desired state generated in the CR wouldn't end up with the expected PVs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem for me lies in the duplication of effort that needs to be maintained in sync by two separate teams that may not be in regular contact with each other. It's not that it's difficult, it's that it's inefficient and more of a long-term maintenance headache than the proposed solution.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jarrpa and I had a discussion on this... Most importantly, the discussion shouldn't block this design PR. Let's keep pushing to get this merged soon. Jose will start a new thread with this topic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Most importantly, the discussion shouldn't block this design PR
This discussion is the root of this PR. If the filtering is done in UI, why do we need complicated rules in LSO? UI, knowing which devices user selected / filtered, can create LocalVolume CRs directly, without any DeviceDiscoveryPolicy.
Note that output of LSO are Kubernetes PVs to be consumed by OCS (or anything else), not hardware inventory of all disks on a node.
From what I read above, some UI wants to present hardware inventory and once user selects / filters the devices then UI does something so PVs are created. We are discussing the something here in this PR, however, I haven't read anything about how the HW inventory gets to UI in the first place. Will there be another daemon running on all nodes collecting information about block devices, just like LSO, only presenting them to Kubernetes in another way than LSO does???
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jsafrane The reason we discussed having a separate thread for this topic is that this design doc is independent from the UI. The UI will need to build on top of this new design, but fundamentally this AutoDetectVolume CR doesn't need any UI around it in order to configure the devices. Though if something in this design is blocking the UI design, then I would certainly want to resolve it here.
From what I read above, some UI wants to present hardware inventory and once user selects / filters the devices then UI does something so PVs are created. We are discussing the something here in this PR
That something for the UI to create PVs would be to create the AutoDetectVolume CR(s). This is the signal that the UI is ready to commit to the device selection properties that the user decided on. That something could also be for the UI to create individual CRs for each of the devices, and there would be no need for the AutoDetectVolume CR. IMHO, the AutoDetectVolume is a better approach for the UI, but if we settle on the other approach during UI design, then so be it.
however, I haven't read anything about how the HW inventory gets to UI in the first place. Will there be another daemon running on all nodes collecting information about block devices, just like LSO, only presenting them to Kubernetes in another way than LSO does???
The thought is that the LSO daemons that are already running on all the nodes would be able to discover the available devices and store their properties somewhere for the UI to discover. That somewhere could be a ConfigMap for each node, stored as JSON. Or maybe a new CRD type should be created. The CRD seems like overkill since there is nothing for an operator to reconcile. LSO just needs to communicate metadata to the UI.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The thought is that the LSO daemons that are already running on all the nodes would be able to discover the available devices and store their properties somewhere for the UI to discover.
This was not part of this enhancement. Can we get all requirements on LSO into this enhancement, with all API objects and a complete work flow from a "blank" OCP cluster to a cluster with OCS installed (or at least ready to be installed)? Otherwise we may end up with half designed solution that won't make anyone happy.
To me it looks like this PR is trying to reinvent OpenEBS disk manager
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I were to summarize all the meetings and design discussions...
- OpenShift admins need a UI to select which devices will be exposed as local PVs
- The UI needs to display available storage for devices that don't yet have PVs. The UI (pending UI design) would not show individual devices, but would show some representation of available storage and allow the user to filter devices out of the selection based on ssd/hdd, size, or other properties.
- The UI would drive LSO to create the local PVs
- Different types of devices should be available through different storage classes (ssds vs nvme vs hdd)
- OCS UI needs to detect what local PVs are available and allow the user to configure OCS to use some subset of the available storage.
We don't see #1 as being under the OCS scope since the local PVs could also be used outside of OCS.
@gnufied Since you've been in most of the discussions, does this match your expectation as well?
@jsafrane There are at least several other solutions in the community for local PV configuration and LSO overlaps with them. Our discussions have been to take a dependency on LSO. Are you suggesting LSO isn't the right place for this? Or simply pointing out a similar design?
| } | ||
|
|
||
| type AutoDetectVolumeStatus struct { | ||
| Status string `json:"status"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please expand this type.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about adding pod conditions instead of a single status string?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure adding pod conditions will be a good idea here. I think the status should be reflecting the discovery progress.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think expanding this, whatever you had in mind, would be overkill for an MVP. We should be able ot get by with just a simple string for the time being, unless you have something else specific in mind.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what will be the support status of this new feature? I proposed to keep this alpha for now. but even then keeping this field as a simple string is not going to be enough. In LSO we typically try to stick to specs provided to us by api team - https://github.com/openshift/api/blob/master/operator/v1/types.go#L97
I am not sure if all fields in that type will be necessary but even for MVP/alpha feature, we should pick and propose fields we are going to use, so that this can be discussed during the design phase.
| - state (as outputted by `lsblk`) is not `suspended` | ||
| - Ensuring disks aren't re-detected as new or otherwise destroyed if their device path changes. | ||
| - This is already ensured by the current LSO approach of consuming disks by their `UUID` | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to document recovery from errors. What happens if user has made a mistake and wants to undo "creation" of AutodetectVolume object?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If they want to remove a volume that was automatically created, they would need to:
- Update the CR so it doesn't pick up the device(s) automatically that aren't desired
- The admin would delete/cleanup the PV
@gnufied Is that along the lines of what you are asking?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and does this documentation need to come with this PR? @gnufied
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is fine, we just need to ensure that this is captured somewhere. As limitation or something, so as this can be documented later.
This commit adds tolerations in the API spec of AutoDetectVolume CR. Signed-off-by: Ashish Ranjan <[email protected]>
| VolumeMode PersistentVolumeMode `json:"volumeMode,omitempty"` | ||
| // FSType type to create when volumeMode is Filesystem | ||
| // +optional | ||
| FSType string `json:"fsType,omitempty"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jarrpa and I had a discussion on this... Most importantly, the discussion shouldn't block this design PR. Let's keep pushing to get this merged soon. Jose will start a new thread with this topic.
@jsafrane why is this the case? I've not faced any issues so far. |
Because admin has to find the PVs and delete them manually. Especially finding the few PVs that are wrong among thousand of other PVs with randomly generated name can be challenging. And any mistake can lead to admin loosing data. |
|
I think it'll be possible to delete by StorageClass through either:
|
|
Yes, you can filter by storage class name, still, it can yield hundreds / thousands of PVs. |
|
Another difficulty with removing local-volume PVs is, as long as symlinks exist in path where upstream provisioner searches for devices, it is just going to re-create the PV even if you delete them. So it is somewhat harder to remove local PVs... |
travisn
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ashishranjan738 Another section that would help with this design is the flow of events after an AutoDetectVolume CR is created. For example, does the operator communicate with each of the daemons, and they will create the PVs?
| VolumeMode PersistentVolumeMode `json:"volumeMode,omitempty"` | ||
| // FSType type to create when volumeMode is Filesystem | ||
| // +optional | ||
| FSType string `json:"fsType,omitempty"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jsafrane The reason we discussed having a separate thread for this topic is that this design doc is independent from the UI. The UI will need to build on top of this new design, but fundamentally this AutoDetectVolume CR doesn't need any UI around it in order to configure the devices. Though if something in this design is blocking the UI design, then I would certainly want to resolve it here.
From what I read above, some UI wants to present hardware inventory and once user selects / filters the devices then UI does something so PVs are created. We are discussing the something here in this PR
That something for the UI to create PVs would be to create the AutoDetectVolume CR(s). This is the signal that the UI is ready to commit to the device selection properties that the user decided on. That something could also be for the UI to create individual CRs for each of the devices, and there would be no need for the AutoDetectVolume CR. IMHO, the AutoDetectVolume is a better approach for the UI, but if we settle on the other approach during UI design, then so be it.
however, I haven't read anything about how the HW inventory gets to UI in the first place. Will there be another daemon running on all nodes collecting information about block devices, just like LSO, only presenting them to Kubernetes in another way than LSO does???
The thought is that the LSO daemons that are already running on all the nodes would be able to discover the available devices and store their properties somewhere for the UI to discover. That somewhere could be a ConfigMap for each node, stored as JSON. Or maybe a new CRD type should be created. The CRD seems like overkill since there is nothing for an operator to reconcile. LSO just needs to communicate metadata to the UI.
Added in latest commit. |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: ashishranjan738, leseb, rohantmp The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
Opened a new PR #237 to avoid the GitHub unicorn. |
This commit adds a design proposal for auto detection of disks/devices
in local-storage operator.
Opened a new PR #237 to avoid the GitHub unicorn.
Signed-off-by: Ashish Ranjan [email protected]