Adds proposal for AutoDetection on LSO #190

ashishranjan738 · 2020-01-27T13:43:22Z

This commit adds a design proposal for auto detection of disks/devices
in local-storage operator.

Opened a new PR #237 to avoid the GitHub unicorn.

Signed-off-by: Ashish Ranjan [email protected]

enhancements/local-storage/automatic-detection.md

jarrpa

I like where this is going. I hope people don't try to overload the initial API with too much functionality, but leave room to expand it later so we can get started working on an MVP. I've left a couple comments, mostly for language clarity.

enhancements/local-storage/automatic-detection.md

jarrpa · 2020-01-31T09:35:24Z

@hekumar, @chuffman

enhancements/local-storage/automatic-detection.md

ashishranjan738 · 2020-02-24T08:44:08Z

What I miss in this enhancement the most is UX from the OCS user point of view.

To me it looks like OCS will provide an editor for AutoDetectVolumes CRs, where user prepares the matching rules. But there is no way back when user presses "submit" button - CRs are created, devices are matched a PVs will be created. Once local PVs are created, it's hard to remove them. What if the user is not happy with the result?

We had a meeting regarding this last on 20th February, and we came to conclude that we will be developing a separate UI for the LSO and users will be directed to do LSO configuration first before setting up the LSO.

Signed-off-by: Ashish Ranjan <[email protected]> Signed-off-by: Rohan CJ <[email protected]>

- Add non-goal: coexisting with other owners of local-storage - Move examples to their own section. Signed-off-by: Rohan CJ <[email protected]>

rohantmp · 2020-02-24T14:54:30Z

Made some changes in a new commit that is to be squashed later:

Add non-goal: coexisting safely with other owners of local-storage on the same node
Move examples to their own section.

gnufied

looks closer to what we discussed. It is still missing the flow of local-storage-operator when a AutoDetectVolume type is created by the user.

enhancements/local-storage/automatic-detection.md

gnufied · 2020-02-24T15:11:13Z

enhancements/local-storage/automatic-detection.md

+    VolumeMode PersistentVolumeMode `json:"volumeMode,omitempty"`
+    // FSType type to create when volumeMode is Filesystem
+    // +optional
+    FSType string `json:"fsType,omitempty"`


I thought we also discussed to have a boolean "dry-run" value which just finds the devices and not really creates them. We are also going to need to design a type that could hold the devices found on the node. We can use configmap or a new type...

I disagree with the dry-run value approach. CRs already represent desired state. Using them for dry-run state just doesn't seem right. Discovery should store the available device list somewhere that the UI can load it, present the devices to the user, then create CRs when the user is ready to commit the devices.

Okay, but then that still leaves us with an unsolved problem. We want the ability for the UI to trigger a filtering of the set of all discovered devices without 1. Commiting to whatever the result it or 2. Forcing the UI to reimplement the filtering logic of the backend. What's your counter proposal here?

The UI is given all the metadata it needs to present the available devices to the user and the UI has to display a reasonable UI around those properties. The filtering itself seems like it would already need to be baked into the UI, such as knowing which nodes hold which devices and how much storage is on each node.

Where is the filtering complicated enough that we wouldn't have reasonable confidence that the desired state generated in the CR wouldn't end up with the expected PVs?

The problem for me lies in the duplication of effort that needs to be maintained in sync by two separate teams that may not be in regular contact with each other. It's not that it's difficult, it's that it's inefficient and more of a long-term maintenance headache than the proposed solution.

@jarrpa and I had a discussion on this... Most importantly, the discussion shouldn't block this design PR. Let's keep pushing to get this merged soon. Jose will start a new thread with this topic.

Most importantly, the discussion shouldn't block this design PR

This discussion is the root of this PR. If the filtering is done in UI, why do we need complicated rules in LSO? UI, knowing which devices user selected / filtered, can create LocalVolume CRs directly, without any DeviceDiscoveryPolicy.

Note that output of LSO are Kubernetes PVs to be consumed by OCS (or anything else), not hardware inventory of all disks on a node.

From what I read above, some UI wants to present hardware inventory and once user selects / filters the devices then UI does something so PVs are created. We are discussing the something here in this PR, however, I haven't read anything about how the HW inventory gets to UI in the first place. Will there be another daemon running on all nodes collecting information about block devices, just like LSO, only presenting them to Kubernetes in another way than LSO does???

@jsafrane The reason we discussed having a separate thread for this topic is that this design doc is independent from the UI. The UI will need to build on top of this new design, but fundamentally this AutoDetectVolume CR doesn't need any UI around it in order to configure the devices. Though if something in this design is blocking the UI design, then I would certainly want to resolve it here.

From what I read above, some UI wants to present hardware inventory and once user selects / filters the devices then UI does something so PVs are created. We are discussing the something here in this PR

That something for the UI to create PVs would be to create the AutoDetectVolume CR(s). This is the signal that the UI is ready to commit to the device selection properties that the user decided on. That something could also be for the UI to create individual CRs for each of the devices, and there would be no need for the AutoDetectVolume CR. IMHO, the AutoDetectVolume is a better approach for the UI, but if we settle on the other approach during UI design, then so be it.

however, I haven't read anything about how the HW inventory gets to UI in the first place. Will there be another daemon running on all nodes collecting information about block devices, just like LSO, only presenting them to Kubernetes in another way than LSO does???

The thought is that the LSO daemons that are already running on all the nodes would be able to discover the available devices and store their properties somewhere for the UI to discover. That somewhere could be a ConfigMap for each node, stored as JSON. Or maybe a new CRD type should be created. The CRD seems like overkill since there is nothing for an operator to reconcile. LSO just needs to communicate metadata to the UI.

The thought is that the LSO daemons that are already running on all the nodes would be able to discover the available devices and store their properties somewhere for the UI to discover.

This was not part of this enhancement. Can we get all requirements on LSO into this enhancement, with all API objects and a complete work flow from a "blank" OCP cluster to a cluster with OCS installed (or at least ready to be installed)? Otherwise we may end up with half designed solution that won't make anyone happy.

To me it looks like this PR is trying to reinvent OpenEBS disk manager

If I were to summarize all the meetings and design discussions...

OpenShift admins need a UI to select which devices will be exposed as local PVs

The UI needs to display available storage for devices that don't yet have PVs. The UI (pending UI design) would not show individual devices, but would show some representation of available storage and allow the user to filter devices out of the selection based on ssd/hdd, size, or other properties.

The UI would drive LSO to create the local PVs

Different types of devices should be available through different storage classes (ssds vs nvme vs hdd)

OCS UI needs to detect what local PVs are available and allow the user to configure OCS to use some subset of the available storage.

We don't see #1 as being under the OCS scope since the local PVs could also be used outside of OCS.

@gnufied Since you've been in most of the discussions, does this match your expectation as well?

@jsafrane There are at least several other solutions in the community for local PV configuration and LSO overlaps with them. Our discussions have been to take a dependency on LSO. Are you suggesting LSO isn't the right place for this? Or simply pointing out a similar design?

enhancements/local-storage/automatic-detection.md

gnufied · 2020-02-24T15:12:54Z

enhancements/local-storage/automatic-detection.md

+}
+
+type AutoDetectVolumeStatus struct {
+    Status string `json:"status"`


Please expand this type.

How about adding pod conditions instead of a single status string?

Not sure adding pod conditions will be a good idea here. I think the status should be reflecting the discovery progress.

I think expanding this, whatever you had in mind, would be overkill for an MVP. We should be able ot get by with just a simple string for the time being, unless you have something else specific in mind.

what will be the support status of this new feature? I proposed to keep this alpha for now. but even then keeping this field as a simple string is not going to be enough. In LSO we typically try to stick to specs provided to us by api team - https://github.com/openshift/api/blob/master/operator/v1/types.go#L97

I am not sure if all fields in that type will be necessary but even for MVP/alpha feature, we should pick and propose fields we are going to use, so that this can be discussed during the design phase.

enhancements/local-storage/automatic-detection.md

gnufied · 2020-02-24T15:15:55Z

enhancements/local-storage/automatic-detection.md

+  - state (as outputted by `lsblk`) is not `suspended`
+- Ensuring disks aren't re-detected as new or otherwise destroyed if their device path changes.
+  - This is already ensured by the current LSO approach of consuming disks by their `UUID`
+


We need to document recovery from errors. What happens if user has made a mistake and wants to undo "creation" of AutodetectVolume object?

If they want to remove a volume that was automatically created, they would need to:

Update the CR so it doesn't pick up the device(s) automatically that aren't desired

The admin would delete/cleanup the PV

@gnufied Is that along the lines of what you are asking?

and does this documentation need to come with this PR? @gnufied

That is fine, we just need to ensure that this is captured somewhere. As limitation or something, so as this can be documented later.

This commit adds tolerations in the API spec of AutoDetectVolume CR. Signed-off-by: Ashish Ranjan <[email protected]>

enhancements/local-storage/automatic-detection.md

travisn · 2020-02-25T19:37:05Z

enhancements/local-storage/automatic-detection.md

+    VolumeMode PersistentVolumeMode `json:"volumeMode,omitempty"`
+    // FSType type to create when volumeMode is Filesystem
+    // +optional
+    FSType string `json:"fsType,omitempty"`


@jarrpa and I had a discussion on this... Most importantly, the discussion shouldn't block this design PR. Let's keep pushing to get this merged soon. Jose will start a new thread with this topic.

rohantmp · 2020-02-26T09:35:43Z

Once local PVs are created, it's hard to remove them.

@jsafrane why is this the case? I've not faced any issues so far.

jsafrane · 2020-02-26T12:30:16Z

Once local PVs are created, it's hard to remove them.

@jsafrane why is this the case? I've not faced any issues so far.

Because admin has to find the PVs and delete them manually. Especially finding the few PVs that are wrong among thousand of other PVs with randomly generated name can be challenging. And any mistake can lead to admin loosing data.

rohantmp · 2020-02-26T13:48:59Z

I think it'll be possible to delete by StorageClass through either:

filtering via jsonpath.
OR
labelling the PVs by StorageClass/DiscoveryPolicy (if this is possible via the upstream provisioner).

jsafrane · 2020-02-26T13:52:24Z

Yes, you can filter by storage class name, still, it can yield hundreds / thousands of PVs.

gnufied · 2020-02-26T14:39:34Z

Another difficulty with removing local-volume PVs is, as long as symlinks exist in path where upstream provisioner searches for devices, it is just going to re-create the PV even if you delete them. So it is somewhat harder to remove local PVs...

travisn

@ashishranjan738 Another section that would help with this design is the flow of events after an AutoDetectVolume CR is created. For example, does the operator communicate with each of the daemons, and they will create the PVs?

travisn · 2020-02-26T18:33:33Z

enhancements/local-storage/automatic-detection.md

+    VolumeMode PersistentVolumeMode `json:"volumeMode,omitempty"`
+    // FSType type to create when volumeMode is Filesystem
+    // +optional
+    FSType string `json:"fsType,omitempty"`


@jsafrane The reason we discussed having a separate thread for this topic is that this design doc is independent from the UI. The UI will need to build on top of this new design, but fundamentally this AutoDetectVolume CR doesn't need any UI around it in order to configure the devices. Though if something in this design is blocking the UI design, then I would certainly want to resolve it here.

From what I read above, some UI wants to present hardware inventory and once user selects / filters the devices then UI does something so PVs are created. We are discussing the something here in this PR

That something for the UI to create PVs would be to create the AutoDetectVolume CR(s). This is the signal that the UI is ready to commit to the device selection properties that the user decided on. That something could also be for the UI to create individual CRs for each of the devices, and there would be no need for the AutoDetectVolume CR. IMHO, the AutoDetectVolume is a better approach for the UI, but if we settle on the other approach during UI design, then so be it.

however, I haven't read anything about how the HW inventory gets to UI in the first place. Will there be another daemon running on all nodes collecting information about block devices, just like LSO, only presenting them to Kubernetes in another way than LSO does???

The thought is that the LSO daemons that are already running on all the nodes would be able to discover the available devices and store their properties somewhere for the UI to discover. That somewhere could be a ConfigMap for each node, stored as JSON. Or maybe a new CRD type should be created. The CRD seems like overkill since there is nothing for an operator to reconcile. LSO just needs to communicate metadata to the UI.

rohantmp · 2020-02-27T12:44:02Z

@ashishranjan738 Another section that would help with this design is the flow of events after an AutoDetectVolume CR is created. For example, does the operator communicate with each of the daemons, and they will create the PVs?

Added in latest commit.

openshift-ci-robot · 2020-02-27T12:45:56Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: ashishranjan738, leseb, rohantmp
To complete the pull request process, please assign josephschorr
You can assign the PR to them by writing /assign @josephschorr in a comment when ready.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

ashishranjan738 · 2020-03-04T13:12:30Z

Opened a new PR #237 to avoid the GitHub unicorn.

openshift-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Jan 27, 2020

JohnStrunk reviewed Jan 27, 2020

View reviewed changes

ashishranjan738 force-pushed the autoDetect branch from 160bcc5 to 83c1950 Compare January 29, 2020 08:44

ashishranjan738 marked this pull request as ready for review January 29, 2020 08:56

openshift-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jan 29, 2020

leseb suggested changes Jan 29, 2020

View reviewed changes

sp98 reviewed Jan 29, 2020

View reviewed changes

enhancements/local-storage/automatic-detection.md Outdated Show resolved Hide resolved

enhancements/local-storage/automatic-detection.md Outdated Show resolved Hide resolved

travisn suggested changes Jan 30, 2020

View reviewed changes

jarrpa suggested changes Jan 31, 2020

View reviewed changes

rohantmp reviewed Feb 5, 2020

View reviewed changes

enhancements/local-storage/automatic-detection.md Show resolved Hide resolved

gnufied reviewed Feb 5, 2020

View reviewed changes

enhancements/local-storage/automatic-detection.md Outdated Show resolved Hide resolved

gnufied reviewed Feb 5, 2020

View reviewed changes

enhancements/local-storage/automatic-detection.md Outdated Show resolved Hide resolved

gnufied reviewed Feb 5, 2020

View reviewed changes

enhancements/local-storage/automatic-detection.md Outdated Show resolved Hide resolved

ashishranjan738 force-pushed the autoDetect branch 2 times, most recently from 409ff65 to 309f097 Compare February 6, 2020 12:37

ashishranjan738 requested review from JohnStrunk, gnufied, jarrpa, leseb, rohantmp, sp98 and travisn February 6, 2020 12:39

ashishranjan738 commented Feb 6, 2020

View reviewed changes

enhancements/local-storage/automatic-detection.md Outdated Show resolved Hide resolved

ashishranjan738 force-pushed the autoDetect branch from 309f097 to ab62f5b Compare February 6, 2020 12:44

ashishranjan738 commented Feb 6, 2020

View reviewed changes

enhancements/local-storage/automatic-detection.md Outdated Show resolved Hide resolved

ashishranjan738 commented Feb 6, 2020

View reviewed changes

enhancements/local-storage/automatic-detection.md Show resolved Hide resolved

JohnStrunk reviewed Feb 10, 2020

View reviewed changes

leseb reviewed Feb 13, 2020

View reviewed changes

enhancements/local-storage/automatic-detection.md Outdated Show resolved Hide resolved

ashishranjan738 force-pushed the autoDetect branch 2 times, most recently from dcb8a42 to e11d94a Compare February 24, 2020 09:18

Adds proposal for AutoDetection on LSO

fa69016

Signed-off-by: Ashish Ranjan <[email protected]> Signed-off-by: Rohan CJ <[email protected]>

ashishranjan738 force-pushed the autoDetect branch from e11d94a to fa69016 Compare February 24, 2020 09:26

ashishranjan738 requested review from JohnStrunk, gnufied, jsafrane, leseb and travisn February 24, 2020 09:26

to-squash:

ee13744

- Add non-goal: coexisting with other owners of local-storage - Move examples to their own section. Signed-off-by: Rohan CJ <[email protected]>

gnufied reviewed Feb 24, 2020

View reviewed changes

leseb approved these changes Feb 24, 2020

View reviewed changes

Adds Tolerations in the API spec

e3a69bb

This commit adds tolerations in the API spec of AutoDetectVolume CR. Signed-off-by: Ashish Ranjan <[email protected]>

travisn reviewed Feb 25, 2020

View reviewed changes

travisn reviewed Feb 26, 2020

View reviewed changes

Local Storage: Add an implementation flow to the design.

58782b6

rohantmp force-pushed the autoDetect branch from d185748 to 58782b6 Compare February 27, 2020 12:43

rohantmp approved these changes Feb 27, 2020

View reviewed changes

ashishranjan738 closed this Mar 4, 2020

Adds proposal for AutoDetection on LSO #190

Adds proposal for AutoDetection on LSO #190

Uh oh!

Conversation

ashishranjan738 commented Jan 27, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Opened a new PR #237 to avoid the GitHub unicorn.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jarrpa left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jarrpa commented Jan 31, 2020

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ashishranjan738 commented Feb 24, 2020

Uh oh!

rohantmp commented Feb 24, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gnufied left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

ashishranjan738 commented Jan 27, 2020 •

edited

Loading

rohantmp commented Feb 24, 2020 •

edited

Loading

travisn Feb 24, 2020 •

edited

Loading

ashishranjan738 Feb 25, 2020 •

edited

Loading

rohantmp commented Feb 26, 2020 •

edited

Loading