Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

storage: SMART support #19103

Draft
wants to merge 16 commits into
base: main
Choose a base branch
from

Conversation

tomasmatus
Copy link
Member

@tomasmatus tomasmatus commented Jul 17, 2023

@tomasmatus
Copy link
Member Author

image

For now I added a new card to the disk detail which shows basic information and allows to run/abort self test.

Note that for now only (S)ATA devices are supported by Udisks, NVMe support will be added in the latest release https://github.com/storaged-project/udisks/releases/tag/udisks-2.10.0

@garrett I believe you wanted to work on mockups for this, any chance you could look into proper design?

Copy link
Member

@jelly jelly left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Creating tests for this would be great, maybe udisks already has a way to mock it?

pkg/storaged/smart-details.jsx Show resolved Hide resolved
return (
<Card>
<CardHeader actions={{ actions: <SmartActions smartInfo={smartInfo} /> }}>
<CardTitle component="h2">{_("S.M.A.R.T")}</CardTitle>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that's really something one can translate?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm force of habit

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe in non-latin languages? Like chinese or russian or so?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, if in doubt, keep it translatable. I can't say I like the many dots, this looks a bit like a 90's superhero comic 😁 So please either "SMART", or at least make it consistent and add a period to the T as well.

@martinpitt martinpitt marked this pull request as draft July 24, 2023 08:55
@martinpitt
Copy link
Member

Creating tests for this would be great, maybe udisks already has a way to mock it?

Yes, it does. SmartUpdate() has an atasmart_blob attribute where you can load a previously skdump --save'd blob. Of course you can also edit this in between, to have a few scenarios (passing and failing).

@tomasmatus
Copy link
Member Author

tomasmatus commented Jul 24, 2023

I am not sure how to test this. In order to run SmartUpdate udisks needs to expose org.freedesktop.UDisks2.Drive.Ata which is only available when the drive is (S)ATA and is likely to support SMART.

@jelly
Copy link
Member

jelly commented Jul 24, 2023

I am not sure how to test this. In order to run SmartUpdate udisks needs to expose org.freedesktop.UDisks2.Drive.Ata which is only available when the drive is (S)ATA and is likely to support SMART.

Neither does udisks test it (only if they find SMART disks), maybe because of the same reason?

Random googling https://stackoverflow.com/questions/48351096/how-to-emulate-a-sata-disk-drive-in-qemu
Hmm :)

-drive id=disk,file=IMAGE.img,if=none \
-device ahci,id=ahci \
-device ide-hd,drive=disk,bus=ahci.0

So my suggestion would be to yolo hack Virtmachine.add_disk in (bots), make sure your cockpit checkout uses the changed bots and try to get TEST_DISK_XML to use AHCI. If that even supports SMART faking, if not we can't do this I suppose :(.

@jelly
Copy link
Member

jelly commented Nov 17, 2023

FYI udiskctl has a smart simulate options:

[root@archlinux ~]# udisksctl --help
Unknown command `--help'
Usage:
  udisksctl COMMAND

Commands:
  help            Shows this information
  info            Shows information about an object
  dump            Shows information about all objects
  status          Shows high-level status
  monitor         Monitor changes to objects
  mount           Mount a filesystem
  unmount         Unmount a filesystem
  unlock          Unlock an encrypted device
  lock            Lock an encrypted device
  loop-setup      Set-up a loop device
  loop-delete     Delete a loop device
  power-off       Safely power off a drive
  smart-simulate  Set SMART data for a drive

@martinpitt
Copy link
Member

@tomasmatus Are you still interested in working on this? It's quite a nice feature.

@tomasmatus
Copy link
Member Author

Wow its been a year already... yes, this is something i should revive @martinpitt

@jelly
Copy link
Member

jelly commented Jul 5, 2024

If i recall correctly the issue was tests, so trying to experiment with udisksctl smart-simulate sounds like a good idea to move this forward.

@garrett
Copy link
Member

garrett commented Jul 8, 2024

So, I think having SMART info is good. I remember I suggested it a long time ago in some issue and/or mockup. I don't remember where or what though. Where would this go?

It's usually stylized as SMART, not S.M.A.R.T. (which is just awkward). At least in English, acronyms and initialisms usually do not include periods. https://en.wikipedia.org/wiki/Acronym — and acronyms that are proper names (including SMART) or commonly used are generally not supposed to even have periods... and they're usually (but not always) all in uppercase. It's kind of weird with all kinds of special-casing, however; Wikipedia's acronym page actually has a "nomenclature" section with some examples: https://en.wikipedia.org/wiki/Acronym#Nomenclature — and it even covers pronunciation a bit (when some are pronounced and some are spelled out... and even some that are a mixture, like "JPEG" and "MS-DOS").

As far as being detailed; I think we should show an overview and use a "progressive disclosure" concept. (That is, show the important, common stuff first and then click somewhere to see everything.)

Since it looks like you're picking this up, it would be great to talk about the design of it. 👍

I know we had the concept of SMART errors propagating all the way up to the health card on the overview too (which would then link to the details).

@SchoolGuy
Copy link

I am interested in this feature. Is there anything that can be done to progress the PR?

@garrett
Copy link
Member

garrett commented Jul 18, 2024

@SchoolGuy: A recent update related to this PR: @tomasmatus is back and is planning on picking this up (as you can see in an above comment). I will be working with him on the design. If you'd like to help out, that'd be fine as well.

In what ways are you interested in SMART? What data are you looking for, specifically? And when do you consult SMART data, usually? (From time to time or only when there are strange things going on as a diagnosis tool?)

@SchoolGuy
Copy link

@garrett I am no designer by trade as such I am not sure on what part of the design I may be of assistance. I should be able to dig into the code and help out with implementing an idea that somebody else had. If that is not available I may be able to design things on my own but they may be hard to use...


I am interested in SMART in two ways:

  1. The binary quick one: Is the drive currently "failed" due to one of the measurements failing a threshold or is everything fine.
  2. The longer one: Use the WebUI as a replacement for the CLI, aka look at the individual output of the metrics and decide for myself if I want to replace a drive due to me interpreting the values myself.

The idea for this specific project is to have Cockpit running as a WebUI for a single host with an mdadm array running and that resulting device is exposed as an NFS share that other hosts will use. In short: I want to use the host as a NAS.

I will grab the SMART values additionally with the smartd exporter and will let Prometheus grab that but since that will be hosted by one of the other hosts I want to be able to know if a failure there is related to the mdadm array being down or by a configuration error.

I am a quite paranoid person (regarding my infrastructure) and as such I am checking my monitoring quite regularly but it is not a fixed schedule I have.

@hobbes1069
Copy link

I'm not a developer but I'm a Fedora packager and would be very interested in seeing this in Fedora/RHEL.

@garrett
Copy link
Member

garrett commented Jul 22, 2024

I can try to find some time to get back around to this to make some mockups. But the gist is that we need it in two places:

  1. Detailed view with just a few of the most important items shown by default, with a way to expand or view more to get to all of the stats.
  2. Current status in the health card on the overview page.

FWIW, here are the design ideas for disk issues (from a long time ago): #8787 (comment)

SMART is mentioned, with regard to showing it in the health card:

  • warning: SMART errors that don't stop functioning of disk
  • error: SMART found bad sectors (disk is dying; data loss may have happened

We probably want to have some kind of status API where the overview page can query the storage page to get the SMART info and show it on the health card if there is a problem (and only if there is a problem), similar to the updates being on the health card when there are updates. (The software updates page is queried for this.)

@SchoolGuy
Copy link

Perfect. This is already a great amount of detail.

@tomasmatus how to you want to organize so we don't run into conflicts working on the same files?

@jelly
Copy link
Member

jelly commented Jul 22, 2024

@garrett I am no designer by trade as such I am not sure on what part of the design I may be of assistance. I should be able to dig into the code and help out with implementing an idea that somebody else had. If that is not available I may be able to design things on my own but they may be hard to use...

I am interested in SMART in two ways:

1. The binary quick one: Is the drive currently "failed" due to one of the measurements failing a threshold or is everything fine.

That fails into the scope of Cockpit!

2. The longer one: Use the WebUI as a replacement for the CLI, aka look at the individual output of the metrics and decide for myself if I want to replace a drive due to me interpreting the values myself.

What metrics are you interested in? UDisks provides some basic health metrics as seen in the screenshot on top. For other attributes we have to use SmartGetAttributes, so if you can give us a list of interesting attributes and maybe a busctl command with some data, that would be interesting and helpful.

The idea for this specific project is to have Cockpit running as a WebUI for a single host with an mdadm array running and that resulting device is exposed as an NFS share that other hosts will use. In short: I want to use the host as a NAS.

I will grab the SMART values additionally with the smartd exporter and will let Prometheus grab that but since that will be hosted by one of the other hosts I want to be able to know if a failure there is related to the mdadm array being down or by a configuration error.

Cockpit will only support showing SMART values for a single host, the host you connect too. We don't want to support anything outside of the UDisks API

@garrett
Copy link
Member

garrett commented Jul 23, 2024

For reference, udisks reports only this information from SMART:

$ udisksctl dump | grep -i smart
    SmartCriticalWarning:               
    SmartPowerOnHours:                  71
    SmartSelftestPercentRemaining:      -1
    SmartSelftestStatus:                success
    SmartTemperature:                   319
    SmartUpdated:                       1721742479

It's not much, but it's probably enough in most cases? If this is all, then we can't do the expand idea, unless we get more information, like from smartctl.

@hobbes1069
Copy link

Is it worth looking to see how gnome-disk-utility looks up the extra data?
https://gitlab.gnome.org/GNOME/gnome-disk-utility

@jelly
Copy link
Member

jelly commented Jul 24, 2024

Is it worth looking to see how gnome-disk-utility looks up the extra data? https://gitlab.gnome.org/GNOME/gnome-disk-utility

See my comment:

What metrics are you interested in? UDisks provides some basic health metrics as seen in the screenshot on top. For other attributes we have to use SmartGetAttributes, so if you can give us a list of interesting attributes and maybe a busctl command with some data, that would be interesting and helpful.

@jelly
Copy link
Member

jelly commented Jul 26, 2024

I've looked into testing this in a test virtual machine, the main issue is that even if I add a sata disk, udisks does not seem to identify this as a ATA drive:

On my nas

/org/freedesktop/UDisks2/drives/$ID:
  org.freedesktop.UDisks2.Drive:
    CanPowerOff:                true
  <snip>
  org.freedesktop.UDisks2.Drive.Ata:
    AamEnabled:                                 false

Attaching a SATA disk does allow me to use smartctl on it.
image

@hobbes1069
Copy link

I like the data that is presented here, including the ability to start a SMART test:
image

@garrett
Copy link
Member

garrett commented Jul 29, 2024

the data that is presented here

Right, that's the extended data that we're definitely not going to show by default. It's way too verbose and misleading.

We could, in a third stage, add all the details similar to that, possibly in a modal dialog similar to that. But, we should handle SMART in these steps:

  1. Basic details
    • filtered list of what udisks gives us by default (mentioned above)... filtered to show only relevant information (for example: don't show progress of scan unless a scan is happening; don't show errors unless there are errors, etc.)
    • the basic list udisks gives us by default seems to be reasonable
    • this would likely be a key + value pair in a description list with a heading
  2. Overview integration
    • only show with warnings and errors, with a link back to the basic details
  3. A way to get detailed information
    • a "progressive disclosure" concept, where we show the important first (already implemented in step 1) and a way to see more information
    • examples: this could be something like an expander widget at the bottom of the list or a modal (we'll need to figure this out still, but it can be deferred until after step 1 and 2 land)

This first PR should implement step # 1. Then we need to do a follow-up for step # 2 (overview). Then eventually add a step # 3 for more details.

We should not do this all in one go, nor should we just show all the data points SMART can give us. (Many are very misleading, scary, or not relevant depending on the circumstance. And it varies per manufacturer and storage medium (spinning disks of different types, SSD, NVMe, etc.). It's a mess, really.)

(I think @tomasmatus and I will talk about this more either later this week or next week.)

@jelly
Copy link
Member

jelly commented Jul 29, 2024

For testing SMART in CI, there is a udisks issue now https://issues.redhat.com/browse/STORAGECFG-801

@tbzatek
Copy link

tbzatek commented Oct 3, 2024

For testing SMART in CI, there is a udisks issue now https://issues.redhat.com/browse/STORAGECFG-801

Please check this release out: https://github.com/storaged-project/udisks/releases/tag/udisks-2.10.90

Random googling https://stackoverflow.com/questions/48351096/how-to-emulate-a-sata-disk-drive-in-qemu Hmm :)

-drive id=disk,file=IMAGE.img,if=none \
-device ahci,id=ahci \
-device ide-hd,drive=disk,bus=ahci.0

This works now in UDisks, however it's still an ide-hd device emulation and even though it's connected through an emulated AHCI controller, it still doesn't talk the SAT (SCSI / ATA Translation) protocol. This was the culprit and to some degree it still is as UDisks only supports SAT and nothing else (looking for an expert to make this happen!). Similarly, udev's ata_id also supports SAT only but has a bug in the code that makes this work.

However I've found this hidden workaround in libatasmart to make this work with qemu devices: https://github.com/storaged-project/libblockdev/blob/master/src/plugins/smart/smart-common.c#L131-L167

Let me know if it helps your testing use case.

@tomasmatus tomasmatus added the no-test For doc/workflow changes, or experiments which don't need a full CI run, label Jan 28, 2025
pkg/storaged/drive/drive.jsx Show resolved Hide resolved
pkg/storaged/smart-details.jsx Outdated Show resolved Hide resolved
pkg/storaged/smart-details.jsx Outdated Show resolved Hide resolved
pkg/storaged/smart-details.jsx Outdated Show resolved Hide resolved
pkg/storaged/smart-details.jsx Outdated Show resolved Hide resolved
pkg/storaged/smart-details.jsx Outdated Show resolved Hide resolved
pkg/storaged/smart-details.jsx Outdated Show resolved Hide resolved
pkg/storaged/smart-details.jsx Outdated Show resolved Hide resolved
pkg/storaged/smart-details.jsx Outdated Show resolved Hide resolved
pkg/storaged/smart-details.jsx Show resolved Hide resolved
@tomasmatus
Copy link
Member Author

I installed newest udisks from copr and played a bit with virsh attach-disk and different --targetbus x types. I found out that --targetbus sata does not support hotplugging so if I add the following to TEST_DOMAIN_XML in bots I am able to get a disk that shows as drive ata and has SMART info available

    <disk type='file'>
      <driver name='qemu' type='qcow2' cache='unsafe'/>
      <source file='/home/tmatus/projects/storage-img.qcow2'/>
      <target dev='sda' bus='sata'/>
      <serial>SATADISK</serial>
      <boot order='3'/>
    </disk>

After running the VM I can simulate SMART info with udisksctl smart-simulate. What I can't do is start a SMART test. (I suppose that makes sense when it's all simulated). For other targetbus types I'm not seeing any SMART data.

-drive id=disk,file=IMAGE.img,if=none
-device ahci,id=ahci
-device ide-hd,drive=disk,bus=ahci.0

I tried to look into this as that is for a different tool, not virsh. I didn't find any mentions of ahci in virsh so can't say for sure if that is better or same solution.

@tbzatek
Copy link

tbzatek commented Jan 28, 2025

After running the VM I can simulate SMART info with udisksctl smart-simulate. What I can't do is start a SMART test. (I suppose that makes sense when it's all simulated).

Yeah, trigerring SMART tests wouldn't work on any emulated device as far as I know.

For other targetbus types I'm not seeing any SMART data.

You'd likely need to tweak the ID_ATA_SMART_ACCESS udev attribute to reflect the actual protocol for the qemu emulated device. It really depends what protocol qemu actually implements.

-drive id=disk,file=IMAGE.img,if=none
-device ahci,id=ahci
-device ide-hd,drive=disk,bus=ahci.0

I tried to look into this as that is for a different tool, not virsh. I didn't find any mentions of ahci in virsh so can't say for sure if that is better or same solution.

That's a qemu commandline, as libvirt is just a wrapper, it'll likely have corresponding features in a slightly different syntax (i.e. sata == ahci in most cases).

@tomasmatus
Copy link
Member Author

tomasmatus commented Feb 4, 2025

(notes to self, will update with more later)

Testing this on my trusty WD friend:
image

I am able to start the test and see "live" updates on the progress status:
image
It's not fully live progress bar. After talking to @tbzatek he mentioned this kind of depends on the disk how it reports it. In my case it seems to jump in +10% or +20% increments. It could also depend on how often cockpit receives updates from udisks.

I can abort the test while it's running:
image

I can see one bad sector on my SSD. The disk still reports itself as healthy.
image

@jelly
Copy link
Member

jelly commented Feb 10, 2025

We need udisks-nightly or well the udisks pre-release for testing this feature as it now uses udev instead of direct drive access for determing disk types etc. This apparently works better for testing emulating SMART data in a virtual machine.

We have two options to get this tested:

  • A new scenario, which like /nightly installs udisks from copr. The downside is that we test against a moving target in PR's
  • Downloading the udisks nightly packages on the Fedora image like we for example do for clevis packages in the RHEL image (the reason escapes me)
dnf download --downloaddir=/var/lib/package-sets/clevis $PACKAGE_SET_CLEVIS

In the test you would install those packages and kinda pray you have no dependency resolving issues 😄

Now for adding the SATA disk, that's a good question....

It is officially not supported.

We can't easily override the so an idea I had was:

class TestSmart(testlib.MachineCase):
    provision = {
        "0": { "drive_bus": "sata" },
    }

And then you'll need to hack on your own bots checkout:

class VirtMachine(Machine):
    networking: dict[str, Any]
    virt_connection: libvirt.virConnect | None
    _transient_image: IO[bytes] | None
    _domain: libvirt.virDomain | None

    network: int | None = None
    memory_mb: int | None = None
    cpus: int | None = None
    is_efi: bool = False
    image_file: str
    run_dir: str

    def __init__(
        self,
        image: str,
        networking: dict[str, Any] | None = None,
        maintain: bool = False,
        memory_mb: int | None = None,
        cpus: int | None = None,
        capture_console: bool = True,
        **kwargs: Any
    ):

   if kwargs.get("disk_bus"):
               something_smart()

Basically the domain XML to get a drive_bus format argument

test_domain_desc = TEST_DOMAIN_XML.format(**keys)

@martinpitt
Copy link
Member

Thanks @jelly. I agree that we should only throw so much effort on automating the test here -- if it's not possible with current udisks, there's that. But as far as I understand it, this is exclusively about testing with a mock smart device, right? i.e. current udisks can trigger smart in current versions, just not inject mock data?

Please test it carefully on actual hardware. The tests should then do a feature test (like udisks version) so that it can run in the /nightly scenario -- you can easily trigger that in PRs. Then we have at least one place where it's covered, and over time it will flow into the distros. IMHO adding a package set etc. is overkill.

@tomasmatus
Copy link
Member Author

But as far as I understand it, this is exclusively about testing with a mock smart device, right? i.e. current udisks can trigger smart in current versions, just not inject mock data?

Correct, current udisks works with the real hardware. The newer version (pre-release) is only needed only for virtual disks so I can mock something in the test. @martinpitt

Please test it carefully on actual hardware.

I am trying to do this as much as I can. I can do that on my laptop with fedora 41 and desktop with arch which covers all three cases of HDD, SATA SSD and NVMe device.

tomasmatus added a commit to tomasmatus/bots that referenced this pull request Feb 14, 2025
martinpitt pushed a commit to cockpit-project/bots that referenced this pull request Feb 15, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
needsdesign no-test For doc/workflow changes, or experiments which don't need a full CI run,
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants