S.M.A.R.T support by geekifan · Pull Request #614 · henrygd/beszel

geekifan · 2025-02-22T09:07:25Z

I follow the manner of GPUManager to add support for S.M.A.R.T to the agent. Since I am not an expert in Go and do not have enough physical devices around for testing, I hope someone can do a basic review of my code and test it on their own devices. Once everything is ready, I will proceed with modifying the hub's code.

TODO:

Add S.M.A.R.T. manager in agent
Show disk and smart info in web ui
Add S.M.A.R.T. failing alert
Documentation & tests

henrygd · 2025-02-23T01:15:47Z

Hi Yifan, thank you very much for your work!

This looks like a great start. Let me get back to you later in the week as I have limited time right now and am trying to get the next release out as soon as possible.

On the hub side we should probably create a new table (PocketBase collection) for this data.

From my limited knowledge I think parsing smartctl output is a fine approach and should work on MacOS also. But I may be wrong.

There's also this Go library which provides SMART information: https://github.com/anatol/smart.go

And a standalone application, Scrutiny, which is written in Go and may be a helpful reference: https://github.com/AnalogJ/scrutiny

As far as hardware, I'm in the same boat as you. I actually don't even own a HDD, but we should be able to find some output samples online and use them as test data (or people using Beszel can provide them).

Again, I appreciate your time and will get back to you as soon as I can.

Edit: If anyone reads this and wants to provide sample output, please change the serial numbers before sharing.

geekifan · 2025-02-23T02:11:53Z

Thank you very much for your detailed response.

First, I have considered using smart.go. If we use smart.go, we will be dependent on all its aspects (such as potential bugs and the possibility that its smart database may not be updated in a timely manner). If such issues arise and it is no longer maintained, all we can do is fork it, fix the bugs, or update the smart database. This would add a significant burden to the maintenance of beszel. In contrast, smartctl is a very widely used tool, with timely updates to the smart database and more prompt maintenance in case of bugs. Its support for JSON-formatted output is a great advantage for data parsing in Go.

Regarding the macOS issue, I currently also have macOS and will conduct tests later.

The hardware I currently have available for testing includes: NVMe/SATA/SCSI (only testable under Linux platform), and USB storage, which should cover mainstream hardware. What I really worry about are some corner cases.

Additionally, I have a few issues that I am unsure how to handle:

The SMART data for SATA/SCSI uses the ATA format, while NVMe uses a different format, leading to inconsistencies in SMART key values. Other hardware might have more SMART formats, so I believe we need everyone's help to find the appropriate data structures to store and monitor them.
Due to the hot-swappable nature of hard drives, if a hard drive is unplugged, the agent part will delete the corresponding data entry when report to the hub. But how will the hub handle the missing data? Will it delete the corresponding hard drive data when displaying, or will it retain the state at the time of unplugging? (Sorry, I am not familiar with PocketBase and some database operations.)

EDIT: I checked the code of https://github.com/AnalogJ/scrutiny. Scrutiny parses the json output of smartctl to get the SMART info.

henrygd · 2025-02-24T00:45:06Z

Sounds good, I agree with the direct smartctl approach.

I don't think there's any reason to worry about corner cases in the first iteration. We'll get sample output and include the most important or common values.

If there's an issue parsing then we'll just log an error. We can add support for more formats as people request them.

Hopefully the JSON structure is consistent and it's just the properties that differ, because dealing with inconsistent JSON is not fun.

The regular non-JSON output looks easy to parse, so we could just use bufio to scan the output line by line for the values we need.

Here's output from my laptop with one nvme drive:

smartctl --scan

/dev/nvme0 -d nvme # /dev/nvme0, NVMe device

smartctl --scan -j

{
  "json_format_version": [
    1,
    0
  ],
  "smartctl": {
    "version": [
      7,
      4
    ],
    "pre_release": false,
    "svn_revision": "5530",
    "platform_info": "x86_64-linux-6.13.2-arch1-1",
    "build_info": "(local build)",
    "argv": [
      "smartctl",
      "--scan",
      "-j"
    ],
    "exit_status": 0
  },
  "devices": [
    {
      "name": "/dev/nvme0",
      "info_name": "/dev/nvme0",
      "type": "nvme",
      "protocol": "NVMe"
    }
  ]
}

sudo smartctl -a /dev/nvme0

smartctl 7.4 2023-08-01 r5530 [x86_64-linux-6.13.2-arch1-1] (local build)
Copyright (C) 2002-23, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Number:                       WD PC SN810 SDCPNRY-1T00-1006
Serial Number:                      226223861317
Firmware Version:                   HPS2
PCI Vendor/Subsystem ID:            0x15b7
IEEE OUI Identifier:                0x001b44
Total NVM Capacity:                 1,024,209,543,168 [1.02 TB]
Unallocated NVM Capacity:           0
Controller ID:                      8224
NVMe Version:                       1.4
Number of Namespaces:               1
Namespace 1 Size/Capacity:          1,024,209,543,168 [1.02 TB]
Namespace 1 Formatted LBA Size:     512
Namespace 1 IEEE EUI-64:            001c44 8b25c6eb61
Local Time is:                      Sun Feb 23 19:34:35 2025 EST
Firmware Updates (0x14):            2 Slots, no Reset required
Optional Admin Commands (0x0017):   Security Format Frmw_DL Self_Test
Optional NVM Commands (0x005f):     Comp Wr_Unc DS_Mngmt Wr_Zero Sav/Sel_Feat Timestmp
Log Page Attributes (0x1e):         Cmd_Eff_Lg Ext_Get_Lg Telmtry_Lg Pers_Ev_Lg
Maximum Data Transfer Size:         128 Pages
Warning  Comp. Temp. Threshold:     84 Celsius
Critical Comp. Temp. Threshold:     88 Celsius
Namespace 1 Features (0x02):        NA_Fields

Supported Power States
St Op     Max   Active     Idle   RL RT WL WT  Ent_Lat  Ex_Lat
 0 +     8.25W    8.25W       -    0  0  0  0        0       0
 1 +     3.50W    3.50W       -    0  0  0  0        0       0
 2 +     2.60W    2.60W       -    0  0  0  0        0       0
 3 -   0.0250W       -        -    3  3  3  3     5000   10000
 4 -   0.0035W       -        -    4  4  4  4     3900   45700

Supported LBA Sizes (NSID 0x1)
Id Fmt  Data  Metadt  Rel_Perf
 0 +     512       0         2
 1 -    4096       0         1

=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

SMART/Health Information (NVMe Log 0x02)
Critical Warning:                   0x00
Temperature:                        34 Celsius
Available Spare:                    100%
Available Spare Threshold:          5%
Percentage Used:                    0%
Data Units Read:                    20,427,281 [10.4 TB]
Data Units Written:                 27,523,884 [14.0 TB]
Host Read Commands:                 308,278,905
Host Write Commands:                722,398,619
Controller Busy Time:               2,230
Power Cycles:                       3,086
Power On Hours:                     1,392
Unsafe Shutdowns:                   173
Media and Data Integrity Errors:    0
Error Information Log Entries:      0
Warning  Comp. Temperature Time:    0
Critical Comp. Temperature Time:    0

Error Information (NVMe Log 0x01, 16 of 256 entries)
No Errors Logged

Self-test Log (NVMe Log 0x06)
Self-test status: No self-test in progress
No Self-tests Logged

sudo smartctl -aj /dev/nvme0

{
  "json_format_version": [
    1,
    0
  ],
  "smartctl": {
    "version": [
      7,
      4
    ],
    "pre_release": false,
    "svn_revision": "5530",
    "platform_info": "x86_64-linux-6.13.2-arch1-1",
    "build_info": "(local build)",
    "argv": [
      "smartctl",
      "-aj",
      "/dev/nvme0"
    ],
    "exit_status": 0
  },
  "local_time": {
    "time_t": 1740357511,
    "asctime": "Sun Feb 23 19:38:31 2025 EST"
  },
  "device": {
    "name": "/dev/nvme0",
    "info_name": "/dev/nvme0",
    "type": "nvme",
    "protocol": "NVMe"
  },
  "model_name": "WD PC SN810 SDCPNRY-1T00-1006",
  "serial_number": "286223861317",
  "firmware_version": "HPS2",
  "nvme_pci_vendor": {
    "id": 5559,
    "subsystem_id": 5559
  },
  "nvme_ieee_oui_identifier": 5920,
  "nvme_total_capacity": 1024209543168,
  "nvme_unallocated_capacity": 0,
  "nvme_controller_id": 8224,
  "nvme_version": {
    "string": "1.4",
    "value": 66560
  },
  "nvme_number_of_namespaces": 1,
  "nvme_namespaces": [
    {
      "id": 1,
      "size": {
        "blocks": 2000409264,
        "bytes": 1024209543168
      },
      "capacity": {
        "blocks": 2000409264,
        "bytes": 1024209543168
      },
      "utilization": {
        "blocks": 2000409264,
        "bytes": 1024209543168
      },
      "formatted_lba_size": 512,
      "eui64": {
        "oui": 5930,
        "ext_id": 592171146913
      }
    }
  ],
  "user_capacity": {
    "blocks": 2000409264,
    "bytes": 1024209543168
  },
  "logical_block_size": 512,
  "smart_support": {
    "available": true,
    "enabled": true
  },
  "smart_status": {
    "passed": true,
    "nvme": {
      "value": 0
    }
  },
  "nvme_smart_health_information_log": {
    "critical_warning": 0,
    "temperature": 34,
    "available_spare": 100,
    "available_spare_threshold": 5,
    "percentage_used": 0,
    "data_units_read": 20427312,
    "data_units_written": 27524011,
    "host_reads": 308279032,
    "host_writes": 722405653,
    "controller_busy_time": 2230,
    "power_cycles": 3086,
    "power_on_hours": 1392,
    "unsafe_shutdowns": 173,
    "media_errors": 0,
    "num_err_log_entries": 0,
    "warning_temp_time": 0,
    "critical_comp_time": 0
  },
  "temperature": {
    "current": 34
  },
  "power_cycle_count": 3086,
  "power_on_time": {
    "hours": 1392
  },
  "nvme_error_information_log": {
    "size": 256,
    "read": 16,
    "unread": 0
  },
  "nvme_self_test_log": {
    "current_self_test_operation": {
      "value": 0,
      "string": "No self-test in progress"
    }
  }
}

If a drive is unplugged and not in current updates, we'll just keep the record for some predefined time, like a week.

So the data would remain the same as when the drive was unplugged. We could show a 'last updated' time or up/down indicator.

I'll use a scheduled job to delete records that haven't had an update in a week. We could also give users an option to delete the drive themselves.

You can keep the scope of this PR as narrow as you'd like. Just having something working on the agent side is a huge help! I can handle the rest of it no problem.

There's also no rush as I have two other big PRs in the queue as well.

sym0nd0 · 2025-04-13T10:17:02Z

As far as hardware, I'm in the same boat as you. I actually don't even own a HDD, but we should be able to find some output samples online and use them as test data (or people using Beszel can provide them).

Edit: If anyone reads this and wants to provide sample output, please change the serial numbers before sharing.

Let me know what you need (and more so how to pull it) and I'll happily provide from across my drives.

geekifan · 2025-04-18T12:00:19Z

Recently, I've been occupied with other projects and haven't been able to dedicate much time to the SMART feature development. However, I may now be able to allocate some time to work on this, particularly on the front-end and database aspects (though I can't guarantee significant progress at this stage).

Regarding the front-end implementation, I'd like to get your thoughts @henrygd : Do you think we should display the SMART data in a separate page, tab, or pop-up window? If so, where would you recommend placing it for optimal user experience?

I don't have much expertise in UI/UX design, so I'd be happy to hear any suggestions or ideas from anyone ;).

henrygd · 2025-04-19T00:22:04Z

No worries Yifan! Please only work on it if you want to. Don't feel any obligation. What you've already done will be helpful even if you don't do anything more.

We don't need to commit to a specific design right now, but my first thought is to put the SMART data on its own page.

Here's how Scrutiny does it for reference: https://imgur.com/a/5k8qMzS

Maybe on /system/system-name/smart we can have a table similar to the 'All Systems' table that lists all the system's drives with the most useful info. Then clicking on a row will bring you to system/system-name/smart/drive-name with details.

Alternatively, we can just stick the table under the other graphs on the system details page instead of making a standalone page for the SMART data table.

In the future maybe we can include a table on the home page that lists all drives from all systems as well.

IMO the most important part is getting the data where we need it. The layout can always be improved later.

Edit: We use shadcn so you might find something here that fits well: https://ui.shadcn.com

evrial · 2025-04-19T12:02:16Z

Alternatively, we can just stick the table under the other graphs on the system details page instead of making a standalone page for the SMART data table.

I think this would be perfect, at least for start. One panel with table, each drive in row. Temperature sensor data may be added to Temperature panel.

wesgeorge · 2025-06-12T19:23:56Z

Thought it might be useful to provide some output from a system with a large number of drives.

My output is of the same commands as above, just with a grep -v serial. json version only for the smartctl, both tabular and json of smartscan.

smartscan.txt
smartscanjson.txt

This setup is a total of 10 drives in the following configuration:

8 disks hanging off of Dell PERC RAID CARD in JBOD mode, running Debian 12:
-- SATA disks: sda-sdd, sdg, sdh
smartctl-sda-j.txt
smartctl-sdb-j.txt
smartctl-sdc-j.txt
smartctl-sdd-j.txt
smartctl-sdg-j.txt
smartctl-sdh-j.txt
-- SAS SSDs: sde and sdj
smartctl-sde-j.txt
smartctl-sdj-j.txt
2 SATA disks connected to motherboard:
-- sdf, sdi
smartctl-sdf-j.txt
smartctl-sdi-j.txt

So the megaraid_disk_0n output in the scan is duplicate, and in the strictest sense, not all of the devices listed are actually SCSI devices. Probably doesn't matter if you're just using it to pull your list of devices for the output of smartctl, but I know that my /etc/smartd.conf (where the tests are configured) definitely cares that you specify the right type of disk (sat vs scsi) when invoking tests.

Also, I suggest that you key your data on the SN of the disk (or /dev/disk/by-id) rather than the device ID, because sometimes disks can change drive letters at boot when you have this many spread across multiple devices.

geekifan · 2025-06-16T10:13:37Z

@wesgeorge Thank you very much for providing the data and suggestions. I will modify the code for the agent part to make it more robust.

Besides, I finished a front-end demo using some hard drive data (with fake serial numbers) I have on hand. Does anyone have any suggestions? Personally, I prefer displaying all disks in a list format and showing more detailed SMART information by clicking on the corresponding row (just like Proxmox VE).

EDIT1: I finished the SMART detail dialog.

zero77 · 2025-06-19T10:34:56Z

@geekifan
This looks really good, but could you give some indication of warnings. May be an extra column showing number and type of warnings or something similar

geekifan · 2025-06-19T10:44:09Z

@zero77 Thank you for your suggestion! I will display the "When Failed" attribute in the SMART information table and highlight the failed attributes in red (or add an error icon) based on this property.

Updated the SmartManager's methods to use the device's serial number as the key in the SmartDataMap instead of the device name.

Introduced a new Disks tab in the SystemDetail component to display disk information and S.M.A.R.T. data. The tab includes a table for visualizing disk attributes and their statuses. Also added SmartData and SmartAttribute interfaces to support the new functionality.

muro-dot · 2025-06-27T08:05:06Z

It's encouraging to see that the feature I was going to suggest is already in the works.
However, I was wondering if the temperature of the HDD could be integrated into the existing temperature tab to track its history?

geekifan · 2025-06-27T13:06:11Z

@muro-dot On the agent side, the hard drive temperatures are read via SMART data and then incorporated into the system temperature readings. Therefore, you can find the temperature curves of different hard drives in the temperature sensor charts.

muro-dot · 2025-07-02T01:21:30Z

Is there any chance to try it out before henrygd officially releases it? I'm really excited about the new features
Sorry for the comment not directly related to the development

geekifan · 2025-07-20T04:52:08Z

Well, I'm currently facing an issue. Should I set up alerts based on the smartd output from the monitoring target or implement alerts using the metrics obtained from Beszel? (I think the former might be better? After all, a solution implemented at the beszel hub side probably wouldn't be as comprehensive as the alerts in smartd.)

svenvg93 · 2025-07-24T09:15:24Z

Well, I'm currently facing an issue. Should I set up alerts based on the smartd output from the monitoring target or implement alerts using the metrics obtained from Beszel? (I think the former might be better? After all, a solution implemented at the beszel hub side probably wouldn't be as comprehensive as the alerts in smartd.)

I do agree to start the the alerts based on the smartd output, alerts based on other mertics can always be added later if there is an need for it.

RikudouGoku · 2025-10-01T14:56:33Z

Is there any ETA for when this is added? Looks awesome. I would just change the "Type SAT" to "Type SATA" though. SAT sounds weird lol. "SAS" is a different type though.

henrygd · 2025-10-01T16:47:06Z

I'll try to get this in soon. Hopefully this month.

RikudouGoku · 2025-10-01T16:49:04Z

I'll try to get this in soon. Hopefully this month.

Awesome!

geekifan · 2025-10-02T01:58:21Z

@henrygd Thanks hank! This PR is almost done except SMART monitor alerts. I'm busy with my academic work so I have no time to finish the SMART alerts. I would appreciate it if you could finish the rest. This PR is now review-ready.

henrygd · 2025-10-02T16:23:57Z

No worries Yifan, I'll finish it off. Thanks again for your work 👍

M3rcur-x · 2025-10-02T18:28:15Z

Hi,
Would it be possible to add the "-n standby" parameter of smartctl program to avoid to wakeup "sleepy" disks ? Maybe this one could be configurable

henrygd · 2025-10-24T23:06:18Z

This is has finally been added. I need to finish the documentation and clean up a few things, but I'll try to have a release out this weekend.

Thanks again for your efforts, Yifan! Sincerely appreciated.

@M3rcur-x I added standby handling so it should only wake disks once. Then if the disk is sleeping again it will use the previous data.

geekifan force-pushed the feat-smart branch from d32e3cd to c4e98b9 Compare June 15, 2025 08:02

add agent smart support

d0ea9f6

geekifan force-pushed the feat-smart branch from c4e98b9 to d0ea9f6 Compare June 15, 2025 08:22

geekifan added 4 commits June 20, 2025 10:47

refactor(system): update JSON tags in SmartData struct

0f8a336

refactor(agent): use serial number as the key of SmartDataMap

a3c2672

Updated the SmartManager's methods to use the device's serial number as the key in the SmartDataMap instead of the device name.

refactor: use raw values in smart attributes for nvme devices

95d7ef6

henrygd force-pushed the main branch from 943ea23 to 71f081d Compare July 9, 2025 00:40

svenvg93 mentioned this pull request Aug 29, 2025

[Feature]: Monitor storage disks temps using binary version /Linux #1087

Open

henrygd mentioned this pull request Sep 11, 2025

[Feature] Basic systemd service monitoring #1153

Merged

github-actions bot added the Stale label Sep 23, 2025

github-actions bot closed this Sep 30, 2025

github-project-automation bot moved this from Next to Done in Beszel Roadmap Sep 30, 2025

henrygd reopened this Sep 30, 2025

henrygd removed the Stale label Sep 30, 2025

henrygd moved this from Done to Next in Beszel Roadmap Sep 30, 2025

geekifan marked this pull request as ready for review October 2, 2025 01:58

henrygd self-assigned this Oct 22, 2025

henrygd moved this from Next to In Progress in Beszel Roadmap Oct 22, 2025

henrygd changed the base branch from main to 614-smart October 24, 2025 13:48

henrygd merged commit 16d5ec2 into henrygd:614-smart Oct 24, 2025

github-project-automation bot moved this from In Progress to Done in Beszel Roadmap Oct 24, 2025

Uh oh!

Conversation

geekifan commented Feb 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

henrygd commented Feb 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

geekifan commented Feb 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

henrygd commented Feb 24, 2025

Uh oh!

sym0nd0 commented Apr 13, 2025

Uh oh!

geekifan commented Apr 18, 2025

Uh oh!

henrygd commented Apr 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

evrial commented Apr 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wesgeorge commented Jun 12, 2025

Uh oh!

geekifan commented Jun 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zero77 commented Jun 19, 2025

Uh oh!

geekifan commented Jun 19, 2025

Uh oh!

muro-dot commented Jun 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

geekifan commented Jun 27, 2025

Uh oh!

muro-dot commented Jul 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

geekifan commented Jul 20, 2025

Uh oh!

svenvg93 commented Jul 24, 2025

Uh oh!

RikudouGoku commented Oct 1, 2025

Uh oh!

henrygd commented Oct 1, 2025

Uh oh!

RikudouGoku commented Oct 1, 2025

Uh oh!

geekifan commented Oct 2, 2025

Uh oh!

henrygd commented Oct 2, 2025

Uh oh!

M3rcur-x commented Oct 2, 2025

Uh oh!

henrygd commented Oct 24, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

10 participants

geekifan commented Feb 22, 2025 •

edited

Loading

henrygd commented Feb 23, 2025 •

edited

Loading

geekifan commented Feb 23, 2025 •

edited

Loading

henrygd commented Apr 19, 2025 •

edited

Loading

evrial commented Apr 19, 2025 •

edited

Loading

geekifan commented Jun 16, 2025 •

edited

Loading

muro-dot commented Jun 27, 2025 •

edited

Loading

muro-dot commented Jul 2, 2025 •

edited

Loading