Skip to content

S.M.A.R.T support#614

Merged
henrygd merged 5 commits intohenrygd:614-smartfrom
geekifan:feat-smart
Oct 24, 2025
Merged

S.M.A.R.T support#614
henrygd merged 5 commits intohenrygd:614-smartfrom
geekifan:feat-smart

Conversation

@geekifan
Copy link
Contributor

@geekifan geekifan commented Feb 22, 2025

I follow the manner of GPUManager to add support for S.M.A.R.T to the agent. Since I am not an expert in Go and do not have enough physical devices around for testing, I hope someone can do a basic review of my code and test it on their own devices. Once everything is ready, I will proceed with modifying the hub's code.

TODO:

  • Add S.M.A.R.T. manager in agent
  • Show disk and smart info in web ui
  • Add S.M.A.R.T. failing alert
  • Documentation & tests

@henrygd
Copy link
Owner

henrygd commented Feb 23, 2025

Hi Yifan, thank you very much for your work!

This looks like a great start. Let me get back to you later in the week as I have limited time right now and am trying to get the next release out as soon as possible.

On the hub side we should probably create a new table (PocketBase collection) for this data.

From my limited knowledge I think parsing smartctl output is a fine approach and should work on MacOS also. But I may be wrong.

There's also this Go library which provides SMART information: https://github.com/anatol/smart.go

And a standalone application, Scrutiny, which is written in Go and may be a helpful reference: https://github.com/AnalogJ/scrutiny

As far as hardware, I'm in the same boat as you. I actually don't even own a HDD, but we should be able to find some output samples online and use them as test data (or people using Beszel can provide them).

Again, I appreciate your time and will get back to you as soon as I can.

Edit: If anyone reads this and wants to provide sample output, please change the serial numbers before sharing.

@geekifan
Copy link
Contributor Author

geekifan commented Feb 23, 2025

Thank you very much for your detailed response.

First, I have considered using smart.go. If we use smart.go, we will be dependent on all its aspects (such as potential bugs and the possibility that its smart database may not be updated in a timely manner). If such issues arise and it is no longer maintained, all we can do is fork it, fix the bugs, or update the smart database. This would add a significant burden to the maintenance of beszel. In contrast, smartctl is a very widely used tool, with timely updates to the smart database and more prompt maintenance in case of bugs. Its support for JSON-formatted output is a great advantage for data parsing in Go.

Regarding the macOS issue, I currently also have macOS and will conduct tests later.

The hardware I currently have available for testing includes: NVMe/SATA/SCSI (only testable under Linux platform), and USB storage, which should cover mainstream hardware. What I really worry about are some corner cases.

Additionally, I have a few issues that I am unsure how to handle:

  1. The SMART data for SATA/SCSI uses the ATA format, while NVMe uses a different format, leading to inconsistencies in SMART key values. Other hardware might have more SMART formats, so I believe we need everyone's help to find the appropriate data structures to store and monitor them.
  2. Due to the hot-swappable nature of hard drives, if a hard drive is unplugged, the agent part will delete the corresponding data entry when report to the hub. But how will the hub handle the missing data? Will it delete the corresponding hard drive data when displaying, or will it retain the state at the time of unplugging? (Sorry, I am not familiar with PocketBase and some database operations.)

EDIT: I checked the code of https://github.com/AnalogJ/scrutiny. Scrutiny parses the json output of smartctl to get the SMART info.

@henrygd
Copy link
Owner

henrygd commented Feb 24, 2025

Sounds good, I agree with the direct smartctl approach.

I don't think there's any reason to worry about corner cases in the first iteration. We'll get sample output and include the most important or common values.

If there's an issue parsing then we'll just log an error. We can add support for more formats as people request them.

Hopefully the JSON structure is consistent and it's just the properties that differ, because dealing with inconsistent JSON is not fun.

The regular non-JSON output looks easy to parse, so we could just use bufio to scan the output line by line for the values we need.

Here's output from my laptop with one nvme drive:

smartctl --scan
/dev/nvme0 -d nvme # /dev/nvme0, NVMe device
smartctl --scan -j
{
  "json_format_version": [
    1,
    0
  ],
  "smartctl": {
    "version": [
      7,
      4
    ],
    "pre_release": false,
    "svn_revision": "5530",
    "platform_info": "x86_64-linux-6.13.2-arch1-1",
    "build_info": "(local build)",
    "argv": [
      "smartctl",
      "--scan",
      "-j"
    ],
    "exit_status": 0
  },
  "devices": [
    {
      "name": "/dev/nvme0",
      "info_name": "/dev/nvme0",
      "type": "nvme",
      "protocol": "NVMe"
    }
  ]
}
sudo smartctl -a /dev/nvme0
smartctl 7.4 2023-08-01 r5530 [x86_64-linux-6.13.2-arch1-1] (local build)
Copyright (C) 2002-23, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Number:                       WD PC SN810 SDCPNRY-1T00-1006
Serial Number:                      226223861317
Firmware Version:                   HPS2
PCI Vendor/Subsystem ID:            0x15b7
IEEE OUI Identifier:                0x001b44
Total NVM Capacity:                 1,024,209,543,168 [1.02 TB]
Unallocated NVM Capacity:           0
Controller ID:                      8224
NVMe Version:                       1.4
Number of Namespaces:               1
Namespace 1 Size/Capacity:          1,024,209,543,168 [1.02 TB]
Namespace 1 Formatted LBA Size:     512
Namespace 1 IEEE EUI-64:            001c44 8b25c6eb61
Local Time is:                      Sun Feb 23 19:34:35 2025 EST
Firmware Updates (0x14):            2 Slots, no Reset required
Optional Admin Commands (0x0017):   Security Format Frmw_DL Self_Test
Optional NVM Commands (0x005f):     Comp Wr_Unc DS_Mngmt Wr_Zero Sav/Sel_Feat Timestmp
Log Page Attributes (0x1e):         Cmd_Eff_Lg Ext_Get_Lg Telmtry_Lg Pers_Ev_Lg
Maximum Data Transfer Size:         128 Pages
Warning  Comp. Temp. Threshold:     84 Celsius
Critical Comp. Temp. Threshold:     88 Celsius
Namespace 1 Features (0x02):        NA_Fields

Supported Power States
St Op     Max   Active     Idle   RL RT WL WT  Ent_Lat  Ex_Lat
 0 +     8.25W    8.25W       -    0  0  0  0        0       0
 1 +     3.50W    3.50W       -    0  0  0  0        0       0
 2 +     2.60W    2.60W       -    0  0  0  0        0       0
 3 -   0.0250W       -        -    3  3  3  3     5000   10000
 4 -   0.0035W       -        -    4  4  4  4     3900   45700

Supported LBA Sizes (NSID 0x1)
Id Fmt  Data  Metadt  Rel_Perf
 0 +     512       0         2
 1 -    4096       0         1

=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

SMART/Health Information (NVMe Log 0x02)
Critical Warning:                   0x00
Temperature:                        34 Celsius
Available Spare:                    100%
Available Spare Threshold:          5%
Percentage Used:                    0%
Data Units Read:                    20,427,281 [10.4 TB]
Data Units Written:                 27,523,884 [14.0 TB]
Host Read Commands:                 308,278,905
Host Write Commands:                722,398,619
Controller Busy Time:               2,230
Power Cycles:                       3,086
Power On Hours:                     1,392
Unsafe Shutdowns:                   173
Media and Data Integrity Errors:    0
Error Information Log Entries:      0
Warning  Comp. Temperature Time:    0
Critical Comp. Temperature Time:    0

Error Information (NVMe Log 0x01, 16 of 256 entries)
No Errors Logged

Self-test Log (NVMe Log 0x06)
Self-test status: No self-test in progress
No Self-tests Logged
sudo smartctl -aj /dev/nvme0
{
  "json_format_version": [
    1,
    0
  ],
  "smartctl": {
    "version": [
      7,
      4
    ],
    "pre_release": false,
    "svn_revision": "5530",
    "platform_info": "x86_64-linux-6.13.2-arch1-1",
    "build_info": "(local build)",
    "argv": [
      "smartctl",
      "-aj",
      "/dev/nvme0"
    ],
    "exit_status": 0
  },
  "local_time": {
    "time_t": 1740357511,
    "asctime": "Sun Feb 23 19:38:31 2025 EST"
  },
  "device": {
    "name": "/dev/nvme0",
    "info_name": "/dev/nvme0",
    "type": "nvme",
    "protocol": "NVMe"
  },
  "model_name": "WD PC SN810 SDCPNRY-1T00-1006",
  "serial_number": "286223861317",
  "firmware_version": "HPS2",
  "nvme_pci_vendor": {
    "id": 5559,
    "subsystem_id": 5559
  },
  "nvme_ieee_oui_identifier": 5920,
  "nvme_total_capacity": 1024209543168,
  "nvme_unallocated_capacity": 0,
  "nvme_controller_id": 8224,
  "nvme_version": {
    "string": "1.4",
    "value": 66560
  },
  "nvme_number_of_namespaces": 1,
  "nvme_namespaces": [
    {
      "id": 1,
      "size": {
        "blocks": 2000409264,
        "bytes": 1024209543168
      },
      "capacity": {
        "blocks": 2000409264,
        "bytes": 1024209543168
      },
      "utilization": {
        "blocks": 2000409264,
        "bytes": 1024209543168
      },
      "formatted_lba_size": 512,
      "eui64": {
        "oui": 5930,
        "ext_id": 592171146913
      }
    }
  ],
  "user_capacity": {
    "blocks": 2000409264,
    "bytes": 1024209543168
  },
  "logical_block_size": 512,
  "smart_support": {
    "available": true,
    "enabled": true
  },
  "smart_status": {
    "passed": true,
    "nvme": {
      "value": 0
    }
  },
  "nvme_smart_health_information_log": {
    "critical_warning": 0,
    "temperature": 34,
    "available_spare": 100,
    "available_spare_threshold": 5,
    "percentage_used": 0,
    "data_units_read": 20427312,
    "data_units_written": 27524011,
    "host_reads": 308279032,
    "host_writes": 722405653,
    "controller_busy_time": 2230,
    "power_cycles": 3086,
    "power_on_hours": 1392,
    "unsafe_shutdowns": 173,
    "media_errors": 0,
    "num_err_log_entries": 0,
    "warning_temp_time": 0,
    "critical_comp_time": 0
  },
  "temperature": {
    "current": 34
  },
  "power_cycle_count": 3086,
  "power_on_time": {
    "hours": 1392
  },
  "nvme_error_information_log": {
    "size": 256,
    "read": 16,
    "unread": 0
  },
  "nvme_self_test_log": {
    "current_self_test_operation": {
      "value": 0,
      "string": "No self-test in progress"
    }
  }
}

If a drive is unplugged and not in current updates, we'll just keep the record for some predefined time, like a week.

So the data would remain the same as when the drive was unplugged. We could show a 'last updated' time or up/down indicator.

I'll use a scheduled job to delete records that haven't had an update in a week. We could also give users an option to delete the drive themselves.

You can keep the scope of this PR as narrow as you'd like. Just having something working on the agent side is a huge help! I can handle the rest of it no problem.

There's also no rush as I have two other big PRs in the queue as well.

@sym0nd0
Copy link

sym0nd0 commented Apr 13, 2025

As far as hardware, I'm in the same boat as you. I actually don't even own a HDD, but we should be able to find some output samples online and use them as test data (or people using Beszel can provide them).

Edit: If anyone reads this and wants to provide sample output, please change the serial numbers before sharing.

Let me know what you need (and more so how to pull it) and I'll happily provide from across my drives.

@geekifan
Copy link
Contributor Author

Recently, I've been occupied with other projects and haven't been able to dedicate much time to the SMART feature development. However, I may now be able to allocate some time to work on this, particularly on the front-end and database aspects (though I can't guarantee significant progress at this stage).

Regarding the front-end implementation, I'd like to get your thoughts @henrygd : Do you think we should display the SMART data in a separate page, tab, or pop-up window? If so, where would you recommend placing it for optimal user experience?

I don't have much expertise in UI/UX design, so I'd be happy to hear any suggestions or ideas from anyone ;).

@henrygd
Copy link
Owner

henrygd commented Apr 19, 2025

No worries Yifan! Please only work on it if you want to. Don't feel any obligation. What you've already done will be helpful even if you don't do anything more.

We don't need to commit to a specific design right now, but my first thought is to put the SMART data on its own page.

Here's how Scrutiny does it for reference: https://imgur.com/a/5k8qMzS

Maybe on /system/system-name/smart we can have a table similar to the 'All Systems' table that lists all the system's drives with the most useful info. Then clicking on a row will bring you to system/system-name/smart/drive-name with details.

Alternatively, we can just stick the table under the other graphs on the system details page instead of making a standalone page for the SMART data table.

In the future maybe we can include a table on the home page that lists all drives from all systems as well.

IMO the most important part is getting the data where we need it. The layout can always be improved later.

Edit: We use shadcn so you might find something here that fits well: https://ui.shadcn.com

@evrial
Copy link
Contributor

evrial commented Apr 19, 2025

Alternatively, we can just stick the table under the other graphs on the system details page instead of making a standalone page for the SMART data table.

I think this would be perfect, at least for start. One panel with table, each drive in row. Temperature sensor data may be added to Temperature panel.

@wesgeorge
Copy link

Thought it might be useful to provide some output from a system with a large number of drives.

My output is of the same commands as above, just with a grep -v serial. json version only for the smartctl, both tabular and json of smartscan.

smartscan.txt
smartscanjson.txt

This setup is a total of 10 drives in the following configuration:

So the megaraid_disk_0n output in the scan is duplicate, and in the strictest sense, not all of the devices listed are actually SCSI devices. Probably doesn't matter if you're just using it to pull your list of devices for the output of smartctl, but I know that my /etc/smartd.conf (where the tests are configured) definitely cares that you specify the right type of disk (sat vs scsi) when invoking tests.

Also, I suggest that you key your data on the SN of the disk (or /dev/disk/by-id) rather than the device ID, because sometimes disks can change drive letters at boot when you have this many spread across multiple devices.

@geekifan
Copy link
Contributor Author

geekifan commented Jun 16, 2025

@wesgeorge Thank you very much for providing the data and suggestions. I will modify the code for the agent part to make it more robust.

Besides, I finished a front-end demo using some hard drive data (with fake serial numbers) I have on hand. Does anyone have any suggestions? Personally, I prefer displaying all disks in a list format and showing more detailed SMART information by clicking on the corresponding row (just like Proxmox VE).

image

EDIT1: I finished the SMART detail dialog.

image

@zero77
Copy link

zero77 commented Jun 19, 2025

@geekifan
This looks really good, but could you give some indication of warnings. May be an extra column showing number and type of warnings or something similar

@geekifan
Copy link
Contributor Author

@zero77 Thank you for your suggestion! I will display the "When Failed" attribute in the SMART information table and highlight the failed attributes in red (or add an error icon) based on this property.

geekifan added 4 commits June 20, 2025 10:47
Updated the SmartManager's methods to use the device's serial number as the key in the SmartDataMap instead of the device name.
Introduced a new Disks tab in the SystemDetail component to display disk information and S.M.A.R.T. data. The tab includes a table for visualizing disk attributes and their statuses.

Also added SmartData and SmartAttribute interfaces to support the new functionality.
@muro-dot
Copy link

muro-dot commented Jun 27, 2025

It's encouraging to see that the feature I was going to suggest is already in the works.
However, I was wondering if the temperature of the HDD could be integrated into the existing temperature tab to track its history?

@geekifan
Copy link
Contributor Author

@muro-dot On the agent side, the hard drive temperatures are read via SMART data and then incorporated into the system temperature readings. Therefore, you can find the temperature curves of different hard drives in the temperature sensor charts.

@muro-dot
Copy link

muro-dot commented Jul 2, 2025

Is there any chance to try it out before henrygd officially releases it? I'm really excited about the new features
Sorry for the comment not directly related to the development

@geekifan
Copy link
Contributor Author

Well, I'm currently facing an issue. Should I set up alerts based on the smartd output from the monitoring target or implement alerts using the metrics obtained from Beszel? (I think the former might be better? After all, a solution implemented at the beszel hub side probably wouldn't be as comprehensive as the alerts in smartd.)

@svenvg93
Copy link
Collaborator

Well, I'm currently facing an issue. Should I set up alerts based on the smartd output from the monitoring target or implement alerts using the metrics obtained from Beszel? (I think the former might be better? After all, a solution implemented at the beszel hub side probably wouldn't be as comprehensive as the alerts in smartd.)

I do agree to start the the alerts based on the smartd output, alerts based on other mertics can always be added later if there is an need for it.

@henrygd henrygd removed the Stale label Sep 30, 2025
@henrygd henrygd moved this from Done to Next in Beszel Roadmap Sep 30, 2025
@RikudouGoku
Copy link

Is there any ETA for when this is added? Looks awesome. I would just change the "Type SAT" to "Type SATA" though. SAT sounds weird lol. "SAS" is a different type though.

@henrygd
Copy link
Owner

henrygd commented Oct 1, 2025

I'll try to get this in soon. Hopefully this month.

@RikudouGoku
Copy link

I'll try to get this in soon. Hopefully this month.

Awesome!

@geekifan
Copy link
Contributor Author

geekifan commented Oct 2, 2025

@henrygd Thanks hank! This PR is almost done except SMART monitor alerts. I'm busy with my academic work so I have no time to finish the SMART alerts. I would appreciate it if you could finish the rest. This PR is now review-ready.

@geekifan geekifan marked this pull request as ready for review October 2, 2025 01:58
@henrygd
Copy link
Owner

henrygd commented Oct 2, 2025

No worries Yifan, I'll finish it off. Thanks again for your work 👍

@M3rcur-x
Copy link

M3rcur-x commented Oct 2, 2025

Hi,
Would it be possible to add the "-n standby" parameter of smartctl program to avoid to wakeup "sleepy" disks ? Maybe this one could be configurable

@henrygd henrygd self-assigned this Oct 22, 2025
@henrygd henrygd moved this from Next to In Progress in Beszel Roadmap Oct 22, 2025
@henrygd henrygd changed the base branch from main to 614-smart October 24, 2025 13:48
@henrygd henrygd merged commit 16d5ec2 into henrygd:614-smart Oct 24, 2025
@github-project-automation github-project-automation bot moved this from In Progress to Done in Beszel Roadmap Oct 24, 2025
@henrygd
Copy link
Owner

henrygd commented Oct 24, 2025

This is has finally been added. I need to finish the documentation and clean up a few things, but I'll try to have a release out this weekend.

Thanks again for your efforts, Yifan! Sincerely appreciated.

@M3rcur-x I added standby handling so it should only wake disks once. Then if the disk is sleeping again it will use the previous data.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

10 participants