Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Subjective Failure Criteria #637

Open
Ziraya opened this issue Jan 5, 2025 · 6 comments
Open

Subjective Failure Criteria #637

Ziraya opened this issue Jan 5, 2025 · 6 comments

Comments

@Ziraya
Copy link

Ziraya commented Jan 5, 2025

I'm dealing with a large volume of drives with extremely varied condition. Almost every batch I have one or more drives that by all concrete objective criteria function, however they are doing so at say, 30kb/s. If I was looking at the screen I would know that this drive is actually dead and does not need to continue, but further I have found that devices performing this bad are often polluting the scsi bus or confusing the controller or some other underlying technical problem state, and making every other drive in the same grouping run much worse.

I'm slowly figuring out how to identify the drive and manually disconnect it, but it would be nice if I could configure nwipe to identify obvious numerical conditions that indicate a drive is not worth continuing with, so that it can terminate that drive's job and maybe disable the drive entirely. I have not yet figured out how that last bit is done in linux, or if it can be done in busybox.

Some criteria I can immediately think of:

  • Transfer rate below X for Y minutes/hours
  • Transfer drops below X, Y or more times per hour
  • ETA exceeds X standard deviations above disks of a similar size or the same model
  • Temperature outside of W..X range for Y minutes

I don't know what other information is easily available, these use information that appears on screen. If device resets and errors can be tracked they would be good candidates for quotas. The temperature limit would also help to protect the batch against a cooling failure as they would all fail out instead of cooking until they finish or die.

At least the four I've thought up would be able both to reduce power consumption and potentially speed up the batch, but if set tighter would serve as a rough grading filter for drives still worth using.

@PartialVolume
Copy link
Collaborator

PartialVolume commented Jan 6, 2025

While your suggestions are all valid there is a fundamental problem that means none of them would work with nwipe at the moment. This fundamental problem involves the I/O write method used, cached I/O versus Direct I/O.

Currently nwipe uses cached I/O, so every block of data supposedly written to the disk is actually stored in the Linux disk cache, the cache manager then writes the block to disk at some point. When a write to disk command is issued by nwipe the return status of the write command does not provide any information about whether the write was successful or not, other than it was successfully written to cache, nwipe can only determine a failure by then issuing a fdatasync command, this tells the cache it MUST flush all cached blocks to the drive and return a success or fail status. Now the problem is you can't issue a fdatasync for every block you write as the disc transfer speed drastically reduces, you issue a fdatasync every x number of blocks written. In our case I think it's about 100,000 blocks which was found to be a good compromise in terms of speed vs checking for I/O errors which then occurs every few seconds.

Now what's happening in your case is that you see the speed of the drive drop down to 30kb/s which is an average over time, in fact it's very likely there is not a transfer occurring at all. This is because nwipe issued a fdatasync and the cache manager tried to flush the cache to the disc but it can't because the disc isn't responding, so the cache manager just keeps trying, there is no timeout by the cache manager and I can't set a timeout as far as I'm aware, so fdatasync never returns control back to nwipe's wipe thread for that drive, effectively the wipe thread for this drive has hung. There is no command that I can issue to the cache manager to drop the cache and stop trying to write to the faulty disc and return control back to nwipe's wipe thread for this drive. Interestingly this doesn't happen with all drives, other times the fdatasync will report a I/O error after a few seconds, the wipe thread can already handle this and prints a I/O error on screen and increments the I/O error or pass error counter.

Now there is a possible solution. To hopefully resolve this problem, nwipe needs to switch from cached I/O to direct I/O as it's default I/O method. nwipe will then manage writing large blocks of data to the disc which is immediate from CPU memory to disc. With direct I/O you have the advantage of a return status on the write function (not available in cached I/O) that confirms whether the write was successful unlike cached I/O where you have little fine tuned control of the cache management. Direct I/O does not require use of the fdatasync. However direct I/O does make the code more complex, not that I think that's a problem as I believe there are more pros than cons. It's also the possible to set a timeout on the write function call to catch any drives that fail to respond within a set period of time.

Can you tell me more about the hardware you have, the output of lspci would be useful but also about the discs that fail, eg make/model SATA/SAS/IDE/SCSI. I've only come across one or two discs that fail without nwipe reporting a I/O errors however I think a hang is more likely to occur on an old SCSI disc using a LSI logic SCSI controller but I think I've seen it happen on IDE, not so much on SATA. But maybe it depends on what type of fault the drive has.

@PartialVolume
Copy link
Collaborator

PartialVolume commented Jan 6, 2025

Just thinking of possible workarounds for you. It's possible to run multiple nwipes simultaneously in ALT F2 on ShredOS, each nwipe wiping just one drive, although it doesn't have to be just one drive. So when one drive causes a problem you just kill that one nwipe and the other nwipes carry on. However each nwipe started must be provided with the drive you want to wipe on the command line, nwipe /dev/sda. If you don't do this and just type nwipe for the second instance then the second or subsequent nwipe will hang and not display it's GUI. (Caused by an issue with libata). So if you don't know tmux you will need to read up on that. Allows you to run multipe terminals in a virtual terminal.

Here's tmux running 4 terminals in ShredOS, however you are not limited to four, I've run 16 terminals, limited only by monitor size and resolution.

Whether killing nwipe process would kill the disc cache I/O for a drive that's not responding, and free up the bus to allow the over drives to continue, I don't know. I guess you would have to try it and see.

image

@PartialVolume
Copy link
Collaborator

PartialVolume commented Jan 6, 2025

Of course you may not be using ShredOS but a distro in which case you could just open multiple terminals and in each terminal run nwipe, sudo nwipe /dev/sda, sudo nwipe /dev/sdb, sudo nwipe /dev/sdc etc. then you don't need to use tmux.

@Ziraya
Copy link
Author

Ziraya commented Jan 7, 2025

I am running ShredOS, though I may at some point learn how to build it from scratch so I can include a fully functional copy of lsscsi and an up to date version of sg3_utils.

I feel like I'll need to re-read this later to be sure I've comprehended everything. My presumption is this logic would operate using the speed values as they're written to screen, this should be sufficient even if there are some layers of abstraction.

I'll try to get lspci tomorrow, I don't have network to the machine so I'll have to get it on the flash drive it's booting from. Broad sweep, it's a dell or HP branded LSI SAS2 HBA with LSI firmware, previously I had a bunch of miniSAS to 4x SATA cables, now I've got a supermicro SAS2-216EL1 (24x U.2 and a severe bottleneck, did some bad math), this was occurring in both setups. I had two such drives today, one was a toshiba, one was a seagate, both were in their respective generic 2.5" SATA series. It's primarily SATA drives

@Ziraya
Copy link
Author

Ziraya commented Jan 9, 2025

I expected this to say more for some reason.

lspci.txt

Callouts:

02:00.0 Serial Attached SCSI controller: Broadcom / LSI SAS2308 PCI-Express Fusion-MPT SAS-2 (rev 05)
06:00.0 Serial Attached SCSI controller: Broadcom / LSI SAS2116 PCI-Express Fusion-MPT SAS-2 [Meteor] (rev 02)
41:00.0 Serial Attached SCSI controller: Broadcom / LSI SAS2116 PCI-Express Fusion-MPT SAS-2 [Meteor] (rev 02)

there's 3 SAS controllers, one built into the motherboard and 2 cards, this is I believe an HP Z640 workstation because I didn't have anything else with good PCIe. It's dual socket with a pair of middling spec xeons, 64GB of registered whichever DDR this is. Each processor gets it's own dedicated SAS2116, one of which is currently not connected to anything.

@PartialVolume
Copy link
Collaborator

I expected this to say more for some reason.

lspci -k would have also showed which driver was loaded for each hardware component which is preferable.

Thanks for that info, it's useful to know which LSI controllers you have.

As for a resolution, I'm afraid it will have to wait until I can develop the direct I/O option for you to try. I can then implement your earlier suggestions. As to timescale for a direct I/O version? I can't provide one at the moment.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants