-
Notifications
You must be signed in to change notification settings - Fork 87
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
no temperature for SAS drives #497
Comments
after having fixed the bustype detection I looked for a way to get the drive temperature (and only the drive temperature. And another question: the interactive display of nwipe seems to refresh quite often, and the load of the machine is the disks being wiped. Especially when thing about getting drive temp via smart it seems useful to me to reduce the rate of asking for temps as these do not tend to change within ms. |
Yes, may well be worth looking at hddtemp code for ideas. Regarding screen refresh, both screen refresh and keyboard response are the same in nwipe. The screen refreshes and keyboard is checked for key hit no faster than 250ms. Any faster and it's unnecessary. Any slower and the user will notice the keyboard to feel less responsive. This is accomplished by these three lines that you will see in most gui functions.
However the temperature updates although in these loops update far less frequently. If you look in the gui.c file at the compute_stats function you will find the if statement as shown below. This if statement limits how often nwipe obtains the temperature values from hwmon or from smartctl to only once every 60 seconds. So calling smartctl (if we used that) won't have the same impact on resources as we are not reading temperatures at screen/keyboard refresh speeds.
The temperature code is quite complicated as it also handles temperature limits of the device and changes the colour and blink rate depending on whether temperature critical upper and lower limits have been reached. However it would be relatively easy to add additional code for smartctl or hddtemp by adding extra code in temperature.c you wouldn't need to worry about the GUI code. |
Basically you just need a function that obtains the drive temperatures from SAS and updates the relevent temperature fields in the drive context structure for each drive. This function would be called if the bus type is SAS and the temperature field in the drive context is null i.e not been found by hwmon. The SAS function would be called from within the |
tnx for that great reply. |
I was just looking at the hddtemp code https://github.com/guzu/hddtemp/tree/master/src I think maybe it's easier to use smartctl. We already use smartctl for various things and already check it exists on the system prior to using it. We note in the logs if it's missing. In the nwipe code there are already example of obtaining the output from smartctl. Looking at hddtemp it looks too much like a steep learning curve to extract the relevant bits we need and incorporating all of its code into nwipe looks like too much unnecessary code. So I'd prefer to go with smartctl unless you can write a succinct low level function to access the smart data direct from the drive. |
it's true that the use of smartctl would be easier to implement. But: those HP SAS drives in my current server tend to produce a significant delay. So I think it's worth the work to integrate the hddtemp source code. |
I agree, those sort of delays would be unacceptably long. Especially the 2.7 seconds, bad enough wiping one drive let alone 28 simultaneously. So hddtemp looks like a better option. What would be perfect would be a small low level function that just accesses the temperature from the smart data. I wrote a low level function that is used to obtain hpa/dco max sector information from the drive in much the same way hddtemp and smartctl do. I would expect such a function written like this to extract temperature would be the fastest by far, assuming the drives themselves don't have significant response delays while writing. If you want to take a look at that low level access to a drive that I use to obtain the hpa/dco real max sectors take a look at the file hpa_dco.c for a function called I believe SAS use either the SCSI command set or the STP protocol. Maybe this link would be useful too. https://www.scsita.org/sas_and_serial_ata_tunneling_protocol_stp/ Here's the low level function as an example, basically it reads a 512 block using the ioctl command and then from the contents of that block I extract the real max sectors. The block content format is defined in the ATA standards document, which is where you also find the smart data format.
|
wow, cool stuff.
Well, finally we'll see what happens when we test the code on my system. With these lovely HP HDDs ;-) Regarding your low-level suggestions: I do not yet understand fully, why you do the complete cycle of trying to read the whole set of And: we have the And last, but not least: hddtemp knows/differentiates (at least in the header) between ATA und SATA. Could this be the same as IDE and ATA in nwipe? (Find it just a very little bit confusing to read ATA instead of the expected SATA in the drive overview.) |
Yes, the temperature/limits into the log in verbose is excessive to say the least. It's left over from when I wrote the code and have been meaning to do something about it, just never got around it tidying it up.
I think the problem here is libparted can't tell the difference between SATA and IDE. It identifies them as ATA as that's the SCSI command set they all use. I don't know whether there is a way if differentiating between SATA and IDE? Any suggestions to improving are welcome although I don't want to just change ATA to SATA as IDE drives would be called SATA. |
ok, so we go ahead with ATA. And I'll try the separation into an init function, that would fill the limit values, and the "normal" readings during wiping. Stay tuned ;-) |
There is a function in temperature.c that's called I'm not sure what your init function is going to do but maybe it could fit inside this function? This function is called once at startup. |
exactly there it will be dropped in :-) |
@ggruber Just so we don't end up duplicating effort, I'm working on the PDF enable/disable and PDF preview config options on the GUI config page. The values of these will be loaded at startup from nwipe.conf but if the user specifies PDF enable/disable on the command line that will override the value in nwipe.conf. |
fine for me |
Just found that sg3-utils contains a shell wrapping script scsi_temperature which basically calls Ok, I have the trip temp from within the hddtemp code. |
Is there a special reason to refresh the screen with permanent polling of the temperatures during disk selection? Will try to find and fix this. Seems the limit (get temp only once in 60 seconds) is not considered here.
|
It's looking good. I'm getting a implicit declaration warning, when I compile that needs fixing
At the time I think it was so if the drive was was too hot, nearing critical you could see the temperature change more quickly in drive selection as you attempt to cool the drive down before starting a wipe. As hwmon is relatively quick it didn't matter but in practise there is no reason really why drive selection shouldn't poll less frequently at one every sixty seconds like drive wiping does. |
I was just looking at the drive selection GUI code and there is this bit of code below that refreshes temperature every 1 second. The one second is important for the flashing HPA messages that appear when determining HPA/DCO drive status. So this code shouldn't be changed.
However ... At line 828 of the master code, the following line exists
this should really be in a if statement so that it updates every 60 seconds rather than 1 second. Or even a new function that contains the if statment called nwipe_temperature_update_60s as this code is duplicated elsewhere. I'm surprised there is much of a delay when reading temperatures on SAS drives. How many milliseconds is it taking to read the temperature on each SAS drive? |
To measure the execution time of the existing nwipe_update_temperature() function I placed some timing code before and after the function and had it print the execution time to the log.
I ran nwipe and left it at the drive selection screen then aborted it after 10 seconds. Here are the function execution times.
So on my 20 core I7 we are talking about anything from 91uS to 316uS, however I have only one drive and it's a NvMe WD Blue SN570 1TB. Would be interesting if you ran the same test on your multiple SAS drives. |
Only those old HP drives cause real pain. I'll interrupt the running nwipe, and fill some other disks in the box, among those also NVMe. I'll integrate your suggestions and tidy up the code a bit. |
Nice 👍 I like to see lots of flashing lights, you know things are happening. Could also be because I like old 50s & 60s science fiction movies where the computers always had an abundance of flashing lights 🤣 It also demonstrates a 'feature' of nwipe, where it pauses when it's writing as opposed to syncing. This pause is more noticeable when it's doing a PRNG pass and I always thought this pause could be reduced by having the PRNG creation calculated in a thread that was free running writing into a buffer, when the next prng block was needed it had already been calculated, so the wipe thread doesn't have to wait for the next prng block to be calculated. Each wipe thread would need to create a prng thread, so for a 20 drive wipe each having its own thread there would be 20 prng threads. So for nwipe it would running 40 wipe/prng threads plus the gui status thread. This was one of those things that has been talked about in the past but never got round to optimising that part of the wipe thread. Its one of those things that one day we will get round to doing but for me adding ATA secure erase is the next priority for me at least. |
that's why I used the chance with the arrival of 12 SATA disk which I want to wipe to chance some more disks. Have some trouble yet to get the box up as the NVMe are not blank but contain a former version of the OS which is now installed. ZFS doesn't like to find identical named pools, especially not from initramfs |
here is a logfile with your suggested time measurement code results included.
|
Those timings look pretty good to me, the worst being 1.360 milliseconds, in percentage terms that means 0.136% of every second is taken up obtaining the drive temperature. Even if writing to the drive stopped during that period, which I don't think it does, but there is always that possibility. Extrapolating that out to the time period of a full wipe, say 4 hrs (240 minutes or 14400 seconds) that means the wipe might take an extra 14400/100*0.136 =19.58 seconds to complete and that's checking the temperature every second. Would you agree with that? If I've not made a big mistake in my calculations, then I would say checking temperature at 1 second intervals is fine, however is that what happens in practise? So on your hardware, if you wiped a single drive twice to completion, once with 1 second interval between temperature checks and then repeat the wipe with 60 seconds between checks is there much difference? |
Also, the temperature checking is operating in a different thread to the actual disk wipe so on a multi core processor should not delay the wipe thread too much. |
Unrelated to the temperature, but thinking about the pause I could see when the drives were writing rather than syncing, I was just wondering, when you have 28 drives wiping and using the PRNG method what is |
I think you are correct about having the smart data as a single column rather than two so it occupies page two and three. I just did a wipe on a disk and the left column text overlaps with the right column in places and the right column gets cropped to the right. So yes, the two columns needs to be changed to a single column. |
that should go to the PDF issue, and fits with my suggestions. I'll continue with the timing measurements and will intergrate temperature min/max saving for SAS drives. And I prepared a cleanup function, that closes open fd und frees malloced mem (for in single SAS drive). Still looking for the best place to call it. |
another gui qustion: the firmware of my LSI/Avago/Broadcom HBA distinguishes between SAS, SATA, SAS-SSD und SATA-SSD. |
Yes, I remember now, on some drives you have to set the two values as reported by And in some other makes/models the dco-restore may not be necessary. Other drives don't care about having to set the HPA, you can just do a dco-restore. It will certainly be a challenge to automate this in nwipe to behave consistently for any make/model of drive. Quite possible, but a challenge. |
done. |
could stop us for a longer time. |
Resetting the HPA didn't work, it's back to 468862128 blocks rather than 468862129. Could just be a 'feature' of this drive. If it behaves the same on other hardware, I guess it's got to be the drives implementation of the HPA DCO isn't great. Perhaps it hides a block for it's own purposes whatever that might be. Anyway the code in nwipe is detecting the hidden block correctly so I'll leave it at that. |
what about SSD smart erase? is it on the roadmap for the next release? |
I think we'll freeze any new features so I get get this release out. I'm happy with what we've done, so I think we are ready for release v0.35 to be published. |
Should I have a (quick) look for a way to sort the disk list? And what happended to the temp reading thread? |
Yes, that is at the top of the list for the v0.36 release
I nearly forgot about that ! Thanks for reminding me. Yes that has to be in this release (0.35) |
I'll work on the temp thread if you want to take a look at the sorting |
sorted device list is in my repo now, would be glad if someone would test it @PartialVolume btw, for testing purposes I added another disk, so we have |
Tested, I built it on the server and compared the original listing with the sorted. All looks good. I randomly picked a drive /dev/sdq and checked it's contents (0x74) then started a wipe on just that drive, aborted. Rechecked the contents on /dev/sdq was now 0x00 which it was. Looks good, as far as I'm concerned you can do a PR on that. Nice bit of code. Looks like you are sorting the pointers to the contexts based on drive name? I don't think I'm going to be quite as quick as you with the temperature thread. If I don't get it completed this evening, it may be a couple of days as I'm very busy over the weekend. |
tnx for testing/confirmation, PR started
That's what I did |
Temperature thread and sort committed. Looks good, looks like those lags in the GUI have gone. If you could check it out. Thanks. nwipe-2023-10-22_01.00.49.mp4 |
looks impressive in comparison to the behaviour before. will have to check the blinking ;-) I pimped eraseHDD in the meantime a bit, could think about renaming it towards "Coarse Disk Analyser" with main function information gathering and optional (exchangable or configureable) wiping backend. |
If by blinking, you mean the disk activity light blinking, this is a consequence of using fdatasync periodically to check the data is being written without any errors. When a fdatasync is issued the Linux disc cache is forced to flush it's cache for a given drive, this is done to detect I/O errors. If fdatasync is disabled by setting --sync=0 in the options you will probably find the disk activity light stays on permanently, i.e it's continually writing. However, if --sync=0 then nwipe can't detect an I/0 error so the only way you may know a drive has failed is that the throughput for that particular drive will start to drop away. You can increase the period between fdatsyncs however tests done in the past showed that the default setting, I think 100,000 writes was a good compromise. An alternative to using the Linux disc cache method of writing is to use dual buffered direct I/0 where no Linux cache is involved and fdatsyncs are unnecessary. Nwipe would be immediately informed of a /0 error and there would be no momentary pause in writing as the cache is flushed by fdatasync. This has been discussed a lot in past if you look back through the issues. Dual buffer direct I/O is something I would like to add to nwipe but other priorities always mean I never get around to working on it. If that's something you fancy working on, please read through all the past discussions about this. If you can't find them let me know and I'll find the links. In the meantime try wiping discs with --sync=0, you may find there is maybe a 5% increase in speed but if any discs have an I/O error nwipe may hang. One fix for that hang may be may be for nwipe to monitor throughput and terminate a drives thread if it's throughput is ridiculously low. |
I see. Thanks for explaining this. For just 5% better performance: well, I'd priorise smart SSD wiping much higher. |
Uploaded new video from the blinking lights from the server, with one drive that failed. The screenshot from eraseHDD shows that I would have thrown away/mechanically destroyed this disk instead of wiping it as it showed >900 reallocated sectors. I will try --sync=0, but I'm not sure yet when I will have a deeper look into this (abyss? ;-) ). Interesting enough I find that the SMART ERROR flagged sde performs apparently completely normal. With its 2890 reallocated sectors. (But I do not trust this disk, it'll find its place on the graveyard soon.) |
Yes, I can see why having reallocated sectors visible at nwipes drive selection screen would be a useful feature to have. Something for 0.36. Yes, definitely an abyss, any changes to how nwipe writes to disc would require extensive testing. Not something I want to do until after SSD erase implemented. I'll checkout the new video. |
wouldn't it be a kind of oddly satifying video if I installed a webcam showing the blinking activity ligths from being-wiped disks ;-) |
a thought that came to me: what, if we let the pre-flight check in an easy modifyable/adopatble perl-program (I mentioned before it could be renamed also) and use nwipe just for the wiping, as we did years ago? btw, with only 7 disks now being actively wiped the LEDs blink very evenly. same for the spinning bar. throughput is only 564MB/s |
@PartialVolume : is there anything left I could contribute to the 0.36 release? |
@ggruber You mean the 0.35 release, I don't think there is. I'm just updating CHANGELOG.md then I'll publish the release. There will be more to do after I've published the 0.35 release, which will go into 0.36, such as
|
Shouldn't we declare smartmontools mandatory in ReadME? |
I've just updated the README.md file with information about smartmontools and new features in 0.35 |
Just one bug to fix before I release 0.35. I was just testing under Debian SID and nwipe runs ok as superuser but if you are not super user a message should be displayed saying you need to run nwipe as superuser, unfortunately it segfaults rather than display the message and exit gracefully. |
ok, figured out what the problem was, it segfaults because it cannot create /etc/nwipe/customers.csv because it's not running as root. Once you run it as root and the file gets created it no longer segfaults if you don't run as root because customers.csv already exists. Should be a quick fix. |
you mean something like
? |
Thanks, Just added your code snippet. |
Originally posted by @PartialVolume in #426 (comment)
Yes, via the kernel (hwmon) would be the preferred method which nwipe already supports but it needs somebody with the SAS hardware and, proficient in C and prepared to learn about the low level access to SAS and the standards used. I'm hoping somebody that's looking for a project could add the code to hwmon to support SAS. The maintainer is happy to accept commits but doesn't have the hardware or maybe time to do this himself. If anybody wants to get involved the source for hwmon can be found here. https://github.com/torvalds/linux/blob/master/drivers/hwmon/hwmon.c
Until such time that hwmon supports SAS we may have to use smartctl or alternatively write a low level function ourselves.
The text was updated successfully, but these errors were encountered: