Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

no temperature for SAS drives #497

Closed
ggruber opened this issue Oct 5, 2023 · 115 comments
Closed

no temperature for SAS drives #497

ggruber opened this issue Oct 5, 2023 · 115 comments

Comments

@ggruber
Copy link
Contributor

ggruber commented Oct 5, 2023

Originally posted by @PartialVolume in #426 (comment)

Yes, via the kernel (hwmon) would be the preferred method which nwipe already supports but it needs somebody with the SAS hardware and, proficient in C and prepared to learn about the low level access to SAS and the standards used. I'm hoping somebody that's looking for a project could add the code to hwmon to support SAS. The maintainer is happy to accept commits but doesn't have the hardware or maybe time to do this himself. If anybody wants to get involved the source for hwmon can be found here. https://github.com/torvalds/linux/blob/master/drivers/hwmon/hwmon.c

Until such time that hwmon supports SAS we may have to use smartctl or alternatively write a low level function ourselves.

@ggruber
Copy link
Contributor Author

ggruber commented Oct 5, 2023

after having fixed the bustype detection I looked for a way to get the drive temperature (and only the drive temperature.
smartctl doesn't seem to have an option to get just the temperature.
But I found hddtemp which doesn't give me the temps for my SATA drives but for all the different SAS drives.
Unfortunately the programm never left the beta status, and bookworm doesn't have it in the repos anymore.
So maybe the code gives us a way to include an alternative fetching of temp for SAS drives.

And another question: the interactive display of nwipe seems to refresh quite often, and the load of the machine is the disks being wiped.
To ask with max speed via smart for the drive temp seems not the best idea to me.
Or is there already a mechanism that makes the screen refresh with e.g. max 5 times per second?

Especially when thing about getting drive temp via smart it seems useful to me to reduce the rate of asking for temps as these do not tend to change within ms.
Or am I wrong with this?

@PartialVolume
Copy link
Collaborator

PartialVolume commented Oct 6, 2023

Yes, may well be worth looking at hddtemp code for ideas.

Regarding screen refresh, both screen refresh and keyboard response are the same in nwipe. The screen refreshes and keyboard is checked for key hit no faster than 250ms. Any faster and it's unnecessary. Any slower and the user will notice the keyboard to feel less responsive. This is accomplished by these three lines that you will see in most gui functions.

            timeout( GETCH_BLOCK_MS );  // block getch() for ideally about 250ms.
            keystroke = getch();  // Get user input.
            timeout( -1 );  // Switch back to blocking mode.

However the temperature updates although in these loops update far less frequently.

If you look in the gui.c file at the compute_stats function you will find the if statement as shown below. This if statement limits how often nwipe obtains the temperature values from hwmon or from smartctl to only once every 60 seconds. So calling smartctl (if we used that) won't have the same impact on resources as we are not reading temperatures at screen/keyboard refresh speeds.

       /* Read the drive temperature values */
        if( nwipe_time_now > ( c[i]->temp1_time + 60 ) )
        {
            nwipe_update_temperature( c[i] );
        }

The temperature code is quite complicated as it also handles temperature limits of the device and changes the colour and blink rate depending on whether temperature critical upper and lower limits have been reached.

However it would be relatively easy to add additional code for smartctl or hddtemp by adding extra code in temperature.c you wouldn't need to worry about the GUI code.

@PartialVolume
Copy link
Collaborator

PartialVolume commented Oct 6, 2023

Basically you just need a function that obtains the drive temperatures from SAS and updates the relevent temperature fields in the drive context structure for each drive. This function would be called if the bus type is SAS and the temperature field in the drive context is null i.e not been found by hwmon. The SAS function would be called from within the nwipe_update_temperature function within the temperature.c file.

@ggruber
Copy link
Contributor Author

ggruber commented Oct 6, 2023

tnx for that great reply.
I will try to extract the SAS temp code from hddtemp (it's in two source files) and integrate it into nwipe. As it is also GPLv2 there should be no real issues with the license.
For the docs I suggest to add smartmontools as mandatory required package.

@PartialVolume
Copy link
Collaborator

I was just looking at the hddtemp code https://github.com/guzu/hddtemp/tree/master/src

I think maybe it's easier to use smartctl. We already use smartctl for various things and already check it exists on the system prior to using it. We note in the logs if it's missing.

In the nwipe code there are already example of obtaining the output from smartctl. Looking at hddtemp it looks too much like a steep learning curve to extract the relevant bits we need and incorporating all of its code into nwipe looks like too much unnecessary code.

So I'd prefer to go with smartctl unless you can write a succinct low level function to access the smart data direct from the drive.

@ggruber
Copy link
Contributor Author

ggruber commented Oct 7, 2023

it's true that the use of smartctl would be easier to implement. But: those HP SAS drives in my current server tend to produce a significant delay. time smartctl -A /dev/sdb shows 0.05s for a SSD drive, but 0.4s, 0.6s for HDDs and 2.7s(!) for those HP SAS HDDs. hddtemp for such a HP SAS HDD takes 0.2s, for other SAS drives 0.005s.
Besides that there is a significant overhead for a fork - exec sequence it takes to run another program from nwipe. Fine for one time shots, but in a loop I'd try to avoid that.

So I think it's worth the work to integrate the hddtemp source code.

@PartialVolume
Copy link
Collaborator

I agree, those sort of delays would be unacceptably long. Especially the 2.7 seconds, bad enough wiping one drive let alone 28 simultaneously.

So hddtemp looks like a better option. What would be perfect would be a small low level function that just accesses the temperature from the smart data.

I wrote a low level function that is used to obtain hpa/dco max sector information from the drive in much the same way hddtemp and smartctl do. I would expect such a function written like this to extract temperature would be the fastest by far, assuming the drives themselves don't have significant response delays while writing.

If you want to take a look at that low level access to a drive that I use to obtain the hpa/dco real max sectors take a look at the file hpa_dco.c for a function called nwipe_read_dco_real_max_sectors( char* device ) that may help in understanding some of hddtemps code.

I believe SAS use either the SCSI command set or the STP protocol. Maybe this link would be useful too. https://www.scsita.org/sas_and_serial_ata_tunneling_protocol_stp/

Here's the low level function as an example, basically it reads a 512 block using the ioctl command and then from the contents of that block I extract the real max sectors. The block content format is defined in the ATA standards document, which is where you also find the smart data format.

u64 nwipe_read_dco_real_max_sectors( char* device )
{
    /* This function sends a device configuration overlay identify command 0xB1 (dco-identify)
     * to the drive and extracts the real max sectors. The value is incremented by 1 and
     * then returned. We rely upon this function to determine real max sectors as there
     * is a bug in hdparm 9.60, including possibly earlier or later versions but which is
     * fixed in 9.65, that returns a incorrect (negative) value
     * for some drives that are possibly over a certain size.
     */

    /* TODO Add checks in case of failure, especially with recent drives that may not
     * support drive configuration overlay commands.
     */

#define LBA_SIZE 512
#define CMD_LEN 16
#define BLOCK_MAX 65535
#define LBA_MAX ( 1 << 30 )
#define SENSE_BUFFER_SIZE 32

    u64 nwipe_real_max_sectors;

    /* This command issues command 0xb1 (dco-identify) 15th byte */
    unsigned char cmd_blk[CMD_LEN] = { 0x85, 0x08, 0x0e, 0x00, 0xc2, 0, 0x01, 0, 0, 0, 0, 0, 0, 0x40, 0xb1, 0 };

    sg_io_hdr_t io_hdr;
    unsigned char buffer[LBA_SIZE];  // Received data block
    unsigned char sense_buffer[SENSE_BUFFER_SIZE];  // Sense data

    /* three characters represent one byte of sense data, i.e
     * two characters and a space "01 AE 67"
     */
    char sense_buffer_hex[( SENSE_BUFFER_SIZE * 3 ) + 1];

    int i, i2;  // index
    int fd;  // file descripter

    if( ( fd = open( device, O_RDWR ) ) < 0 )
    {
        /* Unable to open device */
        return -1;
    }

    /******************************************
     * Initialise the sg header for reading the
     * device configuration overlay identify data
     */
    memset( &io_hdr, 0, sizeof( sg_io_hdr_t ) );
    io_hdr.interface_id = 'S';
    io_hdr.cmd_len = sizeof( cmd_blk );
    io_hdr.mx_sb_len = sizeof( sense_buffer );
    io_hdr.dxfer_direction = SG_DXFER_FROM_DEV;
    io_hdr.dxfer_len = LBA_SIZE;
    io_hdr.dxferp = buffer;
    io_hdr.cmdp = cmd_blk;
    io_hdr.sbp = sense_buffer;
    io_hdr.timeout = 20000;

    if( ioctl( fd, SG_IO, &io_hdr ) < 0 )
    {
        nwipe_log( NWIPE_LOG_ERROR, "IOCTL command failed retrieving DCO" );
        i2 = 0;
        for( i = 0, i2 = 0; i < SENSE_BUFFER_SIZE; i++, i2 += 3 )
        {
            /* IOCTL returned an error */
            snprintf( &sense_buffer_hex[i2], sizeof( sense_buffer_hex ), "%02x ", sense_buffer[i] );
        }
        sense_buffer_hex[i2] = 0;  // terminate string
        nwipe_log( NWIPE_LOG_DEBUG, "Sense buffer from failed DCO identify cmd:%s", sense_buffer_hex );
        return -2;
    }

    /* Close the device */
    close( fd );

    /***************************************************************
     * Extract the real max sectors from the returned 512 byte block.
     * Assuming the first word/byte is 0. We extract the bytes & switch
     * the endian. Words 3-6(bytes 6-13) contain the max sector address
     */
    nwipe_real_max_sectors = (u64) ( (u64) buffer[13] << 56 ) | ( (u64) buffer[12] << 48 ) | ( (u64) buffer[11] << 40 )
        | ( (u64) buffer[10] << 32 ) | ( (u64) buffer[9] << 24 ) | ( (u64) buffer[8] << 16 ) | ( (u64) buffer[7] << 8 )
        | buffer[6];

    /* Don't really understand this but hdparm adds 1 to
     * the real max sectors too, counting zero as sector?
     * but only increment if it's already greater than zero
     */
    if( nwipe_real_max_sectors > 0 )
    {
        nwipe_real_max_sectors++;
    }

    nwipe_log(
        NWIPE_LOG_INFO, "func:nwipe_read_dco_real_max_sectors(), DCO real max sectors = %lli", nwipe_real_max_sectors );

    return nwipe_real_max_sectors;
}

@ggruber
Copy link
Contributor Author

ggruber commented Oct 7, 2023

wow, cool stuff.

I wrote a low level function that is used to obtain hpa/dco max sector information from the drive in much the same way
hddtemp and smartctl do. I would expect such a function written like this to extract temperature would be the fastest by far,
assuming the drives themselves don't have significant response delays while writing.

Well, finally we'll see what happens when we test the code on my system. With these lovely HP HDDs ;-)

Regarding your low-level suggestions:
I digged in the code tonight, and I'm thinking top-down currectly:
introduced that stub for switching to alternate SCSI temp reading in temperature.c.
But found, that it will be useful to also implement a kind-of init-function, that would read the trip temperatures and set them
to temp1_crit.
So you would have a little help for your temperature watching while erasing.

I do not yet understand fully, why you do the complete cycle of trying to read the whole set of "temp1_crit", "temp1_highest", "temp1_input", "temp1_lcrit", "temp1_lowest", "temp1_max", "temp1_min" whenever it's time to "read the temp". Even if you know already the drive does not provide these data.
Next: Detection of the hwmon paths could write values to temp1_crit, temp1_lcrit, and you could save the read ops during erasure. (Have to look for the meaning/difference of temp1_highest vs. temp1_max resp. temp1_lowest vs temp1_min yet.)

And: we have the nwipe_context_t_ structure: I suggest to add a flag has_hwmon_data which would at least allow to shrink the log file ;-)

And last, but not least: hddtemp knows/differentiates (at least in the header) between ATA und SATA. Could this be the same as IDE and ATA in nwipe? (Find it just a very little bit confusing to read ATA instead of the expected SATA in the drive overview.)

@PartialVolume
Copy link
Collaborator

PartialVolume commented Oct 7, 2023

I do not yet understand fully, why you do the complete cycle of trying to read the whole set of "temp1_crit", "temp1_highest", "temp1_input", "temp1_lcrit", "temp1_lowest", "temp1_max", "temp1_min" whenever it's time to "read the temp". Even if you know already the drive does not provide these data.
Next: Detection of the hwmon paths could write values to temp1_crit, temp1_lcrit, and you could save the read ops during erasure. (Have to look for the meaning/difference of temp1_highest vs. temp1_max resp. temp1_lowest vs temp1_min yet.) And: we have the nwipe_context_t_ structure: I suggest to add a flag has_hwmon_data which would at least allow to shrink the log file ;-)

Yes, the temperature/limits into the log in verbose is excessive to say the least. It's left over from when I wrote the code and have been meaning to do something about it, just never got around it tidying it up.
If you fancy cleaning that up please go ahead and also make the addition to the context structure as you suggest.

And last, but not least: hddtemp knows/differentiates (at least in the header) between ATA und SATA. Could this be the same as IDE and ATA in nwipe? (Find it just a very little bit confusing to read ATA instead of the expected SATA in the drive overview.)

I think the problem here is libparted can't tell the difference between SATA and IDE. It identifies them as ATA as that's the SCSI command set they all use. I don't know whether there is a way if differentiating between SATA and IDE? Any suggestions to improving are welcome although I don't want to just change ATA to SATA as IDE drives would be called SATA.

@ggruber
Copy link
Contributor Author

ggruber commented Oct 7, 2023

ok, so we go ahead with ATA.

And I'll try the separation into an init function, that would fill the limit values, and the "normal" readings during wiping.

Stay tuned ;-)

@PartialVolume
Copy link
Collaborator

There is a function in temperature.c that's called nwipe_init_temperature who's job is to determine what drives support temperatures and where in the hwmon directory structure the temperature data can be found. Here the temperature limit values are also initialised.

I'm not sure what your init function is going to do but maybe it could fit inside this function? This function is called once at startup.

@ggruber
Copy link
Contributor Author

ggruber commented Oct 7, 2023

exactly there it will be dropped in :-)

@PartialVolume
Copy link
Collaborator

PartialVolume commented Oct 7, 2023

@ggruber Just so we don't end up duplicating effort, I'm working on the PDF enable/disable and PDF preview config options on the GUI config page. The values of these will be loaded at startup from nwipe.conf but if the user specifies PDF enable/disable on the command line that will override the value in nwipe.conf.

image

@ggruber
Copy link
Contributor Author

ggruber commented Oct 7, 2023

fine for me

@ggruber
Copy link
Contributor Author

ggruber commented Oct 7, 2023

Just found that sg3-utils contains a shell wrapping script scsi_temperature which basically calls sg_logs -t <device>
Will possibly use it to get the trip temperature aka reference temperature, which is temp1_crit (imho).
Reading this trip temperature gave me headaches as this is not included in hddtemp.
But I also found the SCSI Commands Reference Manual (from Seagate) which contains the information regarding the temperature page which I was missing.

Ok, I have the trip temp from within the hddtemp code.

@ggruber
Copy link
Contributor Author

ggruber commented Oct 10, 2023

seems I have a first running version
with hddtemp code included

2023-10-10

@ggruber
Copy link
Contributor Author

ggruber commented Oct 10, 2023

Is there a special reason to refresh the screen with permanent polling of the temperatures during disk selection?
From those lovely HP disk 4 to 5 are gathered per second, and to select the disks is heavily delayed, almost unusable.

Will try to find and fix this. Seems the limit (get temp only once in 60 seconds) is not considered here.
Well, it appears to be considered in gui.c
Left this as it is for now, added a check in temperature.c:

    int idx;
    int result;

+    time_t nwipe_time_now = time( NULL );
+    if( nwipe_time_now - c->temp1_time < 60 )
+    {
+        return;
+    }
    /* try to get temperatures from hwmon, standard */
    if( c->templ_has_hwmon_data == 1 )
    {

@PartialVolume
Copy link
Collaborator

PartialVolume commented Oct 10, 2023

It's looking good. I'm getting a implicit declaration warning, when I compile that needs fixing

hddtemp_scsi/get_scsi_temp.c:80:9: warning: implicit declaration of function ‘scsi_get_temperature’; did you mean ‘nwipe_init_temperature’? [-Wimplicit-function-declaration]
   80 |     if( scsi_get_temperature( dsk ) == GETTEMP_SUCCESS )
      |         ^~~~~~~~~~~~~~~~~~~~
      |         nwipe_init_temperature

Is there a special reason to refresh the screen with permanent polling of the temperatures during disk selection?
From those lovely HP disk 4 to 5 are gathered per second, and to select the disks is heavily delayed, almost unusable.

At the time I think it was so if the drive was was too hot, nearing critical you could see the temperature change more quickly in drive selection as you attempt to cool the drive down before starting a wipe. As hwmon is relatively quick it didn't matter but in practise there is no reason really why drive selection shouldn't poll less frequently at one every sixty seconds like drive wiping does.

@PartialVolume
Copy link
Collaborator

PartialVolume commented Oct 10, 2023

I was just looking at the drive selection GUI code and there is this bit of code below that refreshes temperature every 1 second. The one second is important for the flashing HPA messages that appear when determining HPA/DCO drive status. So this code shouldn't be changed.

            /* Update the selection window every 1 second specifically
             * so that the drive temperatures are updated and also the line toggle that
             * occurs with the HPA status and the drive size & temperature.
             */
            if( time( NULL ) > ( temperature_check_time + 1 ) )
            {
                temperature_check_time = time( NULL );
                validkeyhit = 1;
            }

However ...

At line 828 of the master code, the following line exists

                /* Read the drive temperature values */
                nwipe_update_temperature( c[i + offset] );

this should really be in a if statement so that it updates every 60 seconds rather than 1 second. Or even a new function that contains the if statment called nwipe_temperature_update_60s as this code is duplicated elsewhere.

I'm surprised there is much of a delay when reading temperatures on SAS drives. How many milliseconds is it taking to read the temperature on each SAS drive?

@PartialVolume
Copy link
Collaborator

To measure the execution time of the existing nwipe_update_temperature() function I placed some timing code before and after the function and had it print the execution time to the log.

                // NOTE temporary timing code
                clock_t t;
                t = clock();

                /* Read the drive temperature values */
                nwipe_update_temperature( c[i + offset] );

                // NOTE temporary timing code
                t = clock() - t;
                double time_taken = ((double)t)/CLOCKS_PER_SEC; // in seconds
                nwipe_log( NWIPE_LOG_INFO, "nwipe_update_temperature() took %f seconds", time_taken);

I ran nwipe and left it at the drive selection screen then aborted it after 10 seconds. Here are the function execution times.

[2023/10/10 15:13:31]    info: nwipe_update_temperature() took 0.000091 seconds
[2023/10/10 15:13:33]    info: nwipe_update_temperature() took 0.000231 seconds
[2023/10/10 15:13:35]    info: nwipe_update_temperature() took 0.000316 seconds
[2023/10/10 15:13:37]    info: nwipe_update_temperature() took 0.000172 seconds
[2023/10/10 15:13:39]    info: nwipe_update_temperature() took 0.000204 seconds
[2023/10/10 15:13:41]    info: nwipe_update_temperature() took 0.000216 seconds

So on my 20 core I7 we are talking about anything from 91uS to 316uS, however I have only one drive and it's a NvMe WD Blue SN570 1TB.

Would be interesting if you ran the same test on your multiple SAS drives.

@ggruber
Copy link
Contributor Author

ggruber commented Oct 10, 2023

time smartctl -A /dev/sdb shows 0.05s for a SSD drive, but 0.4s, 0.6s for HDDs and 2.7s(!) for those HP SAS HDDs. hddtemp for such a HP SAS HDD takes 0.2s, for other SAS drives 0.005s.

Only those old HP drives cause real pain.

I'll interrupt the running nwipe, and fill some other disks in the box, among those also NVMe.

I'll integrate your suggestions and tidy up the code a bit.
gimme two hours, please.
For your amusement you could download https://support.edv2g.de/owncloud/index.php/s/a2mpN5edRHY9TWA

@PartialVolume
Copy link
Collaborator

Nice 👍 I like to see lots of flashing lights, you know things are happening. Could also be because I like old 50s & 60s science fiction movies where the computers always had an abundance of flashing lights 🤣

It also demonstrates a 'feature' of nwipe, where it pauses when it's writing as opposed to syncing. This pause is more noticeable when it's doing a PRNG pass and I always thought this pause could be reduced by having the PRNG creation calculated in a thread that was free running writing into a buffer, when the next prng block was needed it had already been calculated, so the wipe thread doesn't have to wait for the next prng block to be calculated. Each wipe thread would need to create a prng thread, so for a 20 drive wipe each having its own thread there would be 20 prng threads. So for nwipe it would running 40 wipe/prng threads plus the gui status thread.

This was one of those things that has been talked about in the past but never got round to optimising that part of the wipe thread.

Its one of those things that one day we will get round to doing but for me adding ATA secure erase is the next priority for me at least.

@ggruber
Copy link
Contributor Author

ggruber commented Oct 10, 2023

Its one of those things that one day we will get round to doing but for me adding ATA secure erase is the next priority for me at least.

that's why I used the chance with the arrival of 12 SATA disk which I want to wipe to chance some more disks.
There are now in addition to the 2 Intel SSDs and the 2 SAS SSDs 3 SATA consumer SSDs in the box, plus 2 NVMe SATA plus 2 NVMe PCIe.

Have some trouble yet to get the box up as the NVMe are not blank but contain a former version of the OS which is now installed. ZFS doesn't like to find identical named pools, especially not from initramfs

@ggruber
Copy link
Contributor Author

ggruber commented Oct 10, 2023

here is a logfile with your suggested time measurement code results included.
pls nore, that there are only 3 of those super-slow HP drives left over.
And: it seems to me there is a difference between update_temperature duration in idle mode and under the load of wiping.

grep nwipe_update_temperature nwipe4.log | awk '{print $6 "\t" $9}' | sort -n -u | less
gave me interesting views

nwipe4.log.gz

@PartialVolume
Copy link
Collaborator

PartialVolume commented Oct 11, 2023

Those timings look pretty good to me, the worst being 1.360 milliseconds, in percentage terms that means 0.136% of every second is taken up obtaining the drive temperature. Even if writing to the drive stopped during that period, which I don't think it does, but there is always that possibility. Extrapolating that out to the time period of a full wipe, say 4 hrs (240 minutes or 14400 seconds) that means the wipe might take an extra 14400/100*0.136 =19.58 seconds to complete and that's checking the temperature every second.

Would you agree with that?

If I've not made a big mistake in my calculations, then I would say checking temperature at 1 second intervals is fine, however is that what happens in practise? So on your hardware, if you wiped a single drive twice to completion, once with 1 second interval between temperature checks and then repeat the wipe with 60 seconds between checks is there much difference?

@PartialVolume
Copy link
Collaborator

Also, the temperature checking is operating in a different thread to the actual disk wipe so on a multi core processor should not delay the wipe thread too much.

@PartialVolume
Copy link
Collaborator

PartialVolume commented Oct 11, 2023

Unrelated to the temperature, but thinking about the pause I could see when the drives were writing rather than syncing, I was just wondering, when you have 28 drives wiping and using the PRNG method what is htop -d 1 reporting for core usage, I was just wondering whether the CPU cores are hitting 100% when its calculating the next PRNG block.

@PartialVolume
Copy link
Collaborator

PartialVolume commented Oct 11, 2023

I think you are correct about having the smart data as a single column rather than two so it occupies page two and three. I just did a wipe on a disk and the left column text overlaps with the right column in places and the right column gets cropped to the right.

So yes, the two columns needs to be changed to a single column.

report page two smart data shoowing overlapping and cropped text

@ggruber
Copy link
Contributor Author

ggruber commented Oct 12, 2023

that should go to the PDF issue, and fits with my suggestions.

I'll continue with the timing measurements and will intergrate temperature min/max saving for SAS drives.

And I prepared a cleanup function, that closes open fd und frees malloced mem (for in single SAS drive). Still looking for the best place to call it.

@ggruber
Copy link
Contributor Author

ggruber commented Oct 12, 2023

another gui qustion: the firmware of my LSI/Avago/Broadcom HBA distinguishes between SAS, SATA, SAS-SSD und SATA-SSD.
I'd find it at least nice, but also kind of helpful during drive selection in GUI if I could spot the SSDs e.g. this way.
(but I know about the fragility of the layout)

@PartialVolume
Copy link
Collaborator

Yes, I remember now, on some drives you have to set the two values as reported by sudo hdparm -N /dev/sdh to equal one another using the sudo hdparm -N p468862129 /dev/sdh command . Then power cycle the drive, it then accepts a dco-restore without the I/O error.

And in some other makes/models the dco-restore may not be necessary. Other drives don't care about having to set the HPA, you can just do a dco-restore.

It will certainly be a challenge to automate this in nwipe to behave consistently for any make/model of drive. Quite possible, but a challenge.

@ggruber
Copy link
Contributor Author

ggruber commented Oct 20, 2023

Can you pull those two drives out and in. Thanks.

done.

@ggruber
Copy link
Contributor Author

ggruber commented Oct 20, 2023

It will certainly be a challenge to automate this in nwipe to behave consistently for any make/model of drive. Quite possible, but a challenge.

could stop us for a longer time.
I suggest implementing one working solution for certain drives, maybe even not that in the upcoming release. And enhance it on demand.

@PartialVolume
Copy link
Collaborator

PartialVolume commented Oct 20, 2023

Resetting the HPA didn't work, it's back to 468862128 blocks rather than 468862129.

Could just be a 'feature' of this drive. If it behaves the same on other hardware, I guess it's got to be the drives implementation of the HPA DCO isn't great. Perhaps it hides a block for it's own purposes whatever that might be.

Anyway the code in nwipe is detecting the hidden block correctly so I'll leave it at that.

@ggruber
Copy link
Contributor Author

ggruber commented Oct 20, 2023

what about SSD smart erase? is it on the roadmap for the next release?

@PartialVolume
Copy link
Collaborator

PartialVolume commented Oct 20, 2023

I suggest implementing one working solution for certain drives, maybe even not that in the upcoming release. And enhance it on demand.

I think we'll freeze any new features so I get get this release out. I'm happy with what we've done, so I think we are ready for release v0.35 to be published.

@ggruber
Copy link
Contributor Author

ggruber commented Oct 20, 2023

Should I have a (quick) look for a way to sort the disk list?

And what happended to the temp reading thread?

@PartialVolume
Copy link
Collaborator

what about SSD smart erase? is it on the roadmap for the next release?

Yes, that is at the top of the list for the v0.36 release

And what happended to the temp reading thread?

I nearly forgot about that ! Thanks for reminding me. Yes that has to be in this release (0.35)

@PartialVolume
Copy link
Collaborator

I'll work on the temp thread if you want to take a look at the sorting

@ggruber
Copy link
Contributor Author

ggruber commented Oct 20, 2023

sorted device list is in my repo now, would be glad if someone would test it

@PartialVolume btw, for testing purposes I added another disk, so we have sdaa. please do not wipe it but enjoy it's presence in the list ;-)

@PartialVolume
Copy link
Collaborator

Tested, I built it on the server and compared the original listing with the sorted. All looks good. I randomly picked a drive /dev/sdq and checked it's contents (0x74) then started a wipe on just that drive, aborted. Rechecked the contents on /dev/sdq was now 0x00 which it was.

Looks good, as far as I'm concerned you can do a PR on that.

Nice bit of code. Looks like you are sorting the pointers to the contexts based on drive name?

I don't think I'm going to be quite as quick as you with the temperature thread. If I don't get it completed this evening, it may be a couple of days as I'm very busy over the weekend.

@ggruber
Copy link
Contributor Author

ggruber commented Oct 20, 2023

tnx for testing/confirmation, PR started

Nice bit of code. Looks like you are sorting the pointers to the contexts based on drive name?

That's what I did

@PartialVolume
Copy link
Collaborator

Temperature thread and sort committed. Looks good, looks like those lags in the GUI have gone. If you could check it out. Thanks.

nwipe-2023-10-22_01.00.49.mp4

@ggruber
Copy link
Contributor Author

ggruber commented Oct 22, 2023

looks impressive in comparison to the behaviour before. will have to check the blinking ;-)

I pimped eraseHDD in the meantime a bit, could think about renaming it towards "Coarse Disk Analyser" with main function information gathering and optional (exchangable or configureable) wiping backend.

@PartialVolume
Copy link
Collaborator

PartialVolume commented Oct 23, 2023

If by blinking, you mean the disk activity light blinking, this is a consequence of using fdatasync periodically to check the data is being written without any errors.

When a fdatasync is issued the Linux disc cache is forced to flush it's cache for a given drive, this is done to detect I/O errors. If fdatasync is disabled by setting --sync=0 in the options you will probably find the disk activity light stays on permanently, i.e it's continually writing.

However, if --sync=0 then nwipe can't detect an I/0 error so the only way you may know a drive has failed is that the throughput for that particular drive will start to drop away.

You can increase the period between fdatsyncs however tests done in the past showed that the default setting, I think 100,000 writes was a good compromise.

An alternative to using the Linux disc cache method of writing is to use dual buffered direct I/0 where no Linux cache is involved and fdatsyncs are unnecessary. Nwipe would be immediately informed of a /0 error and there would be no momentary pause in writing as the cache is flushed by fdatasync.

This has been discussed a lot in past if you look back through the issues.

Dual buffer direct I/O is something I would like to add to nwipe but other priorities always mean I never get around to working on it.

If that's something you fancy working on, please read through all the past discussions about this. If you can't find them let me know and I'll find the links.

In the meantime try wiping discs with --sync=0, you may find there is maybe a 5% increase in speed but if any discs have an I/O error nwipe may hang. One fix for that hang may be may be for nwipe to monitor throughput and terminate a drives thread if it's throughput is ridiculously low.

@ggruber
Copy link
Contributor Author

ggruber commented Oct 23, 2023

I see. Thanks for explaining this. For just 5% better performance: well, I'd priorise smart SSD wiping much higher.

@ggruber
Copy link
Contributor Author

ggruber commented Oct 24, 2023

Uploaded new video from the blinking lights from the server, with one drive that failed. The screenshot from eraseHDD shows that I would have thrown away/mechanically destroyed this disk instead of wiping it as it showed >900 reallocated sectors.
The drive had 922 reallocated sectors before the start of nwipe, has 925 now. So watching the reallocated sectors count had been a sufficient point to watch when we wiped a few hundred disks some years ago.

I will try --sync=0, but I'm not sure yet when I will have a deeper look into this (abyss? ;-) ).

Interesting enough I find that the SMART ERROR flagged sde performs apparently completely normal. With its 2890 reallocated sectors. (But I do not trust this disk, it'll find its place on the graveyard soon.)

@PartialVolume
Copy link
Collaborator

Yes, I can see why having reallocated sectors visible at nwipes drive selection screen would be a useful feature to have. Something for 0.36.

Yes, definitely an abyss, any changes to how nwipe writes to disc would require extensive testing. Not something I want to do until after SSD erase implemented.

I'll checkout the new video.

@ggruber
Copy link
Contributor Author

ggruber commented Oct 24, 2023

wouldn't it be a kind of oddly satifying video if I installed a webcam showing the blinking activity ligths from being-wiped disks ;-)
make a webpage around the video box with a selection window which disks should be wiped -> insert coins to see the show ;-)

@ggruber
Copy link
Contributor Author

ggruber commented Oct 24, 2023

Yes, I can see why having reallocated sectors visible at nwipes drive selection screen would be a useful feature to have. Something for 0.36.

a thought that came to me: what, if we let the pre-flight check in an easy modifyable/adopatble perl-program (I mentioned before it could be renamed also) and use nwipe just for the wiping, as we did years ago?
I find it quiet useful to switch to another screen, have a look at the eraseHDD output, and then switch back to the nwipe screen.

btw, with only 7 disks now being actively wiped the LEDs blink very evenly. same for the spinning bar. throughput is only 564MB/s

@ggruber
Copy link
Contributor Author

ggruber commented Nov 1, 2023

@PartialVolume : is there anything left I could contribute to the 0.36 release?

@PartialVolume
Copy link
Collaborator

PartialVolume commented Nov 1, 2023

@ggruber You mean the 0.35 release, I don't think there is. I'm just updating CHANGELOG.md then I'll publish the release.

There will be more to do after I've published the 0.35 release, which will go into 0.36, such as

  • Secure Erase
  • Reverse wipe, Instead of wiping start to end of disc, it wipes end to start of disc . Useful when you have a drive with bad sectors near the start (as is often the case) and you want to make sure as much as the workable parts of the drive are wiped.
  • Add default wipe methods to nwipe.conf so if you want nwipe to always start with PRNG or zeros it can be set by the user.
  • Add temperature support to USB devices, assuming a appropriate adapter is being used that supports ATA pass through /smart data retrieval.
  • Add option that allows the user to create PDF with only the first page, i.e no smart data. Something like PDF_content = Certificate or PDF_content = Certificate+Smart
  • Check for mounted drives with warning in GUI.
  • Any other issues mentioned by other users.
  • HPA/DCO reset.

@ggruber
Copy link
Contributor Author

ggruber commented Nov 2, 2023

Shouldn't we declare smartmontools mandatory in ReadME?
it will be called from several peaces of code (devices.c, PDF creation, ...)

@PartialVolume
Copy link
Collaborator

PartialVolume commented Nov 2, 2023

Shouldn't we declare smartmontools mandatory in ReadME? it will be called from several peaces of code (devices.c, PDF creation, ...)

I've just updated the README.md file with information about smartmontools and new features in 0.35

@PartialVolume
Copy link
Collaborator

PartialVolume commented Nov 2, 2023

Just one bug to fix before I release 0.35. I was just testing under Debian SID and nwipe runs ok as superuser but if you are not super user a message should be displayed saying you need to run nwipe as superuser, unfortunately it segfaults rather than display the message and exit gracefully.

@PartialVolume
Copy link
Collaborator

PartialVolume commented Nov 2, 2023

Just one bug to fix before I release 0.35. I was just testing under Debian SID and nwipe runs ok as superuser but if you are not super user a message should be displayed saying you need to run nwipe as superuser, unfortunately it segfaults rather than display the message and exit gracefully.

ok, figured out what the problem was, it segfaults because it cannot create /etc/nwipe/customers.csv because it's not running as root. Once you run it as root and the file gets created it no longer segfaults if you don't run as root because customers.csv already exists. Should be a quick fix.

@ggruber
Copy link
Contributor Author

ggruber commented Nov 2, 2023

you mean something like

#include <stdio.h>
#include <unistd.h>

int main( int argc, char** argv )
{
    if( geteuid( ) != 0 )
    {
         printf( "nwipe must run with root permissions, which is not the case.\nAborting\n" );
         exit 99;
    }
#...
}

?

@PartialVolume
Copy link
Collaborator

you mean something like

#include <stdio.h>
#include <unistd.h>

int main( int argc, char** argv )
{
    if( geteuid( ) != 0 )
    {
         printf( "nwipe must run with root permissions, which is not the case.\nAborting\n" );
         exit 99;
    }
#...
}

?

Thanks, Just added your code snippet.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants