-
Notifications
You must be signed in to change notification settings - Fork 87
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
do not abort on write error, and, thousands separation #550
Comments
This is a real-world problem since we usually wipe (soft) failing drives (as in SMART value for remapped sector met a treshold and triggered a SMART warning which then triggers an alert in whatever storage system OS is in use) before returning the drives for warranty replacement. |
I remember the idea of restarting from the end, but just continuing on forward is simpler logic and I believe gives the same result. It also handles the case of multiple bad areas.
(Sent from mobile)
…________________________________
From: Firminator ***@***.***>
Sent: Wednesday, February 21, 2024 8:43:42 PM
To: martijnvanbrummelen/nwipe ***@***.***>
Cc: Mike Cato / Hays Technical Services ***@***.***>; Author ***@***.***>
Subject: Re: [martijnvanbrummelen/nwipe] do not abort on write error, and, thousands separation (Issue #550)
1. sounds like a good improvement PartialVolume has been thinking about already in the past somewhere here. Although it was more along the line if bad sectors are encountered to stop the current wipe direction and start wiping from the end of the drive... or skip like 100 blocks and continue the wipe.
I'm all for it and if you can offer coding help and collaborate with him that would be even more awesome.
This is a real-world problem since we usually wipe (soft) failing drives (as in SMART value for remapped sector met a treshold and triggered a SMART warning which then triggers an alert in whatever storage system OS is in use) before returning the drives for warranty replacement.
—
Reply to this email directly, view it on GitHub<#550 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/ANGK2PSKGZMI47U35DPGOHTYU2WF5AVCNFSM6AAAAABDUA7H4GVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSNJYGU2TQMBUGU>.
You are receiving this because you authored the thread.Message ID: ***@***.***>
|
I found the comment @ #497 (comment)
so yeah this was on the boilerplate already. Also found the old thread that had different idea/approach with non-linear wiping @ #10 |
@mdcato By all means please fork. I have a box full of disks that all fail in weird and wonderful ways to test your code. @Firminator is correct, my preference is that the first I/O error that occurs triggers a reverse wipe. This whole discussion also is closely related to why for sequentially writing a block device using linux disc cacheing is ok (but not great) for wiping a disc but is not want you want when dealing with discs with I/O errors. If you take a system with 16GB CPU memory and as an example start the wipe on one disc. Very quickly the memory will fill up with about 12GB of cached writes to the disc. We periodically flush the cache to detect a error, you can't detect the error with the write because it's not direct I/O. However when we issue the fdatasync to check the drive is working correctly the fdatasync won't return until the entire 12GB is flushed to disc (hence why direct I/O will be faster by about 5-20%). However, say the fdatasync detects an I/O error and returns we don't know the actual block that caused the I/O error. Only that it was somewhere in the last 12GB of data that was written. So now we have to go back x blocks to a block that we really don't know the number of to try to find where the bad block actually is and perform a single 512 byte block write followed by a fdatasync to flush the block and detect the error. If that wasn't complicated enough some drives don't even fail nicely and instead cause fdatasync to not return causing the thread to hang. This is why I've always wanted to move away from using the linux disc cache in nwipe and instead perform direct I/O with the disc, so that nwipe has total control over the disc access. In direct I/O the disk write provides us with a error if the block write fails. Nwipe would write a block of 200K bytes ( I think the ideal block size for speed was discussed in the past) and if it failed we know exactly where to start the process of 512 or 4096 block writes to close in and locate the bad sector. I did start a direct I/O branch, which I've kept private as it needs more work but it would take care of trying to get passed bad blocks either by writing a single block until it could get passed the bad section or preferably doing a reverse wipe as this would wipe out the bulk of the drive as fast as possible until it reached the bad sectors again but from the other end.
Yes, ShredOS uses all recent libraries & tools. A I/O error, if you continued wiping as much of the disc as possible would also affect the verification which would fail and the PDF report which should also show failed. In addition the PDF shows the actual number of bytes successfully wiped at least once, this value would need to be calculated correctly when either writing through the bad blocks or reverse wiping after a I/O error. [Humor] The hard part of this experiment is getting known-bad drives to fail "reliably"; I've been sandwiching them with other drives to build up heat; no air flow over them. I do have one or two failed drives that fail within minutes of starting the wipe, which is really handy. It would be a pain if all the faulty test drives failed after x number of hours, but then you could simulate a failed block in code to test the code to a certain degree. |
I've added three features to my project list, I'll give them a priority once I get through the priority zeros and ones. |
With reverse wiping, what would happen if there are multiple non-consecutive bad sectors? Wouldn't this approach leave potentially good sectors in the middle not wiped? |
No, reverse wipe would continue to write through bad sectors to a point where it aborted on the forward wipe. In practise as trying to write through bad blocks means single 512 or 4096 byte block writes writing through potentially bad blocks will drastically slow transfer speed. So the write method would switch from saying 100,000 byte writes for maximum transfer speed to single block writes when it detects I/O errors then back to a large block write if X number of blocks transferred correctly. This would be experimental so may well change based on what we find happens in practice. I would imagine for those that are wiping hundreds or thousands of drives for resale the slow down in speed would be a waste of time, so as what currently happens, if the drive has I/O errors or reallocated sectors the drive is pulled and physically destroyed as it would be unsuitable for resale or use. Forward wipe followed by reverse wipe on error with write through bad blocks would probably work for somebody that just want to wipe as much as possible but doesn't want to physically destroy the platters but just place in electrical waste. Time is maybe not an issue for them. |
@PartialVolume,
I've embarked on some changes but wanted to get your thoughts before forking.
I don't think the above ideas are too controversial, however I haven't look at the build environment for ShredOS in regard to #2 to see if you're using a different lib (for smaller size?) that may not be up-to-date (i.e. older than POSIX.1-2008 or Single UNIX Specification) that doesn't support the '-flag in the format string. I'm assuming you're using recent/latest tools, but worth asking.
At this stage, I'm only changing calls to the log to have the "%'llu" format specifier as the nwipe GUI space is more restricted and you already have large numbers converted to "10 T" for example. The log needs a more precise number if you're trying to determine how large an area the error is occurring.
I have not explored whether these changes, especially not causing a fatal error, would affect the PDF report.
Let me know if you have suggestions/reservations/exclamations/cautions/etc on this experiment.
[Humor] The hard part of this experiment is getting known-bad drives to fail "reliably"; I've been sandwiching them with other drives to build up heat; no air flow over them.
The text was updated successfully, but these errors were encountered: