Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

do not abort on write error, and, thousands separation #550

Open
mdcato opened this issue Feb 22, 2024 · 7 comments
Open

do not abort on write error, and, thousands separation #550

mdcato opened this issue Feb 22, 2024 · 7 comments

Comments

@mdcato
Copy link

mdcato commented Feb 22, 2024

@PartialVolume,

I've embarked on some changes but wanted to get your thoughts before forking.

  1. I'm changing the logic in pass.c/nwipe_static_pass() (others later) to NOT abort on a write() or fdatasync() error. My thought being that if the rest of the drive, after getting past the error-prone area(s), can be wiped, the more secure the drive will be (even tho it's likely to be physically destroyed). There's not much time lost as, most likely, other drives will continue being wiped in the same run. This could be controlled by a command line option if it's thought not everyone would want this behavior change.
  2. I've also changed some of the calls to nwipe_perror() and nwipe_log() to show the current offset where the write() or fdatasync() failed, so the operator can know how large an area is affected, and how many, and have changed the printf formats to "%'llu" so that the long offset numbers are comma-separated. I've also changed those log calls to "Warning" instead of "Fatal" since it will continue on wiping the rest of the drive. The '-flag change also requires the addition of including <locale.h> and 'setlocale( LC_All, "")' in nwipe.c/main(); the thousands-separator will thus be locale-specific.

I don't think the above ideas are too controversial, however I haven't look at the build environment for ShredOS in regard to #2 to see if you're using a different lib (for smaller size?) that may not be up-to-date (i.e. older than POSIX.1-2008 or Single UNIX Specification) that doesn't support the '-flag in the format string. I'm assuming you're using recent/latest tools, but worth asking.

At this stage, I'm only changing calls to the log to have the "%'llu" format specifier as the nwipe GUI space is more restricted and you already have large numbers converted to "10 T" for example. The log needs a more precise number if you're trying to determine how large an area the error is occurring.

I have not explored whether these changes, especially not causing a fatal error, would affect the PDF report.

Let me know if you have suggestions/reservations/exclamations/cautions/etc on this experiment.

[Humor] The hard part of this experiment is getting known-bad drives to fail "reliably"; I've been sandwiching them with other drives to build up heat; no air flow over them.

@Firminator
Copy link
Contributor

  1. sounds like a good improvement PartialVolume has been thinking about already in the past somewhere here. Although it was more along the line if bad sectors are encountered to stop the current wipe direction and start wiping from the end of the drive... or skip like 100 blocks and continue the wipe.
    I'm all for it and if you can offer coding help and collaborate with him that would be even more awesome.

This is a real-world problem since we usually wipe (soft) failing drives (as in SMART value for remapped sector met a treshold and triggered a SMART warning which then triggers an alert in whatever storage system OS is in use) before returning the drives for warranty replacement.

@mdcato
Copy link
Author

mdcato commented Feb 22, 2024 via email

@Firminator
Copy link
Contributor

I found the comment @ #497 (comment)

  • Reverse wipe, Instead of wiping start to end of disc, it wipes end to start of disc . Useful when you have a drive with bad sectors near the start (as is often the case) and you want to make sure as much as the workable parts of the drive are wiped.

so yeah this was on the boilerplate already.

Also found the old thread that had different idea/approach with non-linear wiping @ #10

@PartialVolume
Copy link
Collaborator

@mdcato By all means please fork. I have a box full of disks that all fail in weird and wonderful ways to test your code.

@Firminator is correct, my preference is that the first I/O error that occurs triggers a reverse wipe. This whole discussion also is closely related to why for sequentially writing a block device using linux disc cacheing is ok (but not great) for wiping a disc but is not want you want when dealing with discs with I/O errors.

If you take a system with 16GB CPU memory and as an example start the wipe on one disc. Very quickly the memory will fill up with about 12GB of cached writes to the disc. We periodically flush the cache to detect a error, you can't detect the error with the write because it's not direct I/O. However when we issue the fdatasync to check the drive is working correctly the fdatasync won't return until the entire 12GB is flushed to disc (hence why direct I/O will be faster by about 5-20%). However, say the fdatasync detects an I/O error and returns we don't know the actual block that caused the I/O error. Only that it was somewhere in the last 12GB of data that was written. So now we have to go back x blocks to a block that we really don't know the number of to try to find where the bad block actually is and perform a single 512 byte block write followed by a fdatasync to flush the block and detect the error. If that wasn't complicated enough some drives don't even fail nicely and instead cause fdatasync to not return causing the thread to hang.

This is why I've always wanted to move away from using the linux disc cache in nwipe and instead perform direct I/O with the disc, so that nwipe has total control over the disc access. In direct I/O the disk write provides us with a error if the block write fails. Nwipe would write a block of 200K bytes ( I think the ideal block size for speed was discussed in the past) and if it failed we know exactly where to start the process of 512 or 4096 block writes to close in and locate the bad sector.

I did start a direct I/O branch, which I've kept private as it needs more work but it would take care of trying to get passed bad blocks either by writing a single block until it could get passed the bad section or preferably doing a reverse wipe as this would wipe out the bulk of the drive as fast as possible until it reached the bad sectors again but from the other end.

I don't think the above ideas are too controversial, however I haven't look at the build environment for ShredOS in regard to #2 to see if you're using a different lib (for smaller size?) that may not be up-to-date (i.e. older than POSIX.1-2008 or Single UNIX Specification) that doesn't support the '-flag in the format string. I'm assuming you're using recent/latest tools, but worth asking.

Yes, ShredOS uses all recent libraries & tools.

A I/O error, if you continued wiping as much of the disc as possible would also affect the verification which would fail and the PDF report which should also show failed. In addition the PDF shows the actual number of bytes successfully wiped at least once, this value would need to be calculated correctly when either writing through the bad blocks or reverse wiping after a I/O error.

[Humor] The hard part of this experiment is getting known-bad drives to fail "reliably"; I've been sandwiching them with other drives to build up heat; no air flow over them.

I do have one or two failed drives that fail within minutes of starting the wipe, which is really handy. It would be a pain if all the faulty test drives failed after x number of hours, but then you could simulate a failed block in code to test the code to a certain degree.

@PartialVolume
Copy link
Collaborator

I've added three features to my project list, I'll give them a priority once I get through the priority zeros and ones.
https://github.com/users/PartialVolume/projects/1/views/1

@gorbiWTF
Copy link

gorbiWTF commented Mar 5, 2024

With reverse wiping, what would happen if there are multiple non-consecutive bad sectors? Wouldn't this approach leave potentially good sectors in the middle not wiped?

@PartialVolume
Copy link
Collaborator

No, reverse wipe would continue to write through bad sectors to a point where it aborted on the forward wipe. In practise as trying to write through bad blocks means single 512 or 4096 byte block writes writing through potentially bad blocks will drastically slow transfer speed. So the write method would switch from saying 100,000 byte writes for maximum transfer speed to single block writes when it detects I/O errors then back to a large block write if X number of blocks transferred correctly. This would be experimental so may well change based on what we find happens in practice.

I would imagine for those that are wiping hundreds or thousands of drives for resale the slow down in speed would be a waste of time, so as what currently happens, if the drive has I/O errors or reallocated sectors the drive is pulled and physically destroyed as it would be unsuitable for resale or use.

Forward wipe followed by reverse wipe on error with write through bad blocks would probably work for somebody that just want to wipe as much as possible but doesn't want to physically destroy the platters but just place in electrical waste. Time is maybe not an issue for them.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants