-
Notifications
You must be signed in to change notification settings - Fork 158
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Atomic saving #241
Comments
Original comment by Christoph Reiter (Bitbucket: lazka, GitHub: lazka): On unix it's usually done like this: copy, change, fsync, rename. I don't think mutagen needs any change for this to work or am I missing something? Having example code (or a public helper function if this is complicated enough/platform depended) in the docs for this seems like a good idea though. |
Original comment by Daniel Plachotich (Bitbucket: danpla, GitHub: danpla): Copying file is not the solution for several reasons.
As you see, there is no way to do this without changes in mutagen. I don't see any reasons to afraid this changes — nothing will be broken. I'm sure that both developers and end users of programs will be happy, because data safety is paramount. |
Original comment by Christoph Reiter (Bitbucket: lazka, GitHub: lazka):
I'm confused; your proposed algorithm copies the file. What am I missing?
...
If you want to prevent data loss on disk errors/power loss you need a temporary file. You never know if a write() succeeds and how much is written until the error.
I wouldn't want every application using mutagen to get slow because of this so this is expected.
rename doesn't work across mount points. |
Original comment by Daniel Plachotich (Bitbucket: danpla, GitHub: danpla): If I clearly understand your previous post, you suggested firs make a copy and only then insert the new tags. Inserting require shift the whole data, so there are x2 more IO operations than actually needed: (1) the copying itself and (2) the shifting. In the my algorithm, I suggest to create a new empty temporary file, write new headers and then copy data from the original. Simply saying, it uses only copying, in a one pass; shifting is not needed.
Only the tags can be damaged in this way, but the audio data itself will be untouched. More or less easy, the corrupt tags can be simply removed.
As I said before, the amount of IO operations will be more or less the same as in regular shift using read()-write() or in a one copy operation, so I don't think that there will be any noticeable performance loss.
Yes, I know — I actually meant that temporary directory is one possible example solution when there is no space on the current partition and it's requires moving (not renaming, of course) the file when the mount points are different. But instead it can be better to just raise an exception. Probably the links I posted in the previous post are better explain the idea:
|
Original comment by Christoph Reiter (Bitbucket: lazka, GitHub: lazka):
OK, now I get what you mean. Note that with the recent padding changes it is unlikely that we do need to resize. FileType.save() currently takes a filename argument to save to another file but it is deprecated (as it's confusing that only tags get saved..). We could allow passing a non-existing/empty file path to FileType and make it copy the file + changed tags there... but this change probably isn't easy because most code assumes it saves to a valid file.
What do you think?
This depends on the file type and parser error handling.
We currently only replace a few kb without resizing in the common case.. so this will be 100x slower. Just saying that the default behavior as it is now can't change in mutagen. |
Original comment by Daniel Plachotich (Bitbucket: danpla, GitHub: danpla): As I mentioned before, mutagen will behave as now in case of suitable amount of padding, i.e. tags will written directly without resizing. I think such code is not a solution: it makes impossible to use advantage of an existing padding in most of cases, because Actually, the solution I proposed is just a "safe" alternative to internally used |
Original comment by Christoph Reiter (Bitbucket: lazka, GitHub: lazka):
What if writing fails in the middle of this?
flush + os.fsync is needed before close and rename to ensure the data gets written to disk. |
Original comment by Daniel Plachotich (Bitbucket: danpla, GitHub: danpla):
Like any other software (and Mutagen in particular) does — nothing. The audio is not corrupted, so this is not a big issue. But it may be useful to add a global option (disabled by default) so that temporary file will be used in all cases.
Closing a file does this things automatically. flush and os.fsync manually called when file doesn't closed, but a data needs to be immediately written on disk (to be available for other processes, etc.). |
Original comment by Christoph Reiter (Bitbucket: lazka, GitHub: lazka):
The audio may not be corrupted but the file is (in case of mp4 the audio will too be corrupted as we need to update the offsets). I don't see any benefit in a data integrity feature which only works in the rare case of a resize.
man close says otherwise: "A successful close does not guarantee that the data has been success‐ |
Original comment by Daniel Plachotich (Bitbucket: danpla, GitHub: danpla): This improvement is primarily for audio integrity. Padding in ID3 and FLAC added especially for this purpose, i.e. don't rewrite file if possible. If padding in MP4 doesn't allow do this without the risk of audio corruption, it's not an issue to use temp each time for it. My apologies — Python does only flush on close, but not fsync. Although I don't sure it really needed. For example, neither |
FYI: this link discusses the create/write/fsync/rename idiom: regarding this code:
it says:
also see this old project: and this notes from microsoft: |
Thanks. I've implemented something similar in quodlibet: https://github.com/quodlibet/quodlibet/blob/master/quodlibet/quodlibet/util/atomic.py |
Proposed API:
|
these are implementation details and shouldn't be in the API. atomicity is the key property here. implementation strategy and durability should not be specified.
IMHO this is the wrong API. no file system in real use that i know of is fully transactional. this API provides and abstraction that is not well supported underneath. in the end, it will create a copy of the file in all supported OSes; this is why i think it is the wrong API. a more useful API would be saveAs(self, file=None) that saves pending changes to a copy of the file somewhere else. API help would include a note that, if 'file' parameter is missing, it will save to a temp file and then replace the original as atomically as possible. (and stating that this saving mode is safer and slower than the usual save is OK; but it should be noted that the usual save is safe enough for normal use.)
yes, the saveAs() API could be added. but the important issue here is that it doesn't fix the problem with the current implementation of save()! furthermore, you leave it to the app programmer to decide what to do, and they know little or nothing about risks. what do you expect them to do? wouldn't all of them either choose replace()/saveAs() or ask you a question personally? plus the 'safe save' is only needed when the ID3v2 tag is added or resized, because loosing the tag is not as bad a loosing the payload, and the tag is so small that a partial save is really unlikely. and with replace/saveAs you force a copy on each save, when 99% of the time it is not needed. what is really needed is a safe 'payload scroll' only when resizing the ID3v2 tag. IMO, the normal save should be upgraded to behave in a safe enough manner, and that is it. this entails copying the file but only when a payload move is needed. more free space is needed by the new save? too bad. i rather have a reasonably safe save that complains the disk is full than a dangerous save. if the disk is full, users will have to deal with that very soon, so let them deal with it now. so to recap, my proposal is:
|
and comparatively the save/replace solution has problems:
keep in mind that the shortcoming of the current save implementation is not a 'designed feature'. it is not a calculated compromise of performance and safety; it is a plain-old bug in the implementation, arising from not considering certain failure modes. its behavior is not something anyone would want to keep, but a bug that should be fixed. fortunately the save() API is the right API for the fixed implementation (a safe enough save), so you can fix this without impacting clients. |
Thanks for your input.
I didn't want to be too specific as in reality the drive/file system can lie about flushing and on Windows we don't know what's happening exactly. We could say it does an atomic replace if the OS/hardware supports it, if not it will at least reduce the probability of data corruption as far as possible (?)
Why is it wrong?
There might be a misunderstanding here. I planed exactly what you propose with replace().
Good question. I'll have to think about that.
I don't like multiple shades of "safe". If we somehow leave the file in an invalid but playable state, the next tagger might get tricked into corrupting the file, or some players wont accept the file. And I really don't want to think at every operation in the code about what would happen if there was a crash right after. And is there any guarantee that file operations get flushed to disk in the order I execute them?
I'm not sure, if we talk probability here, how often the "safe enough" approach would result in a recoverable file (from the user POV, not starting up the HEX editor and cleaning things up by hand). (Don't forget all the other tagging formats..). I agree that free space, permissions, and path limits reached by the temp file should just be ignored/error out.
I think you are exaggerating a bit. What amount of programs do atomic replace for your documents? If your kernel crashes a lot you have lots of other problems too. I at least hope that kernel bugs like the one you reported are rare.
Regarding the naming. I prefer replace() as it highlights the the file gets replaced as a whole, while when I read saveAs() I think of saving to another location (like the "save as" menu entry in many file menus), and while that is a step of the process, it's just an implementation detail. |
well, what can i say, my view is different. IMHO save() is broken because it causes a very high probability of loosing data, compared to any other saves the users normally do. to worsen things, the data lost is not the data being edited (the audio), and this is the worst scenario and something users won't understand or be lenient with. save() has to be fixed because it is broken and has not use case as it is. there is no point in introducing another API and leaving a broken one up. and expecting existing clients to change is not realistic. replace() is not the solution, as it causes massive performance loss all the time, when most of the time it is not needed. the probability that a 1KB structure gets partially written is abysmally low compared to the probability that a 100MB file gets partially written; maybe 100.000 times smaller. (i listen to dj sets mostly, so my individual files are typically over 100MB.) this means that around 100.000 corruption incidents of moved tracks would be observed before 1 tag-only corruption happens (if both operations are equally likely to be executed). but of course, before 100.000 corruptions happen your library would be eliminated from any successful app. i guess 100.000 times safer than the current implementation can be called 'safe enough', given that you are not currently bombarded by thousands of corruption reports.
this is exactly what saveAs() does: saves a new file. i proposed that omitting the argument causes save as temp file and replacement of the original. ok, maybe this is not the best interface. so instead: saveAs() must be called with a file argument, and the file must not exist. saveAtomically(), saveViaCopy(), atomicSave(), saveSafely(), safeSave() are all possible names of "saveAs() to temp and then atomically replace". the docs should state that although atomicSave() is safer, save() should be is safe enough for all uses. (implementation of atomicSave() has a low priority IMHO because it will almost never be used. fixing save() has a high priority.) replace() is a bad name, and the semantics are not at all clear. why would you call replace("xxx.mp3") to save a new xxx.mp3 file? and if you want to save to a new file, why would the provided API silently replace any existing file instead of complaining that it exists? and most importantly: if the file exists, would replace() replace it atomically? if not, how would you document that? if yes, then there is a 3rd filename involved that is not being specified! my semantics:
all of them i suppose. the problem is that some apps didn't do an fsync so data losses happened all the same when people changed file systems. you can read about it: https://lwn.net/Articles/322823/ the point is that data loss happens a lot. if your library is used that is; if nobody uses it, then yes, whatever you do here is just fine :). but the idea is that a library must be robust just in case it becomes successful. this is why IMHO save() has to be fixed. i know that adding code to implement a decision to copy or not before starting the save process to the existing code base can be painful. but you don't need to do that. you can start saving in-place and only create a temp when you need to scroll big data. after the scroll, you immediately replace atomically, and then continue to save in-place. this is an easy change (only the scroll operation has to be modified) and yet it adds a lot of reliability. this highlights that saveAs()/saveAtomically() are not necessarily needed to solve (and are not the solution to) save()'s issues. saveAs()/saveAtomically(), if desired, can be added at a later stage. (of course, in a perfect world you would detect scroll before starting save() and use saveAs() instead.) |
Yeah, but I appreciate your input. I will ask some of the larger mutagen users and see what semantics they would expect or what API they want. |
@lazka any news about that? |
Originally reported by: Daniel Plachotich (Bitbucket: danpla, GitHub: danpla)
Resizing files "in place" is pretty dangerous way, especially for the large ones like audio. Both memory mapping and ordinary IO can fail during the process for many reasons and leave a corrupt file.
The only safe approach is to use temporary file. This is already used in some players and tagging libraries, for example, in Foobar2000 [1][2][3].
The algorithm is pretty simple:
The text was updated successfully, but these errors were encountered: