-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
File write corrupts files #3
Comments
Hi Alexey. I'm looking into this, but at the moment I'm running into an error with something else that may or may not be related (there's a problem with the conversion of doubles to As soon as I get that figured out, we'll see if that addresses your issue as well, or if your issue is unrelated. |
Hi @NCrusher74 |
That error probably has the same root cause as the other one, which is that I'm not quite parsing (or putting together) files that are using 64bit integers for values such as media and chapter duration.
I wrote I'm working on figuring it out, but it may take a few days. |
Hi Alexey! I think I've pinned down the problem. Clone the |
Same behavior in which way? The error from your second comment, or the corrupted file from your original post? Can you give me any more information? Because I successfully can edit a decrypted Audible .m4b now. |
You're right, excuse me. The file is not corrupted, but the writing takes almost half an hour (still 100% CPU and 1Gb RAM). |
The lengthy writing problem may be an issue with Foundation's I'm going to tag my mentor in on this. @SDGGiesbrecht , do you know why it might be that The full function is: public func write(tag: Tag, to outputLocation: URL) throws {
let mediaData = try self.getMediaData()
try setMetadataAtoms(tag: tag)
setLanguage(tag: tag)
try setChapterTrack(mediaData: mediaData, tag: tag)
try setMdat(mediaData: mediaData, tag: tag)
var outputData = Data()
for atom in self.optimizedRoot {
outputData.append(atom.encode)
}
try outputData.write(to: outputLocation, options: .atomic)
} Everything above the As for the second problem, could you show me the atom structure of the source file using atomic parsely or Yate's raw data log. The problem I solved last night had to do with a could unexpected atoms messing up the offset for the chapter title information. I'd like to see if there are any atoms in your file that are in an unexpected place. |
If you have tried printing a timestamp immediately before and immediately after that line to prove that that line alone is the culprit, then there is little more you can do about it. All that line does is pass the data off to the operating system to store. The operating system could do any number of things as part of the process, such as compress Time Machine back‐ups to reclaim needed space. The time scale described here still sounds absurd—I can download 1 GB faster that that—, so if it persists after you have fixed the rest, I would report it to Apple. If it is pinwheeling during the wait, then the operation can be moved off the main thread in order to keep the interface responsive. Otherwise, evading Foundation’s |
Well... Atom structure of the file from second comment Atom structure of the file christie.m4b (that cannot be reopened) |
Okay, it looks like the bottleneck isn't the writing after all. public func write(tag: Tag, to outputLocation: URL) throws {
print("get media data begins: \(Date())") // *1*
let mediaData = try self.getMediaData()
print("set metadata atoms begins: \(Date())") // *2*
try setMetadataAtoms(tag: tag)
print("set language begins: \(Date())") // *3*
setLanguage(tag: tag)
print("set chapter track begins: \(Date())") // *4*
try setChapterTrack(mediaData: mediaData, tag: tag)
print("set mdat begins: \(Date())") // *5*
try setMdat(mediaData: mediaData, tag: tag)
print("output data accumulation begins: \(Date())") // *6*
var outputData = Data()
for atom in self.optimizedRoot {
outputData.append(atom.encode)
}
print("writing begins: \(Date())") // *7*
try outputData.write(to: outputLocation, options: .atomic)
print("writing ends: \(Date())") // *8*
} results:
This is for a 222MB file. Focusing (for the moment) on the two longest pieces of the process... It makes sense that Still, yes, it is taking too long, which suggests that I'm doing it wrong and/or inefficiently. This is what func getMediaData() throws -> Data {
let chunkOffsets = self.moov.soundTrack.mdia.minf.stbl.chunkOffsetAtom.chunkOffsetTable
let chunkSizes = try self.chunkSizes(stbl: self.moov.soundTrack.mdia.minf.stbl)
guard chunkSizes.count == chunkOffsets.count else {
throw Mp4FileError.ChunkSizeToChunkOffsetCountMismatch
}
let data = self.data
// Now that we know our CHUNK sizes, we can calculate the data to isolate by adding each chunk size to its corresponding offset to create a range for the data
var mediaData = Data()
for (index, entry) in chunkOffsets.enumerated() {
let start = entry
let end = start + chunkSizes[index]
let range = start ..< end
mediaData.append(data.subdata(in: range))
}
return mediaData
} So basically what I'm doing is accumulating the chunks of audio data (which may or may not be consecutive; at least one wacky app out there interleaves chapter title data with the audio data) and concatenating them into a consecutive chunk (because that makes a lot more sense and is easier to work with.) Then I return that data. func setMdat(mediaData: Data, tag: Tag) throws {
let mdat = try Mdat(mediaData: mediaData,
titleArray: tag.chapterHandler.chapterTitles)
self.mdats = [mdat]
self.moov.soundTrack.mdia.minf.stbl.chunkOffsetAtom.chunkOffsetTable = try calculateNewMediaOffsets()
} What I'm doing here is tacking the chapter title data onto the end of the audio data (some other apps store it up front, ahead of the audio data, some store it in an entirely separate Then I replace the I guess I don't know enough about how Swift stores and handles these big pieces of data to understand what's involved here and whether there is any way I can optimize this process to make it any faster? |
It looks like that error is happening in an extension to public extension SignedInteger {
private init<D>(parsingLittleEndian data: D)
where D: Collection, D.Element == UInt8 {
assert(
data.count <= MemoryLayout<Self>.size,
"Data is too large to be expressed with \(Self.self)."
)
self = data.lazy.enumerated()
.reduce(into: 0 as Self) { underConstruction, entry in
let byte = Self(entry.element)
let repositioned = byte << (entry.offset * 8)
underConstruction |= repositioned
}
}
private func encodingLittleEndian() -> Data {
let byteMask: Self = 0xFF
var data = Data()
for offset in 0 ..< MemoryLayout<Self>.size {
let byte = self & (byteMask << (offset * 8))
let repositioned = byte >> (offset * 8)
data.append(UInt8(repositioned))
}
return data
}
It's possible I did something wrong; can that extension not be used with
both files have |
Well, first make sure all the correctness issues are handled. There is no sense optimizing code you already know you might need to fix or replace. Then toy with profiling code for time efficiency. (“Profile” is one of Xcode’s options alongside “Build”, “Run” and “Test” and you can look up a tutorial if you need.) One of the set‐ups will show you exactly how much time is spent in each function. The first two things that come to mind to try are:
No it cannot. The memory representation of a signed integer is completely different. |
Ah okay. I don't recall exactly why I thought Mp4 required signed instead of unsigned integers for a lot of these values. I think it's because a lot of times, the documentation doesn't specify which, and also it's possible that I saw MP4v2 using signed integers and assumed I needed to. But logic says that in 99.9 percent of the cases where it's being used, unsigned will work just fine, because there won't be any negative values. So that should be a fairly easy (though extensive) fix. So far, everything is building removing the use of Typically, I do know how big the data will get, because I have to have a Usually, however, I do this by putting all the data together and getting a count of the bytes, and I've actually wondered before if that was an inefficient way to get the size, because it essentially means building the atom, counting the bytes, then discarding the atom I just built. For example, this is how I initialize an init() throws {
// init properties
// build payload to get size
var payload = Data()
payload.append(self.version)
payload.append(self.flags)
payload.append(entryCount.uInt32.beData)
payload.append(defaultFirstChunk)
payload.append(defaultSamplesPerChunk)
payload.append(defaultDescriptionID)
let size = payload.count + 8
// use size to init super
try super.init(identifier: "stsc", size: size, payload: payload)
} Is there a better way to get that size? |
Alexey, can I ask what your source is for the two files with the |
I do not think so.
Well, the code you posted does not discard anything, so you are not doing anything twice. What you have is what I would always write to start with, because it is simple. Computing the size in advance will only provide a benefit if
You need to know the exact bitwise scheme from the specification and respect it (just like with the endianness we needed to watch for unsigned numbers, only with more variations). See here for a comparison. |
christie.m4b is produced by myself using AudioBookBinder I could pass you these files for debugging if we'll find a way (wetransfer o something else, 788 Mb) |
I found a file with those atoms, and it doesn't appear those are the problem, so that probably won't be necessary. However, it's been my experience that AudiobookBinder produces files that... don't play very nicely with other apps. Admittedly, it's been a couple years since I've tried to use it, but at the time, I remember noting that trying to edit the metadata with Subler on a file assembled with AudiobookBinder resulted in a corrupt file. So, that very well could be the source of our problems with these files. (you can see the issue I made with the developer of |
Oops, bad news... 99% of my files are created using Binder :-( |
Without knowing more about how AudiobookBinder puts files together together, I don't know what else I can do. I've tested However, before giving up entirely, I did make some adjustments to the integer handling in @SDGGiesbrecht , it doesn't appear that memory allocation was the issue. I've reserved the allocation when constructing most atoms (particularly the large ones) and it hasn't made any difference to the delay. I did manage to reduce the memory usage by 1/2 to 1/3 by removing the parameter to pass in the media data for this function and one other: // getMediaData() is already a method of the same class, so passing it in as a parameter was unnecessary)
func setMdat(tag: Tag) throws {
let mdat = try Mdat(mediaData: getMediaData(),
titleArray: tag.chapterHandler.chapterTitles)
self.mdats = [mdat]
self.moov.soundTrack.mdia.minf.stbl.chunkOffsetAtom.chunkOffsetTable = try calculateNewMediaOffsets()
}
public func write(tag: Tag, to outputLocation: URL) throws {
try setMetadataAtoms(tag: tag)
setLanguage(tag: tag)
try setChapterTrack(tag: tag)
let mdatStart = Date()
try setMdat(tag: tag)
if #available(OSX 10.15, *) {
print("mdat time: \(mdatStart.distance(to: Date()))")
} else {
// Fallback on earlier versions
}
let outputStart = Date()
var outputData = Data()
let size = optimizedRoot.map({$0.size}).sum()
outputData.reserveCapacity(size)
for atom in self.optimizedRoot {
outputData.append(atom.encode)
}
if #available(OSX 10.15, *) {
print("output time: \(outputStart.distance(to: Date()))")
} else {
// Fallback on earlier versions
}
try outputData.write(to: outputLocation, options: .atomic)
} So I'm not passing around quite as many copies of the media data at encoding time. All that accomplished, however, was remove the ~45 seconds it took to store I'm trying to find information on profiling, as you recommended, and while there are a few tutorials about using it with apps, I can't figure out how to use it for a library. I've gotten Instruments running and recorded XCode as it was running the test that demonstrates how long writing a file is taking (roughly 3 minutes for a 222MB file) but I can't find anything within there that is telling me what is causing the problem. |
You cannot “run” a library. Write a short command line tool that exercises the slow code. Then launch that with Instruments attached. |
Okay, I have a trace file now. Out of the 2:15 it takes to write a 222MB file, a good third of that is the Drilling down into that, it looks like the delay there is coming at the call to get the size of the media data in order to establish the offsets. Which, again, this should be really straightforward math, so the fact that it is taking so long is unexpected. And this is actually a function that you helped me write, so maybe you'll have some insight? private var mediaDataCount: Int {
do {
return try self.chunkSizes(stbl: self.moov.soundTrack.mdia.minf.stbl).sum() // just math, right?
} catch {
fatalError(error.localizedDescription)
}
}
func chunkSizes(stbl: Stbl) throws -> [Int] {
let sampleToChunkTable = stbl.stsc.sampleToChunkTable
var sampleSizeTable = [Int]()
if stbl.stsz.sampleSize == 0 {
sampleSizeTable = stbl.stsz.sampleSizeTable
} else {
var count = stbl.stsz.entryCount
while count > 0 {
sampleSizeTable.append(stbl.stsz.sampleSize)
count -= 1
}
}
var remainingSamples = sampleSizeTable
var chunkSizes: [Int] = []
for (index, group) in sampleToChunkTable.enumerated() {
let knownChunks: Int
if let nextGroup = sampleToChunkTable[index...].dropFirst().first {
knownChunks = nextGroup.firstChunk - group.firstChunk
} else {
// If there is no following group,
// we don’t know how many chunks this applies to,
// but there must be at least one.
knownChunks = 1
}
for _ in 0 ..< knownChunks {
var chunkSize = 0
for _ in 0 ..< group.samplesPerChunk {
guard !remainingSamples.isEmpty else {
throw Mp4FileError.MissingSample
}
chunkSize += remainingSamples.removeFirst()
}
chunkSizes.append(chunkSize)
}
}
// Handle any remaining chunks according to the last group.
while !remainingSamples.isEmpty {
guard let samples = sampleToChunkTable.last?.samplesPerChunk else {
throw Mp4FileError.MissingChunk
}
var chunkSize = 0
for _ in 0 ..< samples {
guard !remainingSamples.isEmpty else {
throw Mp4FileError.MissingSample
}
chunkSize += remainingSamples.removeFirst()
}
chunkSizes.append(chunkSize)
}
return chunkSizes
} All of this is just math. We're not even counting bytes of media data at any point. We're taking the (we will later combine that information with the chunk offsets in order to isolate the chunks of data themselves, but we've not doing that HERE, or at any point in the The only thing I can see about this that might cause it to take a long time is the sheer number of samples/chunks. When you're dealing with a large file like an audiobook, there are probably tens, if not hundreds, if thousands of samples:
So we're just doing calculations here, but we're doing A LOT of them. And I have no idea how to optimize that any further. So, it may be that this particular part of the process can't be trimmed down. It's just a whole lotta calculations. Moving on. The other branch that is taking a significant portion of the time is this one: And I notice off the bat that almost half the time involved is in doing... the same huge chunk of calculations we did for the other branch? func setMdat(tag: Tag) throws {
let mdat = try Mdat(mediaData: getMediaData(), // <- calls `chunkSizes(stbl:)` again
titleArray: tag.chapterHandler.chapterTitles)
self.mdats = [mdat]
self.moov.soundTrack.mdia.minf.stbl.chunkOffsetAtom.chunkOffsetTable = try calculateNewMediaOffsets()
recalculateSoundTrackSizes()
}
func getMediaData() throws -> Data {
let chunkOffsets = self.moov.soundTrack.mdia.minf.stbl.chunkOffsetAtom.chunkOffsetTable
let chunkSizes = try self.chunkSizes(stbl: self.moov.soundTrack.mdia.minf.stbl)
guard chunkSizes.count == chunkOffsets.count else {
throw Mp4FileError.ChunkSizeToChunkOffsetCountMismatch
}
let reserve = chunkSizes.sum()
let data = self.data
// Now that we know our CHUNK sizes, we can calculate the data to isolate by adding each chunk size to its corresponding offset to create a range for the data
var mediaData = Data()
mediaData.reserveCapacity(reserve)
for (index, entry) in chunkOffsets.enumerated() {
let start = entry
let end = start + chunkSizes[index]
let range = start ..< end
mediaData.append(data[range])
}
return mediaData
} So, I guess I'm going to... create a And that.... saved me all of 0.05 minutes of processing time. So, um. Hmph. Guess that wasn't the magic bullet after all. ETA: Scratch that. I missed one call to the ETA2: Nope. Still taking at least 2.1 minutes to process. No significant reduction in processing time. |
Holy schnikes! Okay, I didn't realize I would have to restart YES! (of that 43.43 seconds, almost 40 of them are spent doing those many, many calculations I mentioned. If there is a way to streamline that, this thing will fly.) Unfortunately, however, the gains for larger files do not appear to be in proportion. While a 222MB file took 43.43 seconds, a 1.07GB file is at 19+ minutes and still going. (I gave up on it at 28+ minutes.) Alexey, if you could, can you try pulling the updated branch and seeing if that improves your performance as well? |
Notice the hierarchy.
That does not surprise me, because it is being called on an array, and an array places its elements into memory at the front. When you remove the first element, it needs to copy the whole array backwards one space. If you start with an array of 603 857 elements, and you repeatedly remove the first until it is empty, you are performing 603 856 + 603 855 + 603 854 + 604 853... copies. Consider how the work being done would grow more and more significantly as the size of the file grows. It is not your method, so you cannot improve it, but you can dodge it. Try this change:
That will turn You can also try inverting the call stack in Instruments’ report in case that reveals anything more. |
Hi! It seems to be resolved the blade.m4b issue (file that couldn't be open con BAD_INSTRUCTION break), now the file can be open. |
LOL yep, that did it. Processing time is now down to 3 seconds for the 222MB file, and 16 seconds for the 1.07GB file. Thank you so very much! Alexey, I'm glad the problem with at least some of your files is fixed. If you would like to go ahead and share one of your problem files, I will take a look at it and see if I can figure out where the problem is. I can't guarantee anything, and it may take me a few days because I do my normal (i.e. paying) gig from Wednesday to Sunday nights. So I might not be able to have a look at it until next week, but I'll see what I can do. In the meantime, enjoy the new, improved, super-zippy version of SwiftTagger! (as soon as I merge these changes.) |
Actually, scratch that. I just re-ran my other tests and I've broken something with the offsets, so chapter titles are getting mangled. I'll work on fixing that tomorrow. |
Alexey, I'm not sure if this will address YOUR issue, but I did fix a bug that was causing string offset problems (I forgot to count the null terminators when calculating the size of the If not, like I said, I will try to take a look at it after the weekend. |
Sure, let me know and I test all the cases. |
Right now the current version is 1.7.1. If that fixes your issue, great. If not, go ahead and share a link to one of your problem files and I will take a look at it after the weekend. |
With 1.7.1 the file is not corrupted: File opening issue persists |
Hi @NCrusher74 !
I find SwiftTagger very useful, great library! Unfortunately the file writing (after modifying the tags) takes 10-15 minutes and corrupts files (probably XCode12.4 or OS Big Sur 11.2.3 issue).
Here is a code snippet:
let url = URL(fileURLWithPath: "/Users/alexey/Documents/adLibTest/1-01 The Fellowship of the Ring.m4b")
var source = try? AudioFile(location: url)
print(source?.title) // was "The Fellowship of the Ring"
source?.title = "The Fellowship of the Ring-1"
Both, overwriting the existing file and writing a new file, leads to the same result.
Alexey
The text was updated successfully, but these errors were encountered: