GlobalTrackCache SIGSEGV: Improve 2-phase deletion of cached track objects#1536
Conversation
The deletion is split into 2 phase: - 1st phase: evict + save - 2nd phase: deallocate Deallocation is independent of the cache and triggered by the custom deleter of the newly created TrackPointer. The TrackPointer is passed as an argument to the saver callback. The cache releases ownership on the track object by passing it to the saver.
daschuer
left a comment
There was a problem hiding this comment.
In general it is a bad idea to code around a double free issue, without knowing the root cause.
This crash is a very good indicator that there is something wrong in our code or the compiler. I trust the heap allocator here and If we can't clearly identify the crash along our code, it is probably the optimizer bug, you have found before.
I have not get the point where this PR change the general way the usage of the Track is tracked, so it will probably not change any root cause of the crash.
I also do not understand how this PR will be save against the two tracks at the same memory location you have described here. The 2.1 state was a result of a good review process, and just changing some things now might hide symptoms of the issue, making the real issue harder to spot and more dangerous.
| Track* plainPtr) { | ||
| DEBUG_ASSERT(plainPtr); | ||
|
|
||
| // Nearly everything is possible until the cache is locked!!! |
There was a problem hiding this comment.
This comment is useless. Better explain what is allowed after locking, but not before.
| if (!cachedTrack->second.expired()) { | ||
| // We have handed out (revived) this track again while waiting | ||
| // at the lock at the beginning of this function or a new track | ||
| // object has been allocated with the same memory address. |
There was a problem hiding this comment.
How could the later have happened? The underlying reference counter object of a shared/weak pointer pair is deleted after the last referencing weak pointer is dropped. Since we still hold here weak pointer, the counter object cannot be violated.
If the later happens anyway, there have happened a miscount, which should not happen by definition of the STL.
All this sound more like the odd optimizer bug: https://github.com/mixxxdj/mixxx/pull/1492/files#diff-5149c75827e74d344c21b2adcec7e444R179 than an issue in our code.
If this actually still happens, we had the situation before, that a TrackPointer points to an already deleted Track or a track silently changes under the a used TrackPointer.
| evictAndSave(strongPtr); | ||
| } else { | ||
| // Track has already been evicted and can be deallocated now. | ||
| m_allocatedTracks.erase(allocatedTrack); |
There was a problem hiding this comment.
Here is the place where the last reference to the reference counting object falls to zero and is deleted.
This is quite save, because the expired check still works on the second run of the deleter.
This is an extra safety net when deactivate(). In the solution from this PR we have no guarantee that we do not end up in two independent SharedPointer with a deleter.
| const CachedTracks::iterator cachedTrack = | ||
| m_cachedTracks.find(plainPtr); | ||
| if (cachedTrack == m_cachedTracks.end()) { | ||
| // We have already deleted this track while waiting at |
There was a problem hiding this comment.
// We have already evicted this track ...
The second smart pointer for deleting may still live.
|
I have thought about the second pointer with the same address issue. And I think that can actually happen, but It should be save in this PR and before. If the race condition happens that Track A is revived while the first instance is locked inside This delete is either aborted because |
|
I still have now idea what could be the source of the double delete. Since it is so hard to think about the corner cases, I think we are doing something wrong here. The double shared pointer solution is not that bad here. I can now remember why I wanted to keep a track of the pointer during saving. It is because you can loose user Data without it. If we create a new Track object for the same track, it is constructed with the old unsaved data. If this is stored later, it overwrites the changes of the Track of the first instance. Unfortunately the check for this was not implemented. How about create the second smart pointer with the deleter just after creating the Track? |
|
OK, the GCC optimizer bug was a different case. It did not produce a crash, it just looses drops a reference normally, right? |
|
I am working on a new attempt getting rid of the race condition before holdling the lock, based on this branch. |
|
I'm now able to proof why and how the current implementation is broken:
I suppose there is also a double deletion that causes the SIGSEGV, but I'm currently not able to deduce the corresponding invocation sequence. Nevertheless we have at least a memory leak and concurrent access to the same file that the cache should prevent. This new version does not suffer from these deficiencies. The rare race condition that might load outdated data from the database before the modified data could be saved cannot be solved unless the database access is performed within the callback scope in the same thread while the cache is still locked. This is a different issue and this PR does not change the current behavior. |
|
I am convinced that with these changes we are right on track now. By avoiding to enter the cache again during the 2nd phase many issues simply disappear. |
|
Ah, I understand. In your steps above, we will have the issue, that the changes in P1 are lost, because P2 might be created before P1 is saved. Can we merge this as an intermediate version? |
|
That's what I propose ;) This commit fixes the memory leak and maybe also the supposed double free that might have caused the crash. With the example it should be clear that the current implementation is plain wrong and dangerous. This fix addresses only some serious technical issues. The rare race conditions that just risk to lose some recently modified data but are expected and otherwise harmless can be investigated later. I don't expect that there is a solution without unintended side effects. The race condition will disappear when multi-threaded database access is available, although at the cost of additional lock contention on the cache. |
Follow-up of #1492
I experienced 2 crashes when using Mixxx with the new analysis framework. Unfortunately I was only able to reproduce the crash once, but never again. The logs indicate that the same memory address has been reused for another track. Deletion of the first track object finished as expected, but Mixxx crashed when trying to delete the second object that was allocated at the same memory address.
This fix basically resembles my previous version, although simplified and much more elegantly and efficiently. After evicting a track object the cache passes ownership to the saver callback. This is accomplished with a custom deleter that eventually deallocates the track object by invoking QObject::deleteLater() after all references have been dropped. The cache is not involved any more and no race conditions can occur. Erasing the pointer from all internal data structures ensures that evicted pointers have completely disappeared from the cache. Even the case when a delayed deletion callback arrives for the same memory address that already stores a newly allocated object is handled correctly. This callback will simply be rejected, unless the corresponding shared_ptr has expired.
The modified code should ensure that neither double free nor any use after free will occur. It should be simpler than before and easier to reason about.