lp1744550: Fix use-after-free (and more serious issues) in GlobalTrackCache#1492
lp1744550: Fix use-after-free (and more serious issues) in GlobalTrackCache#1492daschuer merged 81 commits intomixxxdj:2.1from uklotzde:lp1744550_globaltrackcache
Conversation
Debug logs enabled during compile time are more appropriate and reliable.
|
Should this be targeted at 2.1? Also, this has commits from #1491. |
|
@Be-ing Sorry, created in a rush. Retargeted to 2.1 of course! |
daschuer
left a comment
There was a problem hiding this comment.
I am glad you have found this itching bug.
However I have not yet understand why the original deletelater() version fails and how this fail produces the crash.
Due to the fail of delete, the Track is not deleted and an access to it can't crash.
On the other hand, if the track is connected to QT signals, the std::shared pointer does not count these references. So I am afraid we delete a Track that is still connected. The deleteLater() puts the object on the QT signal queue to prevent prevent delivering queued signals to the object. Are we sure that this is solved?
| DEBUG_ASSERT(m_trackRef.getId() == trackId); | ||
| m_pTrack->initId(trackId); | ||
| if (m_trackRef.getId().isValid()) { | ||
| DEBUG_ASSERT(m_trackRef.getId() == trackId); |
There was a problem hiding this comment.
Re-init with the same Id is allowed? Is there a use-case for this? If it is not the same Id, ca we continue silently in the release version? A comment can be useful.
| DEBUG_ASSERT((nullptr == pEvictedTrack) || | ||
| (pEvictedTrack == (*trackById).plainPtr)); | ||
| // ...and continue like it has not been found | ||
| auto trackRefPointer = lookupInternal(trackId); |
There was a problem hiding this comment.
IMHO the ISO CPP guideline to use std::pair for two value return function is a mistake. A named struct would be a better choice.
I like the "out" keyword in c# for output parameters as well. We can consider to model it via a "out" decoration template.
Do you know if std::pair or a struct has any performance drawback compared to a pointer parameter as second return value?
There was a problem hiding this comment.
I just needed a tuple with 2 elements to pass those internal results around in order to avoid redundant code. The 2 values belong together and shouldn't be torn apart into obscure output parameters. Creating a new type that is not used anywhere else is tedious and not really worth the effort. Unfortunately crufty, old C++ has no first class support for tuples.
Well, maybe I can write a dedicated TrackRefPointer type. But writing so much unnecessary code that needs top be maintained is one of the reasons why I feel uncomfortable with C++ and want to move on ;)
|
The Trusty gcc build shows an unexpected debug assertion in GlobalTrackCacheTest that never occurred in any of my local builds. Very strange. |
|
Advanced Brainf*ck |
|
This is getting surreal. We should rewrite Mixxx 4 in Rust! I'm loosing my faith in C++, it's simply too unsafe to use. |
|
Hopefully I'm done now. Waiting for the builds to finish and doing some local testing. I've spend hours to figure out what was actually happening and found various issues. The new stress test revealed more issues than expected, including double deletions. Simple explanation: The underlying object of the pointer that is passed to the custom deleter function must not be accessed until we reach the locked cache instance and are able to decide if this object still exists. The deleter might even be invoked with a pointer that has already been deleted before, totally legal. Due to those (expected) race conditions anything you could ever think of is possible, a mine field!!! The multi-threaded stress test and tracing helped me to understand all the invocation patterns that we need to handle. It takes 1s for 100k rounds within the main thread (~30-40k within the worker) and I think it is worth to keep it enabled. Even if results are random and not reproducible, any failure will indicate an open issue that needs to be fixed. |
|
|
||
| typedef std::set<Track*> AllocatedTracks; | ||
|
|
||
| bool evictInternal( |
There was a problem hiding this comment.
It is already private like most functions.
There was a problem hiding this comment.
Oh, sorry for the noise. I got confused by the public keyword for the Item class.
|
@daschuer Thanks for the clever save-before-deletion solution 👍 I didn't notice that we could achieve a better separation of concerns and simplify the code. I just made some minor changes on top of your commits and adjusted the naming to better reflect the actual purpose. |
|
Nice! I am glad we are done. Thank very much for making this peace of unhandy code, tame. |
|
LGTM |
|
phew! 😅 Hopefully the new beta builds will be available soon to fix this nasty bug for all our brave beta testers. |
|
New mixxx user here. I've been lurking this changeset and will test it tonight (building right now). |
|
@thomasjfox could you test #1413? Tips for testing PRs are on the wiki. |
|
@thomasjfox I strongly recommend using #1413 for analysis. This is the version I am using and it has proven to be stable and responsive. While working on this issue I noticed that the library scanner illegally accesses the database from another thread. But we won't be able to fix that for 2.1. The only recommendation is to leave Mixxx alone while rescanning your library. |
|
@Be-ing I think you intended to mention @daschuer Indeed, this was a tough battle ;) But now we should have a solid solution with most of the rough edges ironed out. This serves as the foundation for future improvements, which don't need to be done in a rush. The topic is simply too complex and sensitive as we have learned. |
Does that mean there is a chance of Mixxx crashing when scanning the library and doing other things with Mixxx? If so, maybe we should make a quick hack for 2.1 to block other database operations while the library scanner is running. Haha yes, thanks for pointing that out. |
|
@Be-ing I added some assertions to TrackDAO that I had to comment out, because they are violated when rescanning the library. The only solution would be to block any interaction with Mixxx while the scan is running. |
|
@Be-ing No, sorry, my fault. The LibraryScanner is ok! I just noticed that LibraryScanner uses its own instance of TrackDAO. We instead need to adjust the assertions that must compare the current thread with the thread that constructed the TrackDAO. I will prepare a fix. |
|
@daschuer Unfortunately the recent reorganization of evicting, saving, and finally deallocating has introduced a new issue that caused another crash. I suppose reusing the lookup functions from the cache while evicting and saving the track object is wrong in evictOrDelete(). We must not revive the cached object, but instead just wrap the plain pointer into a new, unshared strong pointer for the saver callback. Preparing a fix... |
|
I had just a close look to the code, but I cannot spot the faulty point. Do you have a backtrace of the crash? |
|
The backtrace doesn't help at all and even the logs don't give much insight. What I see is that one memory address is reused for a new allocation. At least I was able to reproduce the crash once and need to test my fix before publishing it. The deletion will be split into into two phases with different deleters. During the 2nd phase the cache is not involved anymore. The code will get simpler and more efficient, which hopefully is an indicator for correctness ;) |
https://bugs.launchpad.net/mixxx/+bug/1744550
This one is a real killer that sooner or later crashes Mixxx and might cause serious memory corruption!!
There was a race condition in the custom deleter that must be resolved within GlobalTrackCache. The new multi-threaded regression test reliably crashes with debug assertions and finally a SIGSEGV without the fix applied.
I also found a memory leak: We cannot use QObject::deleteLater() for temporary Track objects! Otherwise those objects will never be deleted.Nope, we actually should use deleteLater()!