Skip to content

lp1744550: Fix use-after-free (and more serious issues) in GlobalTrackCache#1492

Merged
daschuer merged 81 commits intomixxxdj:2.1from
uklotzde:lp1744550_globaltrackcache
Feb 24, 2018
Merged

lp1744550: Fix use-after-free (and more serious issues) in GlobalTrackCache#1492
daschuer merged 81 commits intomixxxdj:2.1from
uklotzde:lp1744550_globaltrackcache

Conversation

@uklotzde
Copy link
Copy Markdown
Contributor

@uklotzde uklotzde commented Jan 23, 2018

https://bugs.launchpad.net/mixxx/+bug/1744550

This one is a real killer that sooner or later crashes Mixxx and might cause serious memory corruption!!

There was a race condition in the custom deleter that must be resolved within GlobalTrackCache. The new multi-threaded regression test reliably crashes with debug assertions and finally a SIGSEGV without the fix applied.

I also found a memory leak: We cannot use QObject::deleteLater() for temporary Track objects! Otherwise those objects will never be deleted. Nope, we actually should use deleteLater()!

@Be-ing
Copy link
Copy Markdown
Contributor

Be-ing commented Jan 23, 2018

Should this be targeted at 2.1? Also, this has commits from #1491.

@uklotzde uklotzde changed the base branch from master to 2.1 January 23, 2018 18:39
@uklotzde
Copy link
Copy Markdown
Contributor Author

@Be-ing Sorry, created in a rush. Retargeted to 2.1 of course!

Copy link
Copy Markdown
Member

@daschuer daschuer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am glad you have found this itching bug.
However I have not yet understand why the original deletelater() version fails and how this fail produces the crash.

Due to the fail of delete, the Track is not deleted and an access to it can't crash.

On the other hand, if the track is connected to QT signals, the std::shared pointer does not count these references. So I am afraid we delete a Track that is still connected. The deleteLater() puts the object on the QT signal queue to prevent prevent delivering queued signals to the object. Are we sure that this is solved?

DEBUG_ASSERT(m_trackRef.getId() == trackId);
m_pTrack->initId(trackId);
if (m_trackRef.getId().isValid()) {
DEBUG_ASSERT(m_trackRef.getId() == trackId);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re-init with the same Id is allowed? Is there a use-case for this? If it is not the same Id, ca we continue silently in the release version? A comment can be useful.

Comment thread src/track/globaltrackcache.cpp Outdated
DEBUG_ASSERT((nullptr == pEvictedTrack) ||
(pEvictedTrack == (*trackById).plainPtr));
// ...and continue like it has not been found
auto trackRefPointer = lookupInternal(trackId);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMHO the ISO CPP guideline to use std::pair for two value return function is a mistake. A named struct would be a better choice.
I like the "out" keyword in c# for output parameters as well. We can consider to model it via a "out" decoration template.
Do you know if std::pair or a struct has any performance drawback compared to a pointer parameter as second return value?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just needed a tuple with 2 elements to pass those internal results around in order to avoid redundant code. The 2 values belong together and shouldn't be torn apart into obscure output parameters. Creating a new type that is not used anywhere else is tedious and not really worth the effort. Unfortunately crufty, old C++ has no first class support for tuples.

Well, maybe I can write a dedicated TrackRefPointer type. But writing so much unnecessary code that needs top be maintained is one of the reasons why I feel uncomfortable with C++ and want to move on ;)

@uklotzde
Copy link
Copy Markdown
Contributor Author

The Trusty gcc build shows an unexpected debug assertion in GlobalTrackCacheTest that never occurred in any of my local builds. Very strange.

@uklotzde uklotzde changed the title [PRIO] lp1744550: Fix use-after-free in GlobalTrackCache [WiP] lp1744550: Fix use-after-free in GlobalTrackCache Jan 23, 2018
@uklotzde
Copy link
Copy Markdown
Contributor Author

Advanced Brainf*ck

@uklotzde
Copy link
Copy Markdown
Contributor Author

This is getting surreal. We should rewrite Mixxx 4 in Rust! I'm loosing my faith in C++, it's simply too unsafe to use.

@uklotzde
Copy link
Copy Markdown
Contributor Author

uklotzde commented Jan 24, 2018

Hopefully I'm done now. Waiting for the builds to finish and doing some local testing.

I've spend hours to figure out what was actually happening and found various issues. The new stress test revealed more issues than expected, including double deletions.

Simple explanation: The underlying object of the pointer that is passed to the custom deleter function must not be accessed until we reach the locked cache instance and are able to decide if this object still exists. The deleter might even be invoked with a pointer that has already been deleted before, totally legal. Due to those (expected) race conditions anything you could ever think of is possible, a mine field!!!

The multi-threaded stress test and tracing helped me to understand all the invocation patterns that we need to handle. It takes 1s for 100k rounds within the main thread (~30-40k within the worker) and I think it is worth to keep it enabled. Even if results are random and not reproducible, any failure will indicate an open issue that needs to be fixed.

@uklotzde uklotzde changed the title [WiP] lp1744550: Fix use-after-free in GlobalTrackCache lp1744550: Fix use-after-free (and more serious issues) in GlobalTrackCache Jan 24, 2018
Comment thread src/track/globaltrackcache.h Outdated

typedef std::set<Track*> AllocatedTracks;

bool evictInternal(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be private?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is already private like most functions.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, sorry for the noise. I got confused by the public keyword for the Item class.

@Be-ing Be-ing added the library label Jan 24, 2018
@uklotzde
Copy link
Copy Markdown
Contributor Author

@daschuer Thanks for the clever save-before-deletion solution 👍 I didn't notice that we could achieve a better separation of concerns and simplify the code.

I just made some minor changes on top of your commits and adjusted the naming to better reflect the actual purpose.

@daschuer
Copy link
Copy Markdown
Member

Nice! I am glad we are done. Thank very much for making this peace of unhandy code, tame.

@daschuer
Copy link
Copy Markdown
Member

LGTM

@daschuer daschuer merged commit 483182a into mixxxdj:2.1 Feb 24, 2018
@uklotzde
Copy link
Copy Markdown
Contributor Author

phew! 😅 Hopefully the new beta builds will be available soon to fix this nasty bug for all our brave beta testers.

@thomasjfox
Copy link
Copy Markdown
Contributor

New mixxx user here. I've been lurking this changeset and will test it tonight (building right now).
Usually I was able to crash mixxx 2.1 beta within a few minutes by placing a few hundred files in the analyzer + keep using the UI. The computer has 6*2 cores and 48GB RAM, so it was surprising to see funky memory allocation error messages :)

@Be-ing
Copy link
Copy Markdown
Contributor

Be-ing commented Feb 24, 2018

Big thanks to @uklotzde and @daschuer for your hard work on this!

@Be-ing
Copy link
Copy Markdown
Contributor

Be-ing commented Feb 24, 2018

@thomasjfox could you test #1413? Tips for testing PRs are on the wiki.

@thomasjfox
Copy link
Copy Markdown
Contributor

@Be-ing: Luckily uklotzde already updated the lp1737537 branch with the changes from #1492.
I'll do the bleeding edge thing and run with the changes from #1413, too :o) Thanks for the tip!
If there are any crashes, I'll report it in #1413.

@uklotzde
Copy link
Copy Markdown
Contributor Author

@thomasjfox I strongly recommend using #1413 for analysis. This is the version I am using and it has proven to be stable and responsive.

While working on this issue I noticed that the library scanner illegally accesses the database from another thread. But we won't be able to fix that for 2.1. The only recommendation is to leave Mixxx alone while rescanning your library.

@uklotzde
Copy link
Copy Markdown
Contributor Author

uklotzde commented Feb 24, 2018

@Be-ing I think you intended to mention @daschuer

Indeed, this was a tough battle ;) But now we should have a solid solution with most of the rough edges ironed out. This serves as the foundation for future improvements, which don't need to be done in a rush. The topic is simply too complex and sensitive as we have learned.

@Be-ing
Copy link
Copy Markdown
Contributor

Be-ing commented Feb 24, 2018

While working on this issue I noticed that the library scanner illegally accesses the database from another thread. But we won't be able to fix that for 2.1. The only recommendation is to leave Mixxx alone while rescanning your library.

Does that mean there is a chance of Mixxx crashing when scanning the library and doing other things with Mixxx? If so, maybe we should make a quick hack for 2.1 to block other database operations while the library scanner is running.

@Be-ing I think you intended to mention @daschuer

Haha yes, thanks for pointing that out.

@uklotzde
Copy link
Copy Markdown
Contributor Author

@Be-ing I added some assertions to TrackDAO that I had to comment out, because they are violated when rescanning the library. The only solution would be to block any interaction with Mixxx while the scan is running.

@uklotzde
Copy link
Copy Markdown
Contributor Author

uklotzde commented Feb 24, 2018

@Be-ing No, sorry, my fault. The LibraryScanner is ok! I just noticed that LibraryScanner uses its own instance of TrackDAO. We instead need to adjust the assertions that must compare the current thread with the thread that constructed the TrackDAO. I will prepare a fix.

@uklotzde
Copy link
Copy Markdown
Contributor Author

uklotzde commented Mar 4, 2018

@daschuer Unfortunately the recent reorganization of evicting, saving, and finally deallocating has introduced a new issue that caused another crash. I suppose reusing the lookup functions from the cache while evicting and saving the track object is wrong in evictOrDelete(). We must not revive the cached object, but instead just wrap the plain pointer into a new, unshared strong pointer for the saver callback. Preparing a fix...

@daschuer
Copy link
Copy Markdown
Member

daschuer commented Mar 4, 2018

I had just a close look to the code, but I cannot spot the faulty point.
The Idea was that the cache mutex guarantees that we have no pointer handed out in the meanwhile.
Is there a gap that this check fails? A separate strong pointer might not work if this gap still exist?
In fact, the code assumes that the it is already a separate strong pointer during evicting. Is this assumption wrong?

Do you have a backtrace of the crash?

@uklotzde
Copy link
Copy Markdown
Contributor Author

uklotzde commented Mar 4, 2018

The backtrace doesn't help at all and even the logs don't give much insight. What I see is that one memory address is reused for a new allocation.

At least I was able to reproduce the crash once and need to test my fix before publishing it. The deletion will be split into into two phases with different deleters. During the 2nd phase the cache is not involved anymore. The code will get simpler and more efficient, which hopefully is an indicator for correctness ;)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants