Skip to content

GlobalTrackCache SIGSEGV: Improve 2-phase deletion of cached track objects#1536

Merged
daschuer merged 1 commit into
mixxxdj:2.1from
uklotzde:globaltrackcache_2phase_deletion
Mar 6, 2018
Merged

GlobalTrackCache SIGSEGV: Improve 2-phase deletion of cached track objects#1536
daschuer merged 1 commit into
mixxxdj:2.1from
uklotzde:globaltrackcache_2phase_deletion

Conversation

@uklotzde
Copy link
Copy Markdown
Contributor

@uklotzde uklotzde commented Mar 4, 2018

Follow-up of #1492

I experienced 2 crashes when using Mixxx with the new analysis framework. Unfortunately I was only able to reproduce the crash once, but never again. The logs indicate that the same memory address has been reused for another track. Deletion of the first track object finished as expected, but Mixxx crashed when trying to delete the second object that was allocated at the same memory address.

This fix basically resembles my previous version, although simplified and much more elegantly and efficiently. After evicting a track object the cache passes ownership to the saver callback. This is accomplished with a custom deleter that eventually deallocates the track object by invoking QObject::deleteLater() after all references have been dropped. The cache is not involved any more and no race conditions can occur. Erasing the pointer from all internal data structures ensures that evicted pointers have completely disappeared from the cache. Even the case when a delayed deletion callback arrives for the same memory address that already stores a newly allocated object is handled correctly. This callback will simply be rejected, unless the corresponding shared_ptr has expired.

The modified code should ensure that neither double free nor any use after free will occur. It should be simpler than before and easier to reason about.

The deletion is split into 2 phase:
- 1st phase: evict + save
- 2nd phase: deallocate

Deallocation is independent of the cache and triggered by the
custom deleter of the newly created TrackPointer. The TrackPointer
is passed as an argument to the saver callback. The cache releases
ownership on the track object by passing it to the saver.
Copy link
Copy Markdown
Member

@daschuer daschuer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general it is a bad idea to code around a double free issue, without knowing the root cause.
This crash is a very good indicator that there is something wrong in our code or the compiler. I trust the heap allocator here and If we can't clearly identify the crash along our code, it is probably the optimizer bug, you have found before.

I have not get the point where this PR change the general way the usage of the Track is tracked, so it will probably not change any root cause of the crash.

I also do not understand how this PR will be save against the two tracks at the same memory location you have described here. The 2.1 state was a result of a good review process, and just changing some things now might hide symptoms of the issue, making the real issue harder to spot and more dangerous.

Track* plainPtr) {
DEBUG_ASSERT(plainPtr);

// Nearly everything is possible until the cache is locked!!!
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment is useless. Better explain what is allowed after locking, but not before.

if (!cachedTrack->second.expired()) {
// We have handed out (revived) this track again while waiting
// at the lock at the beginning of this function or a new track
// object has been allocated with the same memory address.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How could the later have happened? The underlying reference counter object of a shared/weak pointer pair is deleted after the last referencing weak pointer is dropped. Since we still hold here weak pointer, the counter object cannot be violated.

If the later happens anyway, there have happened a miscount, which should not happen by definition of the STL.
All this sound more like the odd optimizer bug: https://github.com/mixxxdj/mixxx/pull/1492/files#diff-5149c75827e74d344c21b2adcec7e444R179 than an issue in our code.

If this actually still happens, we had the situation before, that a TrackPointer points to an already deleted Track or a track silently changes under the a used TrackPointer.

evictAndSave(strongPtr);
} else {
// Track has already been evicted and can be deallocated now.
m_allocatedTracks.erase(allocatedTrack);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is the place where the last reference to the reference counting object falls to zero and is deleted.
This is quite save, because the expired check still works on the second run of the deleter.
This is an extra safety net when deactivate(). In the solution from this PR we have no guarantee that we do not end up in two independent SharedPointer with a deleter.

const CachedTracks::iterator cachedTrack =
m_cachedTracks.find(plainPtr);
if (cachedTrack == m_cachedTracks.end()) {
// We have already deleted this track while waiting at
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

// We have already evicted this track ...

The second smart pointer for deleting may still live.

@daschuer
Copy link
Copy Markdown
Member

daschuer commented Mar 5, 2018

I have thought about the second pointer with the same address issue. And I think that can actually happen, but It should be save in this PR and before.

If the race condition happens that Track A is revived while the first instance is locked inside
evictAndSave(). And if the second Instance runs evictAndSave() first, the first instance waiting points to an invalid memory address. And if this address now becomes valid, because it is used for Track B.
The first instance of Track A is now trying to delete Track B.

This delete is either aborted because if (cachedTrack == m_cachedTracks.end()) and if (!cachedTrack->second.expired()) or it is executed, because Track B is waiting at the lock for deletion or evicting, which will be blocked later.

@daschuer
Copy link
Copy Markdown
Member

daschuer commented Mar 5, 2018

I still have now idea what could be the source of the double delete.

Since it is so hard to think about the corner cases, I think we are doing something wrong here.

The double shared pointer solution is not that bad here. I can now remember why I wanted to keep a track of the pointer during saving. It is because you can loose user Data without it. If we create a new Track object for the same track, it is constructed with the old unsaved data. If this is stored later, it overwrites the changes of the Track of the first instance.

Unfortunately the check for this was not implemented.

How about create the second smart pointer with the deleter just after creating the Track?

@daschuer
Copy link
Copy Markdown
Member

daschuer commented Mar 6, 2018

OK, the GCC optimizer bug was a different case. It did not produce a crash, it just looses drops a reference normally, right?

@daschuer
Copy link
Copy Markdown
Member

daschuer commented Mar 6, 2018

I am working on a new attempt getting rid of the race condition before holdling the lock, based on this branch.

@uklotzde
Copy link
Copy Markdown
Contributor Author

uklotzde commented Mar 6, 2018

I'm now able to proof why and how the current implementation is broken:

  1. Start with an empty cache
  2. By resolving a track with ID1 the cache will allocate a new track object T1 at the memory address (= plain pointer) P1. A new shared pointer SP1 is returned to the caller
  3. After the last reference of SP1 is dropped P1 will be evicted from the cache. The pointer is removed from the two indices, but still contained in m_allocatedTracks. After eviction no track with ID1 is found.
  4. P1 is wrapped into a new shared pointer SP1' and passed to the saver callback
  5. While the track is saved another thread tries to resolve the same track with ID1. Since no cached object with this id is found a new track object T2 at memory address P2 is allocated. This is returned as shared pointer SP2 to the caller.
  6. After T1 has been saved the last reference of SP1' is dropped. The custom deleter calls GlobalTrackCache::evictOrDelete() with the argument P1 once again.
  7. P1 is still found in m_allocatedTracks and the corresponding weak pointer has expired as expected. But now bad things happen...
  8. We are trying to obtain a strong pointer for ID1 and will indeed get one. But instead of a shared pointer that wraps P1 we will find P2 that is still in use!!!
  9. Instead of deleting P1 we will now evict and save P2 instead. P1 is actually lost and will never be deleted, i.e. we have created a memory leak. It will never be removed from m_allocatedTracks.
  10. After eviction SP2 is reused and passed to the saver callback for saving, although it is still used by another thread.
  11. After saving we enter the 2nd phase again, this time with P2. The deletion will be aborted this time, because P2 has not yet expired and has pending references.

I suppose there is also a double deletion that causes the SIGSEGV, but I'm currently not able to deduce the corresponding invocation sequence. Nevertheless we have at least a memory leak and concurrent access to the same file that the cache should prevent. This new version does not suffer from these deficiencies.

The rare race condition that might load outdated data from the database before the modified data could be saved cannot be solved unless the database access is performed within the callback scope in the same thread while the cache is still locked. This is a different issue and this PR does not change the current behavior.

@uklotzde
Copy link
Copy Markdown
Contributor Author

uklotzde commented Mar 6, 2018

I am convinced that with these changes we are right on track now. By avoiding to enter the cache again during the 2nd phase many issues simply disappear.

@daschuer
Copy link
Copy Markdown
Member

daschuer commented Mar 6, 2018

Ah, I understand.

In your steps above, we will have the issue, that the changes in P1 are lost, because P2 might be created before P1 is saved.

Can we merge this as an intermediate version?

@uklotzde
Copy link
Copy Markdown
Contributor Author

uklotzde commented Mar 6, 2018

That's what I propose ;) This commit fixes the memory leak and maybe also the supposed double free that might have caused the crash. With the example it should be clear that the current implementation is plain wrong and dangerous.

This fix addresses only some serious technical issues. The rare race conditions that just risk to lose some recently modified data but are expected and otherwise harmless can be investigated later. I don't expect that there is a solution without unintended side effects. The race condition will disappear when multi-threaded database access is available, although at the cost of additional lock contention on the cache.

@daschuer daschuer merged commit 1426419 into mixxxdj:2.1 Mar 6, 2018
@uklotzde uklotzde deleted the globaltrackcache_2phase_deletion branch March 11, 2018 19:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants