Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix device context pool deadlock scenario #751

Open
wants to merge 5 commits into
base: uwp/main
Choose a base branch
from

Conversation

benstevens48
Copy link

Fix a deadlock scenario that can occur in DeviceContextPool::TakeLease which is used in various places including ICanvasImage.GetBounds(). This fixes one of the issues identified in #748.

@damyanp
Copy link
Member

damyanp commented Jan 21, 2020

This seems a reasonable change to me, thank you! A lower-churn would be to add a call to lock.unlock() just before the call to CreateDeviceContext. This avoids rearranging the function and gives a more natural place to attach the comment explaining why the unlock is required as well.

@benstevens48
Copy link
Author

Made suggested changes. (Reason for original version was because often locks don't have an unlock, but I agree that since this one does we might as well use it!)

@benstevens48
Copy link
Author

benstevens48 commented Jan 27, 2020

@damyanp, @shawnhar - would you consider accepting this, and ideally releasing an updated NuGet package? Recently I've been making an animated GIF/WebP player using Win2D and I'm now regularly running into deadlocks because of this. The problem is that any function which calls GetResourceCreationDeviceContext() can trigger the deadlock (assuming another thread already has a lock on the device and also tries to call this function). Worse, the workaround is to take an explicit lock on the device before calling any function which calls this function, and there are lots of such functions, and the workaround actually makes deadlock more likely if I miss one of these functions.

Of course, I can build my own version of Win2D, but I don't really want to diverge from the official version.

Ideally this would be addressed at the some time as the other pull request I created to fix GetOrCreate, but that one might take longer to work out, so I'd be happy if just this fix got included first. If you don't want to ship a new NuGet, then it would make me less concerned about building my own version if this pull request was merged, then at least I know my code matches the official version which will eventually be updated.

@shawnhar
Copy link
Member

Apologies for the delay getting back to you on this. It's on my list to look at carefully, but I'm swamped this week. Really appreciate your help digging into these issues and putting changes together! Short term I would suggest just going ahead and using a locally built version to unblock your app.

@benstevens48
Copy link
Author

Thanks for reply. Building a local version is what I'll do. As part of the structuring of that, I've moved my forked repository, but it looks like it hasn't broken anything.

@benstevens48 benstevens48 mentioned this pull request Jul 7, 2020
@Sergio0694
Copy link
Member

The change here seems reasonable, and I'm not seeing risks with this. Even if a thread races against this and disposes the pool concurrently before the first thread resumes, it'll just return a device context on the original device, which is the same as another thread keeping a device alive for the entire duration of another thread concurrently closing the pool.

FYI @getrou it might make sense to also look into including this PR in the next batch of changes, as with the new public APIs to access the device context pool in #894 (the implementation is in one of my PRs in the internal repo), this problem might become potentially more frequent (given that it's a problem that is already present and just less frequent).

@benstevens48 two questions:

  1. At this line:
auto localDevice = m_d2dDevice;

I assume this also does an implicit AddRef during copy, right? Not 100% familiar with the exact C++ semantics here so figured I'd ask. Because the pool might be destroyed after the lock, so we do need to increment the ref count before unlocking.

  1. Could you also add a unit test and verify that it does deadlock before these changes and that it doesn't after? You should add it to the appropriate file in the internal unit tests project. Essentially just replicate that lock inversion scenario you mentioned so that we can validate this change does in prevent the deadlock from happening.

Thank you! 🙂

@benstevens48
Copy link
Author

@Sergio0694 and @getrou - thanks for working on this. I'm on holiday right now so will have a look when I get back, but here is some overall context.

Firstly to answer your point 1. - it looks like this is just a raw pointer to me so no AddRef. However the DeviceContextPool class itself is just using a raw pointer. I believe that currently the take lease can only be accessed through the device itself so that this is not a problem. If the device context pool is made accessible publicly then perhaps it will need changing and maybe the class member as well - depends on the public API I suppose.

Now on to the deadlocks. There are actually 3 changes I made relating to this - the other two are in #753.

It took me a while to decide what order locks should be taken in. The issue is D2D has an internal lock, then there are Win2d-level locks. But the D2D lock is exposed by D2D and also Win2D because certain operations involving Direct3D etc need to be manually protected by the lock. At first I thought maybe since the D2D lock is exposed to users of Win2D then that would have to be the outermost lock. However, it would probably be difficult to rewrite Win2D to ensure that no D2D locks (implicit or explicit) are taken within Win2D locks. Therefore the locking order should be Win2D lock then D2D lock. (Ideally this means that any Win2D functions which need to be called within a manual D2D lock should not internally use a Win2D lock and this should be added to the documentation - I'd need to investigate what these functions likely are).

Ignoring the potential of a consumer of Win2D using the manual Win2D lock, the only thing that actually needs changing to get rid of the Win2D deadlock is the second change I made as part of #753. That involves releasing the manually taken D2D lock needed when creating a drawing session on a swap chain earlier so that it isn't held while trying to take Win2D lock. The first part of #753 is more complicated but avoids holding a static Win2D-level lock whilst constructing Win2D objects from D2D ones, which involves calling some D2D functions and is generally a bad idea in my opinion because there's a whole long list of constructor functions that may be called and it's difficult to keep track of what they might do and they may be relative expensive so holding a static lock for the whole time seems like a bad idea. My change will result in an extra Win2D object being created an immediately discarded in the very rare case of trying to wrap the same D2D object twice at the same time, but avoids holding the lock except when accessing/updating the map of Win2D objects. Note that the GetOrCreate function is actually called a lot internally e.g. in effects etc.

The third change is the one you are referring to above. On its own it doesn't prevent deadlock but reduces the risk by holding a lock for less time (which also improves concurrency).

Thanks for implementing - #888 - I considered requesting this but for the time being the PixelShaderEffect was sufficient for me. Also the suggestion in #894 is good. I have requested access to the device context pool before here - #755 - so that will close this issue. (At the moment I am using my own device context pool).

@Sergio0694
Copy link
Member

"it looks like this is just a raw pointer to me so no AddRef. However the DeviceContextPool class itself is just using a raw pointer. I believe that currently the take lease can only be accessed through the device itself so that this is not a problem."

Mmmh right, because even if another thread tried to release the Win2D device (ie. the Win2D device wrapper) concurrently after another thread had released the Win2D lock before creating a device context, that same thread would still be keeping the Win2D device alive, so the context pool would also not be invalidated 🤔

"If the device context pool is made accessible publicly then perhaps it will need changing and maybe the class member as well - depends on the public API I suppose."

That shouldn't be an issue, as the pool itself isn't being exposed. #894 just adds the new ID2D1DeviceContextPool COM interface to CanvasDevice, which allows external consumers to access the pool functionality, but that will still go through the device and not the internal pool directly. As in, you just QI from the device for that interface and get a context lease from it.

I will say though, given the changes here are essentially just a slight optimization and not a fix for the deadlock, and given this is overall a tricky area, it might make sense to focus on the actual fix first if we can, and eventually go back to this after that.

That said, #753 definitely seems like a tricky PR, not because it's large but because it's very tricky and touches a critical component that everything else relies on, so the risk is relatively high. You'll definitely want to add some unit tests to go along with those changes as well, again you can add those to the internal unit tests (if there's a file with tests for the resource manager you can just add them in there along with the others) 😄

@benstevens48
Copy link
Author

benstevens48 commented Dec 30, 2022

Mmmh right, because even if another thread tried to release the Win2D device (i.e. the Win2D device wrapper) concurrently after another thread had released the Win2D lock before creating a device context, that same thread would still be keeping the Win2D device alive, so the context pool would also not be invalidated

Sort of - this means Win2D wrapper will not be closed due to no references, and additionally, calling Close explicitly at the same time as another operation is an error. In fact Win2D relies on this being an error in other places e.g. the ResourceWrapper, which does not lock when removing the reference to the wrapped resource upon close. I did actually try to argue in one issue here that calling Close simultaneously might not be considered an error, but Win2D does not allow for that (and it's probably for the best that ResourceWrapper is lock-free when accessing the underlying resource, especially given what I mentioned about the possible lock inversion when taking an explicit D2D lock). (Although in the DeviceContextPool case I think the lock does actually protect against Close being called simultaneously). To be honest I hadn't thought about the implications of not having an AddRef here and releasing the lock earlier, so I could add an AddRef or use a ComPtr, but due to the above discussion I don't think it's needed, and if we were worried about that it might be better for the DeviceContextPool to store a ComPtr containing the D2D device pointer instead of a raw pointer.

Regarding the other PR, I will look at splitting out the second change I made in it, which is kind of separate. However, it will be some time before I have time to spare to work on it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants