Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prevent writes during EventSource.Disable #55862

Closed

Conversation

josalem
Copy link
Contributor

@josalem josalem commented Jul 16, 2021

fixes #55441

There is a race in EventSource.Disable where any active writers that have passed their IsEnabled() checks could end up accessing internal structures as they are being disposed. This patch adds a two phase disable as part of dispose that will prevent writes from happening when resources are being cleaned up.

TODO:

  • performance analysis. Adding a spinwait could be expensive and even with the timeout, it's possible for this to bite us. I want to run this through bdn to see if there is a serious impact to EventSource performance. I also want to check if this impact performance on shutdown. I'll leave this as draft until I'm either convinced I don't need the data, or I've collected it.

CC @brianrob @noahfalk @davmason

John Salem added 2 commits July 16, 2021 16:21
* prevent writers from accessing
  internal data while it is being disposed
@josalem josalem added this to the 6.0.0 milestone Jul 16, 2021
@josalem josalem self-assigned this Jul 16, 2021
@ghost
Copy link

ghost commented Jul 16, 2021

Tagging subscribers to this area: @tarekgh, @tommcdon, @pjanotti
See info in area-owners.md if you want to be subscribed.

Issue Details

fixes #55441

There is a race in EventSource.Disable where any active writers that have passed their IsEnabled() checks could end up accessing internal structures as they are being disposed. This patch adds a two phase disable as part of dispose that will prevent writes from happening when resources are being cleaned up.

TODO:

  • performance analysis. Adding a spinwait could be expensive and even with the timeout, it's possible for this to bite us. I want to run this through bdn to see if there is a serious impact to EventSource performance. I also want to check if this impact performance on shutdown. I'll leave this as draft until I'm either convinced I don't need the data, or I've collected it.

CC @brianrob @noahfalk

Author: josalem
Assignees: josalem
Labels:

area-System.Diagnostics.Tracing

Milestone: 6.0.0

@@ -1457,6 +1462,10 @@ protected virtual void Dispose(bool disposing)
catch { } // If it fails, simply give up.
m_eventSourceEnabled = false;
}

// Wait till active writes have finished (stop waiting at 1 second)
SpinWait.SpinUntil(() => m_activeWritesCount == 0, 1000);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do we know 1 second is the right value to wait?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should have a big old TODO over it saying "until I find evidence for a better value".

I'm not even sure the timeout is appropriate. My fear is there is a code path after a call to one of the core Write methods that would take a lock and having this infinitely loop would be asking for a deadlock of some kind.

Copy link
Member

@AaronRobinsonMSFT AaronRobinsonMSFT Jul 28, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me make a better suggestion here. When I was writing the ETW collector provided by VS this problem was endemic to all ETW collector designs. A solution we came up with was during shutdown we inserted an event into the ETW stream. When we observed that event we knew we were "synchronized" with the request. This isn't entirely true for reasons I can explain offline but it is close enough that VS's ETW collector has never had missed events due to the logic. Timeouts like this in ETW should be avoided at all costs. One second actually is a very reasonable value though – I can also explain that offline if interested.

@noahfalk
Copy link
Member

@josalem and I chatted earlier, but just to record it - most BCL types don't provide a thread-safety guarantee for Dispose() and I wonder if it is necessary here. The synchronization necessary to provide that guarantee comes with perf costs I'm hopeful we can avoid.
- If the runtime will automatically call Dispose() on a thread of our choosing at shutdown then we do need some thread-safety guarantee, but does the runtime need to be calling Dispose() at all? I'm not sure what benefit it gives us but research may reveal that is necessary.
- Since the runtime automatically calls Write for EventCounters we would need to modify Dispose() to ensure that we stopped the EventCounter Write() calls prior to freeing the provider.

@brianrob
Copy link
Member

I agree. I was trying to figure out what I was concerned about with regard to this change and I couldn't quite verbalize it. I think you hit it on the head @noahfalk.

@noahfalk
Copy link
Member

I was looking at a related bug and wound up doing some research on why we call EventSource.Dispose() at shutdown. So far I see two reasons:

  • Avoiding native->managed callbacks after we've made it illegal to call back into managed code here
  • Logging ETW manifests at the end of the trace here

With a little refactoring we could create a new internal API like EventSource.HandleShutdown() that handles only these concerns while still leaving us in a state where concurrent calls to Write don't generate failures (whether any data is logged is optional). Then we would call this new API and not Dispose() in the shutdown path. Dispose() would continue to do what it does now including freeing memory, but unlike HandleShutdown() the onus could be on the caller to ensure they don't make any concurrent calls to other EventSource APIs during/after the call to Dispose().

In the case of EventPipe, HandleShutdown() could disable the callbacks without deleting the provider. In the case of ETW past precedent suggests concurrent calls to EventUnregister and EventWrite are already safe (we've presumably been doing it for many years with no reported issues).

@AaronRobinsonMSFT @vitek-karas @elinor-fung - The comment in the source references throwing a COMPLUS_BOOT_EXCEPTION and I find no reference to it any longer. I do recall that we used to block native->managed calls at some point during AppDomain shutdown, but do you know if that constraint is a relic of the desktop runtime that is no longer an issue on CoreCLR?

@josalem
Copy link
Contributor Author

josalem commented Jul 28, 2021

The consensus is that we don't want to add protection of this nature due to the cost and that we should find an alternate way to prevent this. Rather than discuss this on a draft PR, I'll migrate our conversation to the issue describing the root problem (#55441).

@josalem josalem closed this Jul 28, 2021
@ghost ghost locked as resolved and limited conversation to collaborators Aug 27, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Rare race condition in EventSource dispose/finalizer
5 participants