Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

System.Diagnostics.DiagnosticSource.HttpHandlerDiagnosticListener bug fixes & edge cases #33732

Conversation

CodeBlanch
Copy link
Contributor

@CodeBlanch CodeBlanch commented Mar 18, 2020

Fixes:

  • HttpHandlerDiagnosticListener did not send DiagnosticSource Stop event on netfx in W3C mode corefx#40777 missed a couple spots where RequestId was being checked for telemetry but not TraceParent. This could lead to duplicate request events or no response events being fired when W3C mode is turned on.
  • If an exception is thrown creating the connection (DNS issue, TLS issue, etc.) the listener will send Start event but nothing else. There is now an Exception event so tracers can stop/cleanup their telemetry.
  • Activity is stopped at the end of the request event so Activity.Current is unavailable in response & exception events.
  • Fixed a couple race conditions:
    • If request is canceled/aborted during the Start event processing, we will end up leaking a connection because m_WriteList.Add fires after the abort logic looks to remove/clear the request.
    • Start event fires before headers are added so if the cancel/abort logic is triggered a Stop or Exception event can fire on another thread which will be looking for headers the start logic might not have added yet.

…n during connection creation. Fixed a couple spots where RequestIdHeaderName was checked but not TraceParentHeaderName.
@noahfalk noahfalk requested a review from davidsh March 19, 2020 05:28
Copy link
Member

@noahfalk noahfalk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @CodeBlanch! This looked mostly fine to me with a few questions inline and some requests to beef up the testing.
Historically I think @lmolkova and/or @davidsh would have looked over this code and have more experience spotting issues. @lmolkova is out on parental leave but I am hoping @davidsh is available to take a look. In a pinch I can be the approver though.

@@ -583,7 +610,7 @@ private HttpHandlerDiagnosticListener() : base(DiagnosticListenerName)

private void RaiseRequestEvent(HttpWebRequest request)
{
if (request.Headers.Get(RequestIdHeaderName) != null)
if (IsRequestInstrumented(request))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does anything bad happen if we don't change this line?

If there was an instrumenter that used traceparent previously they could rely on this code (maybe even unintentionally) to set tracestate, correlationcontext, and send a started event. Now we have surpressed all that behavior it may be breaking their scenario?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@noahfalk You are right, it could be breaking. BUT, check out dotnet/corefx#40777. That just went out with .NET 5 Preview 1 (pretty sure?). That PR fixes/breaks 1 of the 2 events. This PR adds 1 new event and then makes it consistent for all 3. IMO better to have the consistency than 1 working differently? LMK

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it is consistency just for consistency's sake I'd prioritize the back-compat risk higher. If the lack of consistency results in a specific harm to some customer or tool I'd want to give that greater consideration.

check out dotnet/corefx#40777

I am making an educated guess that adding new events where they were previously missing/absent has a much lower risk of breaking compat than removing functionality does. As far as I could tell this was the only spot where we potentially stop doing work that happened before whereas all the other edits in this PR and in 40777 are adding things. Also I expect it is relatively easy to workaround getting too many events/activities because tools can ignore them whereas working around a missing event is very hard.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd say taking dependency on that behavior was wrong... Let's make it consistent and predictable

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd say taking dependency on that behavior was wrong

Valuing back-compat means I won't second-guess whether dependencies on our current behavior are wise, reasonable, or intentional. I only care how likely it is that the dependencies exist and weigh the harm it would cause customers to have their working app break vs. the value the change brings.

Let's make it consistent and predictable

These aren't the attributes I value much when modifying existing shipping code. If there are specific scenarios that customers are likely to hit and making this change will benefit them then that could be compelling. Otherwise I am quite content sticking with the devil I know : )

@@ -706,6 +740,8 @@ private static void PrepareReflectionObjects()
s_connectionListField = s_connectionGroupType?.GetField("m_ConnectionList", BindingFlags.Instance | BindingFlags.NonPublic);
s_connectionType = systemNetHttpAssembly?.GetType("System.Net.Connection");
s_writeListField = s_connectionType?.GetField("m_WriteList", BindingFlags.Instance | BindingFlags.NonPublic);
s_errorField = s_connectionType?.GetField("m_Error", BindingFlags.Instance | BindingFlags.NonPublic);
s_innerExceptionField = s_connectionType?.GetField("m_InnerException", BindingFlags.Instance | BindingFlags.NonPublic);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we have confidence these fields exist in every past version of HttpClient? (I don't have any specific reason to believe there is an issue, solely making sure we've done due diligence)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@noahfalk Good call. I checked the .NET 4.5.1 reference source, they are present (ndp\fx\src\Net\System\Net_Connection.cs). Happy to check others, but not sure where to find them. Here's where I got 4.5.1: https://referencesource.microsoft.com/download.html

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I confirmed they exist in 4.5 and the library doesn't target anything earlier than 4.5. For releases after 4.5.1 there are more buttons on the reference source page you linked.

}

public override int Add(object value)
{
// Add before firing events so if some user code cancels/aborts the request it will be found in the outstanding list.
int index = base.Add(value);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't notice any test that would fail if this code got reverted, I would add one.

Copy link
Contributor Author

@CodeBlanch CodeBlanch Mar 19, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If reverted TestCanceledRequest will fail because no Exception event is fired. To cancel the request it does this: new EventObserverAndRecorder(_ => { cts.Cancel(); })

Before:

  1. m_WriteList.Add is called
  2. Start event fired
  3. User code called (cts.Cancel)
  4. Request is aborted
  5. Connection looks for an active request in m_WriteList, finds nothing (https://referencesource.microsoft.com/#system/net/System/Net/_Connection.cs,2816). If it finds something a lot more code happens including a potential call to m_WriteList.Clear. No Exception event is fired
  6. Request is added to the ArrayList. It might be leaked at this point, no cleanup will happen because abort/cleanup has already completed?

After:

  1. m_WriteList.Add is called
  2. Request is added to the ArrayList
  3. Start event fired
  4. User code called (cts.Cancel)
  5. Request is aborted
  6. Connection looks for a request to remove in m_WriteList, successfully finds the active one. Is able to do its cleanup which causes the Exception event to fire correctly

@davidsh davidsh requested review from stephentoub and a team March 19, 2020 15:03
@CodeBlanch
Copy link
Contributor Author

@noahfalk I've been testing this a bunch, there are cases where an exception is thrown during connection where it will automatically retry. If the retry succeeds, then you will get a response event. If the retry fails, you will get another exception event. Both cases I think will lead to confusion. There is also an issue with the activity being stopped at the end of Start. This kills OpenTelemetry, where I really want to utilize this listener. I think I have a solution, but still working on it. I'm going to close PR and I'll reopen when I get it working.

@CodeBlanch CodeBlanch closed this Mar 21, 2020
@noahfalk
Copy link
Member

@CodeBlanch - no problem : )

@CodeBlanch CodeBlanch reopened this Mar 25, 2020
@CodeBlanch
Copy link
Contributor Author

@noahfalk A few days of hair-pulling and I'm back!

This got a whole lot better or a whole lot worse, depending on how you look at it :) I was able to hook into the AsyncCallback used by the HttpWebRequest. That made this infinitely more reliable because it fires at the end of the request processing for both sync & async flows and gives us a nice place to do our work. Hours of trying stuff before I found that little magical beauty.

Anyway, everything is nice and consistent now and follows really how a DiagnosticListener should work IMO. It's not without thorns, though. Ex.Stop event is gone. Now we get a proper response for "no-content" so we don't need that extra event. Exception event is added. Checking IsEnabled before sending all events. That is SOP for DiagnosticListener, no? Some of these things could cause breakage but really, from my seat, it's fixing bugs and inconsistencies that probably shouldn't have ever been in there from the beginning.

LMK what you think. Going to do more testing.


private bool IsRequestInstrumented(HttpWebRequest request)
{
return request.Headers.Get(TraceParentHeaderName) != null || request.Headers.Get(RequestIdHeaderName) != null;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we check what mode we are operating in? It may allow less look ups in the collection.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Adding the check opened up a new avenue where you can have RequestId on there already but we'll still add TraceParent if the W3C mode is enabled so I added a test case for that.

@noahfalk
Copy link
Member

noahfalk commented Apr 1, 2020

@CodeBlanch @SergeyKanzhelev - Is the goal to consume this code from OpenTelemetry? I'm open to discussing alternatives but how would you guys feel about putting your own (improved) implementation of the listener into OT and not relying the one in DiagnosticSource?

This is how things look from my point of view:

  1. The runtime team only does limited bug fixing on full framework, back-compat is a high-priority whereas new funtionality has virtually zero priority.
  2. This updated round of changes looks like more than low risk bug fixes. Event names are being added/removed/changed, the integration points are changing and there are plenty of spots where code dependent on details of the old behavior could stop working. It suggests you are pursuing a path where you want new functionality and are willing to accept more risk to achieve it. If the code is in OT you have the freedom to take whatever change/risk you deem appropriate and can continue to do so in the future.
  3. I am not an expert on the internals of HttpClient. Unless I spend a bunch of time learning those internals you won't be getting much value from my code review of the instrumentation.
  4. Given that the framework is taking very little change at this point the risk of a future framework servicing update breaking you appears quite low.

@davidsh - I'd love to get your feedback on this too. Thanks!

@CodeBlanch
Copy link
Contributor Author

@noahfalk My thoughts for what little they are worth :)

I have this version of HttpHandlerDiagnosticListener mirrored in OT already so if we don't want to take it into dotnet/runtime, that's totally cool. I'll probably rename it and change the namespace there if that ends up being the case, but not a big deal.

The reason I opened this PR is I don't think the existing version really does its job well enough to be in the official lib. It has a lot of issues! Not communicating Exceptions thrown establishing the connection is huge for me as far as instrumenting my production code. I want to know about those issues ASAP. Other than that, for HttpClient, it's going to work reasonably well. If you try to use it for the underlying HttpWebRequest though, forget it! HttpWebRequest throws exceptions for Http StatusCodes other than success. Since there's not an Exception event, you won't get any of that data in your telemetry. What I needed was the ability to add OT to a legacy website and automatically catch all its outgoing HttpWebRequests.

I don't know of anything in here that would break anyone, per se. They should continue to get Start/Stop events as they did before (with the same data) and they can just ignore the Exception event. But I totally understand the risk aversion to breakage!

@noahfalk
Copy link
Member

noahfalk commented Apr 2, 2020

I have this version of HttpHandlerDiagnosticListener mirrored in OT already so if we don't want to take it into dotnet/runtime, that's totally cool

That was easier than I expected, but glad to hear it : )

I don't think the existing version really does its job well enough to be in the official lib. It has a lot of issues!

Yeah I can believe it : / So far nobody has raised an issue (aside from you guys) so I have to assume either they aren't using it, aren't affected by it, or they worked around it already and might have brittle dependencies on the existing behavior. If we did see broader feedback that other customers would also benefit from having an updated version of this instrumentation in a centralized lib I am not closing the door on that option, but I'd prefer to wait-and-see rather than doing it eagerly since it isn't blocking you guys.

I do appreciate the time you spent working on the PR and I hope in hindsight the majority of it was still time well spent to get a good implementation for OpenTelemetry. I'm going to leave this open a little longer in hopes we hear something from @davidsh but I am anticipating we'll close the issue without taking the PR at this point. Thanks @CodeBlanch!

@davidsh
Copy link
Contributor

davidsh commented Apr 2, 2020

So far nobody has raised an issue (aside from you guys) so I have to assume either they aren't using it, aren't affected by it, or they worked around it already and might have brittle dependencies on the existing behavior.

I have always been concerned about this diagnostic listener using reflection against .NET Framework internals in order to provide telemetry. But it is something I agreed to a few years ago based on the need for it at that time.

We do not anticipate making any changes to .NET Framework that might upset the use of reflection that this listener uses. We are not making any changes to .NET Framework except security fixes and those fixes that are directly driven by customer feedback where the fix is needed due to broad impact to a service or business.

That said I am worried that this change might break someone else that is dependent on the current behavior of this listener. Since OT has already created its own copy of the listener, they are in the best position to modify it and fully test it. So, I would not support modifying this in .NET Core unless we see other customer feedback.

@noahfalk
Copy link
Member

noahfalk commented Apr 2, 2020

Thanks @davidsh! It sounds like we are all agreed that we will leave DiagnosticSource as-is (bugs and all) and continue to monitor customer feedback in this area. In the meantime OpenTelemetry will apply the improvements here to their own private copy of the listener which satisfies the telemetry needs there.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants