-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
System.Diagnostics.DiagnosticSource.HttpHandlerDiagnosticListener bug fixes & edge cases #33732
System.Diagnostics.DiagnosticSource.HttpHandlerDiagnosticListener bug fixes & edge cases #33732
Conversation
…n during connection creation. Fixed a couple spots where RequestIdHeaderName was checked but not TraceParentHeaderName.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @CodeBlanch! This looked mostly fine to me with a few questions inline and some requests to beef up the testing.
Historically I think @lmolkova and/or @davidsh would have looked over this code and have more experience spotting issues. @lmolkova is out on parental leave but I am hoping @davidsh is available to take a look. In a pinch I can be the approver though.
.../System.Diagnostics.DiagnosticSource/src/System/Diagnostics/HttpHandlerDiagnosticListener.cs
Outdated
Show resolved
Hide resolved
.../System.Diagnostics.DiagnosticSource/src/System/Diagnostics/HttpHandlerDiagnosticListener.cs
Show resolved
Hide resolved
@@ -583,7 +610,7 @@ private HttpHandlerDiagnosticListener() : base(DiagnosticListenerName) | |||
|
|||
private void RaiseRequestEvent(HttpWebRequest request) | |||
{ | |||
if (request.Headers.Get(RequestIdHeaderName) != null) | |||
if (IsRequestInstrumented(request)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does anything bad happen if we don't change this line?
If there was an instrumenter that used traceparent previously they could rely on this code (maybe even unintentionally) to set tracestate, correlationcontext, and send a started event. Now we have surpressed all that behavior it may be breaking their scenario?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@noahfalk You are right, it could be breaking. BUT, check out dotnet/corefx#40777. That just went out with .NET 5 Preview 1 (pretty sure?). That PR fixes/breaks 1 of the 2 events. This PR adds 1 new event and then makes it consistent for all 3. IMO better to have the consistency than 1 working differently? LMK
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If it is consistency just for consistency's sake I'd prioritize the back-compat risk higher. If the lack of consistency results in a specific harm to some customer or tool I'd want to give that greater consideration.
check out dotnet/corefx#40777
I am making an educated guess that adding new events where they were previously missing/absent has a much lower risk of breaking compat than removing functionality does. As far as I could tell this was the only spot where we potentially stop doing work that happened before whereas all the other edits in this PR and in 40777 are adding things. Also I expect it is relatively easy to workaround getting too many events/activities because tools can ignore them whereas working around a missing event is very hard.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd say taking dependency on that behavior was wrong... Let's make it consistent and predictable
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd say taking dependency on that behavior was wrong
Valuing back-compat means I won't second-guess whether dependencies on our current behavior are wise, reasonable, or intentional. I only care how likely it is that the dependencies exist and weigh the harm it would cause customers to have their working app break vs. the value the change brings.
Let's make it consistent and predictable
These aren't the attributes I value much when modifying existing shipping code. If there are specific scenarios that customers are likely to hit and making this change will benefit them then that could be compelling. Otherwise I am quite content sticking with the devil I know : )
@@ -706,6 +740,8 @@ private static void PrepareReflectionObjects() | |||
s_connectionListField = s_connectionGroupType?.GetField("m_ConnectionList", BindingFlags.Instance | BindingFlags.NonPublic); | |||
s_connectionType = systemNetHttpAssembly?.GetType("System.Net.Connection"); | |||
s_writeListField = s_connectionType?.GetField("m_WriteList", BindingFlags.Instance | BindingFlags.NonPublic); | |||
s_errorField = s_connectionType?.GetField("m_Error", BindingFlags.Instance | BindingFlags.NonPublic); | |||
s_innerExceptionField = s_connectionType?.GetField("m_InnerException", BindingFlags.Instance | BindingFlags.NonPublic); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we have confidence these fields exist in every past version of HttpClient? (I don't have any specific reason to believe there is an issue, solely making sure we've done due diligence)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@noahfalk Good call. I checked the .NET 4.5.1 reference source, they are present (ndp\fx\src\Net\System\Net_Connection.cs). Happy to check others, but not sure where to find them. Here's where I got 4.5.1: https://referencesource.microsoft.com/download.html
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I confirmed they exist in 4.5 and the library doesn't target anything earlier than 4.5. For releases after 4.5.1 there are more buttons on the reference source page you linked.
.../System.Diagnostics.DiagnosticSource/src/System/Diagnostics/HttpHandlerDiagnosticListener.cs
Outdated
Show resolved
Hide resolved
.../System.Diagnostics.DiagnosticSource/src/System/Diagnostics/HttpHandlerDiagnosticListener.cs
Outdated
Show resolved
Hide resolved
} | ||
|
||
public override int Add(object value) | ||
{ | ||
// Add before firing events so if some user code cancels/aborts the request it will be found in the outstanding list. | ||
int index = base.Add(value); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't notice any test that would fail if this code got reverted, I would add one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If reverted TestCanceledRequest
will fail because no Exception event is fired. To cancel the request it does this: new EventObserverAndRecorder(_ => { cts.Cancel(); })
Before:
- m_WriteList.Add is called
- Start event fired
- User code called (cts.Cancel)
- Request is aborted
- Connection looks for an active request in m_WriteList, finds nothing (https://referencesource.microsoft.com/#system/net/System/Net/_Connection.cs,2816). If it finds something a lot more code happens including a potential call to m_WriteList.Clear. No Exception event is fired
- Request is added to the ArrayList. It might be leaked at this point, no cleanup will happen because abort/cleanup has already completed?
After:
- m_WriteList.Add is called
- Request is added to the ArrayList
- Start event fired
- User code called (cts.Cancel)
- Request is aborted
- Connection looks for a request to remove in m_WriteList, successfully finds the active one. Is able to do its cleanup which causes the Exception event to fire correctly
@noahfalk I've been testing this a bunch, there are cases where an exception is thrown during connection where it will automatically retry. If the retry succeeds, then you will get a response event. If the retry fails, you will get another exception event. Both cases I think will lead to confusion. There is also an issue with the activity being stopped at the end of Start. This kills OpenTelemetry, where I really want to utilize this listener. I think I have a solution, but still working on it. I'm going to close PR and I'll reopen when I get it working. |
@CodeBlanch - no problem : ) |
…tResponse blowing up with modified object.
@noahfalk A few days of hair-pulling and I'm back! This got a whole lot better or a whole lot worse, depending on how you look at it :) I was able to hook into the AsyncCallback used by the HttpWebRequest. That made this infinitely more reliable because it fires at the end of the request processing for both sync & async flows and gives us a nice place to do our work. Hours of trying stuff before I found that little magical beauty. Anyway, everything is nice and consistent now and follows really how a DiagnosticListener should work IMO. It's not without thorns, though. Ex.Stop event is gone. Now we get a proper response for "no-content" so we don't need that extra event. Exception event is added. Checking IsEnabled before sending all events. That is SOP for DiagnosticListener, no? Some of these things could cause breakage but really, from my seat, it's fixing bugs and inconsistencies that probably shouldn't have ever been in there from the beginning. LMK what you think. Going to do more testing. |
…on another thread.
|
||
private bool IsRequestInstrumented(HttpWebRequest request) | ||
{ | ||
return request.Headers.Get(TraceParentHeaderName) != null || request.Headers.Get(RequestIdHeaderName) != null; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we check what mode we are operating in? It may allow less look ups in the collection.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. Adding the check opened up a new avenue where you can have RequestId on there already but we'll still add TraceParent if the W3C mode is enabled so I added a test case for that.
@CodeBlanch @SergeyKanzhelev - Is the goal to consume this code from OpenTelemetry? I'm open to discussing alternatives but how would you guys feel about putting your own (improved) implementation of the listener into OT and not relying the one in DiagnosticSource? This is how things look from my point of view:
@davidsh - I'd love to get your feedback on this too. Thanks! |
@noahfalk My thoughts for what little they are worth :) I have this version of HttpHandlerDiagnosticListener mirrored in OT already so if we don't want to take it into dotnet/runtime, that's totally cool. I'll probably rename it and change the namespace there if that ends up being the case, but not a big deal. The reason I opened this PR is I don't think the existing version really does its job well enough to be in the official lib. It has a lot of issues! Not communicating Exceptions thrown establishing the connection is huge for me as far as instrumenting my production code. I want to know about those issues ASAP. Other than that, for HttpClient, it's going to work reasonably well. If you try to use it for the underlying HttpWebRequest though, forget it! HttpWebRequest throws exceptions for Http StatusCodes other than success. Since there's not an Exception event, you won't get any of that data in your telemetry. What I needed was the ability to add OT to a legacy website and automatically catch all its outgoing HttpWebRequests. I don't know of anything in here that would break anyone, per se. They should continue to get Start/Stop events as they did before (with the same data) and they can just ignore the Exception event. But I totally understand the risk aversion to breakage! |
That was easier than I expected, but glad to hear it : )
Yeah I can believe it : / So far nobody has raised an issue (aside from you guys) so I have to assume either they aren't using it, aren't affected by it, or they worked around it already and might have brittle dependencies on the existing behavior. If we did see broader feedback that other customers would also benefit from having an updated version of this instrumentation in a centralized lib I am not closing the door on that option, but I'd prefer to wait-and-see rather than doing it eagerly since it isn't blocking you guys. I do appreciate the time you spent working on the PR and I hope in hindsight the majority of it was still time well spent to get a good implementation for OpenTelemetry. I'm going to leave this open a little longer in hopes we hear something from @davidsh but I am anticipating we'll close the issue without taking the PR at this point. Thanks @CodeBlanch! |
I have always been concerned about this diagnostic listener using reflection against .NET Framework internals in order to provide telemetry. But it is something I agreed to a few years ago based on the need for it at that time. We do not anticipate making any changes to .NET Framework that might upset the use of reflection that this listener uses. We are not making any changes to .NET Framework except security fixes and those fixes that are directly driven by customer feedback where the fix is needed due to broad impact to a service or business. That said I am worried that this change might break someone else that is dependent on the current behavior of this listener. Since OT has already created its own copy of the listener, they are in the best position to modify it and fully test it. So, I would not support modifying this in .NET Core unless we see other customer feedback. |
Thanks @davidsh! It sounds like we are all agreed that we will leave DiagnosticSource as-is (bugs and all) and continue to monitor customer feedback in this area. In the meantime OpenTelemetry will apply the improvements here to their own private copy of the listener which satisfies the telemetry needs there. |
Fixes: