datadog: honor extracted sampling decisions#30577
datadog: honor extracted sampling decisions#30577RyanTheOptimist merged 6 commits intoenvoyproxy:mainfrom
Conversation
- If a trace sampling decision is extracted from request headers, use it regardless of Envoy's sampling configuration. - Refactor Tracer::startSpan to avoid the move assignment operator of datadog::tracing::Span, which is deleted in newer versions of dd-trace-cpp. - Alter the comments above Tracer::startSpan to more accurately reflect changes made previously in envoyproxy#29932. Signed-off-by: David Goffredo <david.goffredo@datadoghq.com>
Signed-off-by: David Goffredo <david.goffredo@datadoghq.com>
|
@cgilmour, my teammate who figured out what the problem was. |
Signed-off-by: David Goffredo <david.goffredo@datadoghq.com>
Signed-off-by: David Goffredo <david.goffredo@datadoghq.com>
|
/assign-from @envoyproxy/envoy-maintainers |
|
@envoyproxy/envoy-maintainers assignee is @snowp |
|
Hi @snowp, it'd be appreciated to have some movement on this one. Also I think assigning Matt Klein for review wasn't intended exclusively and another maintainer could take that on. Thanks! |
|
/assign-from @envoyproxy/envoy-maintainers |
|
@envoyproxy/envoy-maintainers assignee is @RyanTheOptimist |
RyanTheOptimist
left a comment
There was a problem hiding this comment.
Wow, thanks for the investigation and bug fix! The PR description is excellent and really made it easy to understand. LGTM!
Please also update the changelog to reflect this fix.
|
Thanks for the kind words, @RyanTheOptimist. I'll update the changelog. Also, what is the process for having this PR considered for backporting onto v1.27 and v1.28 as a bug fix? |
Good question. @phlax, how do we usually handle bugfix backports? |
Signed-off-by: David Goffredo <david.goffredo@datadoghq.com>
3755761 to
d4c4468
Compare
|
/backport
mark it for backport and then if a pr hasnt been raised already i look at when raising backports @dgoffredo if you are up for doing the backports the target branches are:
|
Will do. |
|
I'm working on the backport PRs for Should we first merge these changes into |
…onto v1.27) Signed-off-by: David Goffredo <david.goffredo@datadoghq.com>
…onto v1.28) Signed-off-by: David Goffredo <david.goffredo@datadoghq.com>
|
I've created two additional pull requests, one for backporting these changes to the v1.27 release branch, and another for the v1.28 release branch:
|
|
@phlax are we OK to merge this PR or is there any addition process we need to follow? |
…27) (#30750) Signed-off-by: David Goffredo <david.goffredo@datadoghq.com>
…28) (#30751) Signed-off-by: David Goffredo <david.goffredo@datadoghq.com>
|
Matt merged the backports onto @phlax Are we good to merge this into |
…onto v1.28) (envoyproxy#30751) Signed-off-by: David Goffredo <david.goffredo@datadoghq.com> Signed-off-by: Sean Killeen <SeanKilleen@gmail.com>
Commit Message: honor extracted sampling decisions in the datadog tracer
Additional Description: See below.
Risk Level: low
Testing: See below.
Docs Changes: N/A
Release Notes (under bug fixes): honor extracted sampling decisions in the datadog tracer
Platform Specific Features: N/A
The Issue and Its Solution
Envoy users within Datadog noticed another bug caused by the migration of the Datadog tracer from OpenTracing to dd-trace-cpp that was released in Envoy 1.27.
Sometimes the Envoy spans are missing from the flame graphs of traces, and sometimes the entire subtree starting at the Envoy span is missing from the flame graphs of traces.
Datadog tracing in Envoy used to use the OpenTracing driver, but it has not since Envoy v1.27.
The OpenTracing driver sets the "sampling priority" to zero ("drop the trace") whenever
bool Envoy::Tracing::Decision::tracedisfalse. It does this regardless of whether a sampling decision was extracted from incoming request headers (e.g. x-b3-sampled or x-datadog-sampling-priority).I followed suit in the non-OpenTracing Datadog tracer, but I missed something. The OpenTracing-based Datadog tracer, dd-opentracing-cpp, ignores sampling changes whenever a sampling decision has already been extracted from an incoming request.
So, in the OpenTracing-based code, when a sampling decision is extracted from an incoming request, here's what happened:
The result is that extracted sampling decisions are always honored.
With the newer dd-trace-cpp based code, when a sampling decision is extracted from an incoming request, here's what happens:
So, if Envoy's configured sampling rate is less than 100%, users will see incomplete traces whenever Envoy is not the "root service." In some such traces, Envoy spans and possibly its descendants are missing.
This is a bug.
This revision restores the logic from the OpenTracing-based tracer: An Envoy decision of "drop" forces a dropped trace only if we haven't extracted a decision from an incoming request.
Testing
I added a unit test that checks a table of possiblities.
I also built Envoy and configured it to reverse proxy
httpbin.org/headerswith 1% sampling. Then I hit it withcurlusing a variety of propagation headers, such asx-datadog-sampling-priorityandtraceparent, and verified that the resulting sampling decision was consistent with what I sent.When I don't send sampling information in the request headers, the resulting sampling decisions are consistent with Envoy's configured 1% rate.
Bug Fix
Please consider this PR as a bug fix so that it might be backported onto the v1.27 and v1.28 release branches.