Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[XRay Sampler] - XRay Sampler Statistics reports much higher RequestCount/SampleCount than expected for Sampling Rules #79

Closed
jj22ee opened this issue Oct 1, 2024 · 1 comment · Fixed by #80 or aws-observability/aws-otel-python-instrumentation#269
Assignees
Labels
bug Something isn't working

Comments

@jj22ee
Copy link
Contributor

jj22ee commented Oct 1, 2024

There is a bug of XRay Sampler Implementation in all OTel SDK Languages. This impacts Sampling Statistics reported by XRay sampler, particularly the SamplingStatisticsDocuments’s metrics of [BorrowCount, RequestCount, SampledCount], which is sent to XRay Service via the /SamplingTargets endpoint.

Example: If a single request to an Application using XRay Sampler creates 1 Server Span (makes decision Sampled=true) with 8 child Internal Spans (copies parent's Sampled value) :

  • Actual: XRay Sampler will produce Sampler Metrics of RequestCount=9, SampledCount=9
    • XRay Sampler produces Sampler metrics based on number of Spans of any kind
  • Expected: XRay Sampler should produce Sampler Metrics of RequestCount=1, SampledCount=1
    • XRay Sampler should produce Sampler metrics based on number of Traces (aka the root Span that makes the Sampling Decision)

In summary, the bug is that while Spans with a parent Span will copy their parent’s XRay Sampling Decision, they will still make additions to the XRay Sampling metric (they shouldn’t because they do not make the decision, only copy it)

How does this negatively affect OTel users of XRay Sampler?

  • If user observes Sampling Rules Console, a Sampling Rule’s Trend Graph will report metric numbers that are much greater than the actual number of Traces sampled.

Given the above example, the graph will incorrectly show 9 req matching the rule, with 9 sampled instead of the actual 1 req matching the rule, with 1 sampled.

Root cause should be because only subcomponents of XRay Sampler uses ParentBased logic to short-circuit the Parent's Sampling Decision, while the XRay Sampling Statistics recording logic is not skipped when a Parent's Sampling Decision is found.

  • Thus the fix is to wrap the entire XRay Sampler Internal Logic under a single ParentBased Sampler, which will skip recording the "extra" sampling statistics when a Parent's Sampling Decision is available.
@jj22ee
Copy link
Contributor Author

jj22ee commented Oct 2, 2024

In Python, this should also significantly reduce statistics lock contention in the XRay Sampler when there are a lot of child spans.

jj22ee added a commit to aws-observability/aws-otel-python-instrumentation that referenced this issue Oct 7, 2024
…arentBased logic (#269)

### Issue #, if available:

This PR fixes
aws-observability/aws-otel-js-instrumentation#79,
but in Python
This fixes the bug where:
- While the subcomponents of XRay Sampler uses ParentBased logic to
short-circuit the Parent's Sampling Decision, the XRay Sampling
Statistics recording logic is not skipped when a Parent's Sampling
Decision is found. This causes XRay Sampler to produce Sampling
Statistics based on number of Spans (regardless of Parent's Sampling
Decision), while XRay Sampler should produce Sampling Statistics based
on number of Traces (aka the number of root Spans that makes the
Sampling Decision)

###  Description of changes:
1. Wrap entire XRay Sampler Internal Logic under a single ParentBased
Sampler
a. This will reduce lock contention and not over-count sampling
statistics.
3. Remove use of redundant ParentBased Sampler logic for internal
subcomponents for XRay Sampler.

### Testing:
1. Added unit tests to verify ParentBased sampling decisions
2. Original Sampler functionality tested locally using [Sampler Test
Bed](https://github.com/aws-observability/aws-otel-community/tree/master/centralized-sampling-tests)

By submitting this pull request, I confirm that you can use, modify,
copy, and redistribute this contribution, under the terms of your
choice.

Co-authored-by: Ping Xiang <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment