Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OpenTelemetry TraceIdRatioBased sampler requirements following OTEP 235 #4166

Closed
wants to merge 36 commits into from

Conversation

jmacd
Copy link
Contributor

@jmacd jmacd commented Jul 29, 2024

Fixes #1413.

Changes

Updates Trace SDK and TraceState handling specifications with OTEP 235 sampling thresholds. This PR depends on #4162 to introduce the concept of Trace Randomness. This PR is the second part of two, it focuses on thresholds.

  • Revise TraceIdRatioBased algorithm section. The existing TODO implies this is not a breaking change.
  • Change text about TraceIdRatioBased construction
  • Move text about TraceIdRatioBased description (leave unmodified).

The content of OTEP 235 was revised for clarity by @kalyanaj in open-telemetry/oteps#261. I've heavily copied from the final text in that still-unmerged OTEP. I introduced new content explaining how to compute thresholds from probabilities with use of variable precision, referring to the OTel Collector-Contrib pkg/sampling reference implementation. The new (Golang) demonstration code is validated here, https://go.dev/play/p/7eLM6FkuoA5.

A proof of concept for this specification along with #4162 can be found in open-telemetry/opentelemetry-go#5645.

Part of #3602.

Product of the Sampling SIG members @kentquirk @kalyanaj @oertl @PeterF778 and myself.

@jmacd
Copy link
Contributor Author

jmacd commented Jul 30, 2024

Feedback from the OTel Spec SIG meeting discussion cc/ @jsuereth:

  • Please add a migration guide to explain how transitioning samplers will work; in particular, it's not safe to begin using non-root independent sampling until TraceIdRatioBased samplers are replaced everywhere in a trace. Until then, only safe to continue using ParentBased sampling w/ root TraceIdRatioBased decision.

Update: 68fa270

Copy link

github-actions bot commented Aug 7, 2024

This PR was marked stale due to lack of activity. It will be closed in 7 days.

@github-actions github-actions bot removed the Stale label Aug 8, 2024
jmacd added a commit that referenced this pull request Aug 15, 2024
This reduces the number of lines of diff in PR 4166, which replaces the
entire `tracestate-probability-sampling.md` file with new contents.

Part of #4166.

## Changes

Move a file, place a link to it and explain that a change is in
progress.
@jmacd
Copy link
Contributor Author

jmacd commented Aug 15, 2024

@kalyanaj @PeterF778 @oertl @kentquirk Please take another look at this PR, especially the file tracestate-probability-sampling.md which now reads as a new file, not as a major rewrite. The contents are derived from open-telemetry/oteps#261.

@jmacd
Copy link
Contributor Author

jmacd commented Aug 15, 2024

@open-telemetry/specs-trace-approvers @open-telemetry/specs-approvers @open-telemetry/technical-committee this PR has reached consensus in the Sampling SIG, we have multiple prototypes implemented, and we are looking for final approvals.

specification/trace/sdk.md Outdated Show resolved Hide resolved
specification/trace/tracestate-handling.md Outdated Show resolved Hide resolved
specification/trace/tracestate-handling.md Outdated Show resolved Hide resolved
@jmacd
Copy link
Contributor Author

jmacd commented Oct 21, 2024

@PeterF778 @oertl @kentquirk @kalyanaj @yuanyuanzhao3, please take another look.

92876f9
1855839
44c8190

Copy link

This PR was marked stale due to lack of activity. It will be closed in 7 days.

@github-actions github-actions bot added the Stale label Oct 29, 2024
@jmacd
Copy link
Contributor Author

jmacd commented Oct 30, 2024

I've made a small but substantial change based on feedback from @yuanyuanzhao3, see e6dc409.

I will place this PR in draft while the Sampling SIG discusses at least one more detail raised by @yuanyuanzhao3, which is (in my words) a concern about the ambiguity between th:0 and unset th. In earlier debates and discussions over this feature, we have discussed a "zero adjusted count" value for the threshold, but we removed this feature when we transitioned from "acceptance threshold" to "rejection threshold". In the rejection threshold formulation, we can't represent zero adjusted count, this will be discussed in tomorrow's Sampling SIG.

@github-actions github-actions bot removed the Stale label Oct 31, 2024

#### AlwaysOn

* Returns `RECORD_AND_SAMPLE` always.
* Description MUST be `AlwaysOnSampler`.
* If the incoming TraceState has a valid OpenTelemetry TraceState `th` sub-key, the the returned TraceState is unmodified.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That part would be correct if the AlwaysOnSampler decided to sample by looking at the parent sampled flag. But it is not supposed to. It is expected to always sample, also when the parent was not sampled. Thus th:0 is always correct.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed--
As discussed in the SIG today, we are missing a specification for a ConsistentParentBased sampler. We should probably not use the AlwaysOn sampler in this case.

Assertion: Consistent probability samplers should not inspect the sampled flag.

ConsistentParentBased: when it is invoked in child context, it simply copies the Th value and returns the sampled flag as its decision. If there is a sampled flag and no th value: leave th unset and respect the sampled flag. This is an error case.

I will revert commit e6dc409

Then, I will make a new PR to specify the ConsistentParentBased sampler.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The issue is that span metrics in this service would be skewed.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, if service B uses AlwaysOnSampler, all of its spans will get sampled (regardless of the sampling decisions for service A upstream). This might mean incomplete traces, but the metrics for service B will be correct with th:0.

carlosalberto pushed a commit to carlosalberto/opentelemetry-specification that referenced this pull request Oct 31, 2024
…metry#4168)

This reduces the number of lines of diff in PR 4166, which replaces the
entire `tracestate-probability-sampling.md` file with new contents.

Part of open-telemetry#4166.

## Changes

Move a file, place a link to it and explain that a change is in
progress.
@jmacd jmacd marked this pull request as draft November 1, 2024 20:52
* For root spans, always sample a new context.
* For child spans, take the decision of the parent context.

By using the ParentBased sampler by default, users can change sampling
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: this should be an "aside" or non-normative callout. I'm not sure we have precedence here, but could you move this to some different markdown structure, perhaps > quote ?


When a TraceIdRatioBased Sampler makes a decision for a non-root Span
using TraceID randomness, but the Trace random flag was not set, the
SDK SHOULD issue a one-time warning statement in its log with a
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: one-time warning - You may need to include criteria of what consistutes a "time", i.e. is it once per line of code, once per span name...

I think this is a bit too vague to implement consistently.

Copy link

This PR was marked stale due to lack of activity. It will be closed in 7 days.

@github-actions github-actions bot added the Stale label Nov 13, 2024
@jmacd jmacd removed the Stale label Nov 13, 2024
Copy link

This PR was marked stale due to lack of activity. It will be closed in 7 days.

@github-actions github-actions bot added the Stale label Nov 21, 2024
Copy link

Closed as inactive. Feel free to reopen if this PR is still being worked on.

@github-actions github-actions bot closed this Nov 28, 2024
@jmacd jmacd reopened this Dec 3, 2024
Copy link

This PR was marked stale due to lack of activity. It will be closed in 7 days.

@github-actions github-actions bot added the Stale label Dec 17, 2024
Copy link

Closed as inactive. Feel free to reopen if this PR is still being worked on.

@github-actions github-actions bot closed this Dec 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Review approach & specify algorithm for TraceIdRatioBasedSampler (ProbabilitySampler)