Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Composite policy doesn't allow for overflow between subpolicies if a policy sets Sampled but has gone over its max rate #36959

Open
cancub opened this issue Dec 24, 2024 · 2 comments
Labels
bug Something isn't working needs triage New item requiring triage processor/tailsampling Tail sampling processor

Comments

@cancub
Copy link

cancub commented Dec 24, 2024

Component(s)

processor/tailsampling

What happened?

Description

Within the algorithm for processing of subpolicies within the composite policy, the following may happen:

  1. the current subpolicy being processed has set the current trace as Sampled
  2. we check if allowing this sampling would put this subpolicy over its specific spans/sec limit or the global limit
  3. if either check evaluates to true, we return NotSampled and exit the function, thereby never checking any of the other subpolicies.

My personal opinion is that this is a bug, but I can see how this might be construed as a feature.

Steps to Reproduce

  1. start up an otelcol with the provided config
  2. send a bunch of traces in (for example, by using xk6-client-tracing)

Expected Result

The always sample subpolicy (ASSP) receives the traces first, and any unsampled traces then move on to the probabilistic subpolicy (PSP).

If a ramping increase of spans/sec is used, we will at first see a 100% sampling rate for this Composite policy followed by an inflection point where the ASSP saturates at 100 spans/sec and 80 % of spans above this point are not sampled. Similarly, when we look at the exporter's rate/sec rate, we can see that there is an inflection at around this time, where the slope of the graph after the inflection is 20% of the slop before.

Actual Result

ASSP always receives all traces and PSP never receives any traces. The spans per second flatlines at the maximum number of spans per second to be allocated to ASSP

Collector version

v0.116.0

Environment information

Environment

OS: Ubuntu 22.04
Compiler(if manually compiled): [email protected] go1.23.4

OpenTelemetry Collector configuration

receivers:
  otlp:
    protocols:
      grpc:
        endpoint: "localhost:14317"

processors:
  tail_sampling:
    policies:
      - name: composite-policy
        type: composite
        composite:
          max_total_spans_per_second: 10_000
          policy_order:
            - always-sample-policy
            - probabilistic-policy
          composite_sub_policy:
            - name: always-sample-policy
              type: always_sample
            - name: probabilistic-policy
              type: probabilistic
              probabilistic:
                sampling_percentage: 20
          rate_allocation:
            - policy: always-sample-policy
              percent: 5
            - policy: probabilistic-policy
              percent: 95

exporters:
  otlp:
    endpoint: "localhost:4317"
    tls:
      insecure: true
  debug:
    verbosity: detailed

service:
  telemetry:
    logs:
      level: DEBUG
    metrics:
      level: detailed
      readers:
        - pull:
            exporter:
              prometheus:
                host: "0.0.0.0"
                port: 8888
  pipelines:
    traces:
      receivers: [otlp]
      processors: [tail_sampling]
      exporters: [otlp]

Log output

2024-12-24T17:29:51.319-0500	debug	[email protected]/processor.go:271	Sampling Policy Evaluation ticked	{"kind": "processor", "name": "tail_sampling", "pipeline": "traces"}
2024-12-24T17:29:51.320-0500	debug	sampling/always_sample.go:29	Evaluating spans in always-sample filter	{"kind": "processor", "name": "tail_sampling", "pipeline": "traces", "policy": "always_sample"}
2024-12-24T17:29:51.320-0500	debug	sampling/always_sample.go:29	Evaluating spans in always-sample filter	{"kind": "processor", "name": "tail_sampling", "pipeline": "traces", "policy": "always_sample"}
2024-12-24T17:29:51.320-0500	debug	sampling/always_sample.go:29	Evaluating spans in always-sample filter	{"kind": "processor", "name": "tail_sampling", "pipeline": "traces", "policy": "always_sample"}
2024-12-24T17:29:51.320-0500	debug	sampling/always_sample.go:29	Evaluating spans in always-sample filter	{"kind": "processor", "name": "tail_sampling", "pipeline": "traces", "policy": "always_sample"}
2024-12-24T17:29:51.320-0500	debug	sampling/always_sample.go:29	Evaluating spans in always-sample filter	{"kind": "processor", "name": "tail_sampling", "pipeline": "traces", "policy": "always_sample"}
2024-12-24T17:29:51.320-0500	debug	sampling/always_sample.go:29	Evaluating spans in always-sample filter	{"kind": "processor", "name": "tail_sampling", "pipeline": "traces", "policy": "always_sample"}
2024-12-24T17:29:51.320-0500	debug	sampling/always_sample.go:29	Evaluating spans in always-sample filter	{"kind": "processor", "name": "tail_sampling", "pipeline": "traces", "policy": "always_sample"}
2024-12-24T17:29:51.320-0500	debug	sampling/always_sample.go:29	Evaluating spans in always-sample filter	{"kind": "processor", "name": "tail_sampling", "pipeline": "traces", "policy": "always_sample"}
2024-12-24T17:29:51.320-0500	debug	sampling/always_sample.go:29	Evaluating spans in always-sample filter	{"kind": "processor", "name": "tail_sampling", "pipeline": "traces", "policy": "always_sample"}
2024-12-24T17:29:51.320-0500	debug	sampling/always_sample.go:29	Evaluating spans in always-sample filter	{"kind": "processor", "name": "tail_sampling", "pipeline": "traces", "policy": "always_sample"}


### Additional context

We need two changes

## Return On Global Max Right Away

We should move the check for

spansInSecondIfSampled <= c.maxTotalSPS

Out of the loop. There's no need to bother with processing any of the subpolicies if the trace will put the global system over global limit regardless of which subpolicy samples it.

I actually think this inequality might be a mistake.
`spansInSecondIfSampled` [only takes into account the number of sampled spans *for this subpolicy* and the number of spans in the current trace](https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/v0.116.0/processor/tailsamplingprocessor/internal/sampling/composite.go#L106). It doesn't make sense to compare this to the `maxTotalSPS` of the full `Composite` policy.

## Modify Logic for `Sampled`/`NotSampled` for Subpolicies

[This logic](https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/v0.116.0/processor/tailsamplingprocessor/internal/sampling/composite.go#L116-L120) is counter-intuitive and -- I would argue -- restricts the utility of the `Composite` policy. It's saying that if any one of the subpolicies is over its limit then we should give up on this trace.

We should get rid of this `return NotSampled, nil` and allow the remaining subpolicies to process the trace. We still have a `return NotSampled, nil` at the end of the function in case no policy wants to sample the trace or no policy is able to sample the trace based on its SPS limits.
@cancub cancub added bug Something isn't working needs triage New item requiring triage labels Dec 24, 2024
@github-actions github-actions bot added the processor/tailsampling Tail sampling processor label Dec 24, 2024
Copy link
Contributor

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

@cancub
Copy link
Author

cancub commented Dec 24, 2024

The graphs showing the flatline at the ASSP maximum rate.
image

image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working needs triage New item requiring triage processor/tailsampling Tail sampling processor
Projects
None yet
Development

No branches or pull requests

1 participant