Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[processor/tailsampling] invert_match precedence #34085

Closed
hyang023 opened this issue Jul 16, 2024 · 9 comments · Fixed by #36673
Closed

[processor/tailsampling] invert_match precedence #34085

hyang023 opened this issue Jul 16, 2024 · 9 comments · Fixed by #36673
Assignees
Labels
bug Something isn't working processor/tailsampling Tail sampling processor

Comments

@hyang023
Copy link

Component(s)

processor/tailsampling

What happened?

Description

a recent fix went in to address #33656 and align with expected behavior documented at https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/processor/tailsamplingprocessor/README.md
When there's an "inverted not sample" decision, the trace is not sampled

however, in the same README, the tail_sampling.policies[] example policy list includes this backwards compatibility policy

      {
        # Rule 1: use always_sample policy for services that don't belong to team_a and are not ready to use tail sampling
        name: backwards-compatibility-policy,
        type: and,
        and:
          {
            and_sub_policy:
              [
                {
                  name: services-using-tail_sampling-policy,
                  type: string_attribute,
                  string_attribute:
                    {
                      key: service.name,
                      values:
                        [
                          list,
                          of,
                          services,
                          using,
                          tail_sampling,
                        ],
                      invert_match: true,
                    },
                },
                { name: sample-all-policy, type: always_sample },
              ],
          },
      },

and using a collector with this fix means that service.name values [list, of, services, using, tail_sampling] no longer can have other sampling policies applied to them since their InvertNotSampled condition will override any applicable Sampled or InvertSampled

Is this change in behavior intended? And if so, what is the new recommended way to define a backwards compatibility policy?

Steps to Reproduce

  1. use the tail sampling processor
  2. define a set of tail sampling policies including a backwards compatibility policy

Expected Result

Expect to be able to define sampling policies for service.name values [list, of, services, using, tail_sampling]
and always_sample for any other service.name not in this list (or as specified in other policies)

Actual Result

no traces for service.name values [list, of, services, using, tail_sampling] can be exported since all are excluded as a result of InvertNotSampled

Collector version

0.104.0

Environment information

Environment

OS: (e.g., "Ubuntu 20.04")
Compiler(if manually compiled): (e.g., "go 14.2")

OpenTelemetry Collector configuration

No response

Log output

No response

Additional context

No response

@hyang023 hyang023 added bug Something isn't working needs triage New item requiring triage labels Jul 16, 2024
@github-actions github-actions bot added the processor/tailsampling Tail sampling processor label Jul 16, 2024
Copy link
Contributor

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

@savar
Copy link

savar commented Jul 30, 2024

Another (possible) issue with invert_match precedence is (or I am just not able to do that), to do something like:

  1. probabilistic-ally catch http.route == /health for 0.1%
  2. probabilistic-ally catch http.route != /health for 20%

The 1st one works easily, but the second one would need to not apply to /health and therefore an invert_match seems to be required, which, even in an and type will create a do not sample decision which will override the 1st rule.

I think the issue is the same as the backwards compatibility example.

Copy link
Contributor

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

@avrong
Copy link

avrong commented Oct 1, 2024

The issue @savar mentioned is present for my use case.

What I want to do is a sampling of grpc requests traces that would depend on condition. It is useful when an application generates lots of traces with one of requests. In this one we're not really interested in, but it is still better to be shown in smaller scale. The problem is that large amount of traces takes more disk space and has impact on Grafana dashboard requests and web UI.

It worked like this: if rpc.method == methodA, then do probabilistic sampling. Otherwise, on all other requests, don't do anything and sample them all. It involved a hacky way of doing and policy with two branches -- one with string_attribute match in one case, and other one with inverted string_match on the same list (with anchor on the match list). With previous behavior it was possible, however after v0.103.0 I couldn't find a way to do that and did not manage to implement it with regular expressions or ottl conditions.

  tail_sampling:
    policies:
      # Applies probabilistic sampling to set of methods
      - name: probabilistic-to-listed-methods
        type: and
        and:
          and_sub_policy:
            - name: spans-with-listed-methods
              type: string_attribute
              string_attribute:
                key: rpc.method
                # List of methods to which probabilistic sampling applies
                values: &probabilistic-methods
                  - methodA
            - name: probabilistic-sampling
              type: probabilistic
              probabilistic:
                sampling_percentage: 5
      # Bypasses other traces
      - name: bypass-others
        type: and
        and:
          and_sub_policy:
            - name: other-spans
              type: string_attribute
              string_attribute:
                key: rpc.method
                values: *probabilistic-methods
                # Accept all except ones in list
                invert_match: true
            - name: catch-all
              type: always_sample

It would be great to have ability to do similar conditions again, probably in some other form.

@karinhawk
Copy link

karinhawk commented Nov 6, 2024

Our team ran into this issue during an upgrade of the collector where our sampling stopped working because of this. The only way so far we've managed to find a workaround for the backwards compatibility policy is a bit horrible but if anyone is in a pinch it works!

Because "inverted not sample" policies take precedence above everything else, don't use it for the backwards compatible policy, instead use a "not sample" decision where you enable regex on an attribute to enable sampling e.g. sampling:true with true being a negative match.

The processor uses grep/re2 syntax, meaning it doesn't support the usual way of negative matching - SO post
However, the correct regex can be generated using this website made by the commenter in that post.

For negative matching on the attribute so services with that attribute in their resource attributes can then be sampled later while services with that value as false are sampled 100%, it looks a little like this:

     [ 
      {
        name: 'backwards-compatibility-policy',
        type: 'and',
        and: {
          and_sub_policy: [
            {
              name: 'services-using-tail_sampling-policy',
              type: 'string_attribute',
              string_attribute: {
                key: 'sampling',
                values: [
                 '^([^t]|t(t|r(t|ut))*([^rt]|r([^tu]|u[^et])))*(t(t|r(t|ut))*(ru?)?)?$',
                ],
                enabled_regex_matching: true
              },
            },
          ],
        },
      },
      ...sampling policies
    ]

It allows you to replicate the behaviour from the backwards compatibility policy in the documentation examples, which is great, even though it's a little ugly.

EDIT: Another way to tackle this problem without being so hacky, is to include the above sampling attribute in instrumentation in services you wish to sample. Then (if you are using the loadbalancing exporter collector -> sampler collector -> exporter collector/endpoints pattern) filter on this attribute in the sampler collector, and have only those traces enter the processor, and send the others straight to a backend of choice in a separate pipeline. This means you only need to define policies in the sampler which keep specified traces, making the need for a backwards compatibility policy redundant.

@hyang023
Copy link
Author

hyang023 commented Nov 6, 2024

it could be helpful to hear from a code owner (@jpkrohling) what the intended/expected behavior is for invert sampling inside of a composite policy (e.g. should the invert-not-sample from a subpolicy apply to the scope of only that composite policy as a whole, as it did before, or should its scope extend beyond that, as it does now?)
and if so, what would be the new recommendation for implementing "backwards compatibility" policies given this breaking change?

@jpkrohling
Copy link
Member

Our team ran into this issue during an upgrade of the collector where our sampling stopped working because of this

I'm sorry about that, this shouldn't have happened. In fact, I'm open to reverting that PR and go back to the drawing board for the original issue it intended to solve.

what the intended/expected behavior is for invert sampling inside of a composite policy

I believe the result of the combination of AND policies should be either SAMPLE or NOT SAMPLE. If the AND policy had an invert attribute, then it would be OK to return an invert sample/not sample result.

@hyang023
Copy link
Author

I'm open to reverting that PR and go back to the drawing board for the original issue it intended to solve.

I think reverting the PR first, and then having any additional discussions after, sounds like a good idea

the combination of AND policies should be either SAMPLE or NOT SAMPLE. If the AND policy had an invert attribute, then it would be OK to return an invert sample/not sample result.

This makes sense! Thanks for clarifying

@mrsimo
Copy link

mrsimo commented Nov 20, 2024

Not to pile in, but we discovered we've been hitting this problem as well. We had one policy for known high volume endpoints that we want to sample less often (0.1%), and one for all the others (5%). We had implemented this with AND policies and string_attribute against the list of endpoints, one of them with invert_match.

Since we have a few endpoints I don't think @karinhawk's solution works for us.

I've been racking my brains to find an alternative but I can't come up with one. I've tried using an ottl_condition, but also not possible. It would be possible to increase the amount of sampling under a condition, but not the opposite.

A revert would be welcome. Having an invert_match on the and policy itself would make a lot of sense to me!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working processor/tailsampling Tail sampling processor
Projects
None yet
7 participants