Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
27 changes: 27 additions & 0 deletions .chloggen/stratified.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,27 @@
# Use this changelog template to create an entry for release notes.

# One of 'breaking', 'deprecation', 'new_component', 'enhancement', 'bug_fix'
change_type: enhancement

# The name of the component, or a single word describing the area of concern, (e.g. filelogreceiver)
component: processor/tailsamplingprocessor

# A brief description of the change. Surround your text with quotes ("") if it needs to start with a backtick (`).
note: "Added stratified sampling policy to the tailsampling processor"

# Mandatory: One or more tracking issues related to the change. You can use the PR number here if no issue exists.
issues: [40917]

# (Optional) One or more lines of additional information to render under the primary note.
# These lines will be padded with 2 spaces and then inserted directly into the document.
# Use pipe (|) for multiline entries.
subtext: The current implementation of the probabilistic sampling policy in the tail sampling processor in the OpenTelemetry Collector Contrib repository randomly samples a percentage of traces. This approach does not ensure that all the application service workflows associated with different transaction types get a representation in the sampled set of traces. A user can initiate any specific application functionality/operation, which subsequently triggers a corresponding subset of service components (or a workflow). For instance, in an e-commerce application designed using a microservices architecture, distinct operations, such as browsing, adding to a cart, and others will invoke different microservices. Each functionality will invoke service components in a defined order, with the invocation order representing a subgraph within the broader application workflow. Defining this subgraph of service components for servicing a request as the trajectory, for a sampled set of traces to truly represent an application and thus be of more value to the downstream tasks, all the trajectories must get at least one representation in the sampled set of traces for the given sampling interval. This new sampling policy, called the stratified sampling policy, samples a new trajectory whenever it is encountered for the first time within a sampling interval. If a trajectory has already been observed within that interval, the policy will revert to a probabilistic sampling approach, where trajectories are selected based on predefined probabilities. This ensures that newly encountered trajectories are prioritized for sampling while maintaining flexibility for previously seen trajectories.

# If your change doesn't affect end users or the exported elements of any package,
# you should instead start your pull request title with [chore] or use the "Skip Changelog" label.
# Optional: The change log or logs in which this entry should be included.
# e.g. '[user]' or '[user, api]'
# Include 'user' if the change is relevant to end users.
# Include 'api' if there is a change to a library API.
# Default: '[user]'
change_logs: [user]
16 changes: 16 additions & 0 deletions processor/tailsamplingprocessor/config.go
Original file line number Diff line number Diff line change
Expand Up @@ -27,6 +27,8 @@ const (
// StringAttribute sample traces that an attribute, of type string, matching
// one of the listed values.
StringAttribute PolicyType = "string_attribute"
// Stratified Probabilistic samples a given percentage of traces considering the trace trajectory as well.
StratifiedProbabilistic PolicyType = "stratified"
// RateLimiting allows all traces until the specified limits are satisfied.
RateLimiting PolicyType = "rate_limiting"
// Composite allows defining a composite policy, combining the other policies in one
Expand Down Expand Up @@ -60,6 +62,8 @@ type sharedPolicyCfg struct {
NumericAttributeCfg NumericAttributeCfg `mapstructure:"numeric_attribute"`
// Configs for probabilistic sampling policy evaluator.
ProbabilisticCfg ProbabilisticCfg `mapstructure:"probabilistic"`
// Configs for stratified probabilistic sampling policy evaluator.
StratifiedProbabilisticCfg StratifiedProbabilisticCfg `mapstructure:"stratified"`
// Configs for status code filter sampling policy evaluator.
StatusCodeCfg StatusCodeCfg `mapstructure:"status_code"`
// Configs for string attribute filter sampling policy evaluator.
Expand Down Expand Up @@ -170,6 +174,18 @@ type ProbabilisticCfg struct {
SamplingPercentage float64 `mapstructure:"sampling_percentage"`
}

// StratifiedProbabilisticCfg holds the configurable settings to create a stratified probabilistic
// sampling policy evaluator.
type StratifiedProbabilisticCfg struct {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From the description I read, I am not sure the term "Stratified" has been quite earned, though from the PR description of ("at least once") sampling, there's something useful here. Note that the composite sampling policy of this component is similar to what you're proposing, too, except (IIUC) you're adding an at-least-once fallback instead of a default-bucket approach.

If, as I take it, what you're trying to achieve is not based specifically on this at-least-once principal, but instead you are aiming just to achieve good coverage across all values in a key-space, then I support, but it leaves me with questions for this configuration struct. I would imagine wanting a rate-limited sampler that tries to achieve balance, which means estimating the most-frequent values in the set and assigning (somehow) the percentage to use for the remaining bunch. (This is what the composite sampler policy in this component does.) I believe from looking into this problem, that the best answer would be only to configure a rate limit and nothing else; let the component figure out what sampling probabilities to use for which strata and also let the component control the relative weight of the "other bunch", which is to say: how much weight of the distribution falls into the default bucket vs how much is explicitly managed with a fixed-size lookup table used to calculate the probability that will achieve the intended rate.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

// HashSalt allows one to configure the hashing salts. This is important in scenarios where multiple layers of collectors
// have different sampling rates: if they use the same salt all passing one layer may pass the other even if they have
// different sampling rates, configuring different salts avoids that.
HashSalt string `mapstructure:"hash_salt"`
Comment on lines +180 to +183

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please consider using the pkg/sampling support in this repository instead of a hash-based approach. OpenTelemetry systems are expected to observe the W3C TraceContext Level 2 specification, which means there are 56 bits of randomness available in one of two ways implemented by that library. We do not encourage hash-based sampling, see the approach we've taken to upgrade in probabilisticsampling processor, which is also the subject of this (current) blog post draft: open-telemetry/opentelemetry.io#7735.

Moreover, there are other probability samplers in this component's configuration: I would expect them all to use the same approach, whatever it is, and would prefer to keep this code as simple as possible.

// SamplingPercentage is the percentage rate at which traces are going to be sampled. Defaults to zero, i.e.: no sample.
// Values greater or equal 100 are treated as "sample all traces".
SamplingPercentage float64 `mapstructure:"sampling_percentage"`
}

// StatusCodeCfg holds the configurable settings to create a status code filter sampling
// policy evaluator.
type StatusCodeCfg struct {
Expand Down
Loading
Loading