Add Trace Span Pruning Processor#45617
Conversation
Signed-off-by: Sean Porter <portertech@gmail.com>
Signed-off-by: Sean Porter <portertech@gmail.com>
Signed-off-by: Sean Porter <portertech@gmail.com>
Signed-off-by: Sean Porter <portertech@gmail.com>
Signed-off-by: Sean Porter <portertech@gmail.com>
Signed-off-by: Sean Porter <portertech@gmail.com>
Signed-off-by: Sean Porter <portertech@gmail.com>
Signed-off-by: Sean Porter <portertech@gmail.com>
Signed-off-by: Sean Porter <portertech@gmail.com>
Signed-off-by: Sean Porter <portertech@gmail.com>
Signed-off-by: Sean Porter <portertech@gmail.com>
Signed-off-by: Sean Porter <portertech@gmail.com>
Signed-off-by: Sean Porter <portertech@gmail.com>
Signed-off-by: Sean Porter <portertech@gmail.com>
Signed-off-by: Sean Porter <portertech@gmail.com>
Signed-off-by: Sean Porter <portertech@gmail.com>
Signed-off-by: Sean Porter <portertech@gmail.com>
Signed-off-by: Sean Porter <portertech@gmail.com>
Signed-off-by: Sean Porter <portertech@gmail.com>
Signed-off-by: Sean Porter <portertech@gmail.com>
Signed-off-by: Sean Porter <portertech@gmail.com>
Signed-off-by: Sean Porter <portertech@gmail.com>
Signed-off-by: Sean Porter <portertech@gmail.com>
Signed-off-by: Sean Porter <portertech@gmail.com>
Signed-off-by: Sean Porter <portertech@gmail.com>
Signed-off-by: Sean Porter <portertech@gmail.com>
Signed-off-by: Sean Porter <portertech@gmail.com>
Signed-off-by: Sean Porter <portertech@gmail.com>
Signed-off-by: Sean Porter <portertech@gmail.com>
|
@csmarchbanks has agreed to be a code owner 🎉 I am now seeking another code owner external to our org (Grafana). |
|
Thanks Sean! Converting this PR to draft until the proposal is accepted. |
Signed-off-by: Sean Porter <portertech@gmail.com>
|
@andrzej-stencel the wonderful @jmacd is keen to sponsor, I've updated the proposal 👍 |
|
To be compatible with Consistent Probability Sampling, only the spans with identical TraceState should be aggregated. The description of the solution does not mention TraceState at all. |
|
@PeterF778 excellent point, going to do some testing 👍 |
Signed-off-by: Sean Porter <portertech@gmail.com>
|
@PeterF778 implementation now accounts for tracestate. I would love your thoughts on it. |
Signed-off-by: Sean Porter <portertech@gmail.com>
Looks good! Thanks! |
|
@andrzej-stencel do you think we can accept this as one PR, or should be broken into skeleton, config, docs, impl etc.? |
|
At over 9k new lines of code, this is nearly impossible to review. I'd 100% be in favour of breaking this up into manageable chunks. |
|
@jmacd how do you propose we decompose it? |
|
Perhaps one PR for the skeleton with the MVP pruner? No outlier detection or loss analysis. |
|
I am going to try breaking it down this afternoon 👍 |
|
I did manage to reduce the component the its core, stripping out additional capabilities, diff main...portertech:opentelemetry-collector-contrib:trace-span-pruning-mvp |
|
This PR was marked stale due to lack of activity. It will be closed in 14 days. |
|
@portertech #45617 (comment) looks good to me. Much better! |
|
This PR was marked stale due to lack of activity. It will be closed in 14 days. |
|
Closed as inactive. Feel free to reopen if this PR is still being worked on. |
## Summary This PR introduces the `spanpruningprocessor`, a new trace processor that reduces trace storage costs while preserving observability value. It intelligently identifies and aggregates repetitive leaf spans within traces, replacing groups of similar operations with single summary spans that capture the full statistical picture. This is a reduced-scope MVP of #45617 (now closed), focusing on the core aggregation algorithm. Advanced features like outlier detection, outlier preservation, histogram buckets, attribute loss analysis, and byte-size metrics will follow in subsequent PRs once the foundation is merged. Component donation issue: #45654 ## The Problem Modern distributed systems generate enormous volumes of trace data. A significant portion consists of repetitive, similar spans -- think N+1 database queries, batch HTTP calls, or fan-out operations. Storing every individual span is expensive and often provides diminishing analytical value beyond the first few instances. Current solutions are inadequate: - **Head sampling** loses entire traces, breaking root cause analysis - **Tail sampling** helps but still keeps every span in sampled traces - **Manual instrumentation changes** require code modifications across services ## The Solution The Span Pruning Processor identifies duplicate or similar leaf spans within a single trace, groups them, and replaces each group with a single aggregated summary span. When leaf spans are aggregated, the processor also recursively aggregates their parent spans if all children of those parents are being aggregated. **Leaf spans** are spans that are not referenced as a parent by any other span in the trace. They typically represent the last actions in an execution call stack (e.g., individual database queries, HTTP calls to external services). Spans are grouped by: 1. **Span name** - spans must have the same name 2. **Span kind** - spans must have the same kind (Internal, Server, Client, Producer, Consumer) 3. **Status code** - spans must have the same status (OK, Error, or Unset) 4. **TraceState** - spans must have identical TraceState values (for Consistent Probability Sampling compatibility) 5. **Configured attributes** - spans must have matching values for attributes specified in `group_by_attributes` 6. **Parent span name** - leaf spans must share the same parent span name to be grouped together Parent spans are eligible for aggregation when all of their children are aggregated, they share the same name, kind, and status code, and they are not root spans. ## Use Cases - **Database query optimization**: When an application makes many similar database queries (e.g., N+1 queries), aggregate them into a single summary span - **Batch operations**: Consolidate many similar leaf operations into a single representative span - **Cost reduction**: Reduce trace storage costs by eliminating redundant span data ## Configuration ```yaml processors: spanpruning: # Attributes to use for grouping similar leaf spans (supports glob patterns) # Spans with the same name AND same values for matching attributes will be grouped # Examples: # - "db.*" matches db.operation, db.name, db.statement, etc. # - "http.request.*" matches http.request.method, http.request.header, etc. # - "db.operation" matches only the exact key "db.operation" group_by_attributes: - "db.*" - "http.method" # Minimum number of similar leaf spans required before aggregation # Default: 5 min_spans_to_aggregate: 3 # Maximum depth of parent span aggregation above leaf spans # 0 = only aggregate leaf spans (no parent aggregation) # -1 = unlimited depth # Default: 1 max_parent_depth: 1 # Prefix for aggregation statistics attributes # Default: "aggregation." aggregation_attribute_prefix: "batch." ``` ## Configuration Options | Field | Type | Default | Description | |-------|------|---------|-------------| | `group_by_attributes` | []string | [] | Attribute patterns for grouping (supports glob patterns like `db.*`) | | `min_spans_to_aggregate` | int | 5 | Minimum group size before aggregation occurs | | `max_parent_depth` | int | 1 | Max depth of parent aggregation (0=none, -1=unlimited) | | `aggregation_attribute_prefix` | string | "aggregation." | Prefix for aggregation statistics attributes | ### Glob Pattern Support The `group_by_attributes` field supports glob patterns for matching attribute keys: | Pattern | Matches | |---------|---------| | `db.*` | `db.operation`, `db.name`, `db.statement`, etc. | | `http.request.*` | `http.request.method`, `http.request.header.content-type`, etc. | | `rpc.*` | `rpc.method`, `rpc.service`, `rpc.system`, etc. | | `db.operation` | Only the exact key `db.operation` | When multiple attributes match a pattern, they are all included in the grouping key (sorted alphabetically for consistency). ## Summary Span When spans are aggregated, the summary span includes: ### Properties - **Name**: Original span name (e.g., `SELECT`) - **TraceID**: Same as original spans - **SpanID**: Newly generated unique ID - **ParentSpanID**: Same as original spans (common parent) - **Kind**: Same as template span (inherited from slowest span) - **StartTimestamp**: Earliest start time of all spans in the group - **EndTimestamp**: Latest end time of all spans in the group - **Status**: Same as original spans (spans are grouped by status code) - **TraceState**: Inherited from the template span (preserved for Consistent Probability Sampling compatibility) - **Attributes**: Inherited from the slowest span in the group - **Events**: Inherited from the template (slowest) span - **Links**: Inherited from the template span > **Note**: The summary span's duration (`EndTimestamp - StartTimestamp`) represents the total time window covered by all aggregated spans, which may exceed `duration_max_ns`. For example, if spans overlap or are staggered, the time range can be larger than any individual span's duration. Use `duration_max_ns` to find the slowest individual operation. ### What Gets Aggregated Away When spans are aggregated into a summary span, the following data from non-template spans is **lost**: | Data | Behavior | |------|----------| | **Span Events** | Only the template (slowest) span's events are preserved | | **Span Links** | Only the template span's links are preserved | | **Attributes** | Non-matching attribute values are lost | | **Individual Timestamps** | Original start/end times replaced by the group's time range | | **SpanIDs** | Original SpanIDs are replaced by a single summary SpanID | ### Aggregation Attributes The following attributes are added to the summary span (shown with default `aggregation_attribute_prefix: "aggregation."`): | Attribute | Type | Description | |-----------|------|-------------| | `<prefix>is_summary` | bool | Always `true` to identify summary spans | | `<prefix>span_count` | int64 | Number of spans that were aggregated | | `<prefix>duration_min_ns` | int64 | Minimum duration in nanoseconds | | `<prefix>duration_max_ns` | int64 | Maximum duration in nanoseconds | | `<prefix>duration_avg_ns` | int64 | Average duration in nanoseconds | | `<prefix>duration_total_ns` | int64 | Total duration in nanoseconds | ## Pipeline Placement This processor is designed to work best when placed after processors that ensure complete traces are available: ```yaml service: pipelines: traces: receivers: [otlp] processors: [groupbytrace, spanpruning, batch] exporters: [otlp] ``` Or with tail sampling: ```yaml service: pipelines: traces: receivers: [otlp] processors: [tail_sampling, spanpruning, batch] exporters: [otlp] ``` ## Examples ### Basic Example A trace with repeated database queries (some failing): **Before Processing:** ``` root-span (parent) ├── SELECT (leaf) - duration: 10ms, db.operation: select, status: OK ├── SELECT (leaf) - duration: 15ms, db.operation: select, status: OK ├── SELECT (leaf) - duration: 12ms, db.operation: select, status: OK ├── SELECT (leaf) - duration: 50ms, db.operation: select, status: Error ├── SELECT (leaf) - duration: 45ms, db.operation: select, status: Error └── INSERT (leaf) - duration: 20ms, db.operation: insert, status: OK ``` **After Processing (with `min_spans_to_aggregate: 2`):** ``` root-span (parent) ├── SELECT (summary, status: OK) │ - aggregation.is_summary: true │ - aggregation.span_count: 3 │ - aggregation.duration_min_ns: 10000000 │ - aggregation.duration_max_ns: 15000000 │ - aggregation.duration_avg_ns: 12333333 ├── SELECT (summary, status: Error) │ - aggregation.is_summary: true │ - aggregation.span_count: 2 │ - aggregation.duration_min_ns: 45000000 │ - aggregation.duration_max_ns: 50000000 │ - aggregation.duration_avg_ns: 47500000 └── INSERT (unchanged - only 1 span, below threshold) ``` Note: Spans with different status codes are grouped separately, preserving error information. ### Recursive Parent Aggregation Example When spans are aggregated, the processor also checks if their parent spans can be aggregated. Parent spans are eligible for aggregation when: 1. All of their children are being aggregated 2. They share the same name, kind, and status code with other eligible parents 3. They are not root spans (must have a parent) 4. At least 2 parents meet the criteria **Before Processing (with `min_spans_to_aggregate: 2`, `group_by_attributes: ["db.op"]`):** ``` root ├── handler (status: OK) │ └── SELECT (db.op=select, status: OK) ───┐ ├── handler (status: OK) │ leaf group A: 3 OK SELECTs │ └── SELECT (db.op=select, status: OK) ───┤ ├── handler (status: OK) │ │ └── SELECT (db.op=select, status: OK) ───┘ ├── handler (status: Error) │ └── SELECT (db.op=select, status: Error) ┐ leaf group B: 2 Error SELECTs ├── handler (status: Error) │ │ └── SELECT (db.op=select, status: Error) ┘ ├── handler (status: OK) │ └── INSERT (db.op=insert, status: OK) ──── only 1, below threshold └── worker (status: OK) └── SELECT (db.op=select, status: OK) ──── different parent name ``` **After Processing:** ``` root ├── handler (summary, status: OK, span_count: 3) │ └── SELECT (summary, status: OK, span_count: 3) ├── handler (summary, status: Error, span_count: 2) │ └── SELECT (summary, status: Error, span_count: 2) ├── handler (status: OK) │ └── INSERT (status: OK) ─────────────────────────── unchanged └── worker (status: OK) └── SELECT (status: OK) ─────────────────────────── unchanged ``` **Why each span was handled this way:** | Span | Result | Reason | |------|--------|--------| | 3x handler (OK) with SELECT children | Aggregated | All children aggregated, same name+kind+status | | 3x SELECT (OK) under handler | Aggregated | Same name + kind + status + attributes + parent name | | 2x handler (Error) with SELECT children | Aggregated | All children aggregated, same name+kind+status | | 2x SELECT (Error) under handler | Aggregated | Same name + kind + status + attributes + parent name | | handler (OK) with INSERT child | Unchanged | Child not aggregated (only 1 INSERT) | | INSERT (OK) | Unchanged | Below threshold (only 1 span) | | worker (OK) | Unchanged | Child not aggregated | | SELECT (OK) under worker | Unchanged | Different parent name than other SELECTs | ## Consistent Probability Sampling (CPS) Compatibility The processor is designed to be compatible with [Consistent Probability Sampling](https://opentelemetry.io/docs/specs/otel/trace/tracestate-probability-sampling/) (CPS). CPS uses TraceState to carry sampling metadata (`ot=th:...;rv:...`) where: - `th` (threshold) indicates the sampling probability threshold - `rv` (randomness value) provides consistent randomness for sampling decisions **Why TraceState matters for aggregation:** Spans with different TraceState values represent different sampling populations with different "adjusted counts" (weights). Aggregating them together would produce statistically incorrect summaries and break downstream sampling decisions. The processor uses **exact TraceState matching** (not just the `th` value) because: - The `rv` value affects sampling decisions - Vendor-specific keys may have semantic meaning - Key ordering may be significant ## Limitations - Requires complete traces for accurate leaf detection - Summary span inherits attributes from the slowest span in the group - Parent spans are only aggregated when ALL their children are aggregated ## Telemetry The processor emits the following metrics to help monitor its operation: ### Counters | Metric | Description | |--------|-------------| | `otelcol_processor_spanpruning_spans_received` | Total number of spans received by the processor | | `otelcol_processor_spanpruning_spans_pruned` | Total number of spans removed by aggregation | | `otelcol_processor_spanpruning_aggregations_created` | Total number of aggregation summary spans created | | `otelcol_processor_spanpruning_traces_processed` | Total number of traces processed | ### Histograms | Metric | Description | |--------|-------------| | `otelcol_processor_spanpruning_aggregation_group_size` | Distribution of the number of spans per aggregation group | | `otelcol_processor_spanpruning_processing_duration` | Time taken to process each batch of traces (in seconds) | These metrics can be used to: - Monitor the effectiveness of span pruning (compare `spans_received` vs `spans_pruned`) - Track the compression ratio achieved by aggregation - Identify processing bottlenecks via `processing_duration` - Understand aggregation patterns via `aggregation_group_size` ## Scope / Future Work This MVP focuses on the core aggregation engine. The following features from the original PR (#45617) are planned for follow-up PRs: - **Outlier detection**: IQR and MAD-based statistical outlier detection - **Outlier preservation**: Keep slow spans as individual spans while aggregating normal ones - **Attribute correlation**: Identify attributes that correlate with slow operations - **Histogram buckets**: Latency distribution in summary spans - **Attribute loss analysis**: Track and report attribute diversity lost during aggregation - **Byte-size metrics**: Measure serialized trace sizes before/after pruning ## Architecture The processor operates in three phases per trace: 1. **Tree Construction** (`tree.go`): Builds parent-child relationships, identifies leaves and orphans 2. **Analysis** (`processor.go`, `grouping.go`): Groups similar leaf spans by key, then walks up the tree to find eligible parent spans for recursive aggregation 3. **Execution** (`aggregation.go`): Sorts groups top-down, creates summary spans with preassigned SpanIDs, and batch-removes originals Key design decisions: - **Tree-based analysis** avoids O(n^2) parent lookups by pre-computing relationships - **Type-safe attribute encoding** (`grouping.go`) ensures correct grouping for all pdata value types (maps, slices, bytes) - **Pooled string builders** minimize allocations in the hot grouping-key path - **Single-pass statistics** (`stats.go`) computes min/max/avg/total and time ranges without extra traversals #### Link to tracking issue Fixes #45654 #### Testing - Comprehensive unit tests (`processor_test.go`) covering: leaf span aggregation, recursive parent aggregation at multiple depths, grouping by attributes with glob patterns, status code separation, TraceState/CPS compatibility, span kind grouping, edge cases (empty traces, single spans, orphans, multiple roots), configuration validation, and template span selection (events, links, attributes inherited from slowest span) - Configuration validation tests (`config_test.go`) covering all fields and error cases - Aggregation logic tests (`aggregation_test.go`) for duration calculation and template selection - Benchmark tests (`processor_benchmark_test.go`) measuring throughput across varying trace sizes (100-10000 spans) and group counts - Generated component lifecycle tests and telemetry tests via `mdatagen` #### Documentation - Comprehensive `README.md` with configuration reference, glob pattern examples, summary span schema, pipeline placement guidance, before/after examples (including recursive parent aggregation), CPS compatibility notes, limitations, and telemetry reference - `documentation.md` generated from `metadata.yaml` describing all 6 custom telemetry metrics --------- Signed-off-by: Sean Porter <portertech@gmail.com>
## Summary This PR introduces the `spanpruningprocessor`, a new trace processor that reduces trace storage costs while preserving observability value. It intelligently identifies and aggregates repetitive leaf spans within traces, replacing groups of similar operations with single summary spans that capture the full statistical picture. This is a reduced-scope MVP of open-telemetry#45617 (now closed), focusing on the core aggregation algorithm. Advanced features like outlier detection, outlier preservation, histogram buckets, attribute loss analysis, and byte-size metrics will follow in subsequent PRs once the foundation is merged. Component donation issue: open-telemetry#45654 ## The Problem Modern distributed systems generate enormous volumes of trace data. A significant portion consists of repetitive, similar spans -- think N+1 database queries, batch HTTP calls, or fan-out operations. Storing every individual span is expensive and often provides diminishing analytical value beyond the first few instances. Current solutions are inadequate: - **Head sampling** loses entire traces, breaking root cause analysis - **Tail sampling** helps but still keeps every span in sampled traces - **Manual instrumentation changes** require code modifications across services ## The Solution The Span Pruning Processor identifies duplicate or similar leaf spans within a single trace, groups them, and replaces each group with a single aggregated summary span. When leaf spans are aggregated, the processor also recursively aggregates their parent spans if all children of those parents are being aggregated. **Leaf spans** are spans that are not referenced as a parent by any other span in the trace. They typically represent the last actions in an execution call stack (e.g., individual database queries, HTTP calls to external services). Spans are grouped by: 1. **Span name** - spans must have the same name 2. **Span kind** - spans must have the same kind (Internal, Server, Client, Producer, Consumer) 3. **Status code** - spans must have the same status (OK, Error, or Unset) 4. **TraceState** - spans must have identical TraceState values (for Consistent Probability Sampling compatibility) 5. **Configured attributes** - spans must have matching values for attributes specified in `group_by_attributes` 6. **Parent span name** - leaf spans must share the same parent span name to be grouped together Parent spans are eligible for aggregation when all of their children are aggregated, they share the same name, kind, and status code, and they are not root spans. ## Use Cases - **Database query optimization**: When an application makes many similar database queries (e.g., N+1 queries), aggregate them into a single summary span - **Batch operations**: Consolidate many similar leaf operations into a single representative span - **Cost reduction**: Reduce trace storage costs by eliminating redundant span data ## Configuration ```yaml processors: spanpruning: # Attributes to use for grouping similar leaf spans (supports glob patterns) # Spans with the same name AND same values for matching attributes will be grouped # Examples: # - "db.*" matches db.operation, db.name, db.statement, etc. # - "http.request.*" matches http.request.method, http.request.header, etc. # - "db.operation" matches only the exact key "db.operation" group_by_attributes: - "db.*" - "http.method" # Minimum number of similar leaf spans required before aggregation # Default: 5 min_spans_to_aggregate: 3 # Maximum depth of parent span aggregation above leaf spans # 0 = only aggregate leaf spans (no parent aggregation) # -1 = unlimited depth # Default: 1 max_parent_depth: 1 # Prefix for aggregation statistics attributes # Default: "aggregation." aggregation_attribute_prefix: "batch." ``` ## Configuration Options | Field | Type | Default | Description | |-------|------|---------|-------------| | `group_by_attributes` | []string | [] | Attribute patterns for grouping (supports glob patterns like `db.*`) | | `min_spans_to_aggregate` | int | 5 | Minimum group size before aggregation occurs | | `max_parent_depth` | int | 1 | Max depth of parent aggregation (0=none, -1=unlimited) | | `aggregation_attribute_prefix` | string | "aggregation." | Prefix for aggregation statistics attributes | ### Glob Pattern Support The `group_by_attributes` field supports glob patterns for matching attribute keys: | Pattern | Matches | |---------|---------| | `db.*` | `db.operation`, `db.name`, `db.statement`, etc. | | `http.request.*` | `http.request.method`, `http.request.header.content-type`, etc. | | `rpc.*` | `rpc.method`, `rpc.service`, `rpc.system`, etc. | | `db.operation` | Only the exact key `db.operation` | When multiple attributes match a pattern, they are all included in the grouping key (sorted alphabetically for consistency). ## Summary Span When spans are aggregated, the summary span includes: ### Properties - **Name**: Original span name (e.g., `SELECT`) - **TraceID**: Same as original spans - **SpanID**: Newly generated unique ID - **ParentSpanID**: Same as original spans (common parent) - **Kind**: Same as template span (inherited from slowest span) - **StartTimestamp**: Earliest start time of all spans in the group - **EndTimestamp**: Latest end time of all spans in the group - **Status**: Same as original spans (spans are grouped by status code) - **TraceState**: Inherited from the template span (preserved for Consistent Probability Sampling compatibility) - **Attributes**: Inherited from the slowest span in the group - **Events**: Inherited from the template (slowest) span - **Links**: Inherited from the template span > **Note**: The summary span's duration (`EndTimestamp - StartTimestamp`) represents the total time window covered by all aggregated spans, which may exceed `duration_max_ns`. For example, if spans overlap or are staggered, the time range can be larger than any individual span's duration. Use `duration_max_ns` to find the slowest individual operation. ### What Gets Aggregated Away When spans are aggregated into a summary span, the following data from non-template spans is **lost**: | Data | Behavior | |------|----------| | **Span Events** | Only the template (slowest) span's events are preserved | | **Span Links** | Only the template span's links are preserved | | **Attributes** | Non-matching attribute values are lost | | **Individual Timestamps** | Original start/end times replaced by the group's time range | | **SpanIDs** | Original SpanIDs are replaced by a single summary SpanID | ### Aggregation Attributes The following attributes are added to the summary span (shown with default `aggregation_attribute_prefix: "aggregation."`): | Attribute | Type | Description | |-----------|------|-------------| | `<prefix>is_summary` | bool | Always `true` to identify summary spans | | `<prefix>span_count` | int64 | Number of spans that were aggregated | | `<prefix>duration_min_ns` | int64 | Minimum duration in nanoseconds | | `<prefix>duration_max_ns` | int64 | Maximum duration in nanoseconds | | `<prefix>duration_avg_ns` | int64 | Average duration in nanoseconds | | `<prefix>duration_total_ns` | int64 | Total duration in nanoseconds | ## Pipeline Placement This processor is designed to work best when placed after processors that ensure complete traces are available: ```yaml service: pipelines: traces: receivers: [otlp] processors: [groupbytrace, spanpruning, batch] exporters: [otlp] ``` Or with tail sampling: ```yaml service: pipelines: traces: receivers: [otlp] processors: [tail_sampling, spanpruning, batch] exporters: [otlp] ``` ## Examples ### Basic Example A trace with repeated database queries (some failing): **Before Processing:** ``` root-span (parent) ├── SELECT (leaf) - duration: 10ms, db.operation: select, status: OK ├── SELECT (leaf) - duration: 15ms, db.operation: select, status: OK ├── SELECT (leaf) - duration: 12ms, db.operation: select, status: OK ├── SELECT (leaf) - duration: 50ms, db.operation: select, status: Error ├── SELECT (leaf) - duration: 45ms, db.operation: select, status: Error └── INSERT (leaf) - duration: 20ms, db.operation: insert, status: OK ``` **After Processing (with `min_spans_to_aggregate: 2`):** ``` root-span (parent) ├── SELECT (summary, status: OK) │ - aggregation.is_summary: true │ - aggregation.span_count: 3 │ - aggregation.duration_min_ns: 10000000 │ - aggregation.duration_max_ns: 15000000 │ - aggregation.duration_avg_ns: 12333333 ├── SELECT (summary, status: Error) │ - aggregation.is_summary: true │ - aggregation.span_count: 2 │ - aggregation.duration_min_ns: 45000000 │ - aggregation.duration_max_ns: 50000000 │ - aggregation.duration_avg_ns: 47500000 └── INSERT (unchanged - only 1 span, below threshold) ``` Note: Spans with different status codes are grouped separately, preserving error information. ### Recursive Parent Aggregation Example When spans are aggregated, the processor also checks if their parent spans can be aggregated. Parent spans are eligible for aggregation when: 1. All of their children are being aggregated 2. They share the same name, kind, and status code with other eligible parents 3. They are not root spans (must have a parent) 4. At least 2 parents meet the criteria **Before Processing (with `min_spans_to_aggregate: 2`, `group_by_attributes: ["db.op"]`):** ``` root ├── handler (status: OK) │ └── SELECT (db.op=select, status: OK) ───┐ ├── handler (status: OK) │ leaf group A: 3 OK SELECTs │ └── SELECT (db.op=select, status: OK) ───┤ ├── handler (status: OK) │ │ └── SELECT (db.op=select, status: OK) ───┘ ├── handler (status: Error) │ └── SELECT (db.op=select, status: Error) ┐ leaf group B: 2 Error SELECTs ├── handler (status: Error) │ │ └── SELECT (db.op=select, status: Error) ┘ ├── handler (status: OK) │ └── INSERT (db.op=insert, status: OK) ──── only 1, below threshold └── worker (status: OK) └── SELECT (db.op=select, status: OK) ──── different parent name ``` **After Processing:** ``` root ├── handler (summary, status: OK, span_count: 3) │ └── SELECT (summary, status: OK, span_count: 3) ├── handler (summary, status: Error, span_count: 2) │ └── SELECT (summary, status: Error, span_count: 2) ├── handler (status: OK) │ └── INSERT (status: OK) ─────────────────────────── unchanged └── worker (status: OK) └── SELECT (status: OK) ─────────────────────────── unchanged ``` **Why each span was handled this way:** | Span | Result | Reason | |------|--------|--------| | 3x handler (OK) with SELECT children | Aggregated | All children aggregated, same name+kind+status | | 3x SELECT (OK) under handler | Aggregated | Same name + kind + status + attributes + parent name | | 2x handler (Error) with SELECT children | Aggregated | All children aggregated, same name+kind+status | | 2x SELECT (Error) under handler | Aggregated | Same name + kind + status + attributes + parent name | | handler (OK) with INSERT child | Unchanged | Child not aggregated (only 1 INSERT) | | INSERT (OK) | Unchanged | Below threshold (only 1 span) | | worker (OK) | Unchanged | Child not aggregated | | SELECT (OK) under worker | Unchanged | Different parent name than other SELECTs | ## Consistent Probability Sampling (CPS) Compatibility The processor is designed to be compatible with [Consistent Probability Sampling](https://opentelemetry.io/docs/specs/otel/trace/tracestate-probability-sampling/) (CPS). CPS uses TraceState to carry sampling metadata (`ot=th:...;rv:...`) where: - `th` (threshold) indicates the sampling probability threshold - `rv` (randomness value) provides consistent randomness for sampling decisions **Why TraceState matters for aggregation:** Spans with different TraceState values represent different sampling populations with different "adjusted counts" (weights). Aggregating them together would produce statistically incorrect summaries and break downstream sampling decisions. The processor uses **exact TraceState matching** (not just the `th` value) because: - The `rv` value affects sampling decisions - Vendor-specific keys may have semantic meaning - Key ordering may be significant ## Limitations - Requires complete traces for accurate leaf detection - Summary span inherits attributes from the slowest span in the group - Parent spans are only aggregated when ALL their children are aggregated ## Telemetry The processor emits the following metrics to help monitor its operation: ### Counters | Metric | Description | |--------|-------------| | `otelcol_processor_spanpruning_spans_received` | Total number of spans received by the processor | | `otelcol_processor_spanpruning_spans_pruned` | Total number of spans removed by aggregation | | `otelcol_processor_spanpruning_aggregations_created` | Total number of aggregation summary spans created | | `otelcol_processor_spanpruning_traces_processed` | Total number of traces processed | ### Histograms | Metric | Description | |--------|-------------| | `otelcol_processor_spanpruning_aggregation_group_size` | Distribution of the number of spans per aggregation group | | `otelcol_processor_spanpruning_processing_duration` | Time taken to process each batch of traces (in seconds) | These metrics can be used to: - Monitor the effectiveness of span pruning (compare `spans_received` vs `spans_pruned`) - Track the compression ratio achieved by aggregation - Identify processing bottlenecks via `processing_duration` - Understand aggregation patterns via `aggregation_group_size` ## Scope / Future Work This MVP focuses on the core aggregation engine. The following features from the original PR (open-telemetry#45617) are planned for follow-up PRs: - **Outlier detection**: IQR and MAD-based statistical outlier detection - **Outlier preservation**: Keep slow spans as individual spans while aggregating normal ones - **Attribute correlation**: Identify attributes that correlate with slow operations - **Histogram buckets**: Latency distribution in summary spans - **Attribute loss analysis**: Track and report attribute diversity lost during aggregation - **Byte-size metrics**: Measure serialized trace sizes before/after pruning ## Architecture The processor operates in three phases per trace: 1. **Tree Construction** (`tree.go`): Builds parent-child relationships, identifies leaves and orphans 2. **Analysis** (`processor.go`, `grouping.go`): Groups similar leaf spans by key, then walks up the tree to find eligible parent spans for recursive aggregation 3. **Execution** (`aggregation.go`): Sorts groups top-down, creates summary spans with preassigned SpanIDs, and batch-removes originals Key design decisions: - **Tree-based analysis** avoids O(n^2) parent lookups by pre-computing relationships - **Type-safe attribute encoding** (`grouping.go`) ensures correct grouping for all pdata value types (maps, slices, bytes) - **Pooled string builders** minimize allocations in the hot grouping-key path - **Single-pass statistics** (`stats.go`) computes min/max/avg/total and time ranges without extra traversals #### Link to tracking issue Fixes open-telemetry#45654 #### Testing - Comprehensive unit tests (`processor_test.go`) covering: leaf span aggregation, recursive parent aggregation at multiple depths, grouping by attributes with glob patterns, status code separation, TraceState/CPS compatibility, span kind grouping, edge cases (empty traces, single spans, orphans, multiple roots), configuration validation, and template span selection (events, links, attributes inherited from slowest span) - Configuration validation tests (`config_test.go`) covering all fields and error cases - Aggregation logic tests (`aggregation_test.go`) for duration calculation and template selection - Benchmark tests (`processor_benchmark_test.go`) measuring throughput across varying trace sizes (100-10000 spans) and group counts - Generated component lifecycle tests and telemetry tests via `mdatagen` #### Documentation - Comprehensive `README.md` with configuration reference, glob pattern examples, summary span schema, pipeline placement guidance, before/after examples (including recursive parent aggregation), CPS compatibility notes, limitations, and telemetry reference - `documentation.md` generated from `metadata.yaml` describing all 6 custom telemetry metrics --------- Signed-off-by: Sean Porter <portertech@gmail.com>
Summary
This PR introduces the spanpruningprocessor, a new trace processor that dramatically reduces trace storage costs while preserving observability value. It intelligently identifies and aggregates repetitive leaf spans within traces, replacing groups of similar operations with single summary spans that capture the full statistical picture.
Component donation issue: #45654
The Problem
Modern distributed systems generate enormous volumes of trace data. A significant portion consists of repetitive, similar spans—think N+1 database queries, batch HTTP calls, or fan-out operations. Storing every individual span is expensive and often provides diminishing analytical value beyond the first few instances.
Current solutions are inadequate:
The Solution
The Span Pruning Processor identifies duplicate or similar leaf spans within a single trace, groups them, and replaces each group with a single aggregated summary span. When leaf spans are aggregated, the processor also recursively aggregates their parent spans if all children of those parents are being aggregated.
Leaf spans are spans that are not referenced as a parent by any other span in the trace. They typically represent the last actions in an execution call stack (e.g., individual database queries, HTTP calls to external services).
Spans are grouped by:
group_by_attributesParent spans are eligible for aggregation when all of their children are aggregated, they share the same name, kind, and status code, and they are not root spans.
Optionally, the processor can detect duration outliers using statistical methods (IQR or MAD) and either annotate summary spans with outlier correlations or preserve outlier spans as individual spans for debugging while still aggregating normal spans.
This processor is useful for reducing trace data volume while preserving meaningful information about repeated operations.
Use Cases
Configuration
Configuration Options
group_by_attributesdb.*)min_spans_to_aggregatemax_parent_depthaggregation_attribute_prefixaggregation_histogram_buckets[5ms, 10ms, 25ms, 50ms, 100ms, 250ms, 500ms, 1s, 2.5s, 5s, 10s]enable_attribute_loss_analysisattribute_loss_exemplar_sample_rateenable_attribute_loss_analysisis true.enable_bytes_metricsenable_outlier_analysisoutlier_analysis.methodoutlier_analysis.iqr_multiplieroutlier_analysis.mad_multiplieroutlier_analysis.min_group_sizeoutlier_analysis.correlation_min_occurrenceoutlier_analysis.correlation_max_normal_occurrenceoutlier_analysis.max_correlated_attributesoutlier_analysis.preserve_outliersoutlier_analysis.max_preserved_outliersoutlier_analysis.preserve_only_with_correlationGlob Pattern Support
The
group_by_attributesfield supports glob patterns for matching attribute keys:db.*db.operation,db.name,db.statement, etc.http.request.*http.request.method,http.request.header.content-type, etc.rpc.*rpc.method,rpc.service,rpc.system, etc.db.operationdb.operationWhen multiple attributes match a pattern, they are all included in the grouping key (sorted alphabetically for consistency).
Summary Span
When spans are aggregated, the summary span includes:
Properties
SELECT)What Gets Aggregated Away
When spans are aggregated into a summary span, the following data from non-template spans is lost:
To understand attribute loss, enable
enable_attribute_loss_analysis: truewhich addsdiverse_attributesandmissing_attributesto summary spans.Aggregation Attributes
The following attributes are added to the summary span (shown with default
aggregation_attribute_prefix: "aggregation."):<prefix>is_summarytrueto identify summary spans<prefix>span_count<prefix>duration_min_ns<prefix>duration_max_ns<prefix>duration_avg_ns<prefix>duration_total_ns<prefix>histogram_bucket_bounds_s<prefix>histogram_bucket_countsOptional Outlier Analysis Attributes
When
enable_outlier_analysis: true, the following additional attributes are added:<prefix>duration_median_ns<prefix>outlier_correlated_attributeskey=value(outlier%/normal%), ...)Histogram Buckets
The histogram provides a latency distribution of the aggregated spans. The buckets are cumulative, meaning each bucket count includes all spans with duration less than or equal to the bucket boundary.
Example with buckets
[10ms, 50ms, 100ms]and 5 spans with durations[5ms, 15ms, 25ms, 75ms, 150ms]:histogram_bucket_bounds_s:[0.01, 0.05, 0.1]histogram_bucket_counts:[1, 3, 4, 5]Outlier Analysis (Optional)
When
enable_outlier_analysis: true, the processor detects duration outliers and identifies attributes that correlate with slow spans.Detection Methods
The processor supports two statistical methods for outlier detection:
threshold = Q3 + (multiplier × IQR)threshold = median + (multiplier × MAD × 1.4826)When to use each:
How It Works
IQR (Interquartile Range) Method:
MAD (Median Absolute Deviation) Method:
Note: The 1.4826 scale factor makes MAD comparable to standard deviation for normal distributions.
Attribute Correlation (same for both methods):
Configuration Example
Example Output
Interpretation:
cache_hit=false, while 0% of normal spans didThis helps identify root causes of latency issues:
When to Use
Performance Impact
min_group_size: 7or higher to skip analysis on small groupsPreserving Outlier Spans (Optional)
When
outlier_analysis.preserve_outliers: true, detected outlier spans are kept as individual spans instead of being aggregated. This provides:Configuration
Configuration Options
preserve_outliersmax_preserved_outlierspreserve_only_with_correlationExample Output
Before (10 similar SELECT spans, 2 are outliers):
After (with
preserve_outliers: true,max_preserved_outliers: 2):Summary Span Attributes (When Preserving Outliers)
<prefix>preserved_outlier_count<prefix>preserved_outlier_span_idsPreserved Outlier Span Attributes
<prefix>is_preserved_outlier<prefix>summary_span_idBehavior Notes
min_spans_to_aggregate), the entire group is left unchangedPipeline Placement
This processor is designed to work best when placed after processors that ensure complete traces are available:
Or with tail sampling:
Example
Basic Example
A trace with repeated database queries (some failing):
Before Processing:
After Processing (with
min_spans_to_aggregate: 2):Note: Spans with different status codes are grouped separately, preserving error information.
Recursive Parent Aggregation Example
When spans are aggregated, the processor also checks if their parent spans can be aggregated. Parent spans are eligible for aggregation when:
Before Processing (with
min_spans_to_aggregate: 2,group_by_attributes: ["db.op"]):After Processing:
Why each span was handled this way:
Limitations
Consistent Probability Sampling (CPS) Compatibility
The processor is designed to be compatible with Consistent Probability Sampling (CPS). CPS uses TraceState to carry sampling metadata (
ot=th:...;rv:...) where:th(threshold) indicates the sampling probability thresholdrv(randomness value) provides consistent randomness for sampling decisionsWhy TraceState matters for aggregation:
Spans with different TraceState values represent different sampling populations with different "adjusted counts" (weights). Aggregating them together would produce statistically incorrect summaries and break downstream sampling decisions.
Example:
The processor uses exact TraceState matching (not just the
thvalue) because:rvvalue affects sampling decisionsTelemetry
The processor emits the following metrics to help monitor its operation:
Counters
otelcol_processor_spanpruning_spans_receivedotelcol_processor_spanpruning_spans_prunedotelcol_processor_spanpruning_aggregations_createdotelcol_processor_spanpruning_traces_processedotelcol_processor_spanpruning_outliers_detectedenable_outlier_analysis: true)otelcol_processor_spanpruning_outliers_preservedpreserve_outliers: true)otelcol_processor_spanpruning_outliers_correlations_detectedotelcol_processor_spanpruning_bytes_receivedenable_bytes_metrics: true)otelcol_processor_spanpruning_bytes_emittedenable_bytes_metrics: true)Histograms
otelcol_processor_spanpruning_aggregation_group_sizeotelcol_processor_spanpruning_processing_durationOptional Attribute Loss Metrics
When
enable_attribute_loss_analysis: true, the processor also emits metrics about attribute loss during aggregation. These metrics help you understand how much information is being lost when spans are grouped together.To correlate these metrics back to traces, a configurable fraction of these metric recordings can include trace exemplars via
attribute_loss_exemplar_sample_rate. Sampling is applied per aggregation group, and the exemplar context is taken from the slowest span in the group.Histograms (Optional)
otelcol_processor_spanpruning_leaf_attribute_diversity_lossotelcol_processor_spanpruning_leaf_attribute_lossotelcol_processor_spanpruning_parent_attribute_diversity_lossotelcol_processor_spanpruning_parent_attribute_lossAttribute loss analysis is disabled by default (
enable_attribute_loss_analysis: false) to reduce overhead. When enabled, the processor:<prefix>diverse_attributesand<prefix>missing_attributessummary attributes to aggregated spansThese metrics can be used to:
spans_receivedvsspans_pruned)processing_durationaggregation_group_size