[processor/tailsampling] Allow faster decisions after the root span is received by csmarchbanks · Pull Request #44878 · open-telemetry/opentelemetry-collector-contrib

csmarchbanks · 2025-12-10T13:57:03Z

Description

When testing the work done to provide early decisions (#44456) the impact was very limited in some scenarios, specifically when policies existed that look for long traces or if an error is ever present. It is not possible to know if an error span will come along in the future so almost all traces end up waiting until the decision wait causing the savings to only be a couple of percent in some environments.

What I found myself wishing for was a way to run all decisions more quickly for most cases, and what worked fairly well is to base that decision on if a root span has been received or not. This change implements a second decision wait to collect any straggler spans that might be present for a trace (e.g. a second service with a different batch timer before sending), but still allow it to be much faster than the base decision wait. A good way of thinking about the two options is that decision_wait is the maximum amount of time we will wait for a span to arrive, and decision_wait_after_root_received is the minimum amount of time we will wait for additional spans to come in.

The downside of this approach is that heavily asynchronous traces may not be sampled as expected, however we find that those do not work very well as it is since they commonly last longer than the decision wait anyway. The behavior is opt in so no changes are needed for any users.

I wanted to keep the changes in this PR relatively small, but in the future we could re-implement id batcher to support moving traces between batches, or possibly a priority queue where we pop values until some threshold, rather than having two batchers.

Link to tracking issue

Part of #43876

Testing

Added smoke test for the new functionality

Documentation

Added documentation explaining the new configuration variable.

When testing the work done to provide early decisions the impact was very limited in some scenarios, specifically when policies existed that look for long traces or if an error is ever present. It is not possible to know if an error span will come along in the future so almost all traces end up waiting until the decision wait causing the savings to only be a couple of percent in some environments. What I found myself wishing for was a way to run all decisions more quickly for most cases, and what worked fairly well is to base that decision on if a root span has been received or not. This change implements a second decision wait to collect any straggler spans that might be present for a trace (e.g. a second service with a different batch timer before sending), but still allow it to be much faster than the base decision wait. A good way of thinking about the two options is that decision_wait is the maximum amount of time we will wait for a span to arrive, and decision_wait_after_root_received is the minimum amount of time we will wait for additional spans to come in.

We only want to increment the trace dropped to early metric when encounter a trace that did not already have a decision made for it. We can use the cache to determine if a decision was appropriately made or if the trace was simply dropped. If running with block on overflow enabled there is no need to check as the code blocks instead of dropping traces that have not yet been processed by one of the two batchers.

crobert-1 · 2025-12-15T18:12:21Z

Pinging @Logiraptor for review as code owner

Copilot

Pull request overview

This PR introduces a new configuration option decision_wait_after_root_received that enables faster sampling decisions after receiving a trace's root span, while still collecting straggler spans. This optimizes tail sampling by allowing most traces to be decided more quickly when their root span arrives, rather than waiting for the full decision_wait period.

Key Changes:

Added decision_wait_after_root_received configuration parameter for faster decisions after root span arrival
Refactored Batch type from slice to map for efficient batch merging
Enhanced trace processing to detect and track root spans

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 3 comments.

Show a summary per file

File	Description
processor/tailsamplingprocessor/config.go	Adds new `DecisionWaitAfterRootReceived` configuration field
processor/tailsamplingprocessor/README.md	Documents the new configuration option
processor/tailsamplingprocessor/processor.go	Implements root span detection and dual-batcher logic for faster decisions
processor/tailsamplingprocessor/processor_test.go	Adds test coverage for root-received batcher and updates test utilities
processor/tailsamplingprocessor/internal/idbatcher/id_batcher.go	Changes Batch from slice to map for efficient merging
processor/tailsamplingprocessor/internal/idbatcher/id_batcher_test.go	Updates tests to work with map-based batches
.chloggen/tsp-decision-wait-after-root-received.yaml	Changelog entry for the enhancement

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2025-12-15T18:12:36Z

+		if sp.ParentSpanID().IsEmpty() {
+			rootSpan = &sp
 		}


Taking the address of a loop variable sp is problematic. In Go, loop variables are reused across iterations, so &sp will always point to the last iteration's value. This means rootSpan may reference the wrong span or become invalid. Instead, create a copy of the span or use a different approach to store the root span reference.

I don't think this is a valid comment as:

sp is not a loop variable, but instead created in each iteration of the loop.

I believe this behavior was fixed in Go 1.22 anyway.

Leaving up for someone to double check though.

Copilot · 2025-12-15T18:12:37Z

+	assert.Less(t, len(allSampledTraces), len(traceIDs)*6/10)
+	assert.Greater(t, len(allSampledTraces), len(traceIDs)*4/10)


The magic numbers 6/10 and 4/10 represent the expected sampling range (40%-60%) around the 50% configured sampling percentage. Consider extracting these as named constants to make the test's intent clearer, such as expectedSamplingPercentage = 0.5, toleranceLower = 0.4, and toleranceUpper = 0.6.

Copilot · 2025-12-15T18:12:37Z

+	if tsp.rootReceivedBatcher != nil {
+		rootBatch, _ := tsp.rootReceivedBatcher.CloseCurrentAndTakeFirstBatch()
+		if batch == nil {
+			batch = rootBatch
+		} else {
+			for id := range rootBatch {
+				batch[id] = struct{}{}
+			}
+		}
+	}


If rootBatch is nil, the code at line 567-569 will panic when attempting to range over it. Add a nil check for rootBatch before attempting to merge it with batch, similar to the check for batch at line 564.

That's not how looping over a nil map works in Go...

Logiraptor

Overall seems reasonable. A few nits. I would be interested in a refactor of the idBatcher so it can support re-prioritization, but I don't mind merging this first and leaving that for later.

Logiraptor · 2025-12-18T14:47:40Z

+	}
+	// Wait long enough that we pass the decision wait after a root is received,
+	// but no where near the base decision wait.
+	time.Sleep(2 * time.Second)


Is it possible to use the controller here and waitForTick instead?

Not easily, mainly because the test controller uses a syncIDBatcher so any wait for tick is going to pull from the normal batch. I could refactor the test controller to allow multiple batches but didn't seem worthwhile as the test just takes a couple seconds. Plus it is kind of nice to have a test using the real batcher.

Let me know if you want me to look into the refactor though!

csmarchbanks · 2025-12-18T16:08:48Z

Agreed on wanting to refactor the id batcher to either a priority queue or an implementation that allows rebatching. I opened an issue for that work: #45054.

…er-root-received

atoulme · 2025-12-31T22:30:55Z

You have approvals from codeowners but CI doesn't pass. Please check CI, push changes and mark ready to review again.

csmarchbanks · 2026-01-05T16:39:33Z

Updated! Just merging main fixed the CI issues.

otelbot · 2026-01-08T16:37:47Z

Thank you for your contribution @csmarchbanks! 🎉 We would like to hear from you about your experience contributing to OpenTelemetry by taking a few minutes to fill out this survey. If you are getting started contributing, you can also join the CNCF Slack channel #opentelemetry-new-contributors to ask for guidance and get help.

…s received (open-telemetry#44878)  #### Description When testing the work done to provide early decisions (open-telemetry#44456) the impact was very limited in some scenarios, specifically when policies existed that look for long traces or if an error is ever present. It is not possible to know if an error span will come along in the future so almost all traces end up waiting until the decision wait causing the savings to only be a couple of percent in some environments. What I found myself wishing for was a way to run all decisions more quickly for most cases, and what worked fairly well is to base that decision on if a root span has been received or not. This change implements a second decision wait to collect any straggler spans that might be present for a trace (e.g. a second service with a different batch timer before sending), but still allow it to be much faster than the base decision wait. A good way of thinking about the two options is that decision_wait is the maximum amount of time we will wait for a span to arrive, and decision_wait_after_root_received is the minimum amount of time we will wait for additional spans to come in. The downside of this approach is that heavily asynchronous traces may not be sampled as expected, however we find that those do not work very well as it is since they commonly last longer than the decision wait anyway. The behavior is opt in so no changes are needed for any users. I wanted to keep the changes in this PR relatively small, but in the future we could re-implement id batcher to support moving traces between batches, or possibly a priority queue where we pop values until some threshold, rather than having two batchers.  #### Link to tracking issue Part of open-telemetry#43876  #### Testing Added smoke test for the new functionality  #### Documentation Added documentation explaining the new configuration variable.

@Logiraptor

…47535) #### Description Replaces @Logiraptor with @csmarchbanks as a codeowner of the tail sampling processor as I have been more active recently and @Logiraptor is working on other efforts. In addition, adds @carsonip as a new code owner. Some of my TSP efforts: * #42573 * #44878 * #45286 * #46161 TSP work that @carsonip has done: * #43561 * #46762 * #42326

@Logiraptor

…pen-telemetry#47535) #### Description Replaces @Logiraptor with @csmarchbanks as a codeowner of the tail sampling processor as I have been more active recently and @Logiraptor is working on other efforts. In addition, adds @carsonip as a new code owner. Some of my TSP efforts: * open-telemetry#42573 * open-telemetry#44878 * open-telemetry#45286 * open-telemetry#46161 TSP work that @carsonip has done: * open-telemetry#43561 * open-telemetry#46762 * open-telemetry#42326

@Logiraptor

…pen-telemetry#47535) #### Description Replaces @Logiraptor with @csmarchbanks as a codeowner of the tail sampling processor as I have been more active recently and @Logiraptor is working on other efforts. In addition, adds @carsonip as a new code owner. Some of my TSP efforts: * open-telemetry#42573 * open-telemetry#44878 * open-telemetry#45286 * open-telemetry#46161 TSP work that @carsonip has done: * open-telemetry#43561 * open-telemetry#46762 * open-telemetry#42326

csmarchbanks changed the title ~~Allow faster decisions after the root span is received~~ [processor/tailsampling] Allow faster decisions after the root span is received Dec 10, 2025

csmarchbanks force-pushed the tsp-decision-wait-after-root-received branch from 8dcdb13 to f39e550 Compare December 10, 2025 18:57

csmarchbanks added 3 commits December 11, 2025 08:57

Do not re-decide (and release) a trace

9566672

Fix race condition in test

6f4cd58

csmarchbanks force-pushed the tsp-decision-wait-after-root-received branch from 9f6a321 to dbecfe5 Compare December 11, 2025 16:31

csmarchbanks marked this pull request as ready for review December 12, 2025 16:25

csmarchbanks requested a review from a team as a code owner December 12, 2025 16:25

csmarchbanks requested a review from dashpole December 12, 2025 16:25

github-actions Bot assigned crobert-1 Dec 12, 2025

github-actions Bot added the processor/tailsampling Tail sampling processor label Dec 12, 2025

csmarchbanks mentioned this pull request Dec 12, 2025

[processor/tailsampling] Initial support for early decisions #44456

Closed

Merge branch 'main' into tsp-decision-wait-after-root-received

f80a8b4

crobert-1 requested review from Copilot and portertech December 15, 2025 18:11

Copilot AI reviewed Dec 15, 2025

View reviewed changes

crobert-1 added the waiting-for-code-owners label Dec 15, 2025

Logiraptor approved these changes Dec 18, 2025

View reviewed changes

Use maps.Copy instead of for loop

dbd7d9a

csmarchbanks mentioned this pull request Dec 18, 2025

Refactor idbatcher to support rebatching #45054

Closed

Merge branch 'main' into tsp-decision-wait-after-root-received

e281094

portertech approved these changes Dec 18, 2025

View reviewed changes

Merge remote-tracking branch 'origin/main' into tsp-decision-wait-aft…

218ff67

…er-root-received

atoulme marked this pull request as draft December 31, 2025 22:30

Merge branch 'main' into tsp-decision-wait-after-root-received

0bc2c3d

csmarchbanks marked this pull request as ready for review January 5, 2026 16:36

github-actions Bot assigned ChrsMark Jan 5, 2026

atoulme approved these changes Jan 8, 2026

View reviewed changes

atoulme merged commit 6ef0986 into open-telemetry:main Jan 8, 2026
194 checks passed

github-actions Bot added this to the next release milestone Jan 8, 2026

csmarchbanks mentioned this pull request Jan 8, 2026

[processor/tailsampling] Remove static decision wait for all traces #43876

Closed

This was referenced Apr 10, 2026

[processor/tailsampling] Add csmarchbanks and carsonip as codeowners #47535

Merged

REQUEST: New membership for csmarchbanks open-telemetry/community#3363

Closed

		assert.Less(t, len(allSampledTraces), len(traceIDs)*6/10)
		assert.Greater(t, len(allSampledTraces), len(traceIDs)*4/10)

Conversation

csmarchbanks commented Dec 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Link to tracking issue

Testing

Documentation

Uh oh!

crobert-1 commented Dec 15, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Dec 15, 2025

Choose a reason for hiding this comment

Uh oh!

csmarchbanks Dec 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 15, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 15, 2025

Choose a reason for hiding this comment

Uh oh!

csmarchbanks Dec 17, 2025

Choose a reason for hiding this comment

Uh oh!

Logiraptor left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Logiraptor Dec 18, 2025

Choose a reason for hiding this comment

Uh oh!

csmarchbanks Dec 18, 2025

Choose a reason for hiding this comment

Uh oh!

csmarchbanks commented Dec 18, 2025

Uh oh!

atoulme commented Dec 31, 2025

Uh oh!

csmarchbanks commented Jan 5, 2026

Uh oh!

Uh oh!

otelbot Bot commented Jan 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

csmarchbanks commented Dec 10, 2025 •

edited

Loading

csmarchbanks Dec 17, 2025 •

edited

Loading