[processor/tailsampling] Drop large traces#45286
Conversation
Allow users to provide a maximum size a trace can reach, if a trace exceeds this size before a decision is made then the trace will be dropped immediately in order to avoid memory issues or even abuse.
53a89d4 to
8f4ef1c
Compare
|
Shouldn't the Memory Limiter Processor help with that? |
|
Yes it can help protect the system as well, but it is limited when using the TSP. If only the memory limiter was used the collector would be applying backpressure and reducing throughput while the large traces just sit in memory until a decision is made for them. The backpressure could also cause incorrect sampling decisions as many decisions will be made while some of their spans are blocked until memory decreases. I'd expect operators to configure both this feature (if their environment has a chance of traces that are too large) as well as the Memory Limiter Processor for general backpressure/avoiding OOMs. In our case the trace storage will also never accept traces that are above a certain size, so dropping them from the TSP ASAP makes the most sense as any additional processing is wasted effort. |
atoulme
left a comment
There was a problem hiding this comment.
Sounds like this could be its own processor - lgtm, approved by codeowner. Merging.
|
Thank you for your contribution @csmarchbanks! 🎉 We would like to hear from you about your experience contributing to OpenTelemetry by taking a few minutes to fill out this survey. If you are getting started contributing, you can also join the CNCF Slack channel #opentelemetry-new-contributors to ask for guidance and get help. |
…47535) #### Description Replaces @Logiraptor with @csmarchbanks as a codeowner of the tail sampling processor as I have been more active recently and @Logiraptor is working on other efforts. In addition, adds @carsonip as a new code owner. Some of my TSP efforts: * #42573 * #44878 * #45286 * #46161 TSP work that @carsonip has done: * #43561 * #46762 * #42326
…pen-telemetry#47535) #### Description Replaces @Logiraptor with @csmarchbanks as a codeowner of the tail sampling processor as I have been more active recently and @Logiraptor is working on other efforts. In addition, adds @carsonip as a new code owner. Some of my TSP efforts: * open-telemetry#42573 * open-telemetry#44878 * open-telemetry#45286 * open-telemetry#46161 TSP work that @carsonip has done: * open-telemetry#43561 * open-telemetry#46762 * open-telemetry#42326
…pen-telemetry#47535) #### Description Replaces @Logiraptor with @csmarchbanks as a codeowner of the tail sampling processor as I have been more active recently and @Logiraptor is working on other efforts. In addition, adds @carsonip as a new code owner. Some of my TSP efforts: * open-telemetry#42573 * open-telemetry#44878 * open-telemetry#45286 * open-telemetry#46161 TSP work that @carsonip has done: * open-telemetry#43561 * open-telemetry#46762 * open-telemetry#42326
Description
This change adds a new config option,
maximum_trace_size_byteswhich will immediately drop any trace that exceeds the configured size. It allows operators to protect their system as otherwise occasional large traces cause spiky memory usage, and could even cause unbounded memory usage.Here is an example of the memory usage before and after of configuring a maximum trace size.

Testing
An automated test was added and this has been successfully running in multiple internal instances for a few weeks now.
Documentation
New parameter added to the TSP readme.