-
Notifications
You must be signed in to change notification settings - Fork 821
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix(sdk-trace-base): eager exporting for batch span processor #3458
Conversation
Codecov Report
@@ Coverage Diff @@
## main #3458 +/- ##
==========================================
+ Coverage 90.52% 92.42% +1.90%
==========================================
Files 159 326 +167
Lines 3757 9284 +5527
Branches 835 1967 +1132
==========================================
+ Hits 3401 8581 +5180
- Misses 356 703 +347
|
Reopening after dealing with browser issues |
I remember there was some discussion about this in the spec. Do you know if that was ever resolved or what the current state of that is? |
Here is the spec issue open-telemetry/opentelemetry-specification#849 |
return; | ||
} | ||
|
||
if (this._nextExport === 0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Personally, I'm not a fan of this pattern of "detecting" whether we should start a timer or not -- I perfer more explicit start/reset descriptions.
It took me a while to understand that this it the "flag" you are effectively using to determine whether there is "more" data (or not) and then whether to "reset" the timer.
I also don't like that you can "reset" an already running timer, as it could be very easy for someone to come along later and cause an endless (until it got full) delay if something is getting batched at a regular interval.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
_nextExport
is only used to avoid needlessly resetting the timer, i.e. it means that an export is already queued next cycle. Think about appending thousands of spans consecutively in the same event loop cycle.
What would your alternative be to resetting an already running timer? Starting an export in a new promise right away when the buffer reaches the threshold can't be done as it would cause both too many concurrent exports and would nullify the concept of max queue size.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
_nextExport is only used to avoid needlessly resetting the timer, i.e. it means that an export is already queued next cycle.
this._timer will (should) be undefined when no timer is already running 😄 (as long as you also set the value to undefined (or null) within the setTimeout() implementation as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Which is basically what the previous implementation was doing with if (this._timer !== undefined) return;
in the _maybeSet function (although I'd also prefer to see a check that nothing is currently batched as well -- to avoid creating the timer in the first place.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The timer is now always running, basically this new implementation sets the timeout to 0 once the batch size is exceeded.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But if there is nothing in the batch we should not have any timer running...
ie. The timer should only be running if there is something waiting to be sent, otherwise, if an application is sitting doing nothing (because the user walked away) by having a running timer this can cause the device (client) to never go to sleep and therefore use more resource (power / battery) when not necessary
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given the spec (https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/trace/sdk.md#batching-processor) we could probably simplify this whole processor to just be a setInterval
and a list of spans. During the most recent spec iteration it was made clear that there is no need to wait for the previous export to complete before starting another one. The word "returns" is used very explicitly in the spec and refers to thread safety, not to the actual async task of the export.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wouldn't not needing to wait for previous export to complete invalidate the maxQueueSize
parameter in this case? E.g. when starting an export when the batch size has been reached
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
simplify this whole processor to just be a setInterval
Noooo! Intervals are worse than timeout management as intervals get left behind and cause the APP/CPU to always be busy (at regular intervals)
We encountered this issue too. @dyladan @seemk @MSNev Is there something blocking it from being merged? |
Not that I know of, just needs reviews |
Nothing blocking for me I'm just on vacation. Spec merged last week |
@dyladan - Hope you had a good holiday.. Can you provide an ETA on getting this fix merged ? We've got a production issue this fixes. |
And my comment (I believe) still stands, we should NOT have a timer running (including an interval timer), if there is nothing "batched" as this will cause issues relating to performance etc. |
This definitely fits the narrative/issue we're trying to resolve. @dyladan can you provide an ETA on this? |
I can update my PR either tomorrow or on Wednesday, however getting rid of a free running timer on a 5 second interval seems to be a microoptimisation, is there even any data to back these performance concerns? |
I don't have any public data that I can provide, but when a client has a running timer it restricts the ability of that process being put to sleep/reduce power usage state and therefore to reduce the amount of energy that is being burnt. This issue mostly affects Clients (browser / mobile devices), but will also indirectly affect hosted node instances that will be constantly burning CPU even when it has no events/traffic -- thus costing users $$$ |
I am bumping this thread as it seems to have gone a bit dormant. We know of production impact in NodeJS/GraphQL that this PR resolves. In order to accelerate the customer and get them unblocked from leverage Batch Processor we had them fork the repo, rebase the branch and build it. They did this, and I can report that this has resolved the issue and they are no longer dropping spans. Also, we have not identified any of the issues related to performance concerns being discussed here. We're in agreement that the timer is a micro-optimization and do not believe it should hold this up any longer. If we want to continue debating the timer, I propose we raise a new issue and let this one pass. If anyone would like more information, let me know what you want and I will try to provide it. We'd be really grateful if we could go ahead and see this make its way to being merged soon so we can pick up the fix and get them back into the main branch... Cheers, |
@cfallwell my apologies I seem to have let this fall through the cracks. In the absence of public data from @MSNev I think I agree we should move forward. Any alternative implementation suggestions are welcome or concrete data to quantify the performance impact. I'll put this on my schedule to review tomorrow. |
My original comments still stand, in fact I have had to fix issues related to "background" timers this week for Application Insights because it kept causing the app to "wake up" the App and the burn CPU for no reason (as 99% of the time the queue was empty). This issue was raised by one of our internal partners who have an extremely robust perp framework and runtime measurements. Also by NOT having a running timer in the background (and assuming that the timeout period is not too long), the need to Using standard But in the current release, we use a wrapper (which works for any runtime (Node, Browser, worker) microsoft/ApplicationInsights-JS@d14c15b while this does use Both of these just pump the "log" event into our export pipeline (which has it's own timer (which is only running when there is something waiting)) |
#3958 solved this |
Which problem is this PR solving?
Fixes BSP silently dropping spans.
Fixes #3094
Short description of the changes
Add eager exporting to BSP:
Additionally:
Type of change
Checklist: