fix(sdk-trace-base): eager exporting for batch span processor #3458

seemk · 2022-11-30T18:08:47Z

Which problem is this PR solving?

Fixes BSP silently dropping spans.

Fixes #3094

Short description of the changes

Add eager exporting to BSP:

Start an export as soon as the max batch size is reached after a span has been added to the queue.
The usual export loop runs as-is based on the BSP delay interval.

Additionally:

Unref the timeout timer when exporting.

Type of change

Bug fix (non-breaking change which fixes an issue)

Checklist:

Followed the style guidelines of this project
Unit tests have been added

codecov · 2022-11-30T18:13:22Z

Codecov Report

Merging #3458 (f2b8d51) into main (2499708) will increase coverage by 1.90%.
The diff coverage is 100.00%.

❗ Current head f2b8d51 differs from pull request most recent head 68247b3. Consider uploading reports for the commit 68247b3 to get more accurate results

@@            Coverage Diff             @@
##             main    #3458      +/-   ##
==========================================
+ Coverage   90.52%   92.42%   +1.90%     
==========================================
  Files         159      326     +167     
  Lines        3757     9284    +5527     
  Branches      835     1967    +1132     
==========================================
+ Hits         3401     8581    +5180     
- Misses        356      703     +347

Files	Coverage Δ
...dk-trace-base/src/export/BatchSpanProcessorBase.ts	`94.16% <100.00%> (+1.24%)`	⬆️

... and 178 files with indirect coverage changes

seemk · 2022-11-30T18:21:44Z

Reopening after dealing with browser issues

dyladan · 2022-12-07T16:51:59Z

I remember there was some discussion about this in the spec. Do you know if that was ever resolved or what the current state of that is?

dyladan · 2022-12-07T19:32:02Z

Here is the spec issue open-telemetry/opentelemetry-specification#849

dyladan · 2022-12-07T19:54:16Z

https://github.com/open-telemetry/opentelemetry-specification/pull/3024/files

MSNev · 2022-12-13T17:02:04Z

packages/opentelemetry-sdk-trace-base/src/export/BatchSpanProcessorBase.ts

+      return;
+    }
+
+    if (this._nextExport === 0) {


Personally, I'm not a fan of this pattern of "detecting" whether we should start a timer or not -- I perfer more explicit start/reset descriptions.

It took me a while to understand that this it the "flag" you are effectively using to determine whether there is "more" data (or not) and then whether to "reset" the timer.

I also don't like that you can "reset" an already running timer, as it could be very easy for someone to come along later and cause an endless (until it got full) delay if something is getting batched at a regular interval.

_nextExport is only used to avoid needlessly resetting the timer, i.e. it means that an export is already queued next cycle. Think about appending thousands of spans consecutively in the same event loop cycle.

What would your alternative be to resetting an already running timer? Starting an export in a new promise right away when the buffer reaches the threshold can't be done as it would cause both too many concurrent exports and would nullify the concept of max queue size.

_nextExport is only used to avoid needlessly resetting the timer, i.e. it means that an export is already queued next cycle.

this._timer will (should) be undefined when no timer is already running 😄 (as long as you also set the value to undefined (or null) within the setTimeout() implementation as well.

Which is basically what the previous implementation was doing with if (this._timer !== undefined) return; in the _maybeSet function (although I'd also prefer to see a check that nothing is currently batched as well -- to avoid creating the timer in the first place.

The timer is now always running, basically this new implementation sets the timeout to 0 once the batch size is exceeded.

But if there is nothing in the batch we should not have any timer running...
ie. The timer should only be running if there is something waiting to be sent, otherwise, if an application is sitting doing nothing (because the user walked away) by having a running timer this can cause the device (client) to never go to sleep and therefore use more resource (power / battery) when not necessary

Given the spec (https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/trace/sdk.md#batching-processor) we could probably simplify this whole processor to just be a setInterval and a list of spans. During the most recent spec iteration it was made clear that there is no need to wait for the previous export to complete before starting another one. The word "returns" is used very explicitly in the spec and refers to thread safety, not to the actual async task of the export.

Wouldn't not needing to wait for previous export to complete invalidate the maxQueueSize parameter in this case? E.g. when starting an export when the batch size has been reached

simplify this whole processor to just be a setInterval

Noooo! Intervals are worse than timeout management as intervals get left behind and cause the APP/CPU to always be busy (at regular intervals)

arbiv · 2023-01-16T11:48:44Z

We encountered this issue too. @dyladan @seemk @MSNev Is there something blocking it from being merged?
@TimurMisharin FYI

seemk · 2023-01-16T19:14:32Z

We encountered this issue too. @dyladan @seemk @MSNev Is there something blocking it from being merged? @TimurMisharin FYI

Not that I know of, just needs reviews

dyladan · 2023-01-16T19:15:53Z

Nothing blocking for me I'm just on vacation. Spec merged last week

cftechwiz · 2023-02-03T17:57:00Z

@dyladan - Hope you had a good holiday.. Can you provide an ETA on getting this fix merged ? We've got a production issue this fixes.

dyladan · 2023-02-03T20:34:20Z

@dyladan - Hope you had a good holiday.. Can you provide an ETA on getting this fix merged ? We've got a production issue this fixes.

I had been waiting for a resolution of the conversation between you and @MSNev

MSNev · 2023-02-03T20:39:12Z

@dyladan - Hope you had a good holiday.. Can you provide an ETA on getting this fix merged ? We've got a production issue this fixes.

I had been waiting for a resolution of the conversation between you and @MSNev

And my comment (I believe) still stands, we should NOT have a timer running (including an interval timer), if there is nothing "batched" as this will cause issues relating to performance etc.

cftechwiz · 2023-02-06T17:58:02Z

@dyladan - Hope you had a good holiday.. Can you provide an ETA on getting this fix merged ? We've got a production issue this fixes.

I had been waiting for a resolution of the conversation between you and @MSNev

And my comment (I believe) still stands, we should NOT have a timer running (including an interval timer), if there is nothing "batched" as this will cause issues relating to performance etc.

This definitely fits the narrative/issue we're trying to resolve. @dyladan can you provide an ETA on this?

seemk · 2023-02-06T18:07:30Z

@dyladan - Hope you had a good holiday.. Can you provide an ETA on getting this fix merged ? We've got a production issue this fixes.

I had been waiting for a resolution of the conversation between you and @MSNev

And my comment (I believe) still stands, we should NOT have a timer running (including an interval timer), if there is nothing "batched" as this will cause issues relating to performance etc.

This definitely fits the narrative/issue we're trying to resolve. @dyladan can you provide an ETA on this?

I can update my PR either tomorrow or on Wednesday, however getting rid of a free running timer on a 5 second interval seems to be a microoptimisation, is there even any data to back these performance concerns?

MSNev · 2023-02-17T23:29:21Z

I can update my PR either tomorrow or on Wednesday, however getting rid of a free running timer on a 5 second interval seems to be a microoptimisation, is there even any data to back these performance concerns?

I don't have any public data that I can provide, but when a client has a running timer it restricts the ability of that process being put to sleep/reduce power usage state and therefore to reduce the amount of energy that is being burnt.

This issue mostly affects Clients (browser / mobile devices), but will also indirectly affect hosted node instances that will be constantly burning CPU even when it has no events/traffic -- thus costing users $$$

cftechwiz · 2023-03-30T17:13:15Z

I am bumping this thread as it seems to have gone a bit dormant. We know of production impact in NodeJS/GraphQL that this PR resolves. In order to accelerate the customer and get them unblocked from leverage Batch Processor we had them fork the repo, rebase the branch and build it. They did this, and I can report that this has resolved the issue and they are no longer dropping spans. Also, we have not identified any of the issues related to performance concerns being discussed here. We're in agreement that the timer is a micro-optimization and do not believe it should hold this up any longer. If we want to continue debating the timer, I propose we raise a new issue and let this one pass.

If anyone would like more information, let me know what you want and I will try to provide it. We'd be really grateful if we could go ahead and see this make its way to being merged soon so we can pick up the fix and get them back into the main branch...

Cheers,
Colin

dyladan · 2023-04-24T15:15:29Z

@cfallwell my apologies I seem to have let this fall through the cracks. In the absence of public data from @MSNev I think I agree we should move forward. Any alternative implementation suggestions are welcome or concrete data to quantify the performance impact.

I'll put this on my schedule to review tomorrow.

MSNev · 2023-04-24T16:14:04Z

My original comments still stand, in fact I have had to fix issues related to "background" timers this week for Application Insights because it kept causing the app to "wake up" the App and the burn CPU for no reason (as 99% of the time the queue was empty). This issue was raised by one of our internal partners who have an extremely robust perp framework and runtime measurements.

Also by NOT having a running timer in the background (and assuming that the timeout period is not too long), the need to unref the timer diminishes as there is no running timer to stop the app from closing.

Using standard setInterval / clearInterval (I could not change to setTimeout in our master repo as we return the interval id as part of the API. microsoft/ApplicationInsights-JS@0d970c7

But in the current release, we use a wrapper (which works for any runtime (Node, Browser, worker) microsoft/ApplicationInsights-JS@d14c15b while this does use unref this is because this background timer is generally only used for internal non-critical log reporting.

Both of these just pump the "log" event into our export pipeline (which has it's own timer (which is only running when there is something waiting))

dyladan · 2023-10-06T14:36:02Z

#3958 solved this

seemk added 3 commits November 30, 2022 13:59

feat(sdk-trace-base): eager exporting for batch span processor

c339cd5

fix: use interval not timeout

48fab8d

chore: cleanup, add test for periodic exports

c12124b

seemk requested a review from a team November 30, 2022 18:08

seemk added 2 commits November 30, 2022 20:08

Merge branch 'main' into bsp-eager-export

1b6fe87

chore: update changelog

e101c00

seemk closed this Nov 30, 2022

seemk added 5 commits December 6, 2022 17:24

fix: use setTimeout for browser compatibility

d79ba67

refactor: spacing

faffbd2

Merge branch 'main' into bsp-eager-export

307ca7c

fix: unref timeout timer

4e01111

Merge branch 'main' into bsp-eager-export

1879569

seemk reopened this Dec 7, 2022

Merge branch 'main' into bsp-eager-export

746457e

seemk added 2 commits December 8, 2022 12:12

Merge branch 'main' into bsp-eager-export

b27af18

Merge branch 'main' into bsp-eager-export

525ad54

MSNev reviewed Dec 13, 2022

View reviewed changes

seemk added 3 commits December 15, 2022 22:04

Merge branch 'main' into bsp-eager-export

487a3fb

Merge branch 'main' into bsp-eager-export

72f2726

Merge branch 'main' into bsp-eager-export

b3718d1

seemk mentioned this pull request Jan 3, 2023

Eager exporting batch span processor open-telemetry/opentelemetry-specification#3024

Merged

seemk added 4 commits January 4, 2023 00:09

Merge branch 'main' into bsp-eager-export

492458d

Merge branch 'main' into bsp-eager-export

c16e528

Merge branch 'main' into bsp-eager-export

946e33b

Merge branch 'main' into bsp-eager-export

5598d25

Merge branch 'main' into bsp-eager-export

b044f6d

Merge branch 'main' into bsp-eager-export

72fa72a

Merge branch 'main' into bsp-eager-export

2799733

GregLahaye mentioned this pull request Feb 17, 2023

Eager exporting for BatchSpanProcessor #3094

Closed

2 tasks

Merge branch 'main' into bsp-eager-export

a1bfd82

seemk added 2 commits March 22, 2023 11:01

Merge branch 'main' into bsp-eager-export

73732f6

Merge branch 'main' into bsp-eager-export

e6d728b

dyladan added 2 commits April 24, 2023 17:15

Merge branch 'main' into bsp-eager-export

94597e1

Remove duplicate changelog entry

d5a1020

Flarna mentioned this pull request Apr 28, 2023

Batched exporter starts sends just first 2048 spans #3772

Closed

seemk added 2 commits May 9, 2023 10:33

Merge branch 'main' into bsp-eager-export

3c01099

Merge branch 'main' into bsp-eager-export

0bea748

Flarna mentioned this pull request Jul 12, 2023

How to prevent spans being dropped #3980

Closed

1 task

Merge branch 'main' into bsp-eager-export

8ff333e

Flarna mentioned this pull request Sep 27, 2023

Allow BatchSpanProcessor to send early when a full batch is ready #4164

Closed

4 tasks

Merge branch 'main' into bsp-eager-export

68247b3

dyladan closed this Oct 6, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(sdk-trace-base): eager exporting for batch span processor #3458

fix(sdk-trace-base): eager exporting for batch span processor #3458

seemk commented Nov 30, 2022 •

edited

Loading

codecov bot commented Nov 30, 2022 •

edited

Loading

seemk commented Nov 30, 2022

dyladan commented Dec 7, 2022

dyladan commented Dec 7, 2022

dyladan commented Dec 7, 2022

MSNev Dec 13, 2022

seemk Dec 13, 2022

MSNev Dec 13, 2022

MSNev Dec 13, 2022

seemk Dec 13, 2022

MSNev Jan 17, 2023

dyladan Feb 3, 2023

seemk Feb 6, 2023 •

edited

Loading

MSNev Apr 24, 2023

arbiv commented Jan 16, 2023

seemk commented Jan 16, 2023

dyladan commented Jan 16, 2023

cftechwiz commented Feb 3, 2023

dyladan commented Feb 3, 2023

MSNev commented Feb 3, 2023

cftechwiz commented Feb 6, 2023

seemk commented Feb 6, 2023

MSNev commented Feb 17, 2023

cftechwiz commented Mar 30, 2023 •

edited

Loading

dyladan commented Apr 24, 2023

MSNev commented Apr 24, 2023

dyladan commented Oct 6, 2023

fix(sdk-trace-base): eager exporting for batch span processor #3458

fix(sdk-trace-base): eager exporting for batch span processor #3458

Conversation

seemk commented Nov 30, 2022 • edited Loading

Which problem is this PR solving?

Short description of the changes

Type of change

Checklist:

codecov bot commented Nov 30, 2022 • edited Loading

Codecov Report

seemk commented Nov 30, 2022

dyladan commented Dec 7, 2022

dyladan commented Dec 7, 2022

dyladan commented Dec 7, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

seemk Feb 6, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

arbiv commented Jan 16, 2023

seemk commented Jan 16, 2023

dyladan commented Jan 16, 2023

cftechwiz commented Feb 3, 2023

dyladan commented Feb 3, 2023

MSNev commented Feb 3, 2023

cftechwiz commented Feb 6, 2023

seemk commented Feb 6, 2023

MSNev commented Feb 17, 2023

cftechwiz commented Mar 30, 2023 • edited Loading

dyladan commented Apr 24, 2023

MSNev commented Apr 24, 2023

dyladan commented Oct 6, 2023

seemk commented Nov 30, 2022 •

edited

Loading

codecov bot commented Nov 30, 2022 •

edited

Loading

seemk Feb 6, 2023 •

edited

Loading

cftechwiz commented Mar 30, 2023 •

edited

Loading