Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize batch span processor #2983

Merged
merged 2 commits into from
Mar 19, 2021

Conversation

sbandadd
Copy link
Contributor

@sbandadd sbandadd commented Mar 4, 2021

Description:
Batch span processor currently is aggressive in the sense that any new spans are sent to the exporter,
this involves lots of overhead from signaling under heavy load and overhead from constant polling by exporter thread
under less load. This PR makes exporter thread wait for maxExportBatchSize to avoid busy polling of the queue.

Benchmark results

BatchSpanProcessorMultiThreadBenchmark.java result
image

BatchSpanProcessorCpuBenchmark.java result
image

@@ -45,6 +45,7 @@ dependencies {
jmh("io.grpc:grpc-api")
jmh("io.grpc:grpc-netty-shaded")
jmh("org.testcontainers:testcontainers") // testContainer for OTLP collector
implementation("org.jctools:jctools-core:3.2.0")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do not want to take additional external dependencies like this for our SDK.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we want to use this, we'll need to shade it into our project, rather than have it be exposed as a transitive dependency.

@sbandadd sbandadd force-pushed the sbandadd-optimize-bsp-pr branch from 0ad85d7 to 046066d Compare March 4, 2021 20:45
@jkwatson
Copy link
Contributor

jkwatson commented Mar 4, 2021

Can you include the benchmarks, before and after in this PR description, rather than just in the middle of a very long discussion on the related issue?

@sbandadd sbandadd force-pushed the sbandadd-optimize-bsp-pr branch from 046066d to 8c2471b Compare March 4, 2021 22:23
import java.util.logging.Level;
import java.util.logging.Logger;
import org.jctools.queues.MpscArrayQueue;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bogdandrutu Is there an existing jmh benchmark with DisruptorAsyncSpanProcessor?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately not, but I think you can copy-paste the same one that you have and only initialize that SpanProcessor.

@jkwatson
Copy link
Contributor

jkwatson commented Mar 4, 2021

Can you also include the BatchSpanProcessorBenchmark before & after as well?

@sbandadd
Copy link
Contributor Author

sbandadd commented Mar 4, 2021

Can you also include the BatchSpanProcessorBenchmark before & after as well?

This is something I couldn't run locally (using docker-desktop on mac). I might be missing some setup required for jmh. Can you run this through the BatchSpanProcessorBenchmark and post the results?

@@ -92,6 +92,7 @@ val DEPENDENCIES = listOf(
"org.awaitility:awaitility:4.0.3",
"org.codehaus.mojo:animal-sniffer-annotations:1.20",
"org.curioswitch.curiostack:protobuf-jackson:1.2.0",
"org.jctools:jctools-core:3.2.0",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From looking the benchmarks, I think making the general change of signaling instead of polling makes sense to me. Let's not add this dependency in this PR though, we'll need to figure out how to vendor in only the MpscQueue to keep the size down if we find it to be significant (it seems like we might be but let's think about it separately)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, lets get the MpscQueue in a different PR. @anuraaga Any help would be great !

Comment on lines 265 to 268
if (flushRequested.compareAndSet(null, flushResult)) {
lock.lock();
try {
needExport.signal();
} finally {
lock.unlock();
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

now that you grab the lock, I think it will be a big simplification if the flushRequested is protected by the lock.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you clarify what would it simplify? Are you saying we can get rid of the AtomicReference? Then the exporter thread then have to take locks inside flush() as well as checking the flushRequested flag inside the lock.

@jkwatson
Copy link
Contributor

jkwatson commented Mar 5, 2021

Benchmark results for the BatchSpanProcessorBenchmarks:

Before (main branch):
image

After (this branch):
image

I would say that there really isn't a clear winner here. The zero-delay results are within the error bars, and the non-zero-delay results are mixed, at best.

@sbandadd
Copy link
Contributor Author

sbandadd commented Mar 5, 2021

I don't understand the benchmark BatchSpanProcessorBenchmark . Why is it trying to test with forceFlush() on every iteration. Do we expect to forceFlush() after every iteration? How is this benchmark different from BatchSpanProcessorFlushBenchmark. Diffing the files shows only the class name is different.

diff BatchSpanProcessorBenchmark.java BatchSpanProcessorFlushBenchmark.java
34c34
< public class BatchSpanProcessorBenchmark {
---
> public class BatchSpanProcessorFlushBenchmark {

And of course forceFlush() is a major cost in that benchmark which we don't expect any improvements from this change.

@jkwatson
Copy link
Contributor

jkwatson commented Mar 5, 2021

I don't understand the benchmark BatchSpanProcessorBenchmark . Why is it trying to test with forceFlush() on every iteration. Do we expect to forceFlush() after every iteration? How is this benchmark different from BatchSpanProcessorFlushBenchmark. Diffing the files shows only the class name is different.

diff BatchSpanProcessorBenchmark.java BatchSpanProcessorFlushBenchmark.java
34c34
< public class BatchSpanProcessorBenchmark {
---
> public class BatchSpanProcessorFlushBenchmark {

And of course forceFlush() is a major cost in that benchmark which we don't expect any improvements from this change.

It only does a forceFlush() after it has added all the spans have been sent through the processor to make sure that the BSP has actually handled them and they aren't stuck in the queue at the end of the process.

I have no idea why we have that duplicated benchmark. It should probably be deleted.

@sbandadd
Copy link
Contributor Author

sbandadd commented Mar 5, 2021

It only does a forceFlush() after it has added all the spans.

Still it does do a forceFlush() on every iteration of the export(). The benchmark is not testing real scenarios.

  @Benchmark
  @Fork(1)
  @Threads(5)
  @Warmup(iterations = 5, time = 1)
  @Measurement(**iterations** = 10, time = 1)
  @OutputTimeUnit(TimeUnit.SECONDS)
  public void export() {

@jkwatson
Copy link
Contributor

jkwatson commented Mar 5, 2021

It only does a forceFlush() after it has added all the spans.

Still it does do a forceFlush() on every iteration of the export(). The benchmark is not testing real scenarios.

  @Benchmark
  @Fork(1)
  @Threads(5)
  @Warmup(iterations = 5, time = 1)
  @Measurement(**iterations** = 10, time = 1)
  @OutputTimeUnit(TimeUnit.SECONDS)
  public void export() {

One forceFlush() per 1000/2000/5000 spans doesn't seem all that significant to me. But, please, by all means, write some better benchmarks and contribute them if you feel ours are not working for you.

private final Collection<SpanData> batch;
private final ReentrantLock lock;
public AtomicLong droppedSpansCounter = new AtomicLong(0);
public AtomicLong exportedSpansCounter = new AtomicLong(0);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a temporary add to show the benchmark results as I couldn't get the metric to work correctly.

@sbandadd
Copy link
Contributor Author

sbandadd commented Mar 8, 2021

Added a benchmark with different thread configuration. Ran the BatchSpanProcessor under different versions. The benchmark clearly shows the overhead of locks/signal/polling.

  • BlockingQueue: Current version. This is the baseline.
  • ConcurrentQueue: Using java inbuilt ConcurrentLinkedQueue.
  • ConcurrentQueue with batch signal: Using java's inbuilt ConcurrentLinkedQueue, but sending signal to exporter every 512 adds to the queue.
  • Disruptor with BlockingWait: Using Disruptor with TimeOutBlockingWaitStrategy with timeout set to scheduleDelayNanos
  • Disruptor with SleepingWait: Using Disruptor with SleepingWaitStrategy() with default disruptor options.
  • MPSCQueue with batch signal: Using MPSCQueue and sending signal to exporter every 512 adds to the queue.

image

All the benchmarks use the following configuration:

  • maxQueueSize: 2048
  • maxExportBatchSize: 512
  • exportTimeOut: 5000ms
  • Delay: 0ms. This is intentionally set low to surface the bottleneck, higher delays won't reveal the bottleneck since exporter is too slow. CPU usage would show clear improvement with higher exporter delays (benchmark data upcoming).

@jkwatson
Copy link
Contributor

jkwatson commented Mar 9, 2021

Is a 0ms exporter delay a "real-life" configuration? Sure, if export is free, then we can improve the BSP CPU usage, but if export takes any time at all, is the BSP CPU usage swamped by the export, in which case these tweaks are of limited value?

@sbandadd
Copy link
Contributor Author

sbandadd commented Mar 9, 2021

I think you are missing the difference between a throughput benchmark and a cpu one. As mentioned in my previous comment, 0ms is meant for throughput benchmark and clearly shows the lock/signal/polling overhead. It also shows that forceFlush() impact on benchnmark.

As mentioned in my previous comment that I am following up with a cpu benchmark. This reflects "real world" scenario. JMH doesn't support measuring CPU time of exporter thread. So one has to use a profiler, in my case I used YJP

Raw Data of CPU (user + kernel) time in ms

image

Chart

image

Throughput stays the same

image

@jkwatson
Copy link
Contributor

jkwatson commented Mar 9, 2021

So, if I read your graphs correctly,

  • ConcurrentQueue with batch signal: Using java's inbuilt ConcurrentLinkedQueue, but sending signal to exporter every 512 adds to the queue.

Is the best option that doesn't involve us depending on a 3rd party library. Do you have a PR with that implementation? Does it also do time-based exports so it will work correctly in low-throughput situations?

@sbandadd
Copy link
Contributor Author

sbandadd commented Mar 9, 2021

Yes, it is java's inbuilt ConcurrentLinkedQueue. I have a local branch that use the ConcurrentLinkedQueue and yes it does time based export. We can go with this approach for now and pull in MPSCQueue later since it makes the implementation really clean and most efficient.

I can update this PR if there is consensus

@@ -33,13 +33,13 @@
@State(Scope.Benchmark)
public class BatchSpanProcessorBenchmark {

private static class DelayingSpanExporter implements SpanExporter {
public static class DelayingSpanExporter implements SpanExporter {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we want to re-use this, please pull it up out of this class to the top level, so it's clear it's a reusable component for benchmarking.

@jkwatson
Copy link
Contributor

I understand all of what you've said. Why are you opposed to having this as a separate incubator implementation for others to try out before we make it into the main default implementation?

@sbandadd
Copy link
Contributor Author

Also please note that throughput alone is a bad metric in the benchmark. One has to really look at the exportedSpans. For example I can write code that simply returns without adding to the queue and the throughput would be super high. So the existing BatchSpanProcessorBenchmark is even more meaningless TBH

@jkwatson
Copy link
Contributor

Also please note that throughput alone is a bad metric in the benchmark. One has to really look at the exportedSpans. For example I can write code that simply returns without adding to the queue and the throughput would be super high. So the existing BatchSpanProcessorBenchmark is even more meaningless TBH

If you feel so strongly that our benchmark is meaningless (btw, please watch your language and be aware that there are human beings who wrote this code), then let's do something else. Please put in a PR that makes the benchmarks meaningful as a separate PR from also changing the implementations.

@sbandadd
Copy link
Contributor Author

I am in no way talking about people who contributed to the code. We really appreciate all the work of the people who contributed here. I am only pointing out that the existing benchmark is not helpful for multiple reasons.

  1. BatchSpanProcessorBenchmark and BatchSpanProcessorFlushBenchmark are actually duplicate.
  2. A forceFlush() on every loop creates a bottleneck and not creating stress on the BSP
  3. Measuring only throughput is not helpful since number of spans getting exported is more appropriate. Measuring CPU usage of exporter thread is also important.

And we have contributed two benchmarks that addresses the above issues. I will look into the incubator implementation, thanks for pointing it up.

@jkwatson
Copy link
Contributor

I am in no way talking about people who contributed to the code. We really appreciate all the work of the people who contributed here. I am only pointing out that the existing benchmark is not helpful for multiple reasons.

  1. BatchSpanProcessorBenchmark and BatchSpanProcessorFlushBenchmark are actually duplicate.
  2. A forceFlush() on every loop creates a bottleneck and not creating stress on the BSP
  3. Measuring only throughput is not helpful since number of spans getting exported is more appropriate. Measuring CPU usage of exporter thread is also important.

And we have contributed two benchmarks that addresses the above issues. I will look into the incubator implementation, thanks for pointing it up.

So, it will be very helpful to have the work separated. Let's have the benchmarks updated to be better, and have general agreement from the maintainers that it is better and more useful. Let's run the existing implementation against them once that is merged. Then, as a separate step, let's have a PR that changes the implementation and shows the change in the output of the benchmarks.

@sbandadd
Copy link
Contributor Author

Separate PR for benchmarks https://github.com/open-telemetry/opentelemetry-java/pull/3017/files


private Worker(
SpanExporter spanExporter,
long scheduleDelayNanos,
int maxExportBatchSize,
long exporterTimeoutNanos,
BlockingQueue<ReadableSpan> queue) {
ConcurrentLinkedQueue<ReadableSpan> queue,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm trying to get up to speed on thread - did you compare with ArrayBlockingQueue? I would be surprised if the linked list, with the allocations it needs to do would perform better and anyways we want to avoid the complicated counter increment. There is almost no work in the blocking section so even with contention it should not really put threads to sleep I guess.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm particularly worried about the allocations since more garbage means unrelated app code ends up being affected by increased GC

Copy link
Contributor Author

@sbandadd sbandadd Mar 15, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you pease elaborate on what you meant by compare with ArrayBlockingQueue? My benchmarks took ArrayBlockingQueue the current implementation as the baseline and this approach showed clear improvement on throughput and CPU. Do you mean using ArrayBlockingQueue with batched signal? If so, yes I did ran the benchmark but it is much less throughput than using concurrent queue though there is good CPU overhead improvements.

Regarding the GC, yes there would be some increased allocations. So ideally we should be pulling in MPSCQueue.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean using ArrayBlockingQueue with batched signal? I

Yeah I meant this - I don't see it in the graphs, it would be nice to see the numbers to know what the real effect is. If it means simplifying the code (meaning not keeping track of the span count, for example) with a small difference, then it's still worth it especially given we may try to replace with MPSCQueue.

Copy link
Contributor Author

@sbandadd sbandadd Mar 15, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here are the benchmark results. Throughput suffers with BlockingQueue, but exporter CPU overhead is less with batched signal. ConcurrentQueue is better overall though.

image

image

I don't see any significant gc.alloc.rate increase. The increase is well within the error boundary.

With blocking queue batch signal

image

With concurrent queue batch signal

image

if (batch.size() >= maxExportBatchSize || System.nanoTime() >= nextExportTime) {
exportCurrentBatch();
updateNextExportTime();
}
if (queue.isEmpty()) {
lock.lock();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use a blocking queue for the signalling too? I think a one element queue is mostly equivalent but means we don't have to manage the locks ourselves, simplifying the code drastically.

@sbandadd sbandadd force-pushed the sbandadd-optimize-bsp-pr branch from 4e94955 to c102d8f Compare March 15, 2021 21:27
} else {
queue.offer(span);
queueSize.incrementAndGet();
if (addedSpansCounter.incrementAndGet() % maxExportBatchSize == 0) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jkwatson This reminds me of an issue I filed a long long time ago

open-telemetry/opentelemetry-specification#849

Never got clarification on whether we're supposed to eagerly export when we have maxExportBatchSize items. But it's true that what are current code is doing so we should preserve it for now which I think is the intent of this line.

droppedSpans.add(1);
} else {
queue.offer(span);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not comfortable with the thread-safety issues of this, it seems the queue can easily go over maxQueueSize which is an invariant we need to preserve. Since ArrayBlockingQueue is a queue with max capacity, let's just use it, it's defined for our job here - throughput with 0ms delay is sort of interesting theoretically but not an actual use case for this class, we can enjoy the CPU improvements when idle though thanks to the signalling which is what matters.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed queue can go over maxQueueSize constraint temporarily. How about optimistically trying to insert into queue and backing off if maxQueueSize is reached?

      if (size.incrementAndGet() >= maxQueueSize) {
        size.decrementAndGet();
        return false;
      }
      queue.offer(span);
      return true

concurrentQueue does have better behavior under contention with reduced exporter cpu overhead and writer thread overhead.

If MPSCQueue is the way forward as an extension, then I feel we are good with a blocking queue and batch signaling. Is it realistic to assume a batch span processor based on MPSCQueue could be supported soon?

Copy link
Contributor

@anuraaga anuraaga Mar 17, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sbandadd Discussed a bit with @jkwatson and let's go ahead and split this into two parts in this order

  1. Replacing constant polling with signaling
  2. Improve concurrency of the queuing

We don't need to bundle them into a single PR. So can we update this one to use signaling but stick with ArrayBlockingQueue, which you've demonstrated does improve CPU significantly? And in a followup we can address concurrency, most likely by vendoring in MPSCQueue as I don't see that being a big problem.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure thing. This sounds like a solid plan to me. Thanks for reaching a consensus !

@jkwatson
Copy link
Contributor

@sbandadd would it be possible for you to attend one of our SIG meetings to discuss this all in more detail? Or jump onto CNCF slack and discuss in more real-time?

@sbandadd
Copy link
Contributor Author

Sure, I can join the SIG meetings. I joined the CNCF opentelemtety channel as well. Thanks !

Comment on lines 200 to 217
while (!queue.isEmpty() && batch.size() < maxExportBatchSize) {
batch.add(queue.poll().toSpanData());
}

if (batch.size() >= maxExportBatchSize || System.nanoTime() >= nextExportTime) {
exportCurrentBatch();
updateNextExportTime();
}
if (queue.isEmpty()) {
try {
long pollWaitTime = nextExportTime - System.nanoTime();
if (pollWaitTime > 0) {
signal.poll(pollWaitTime, TimeUnit.NANOSECONDS);
}
} catch (InterruptedException e) {
Thread.currentThread().interrupt();
return;
}
Copy link
Contributor

@anuraaga anuraaga Mar 17, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this produce the same result using drain? Then I think it lets us use queue.size() >= maxExportBatchSize() in addSpan instead of maintaining a counter.

while(queue.size() >= maxExportBatchSize)) {
  batch.clear();  // Can extract method for these three lines.
  queue.drainTo(batch);
  export(batch);
}
if (!queue.isEmpty() && System.nanoTime() >= nextExportTime) {
  batch.clear();
  queue.drainTo(batch);
  export(batch);
  updateNextExportTime();
} else {
  long pollWaitTime = nextExportTime - System.nanoTime();
  if (pollWaitTime > 0) {
    signal.poll(pollWaitTime, TimeUnit.NANOSECONDS);
  }
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess the race would go away anyways if removing signal.clear() from that suggestion - the worst is just that it evaluates the while and if one more time before sleeping.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The logic looks really complicated imho. Repeatedly copying queue into the batch is bit confusing as well as the race. Using a counter we are essentially amortizing the cost of signal, it is intuitive to think that a signal every maxExportBatchSize essentially implies the exporter thread when receiving the signal is guaranteed to do the export.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In you suggestion, isn't export(batch) already clearing the batch? Also in the if condition !queue.isEmpty() && System.nanoTime() >= nextExportTime, it is exporting the whole batch at once instead of in maxExportBatchSize chunks.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is intuitive to think that a signal every maxExportBatchSize essentially implies the exporter thread when receiving the signal is guaranteed to do the export.

But can't the signal be missed, since it's already exporting or something like that?

In you suggestion, isn't export(batch) already clearing the batch?

Yeah probably, it's somewhat pseudocode

Also in the if condition !queue.isEmpty() && System.nanoTime() >= nextExportTime, it is exporting the whole batch at once instead of in maxExportBatchSize chunks.

We've already exported maxExportBatchSize chunks. However, every time the interval passed, we are supposed to clear the queue fully, so this is where remaining spans get exported - it won't export more than maxExportBatchSize even if the queue suddenly had many spans added, because drainTo won't fill more than batch.size() it's ok.

I think the loop, then if pattern here actually captures the complexity more directly. We have two conditions

  • Eagerly export as many spans as possible when over maxExportBatchSize
  • If interval has passed, make sure to export entire queue

So these are handled with two conditionals any time the thread wakes up, since there are actually two conditions. Does it make sense?

Let's not worry about the race, if we never clear the signal queue we're good and I think that will work well.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current code handles sending chunks with size less than maxExportBatchSize when interval is passed. I am not sure what becomes unclear here?

If I'm not mistaken, then if a chunk is sent that isn't maxExportBatchSize then the % becomes out of sync. The intention of this % is to send chunks with size maxExportBatchSize but if any batch is sent with a different size due to the interval check, then when the signal is sent, we will have less or more than maxExportBatchSize in an unclear way. For example if batch size is 5, and 4 spans are sent due to the interval check, then the signal will happen with only 1 span in the queue. I don't think that's our intention with that signal.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated the PR to just use an atomic boolean and queue size. I think it will make things much more clear though I didn't see any change in benchmark results.

drainTo() holds up queue lock for a long time which negatively impacts the writer threads.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this latest iteration quite a bit. Unfortunately, it looks like BatchSpanProcessorTest.forceExport() is quite flaky with this implementation. It failed about 1/5 times for me locally. I'm not sure what's going on.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jkwatson could you paste the error? I ran the test 10 times and couldn't reproduce locally.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It just fails the first assertion with a '0', rather than a '49'.

Description:
Batch span processor currently is aggressive in the sense that any new spans are sent to the exporter,
this involves lots of overhead from signaling under heavy load and overhead from constant polling by exporter thread
under less load. This PR makes exporter thread wait for maxExportBatchSize to avoid busy polling of the queue.

BatchSpanProcessorMultiThreadBenchmark.java result
![image](https://user-images.githubusercontent.com/62265954/111420486-893c7300-86a8-11eb-8f87-feb2f86f00fc.png)

BatchSpanProcessorCpuBenchmark.java result
![image](https://user-images.githubusercontent.com/62265954/111420492-8e012700-86a8-11eb-800e-7de1fbe2c2b1.png)
@sbandadd sbandadd force-pushed the sbandadd-optimize-bsp-pr branch from f47c1dc to 5ac7737 Compare March 18, 2021 17:50
Copy link
Contributor

@anuraaga anuraaga left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like a relatively small change with a significant improvement in signaling strength. Thanks!

@sbandadd sbandadd force-pushed the sbandadd-optimize-bsp-pr branch 2 times, most recently from 37d2863 to 556210e Compare March 19, 2021 00:27
@sbandadd sbandadd force-pushed the sbandadd-optimize-bsp-pr branch from 556210e to 216fa44 Compare March 19, 2021 00:28
@sbandadd
Copy link
Contributor Author

I made a minor update to use a AtomicInteger instead of AtomicLong.

Copy link
Contributor

@jkwatson jkwatson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the hard work on this!

@anuraaga anuraaga merged commit 085eb9d into open-telemetry:main Mar 19, 2021
@anuraaga
Copy link
Contributor

Thanks @sbandadd!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants