Dropping small amount of spans on master #2042

gouthamve · 2020-01-27T15:55:39Z

Requirement - what kind of business use case are you trying to solve?

Trying to run master and checking if everything works.

Problem - what in Jaeger blocks you from solving the requirement?

Dropping a small number of spans

Any open questions to address

Here's more info:
We are currently running Jaeger with about 10K spans/sec on a host. But we noticed that rate(jaeger_collector_spans_dropped_total[5m]) is ~1 on upgrade to master. Which shouldn't be there, because we have an insane queue capacity (300K).

So I put in some debug logging

diff --git a/pkg/queue/bounded_queue.go b/pkg/queue/bounded_queue.go
index 077e7af..bbbabb7 100644
--- a/pkg/queue/bounded_queue.go
+++ b/pkg/queue/bounded_queue.go
@@ -16,6 +16,7 @@
 package queue
 import (
+       "fmt"
        "sync"
        "sync/atomic"
        "time"
@@ -99,8 +100,11 @@ func (q *BoundedQueue) Produce(item interface{}) bool {
        // we might have two concurrent backing queues at the moment
        // their combined size is stored in q.size, and their combined capacity
        // should match the capacity of the new queue
-       if q.Size() >= q.Capacity() {
+       size := q.Size()
+       capacity := q.Capacity()
+       if size >= capacity {
                // note that all items will be dropped if the capacity is 0
+               fmt.Println("size is bigger than capacity! dropping span!", size, capacity)
                q.onDroppedItem(item)
                return false
        }
@@ -111,6 +115,7 @@ func (q *BoundedQueue) Produce(item interface{}) bool {
                return true
        default:
                // should not happen, as overflows should have been captured earlier
+               fmt.Println("dropping span on overflow!")
                if q.onDroppedItem != nil {
                        q.onDroppedItem(item)
                }

And saw this in the logs:

size is bigger than capacity! dropping span! 4294967295 300000
size is bigger than capacity! dropping span! 4294967295 300000
size is bigger than capacity! dropping span! 4294967295 300000
size is bigger than capacity! dropping span! 4294967295 300000
size is bigger than capacity! dropping span! 4294967295 300000
size is bigger than capacity! dropping span! 4294967295 300000
size is bigger than capacity! dropping span! 4294967295 300000
size is bigger than capacity! dropping span! 4294967295 300000
size is bigger than capacity! dropping span! 4294967295 300000

So I see that recently the queue size handling moved to using ubers atomic package. And from int32 to uint32. I think before the size used to go to -1 but then it worked, but now it's broken.

The text was updated successfully, but these errors were encountered:

joe-elliott · 2020-01-27T16:06:37Z

Is there a race condition here:

	case *q.items <- item:
		q.size.Add(1)

If the goroutine is swapped off the thread between these lines and then another goroutine consumes the the queue item it could subtract before it adds.

	case item, ok := <-queue:
		if ok {
			q.size.Sub(1)

Is this a correct analysis?

jpkrohling · 2020-01-27T16:07:50Z

I talked with @gouthamve on Gitter about this. We definitely need some protection to never go negative, but he's going to do some debugging to try to find out why it's going negative.

jpkrohling · 2020-01-27T16:08:40Z

Is this a correct analysis?

@joe-elliott : thanks for the hint! It might be, as you have both the code and the situation, would you be able to give it a try and validate the theory?

joe-elliott · 2020-01-27T17:32:59Z

@jpkrohling submitted a fix that's working in our environment #2044

Let me know if you'd rather approach it a different way.

yurishkuro · 2020-01-28T00:38:41Z

What is actually going negative, the queue size gauge? (a) I wouldn't worry about it, or (b) just fix where gauge value is recorded to clamp negative values to zero. That gauge is only important when it's a very large number, minor fluctuations are irrelevant and mostly random due to an arbitrary reporting interval.

jpkrohling · 2020-01-28T09:54:25Z

Yes, the queue size gauge (*uatomic.Uint32) is going negative, but I think it's a simpler bug rather than a concurrency data race. If @joe-elliott and @gouthamve have high confidence that this works in their highly concurrent environment, I think the linked PR is the right solution and we won't require a special safe check for negative values.

gouthamve · 2020-01-28T10:24:56Z

This works at 10K spans/sec on a single collector :)

ghost added the needs-triage label Jan 27, 2020

yurishkuro assigned yurishkuro and jpkrohling and unassigned yurishkuro Jan 27, 2020

yurishkuro removed the needs-triage label Jan 27, 2020

jpkrohling added the bug label Jan 27, 2020

joe-elliott mentioned this issue Jan 27, 2020

Add before you push to the queue to prevent race condition on size #2044

Merged

jpkrohling closed this as completed in #2044 Jan 28, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dropping small amount of spans on master #2042

Dropping small amount of spans on master #2042

gouthamve commented Jan 27, 2020

joe-elliott commented Jan 27, 2020 •

edited

Loading

jpkrohling commented Jan 27, 2020

jpkrohling commented Jan 27, 2020

joe-elliott commented Jan 27, 2020

yurishkuro commented Jan 28, 2020

jpkrohling commented Jan 28, 2020

gouthamve commented Jan 28, 2020

Dropping small amount of spans on master #2042

Dropping small amount of spans on master #2042

Comments

gouthamve commented Jan 27, 2020

Requirement - what kind of business use case are you trying to solve?

Problem - what in Jaeger blocks you from solving the requirement?

Any open questions to address

joe-elliott commented Jan 27, 2020 • edited Loading

jpkrohling commented Jan 27, 2020

jpkrohling commented Jan 27, 2020

joe-elliott commented Jan 27, 2020

yurishkuro commented Jan 28, 2020

jpkrohling commented Jan 28, 2020

gouthamve commented Jan 28, 2020

joe-elliott commented Jan 27, 2020 •

edited

Loading