Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added 'resize' operation to BoundedQueue #1949

Merged
merged 2 commits into from
Dec 19, 2019
Merged

Added 'resize' operation to BoundedQueue #1949

merged 2 commits into from
Dec 19, 2019

Conversation

jpkrohling
Copy link
Contributor

Signed-off-by: Juraci Paixão Kröhling [email protected]

Which problem is this PR solving?

Short description of the changes

  • This adds a new Resize(int) method to the BoundedQueue

@codecov
Copy link

codecov bot commented Nov 28, 2019

Codecov Report

Merging #1949 into master will decrease coverage by 0.05%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #1949      +/-   ##
==========================================
- Coverage   96.99%   96.94%   -0.06%     
==========================================
  Files         203      203              
  Lines       10064    10082      +18     
==========================================
+ Hits         9762     9774      +12     
- Misses        264      269       +5     
- Partials       38       39       +1
Impacted Files Coverage Δ
pkg/queue/bounded_queue.go 94.44% <100%> (-5.56%) ⬇️
cmd/query/app/static_handler.go 86.84% <0%> (-1.76%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 46d7f17...bec2db4. Read the comment docs.

@jpkrohling jpkrohling changed the title Added 'resize' operation to BoundedQueue WIP - Added 'resize' operation to BoundedQueue Nov 28, 2019
@jpkrohling jpkrohling changed the title WIP - Added 'resize' operation to BoundedQueue Added 'resize' operation to BoundedQueue Nov 28, 2019
@yurishkuro
Copy link
Member

how is it supposed to be used? who would call the Resize()? Isn't it possible to just pre-determine the queue size upfront?

@jpkrohling
Copy link
Contributor Author

how is it supposed to be used? who would call the Resize()?

A change to the span processor will make it keep track of the number of spans and the total bytes it has seen during the process lifetime. Once enough spans have been recorded, the span processor starts to calculate what's the ideal queue size based on the average number of spans and available memory. It can then decide to trigger a Resize() operation.

I have a local branch that does the above, but I thought it would be cleaner to separate those two features into separate PRs.

Isn't it possible to just pre-determine the queue size upfront?

When we know what's the expected average span size, yes. But I've not heard of any cases where people know in advance what's the average span size, making the queue size hard to calculate.

@jpkrohling
Copy link
Contributor Author

jpkrohling commented Nov 29, 2019

I just added a benchmark to the tests, to make ensure that I'm not bringing performance problems to the queue. Here are the results comparing master with this PR, which shows no performance degradation for current users of the queue (unless I'm making something really wrong with the benchmark):

I'm updating the benchmark results, I was having some data races when reading the len/cap of the channel and changed to use a RWMutex. I ran the benchmark a few times as I'm getting a big standard deviation.

Master:

$ go test -benchmem -run=^$ ./pkg/queue -bench "^(BenchmarkBoundedQueue)$" -test.benchtime 5s
goos: linux
goarch: amd64
pkg: github.com/jaegertracing/jaeger/pkg/queue
BenchmarkBoundedQueue-8   	56784301	       136 ns/op	       8 B/op	       1 allocs/op
PASS
ok  	github.com/jaegertracing/jaeger/pkg/queue	7.813s

$ go test -benchmem -run=^$ ./pkg/queue -bench "^(BenchmarkBoundedQueue)$" -test.benchtime 5s
goos: linux
goarch: amd64
pkg: github.com/jaegertracing/jaeger/pkg/queue
BenchmarkBoundedQueue-8   	57453411	        90.4 ns/op	       8 B/op	       1 allocs/op
PASS
ok  	github.com/jaegertracing/jaeger/pkg/queue	5.303s

$ go test -benchmem -run=^$ ./pkg/queue -bench "^(BenchmarkBoundedQueue)$" -test.benchtime 5s
goos: linux
goarch: amd64
pkg: github.com/jaegertracing/jaeger/pkg/queue
BenchmarkBoundedQueue-8   	80754200	       275 ns/op	       8 B/op	       1 allocs/op
PASS
ok  	github.com/jaegertracing/jaeger/pkg/queue	26.512s

This PR, which seems more consistent:

$ go test -benchmem -run=^$ ./pkg/queue -bench "^(BenchmarkBoundedQueue)$" -test.benchtime 5s
goos: linux
goarch: amd64
pkg: github.com/jaegertracing/jaeger/pkg/queue
BenchmarkBoundedQueue-8   	18307092	       330 ns/op	       8 B/op	       1 allocs/op
PASS
ok  	github.com/jaegertracing/jaeger/pkg/queue	6.388s

$ go test -benchmem -run=^$ ./pkg/queue -bench "^(BenchmarkBoundedQueue)$" -test.benchtime 5s
goos: linux
goarch: amd64
pkg: github.com/jaegertracing/jaeger/pkg/queue
BenchmarkBoundedQueue-8   	17270967	       337 ns/op	       8 B/op	       1 allocs/op
PASS
ok  	github.com/jaegertracing/jaeger/pkg/queue	6.175s

$ go test -benchmem -run=^$ ./pkg/queue -bench "^(BenchmarkBoundedQueue)$" -test.benchtime 5s
goos: linux
goarch: amd64
pkg: github.com/jaegertracing/jaeger/pkg/queue
BenchmarkBoundedQueue-8   	17288391	       316 ns/op	       8 B/op	       1 allocs/op
PASS
ok  	github.com/jaegertracing/jaeger/pkg/queue	5.828s

@yurishkuro
Copy link
Member

I am skeptical about relying on the average span size, because the actual size differences can be very large, 2-3 orders of magnitude, e.g a basic span vs. span with a stack trace vs. span with request payload added to the logs. I think if we want to go by memory size, we should go by the actual size of spans in the queue. The proto/domain model already implements Size() method iirc, which could be used as an approximation of the actual memory size, and we would need to re-implement the queue using an explicit buffer rather than a bound channel (might be worth looking around for an existing implementation).

@jpkrohling
Copy link
Contributor Author

the actual size differences can be very large, 2-3 orders of magnitude, e.g a basic span vs. span with a stack trace vs. span with request payload added to the logs

You are right, but in reality, this wouldn't be a big problem. The queue sizes will typically be big enough to make outliers non-relevant. In my working branch for #943, spans generated by tracegen have an average of 1.5KiB, which means a queue size of about 700k items at 1GiB. And even if spans start being bigger all of a sudden, the periodic background process will readjust the queue size as long as the difference is bigger than a few % points. Note that the "memory" aspect is an approximation: there shouldn't be a crash if we end up using more memory for a short period of time. And most of the times, the queue should be empty anyway ;-)

I would give this simple approach a try first and see how it works in the real world.

@jpkrohling
Copy link
Contributor Author

@yurishkuro any feedback on this one?

q.resizeLock.Lock()
defer q.resizeLock.Unlock()

new := make(chan interface{}, newCap)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • what if newCap is less than currentCap and current size? This will probably deadlock
  • please don't use new keyword as var name

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Funny, I thought I had added a test for the downscale, but looks like I haven't. Will add one.

@yurishkuro
Copy link
Member

I wonder if it's possible to avoid the mutex by approaching this differently. Go routines are cheap to start, so rather than copying the messages into another queue, we could start another set of workers reading from a new channel, close the original channel and let those old workers drain it naturally and exit.

@jpkrohling
Copy link
Contributor Author

That could work, let me give it a try.

@jpkrohling
Copy link
Contributor Author

PR updated, to use a new channel for the resized queue.

@jpkrohling
Copy link
Contributor Author

@yurishkuro would you please re-review?

Signed-off-by: Juraci Paixão Kröhling <[email protected]>
// we might have two concurrent backing queues at the moment
// their combined size is stored in q.size, and their combined capacity
// should match the capacity of the new queue
if atomic.LoadInt32(&q.size) >= int32(q.Capacity()) && q.Capacity() > 0 {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/atomic.LoadInt32(&q.size)/int32(q.Size()) to keep consistent with the capacity

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure If I understand the && q.Capacity() > 0 isn't the capacity always bigger than 0, does it matter in this if?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are consumers of this queue that are not setting the queue size. IIRC, one of them is in the span processor tests. Agree with your earlier suggestion though.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see there is a test which sets the size to 0


released, resized := false, false
q.StartConsumers(1, func(item interface{}) {
if !resized { // we'll have a second consumer once the queue is resized
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't find the second consumer

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Around line 200:

resized = true
assert.True(t, q.Resize(4))
assert.True(t, q.Produce("e")) // in process by the second consumer
secondConsumer.Wait()

released := false
q.StartConsumers(1, func(item interface{}) {
// once we release the lock, we might end up with multiple calls to reach this
if !released {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Released does not seem to be needed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why's that? Note that there's quite a bunch of go funcs just waiting for this lock to be released to start calling consumers with other items. If we have not released, we mark one step down in the work group. If we have released the consumer, we do not want to count down further, as we'd reach a negative count.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean you can remove this parameter altogether. It's not changed during the test - only at the end which does not affect tests and the default value satisfy this if statement.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removal of this variable does not have any effect on the tests. It seems redundant

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It needs to be there, otherwise the test will intermittently fail due to "negative counts" in the work group.


assert.EqualValues(t, 2, q.Capacity())
assert.EqualValues(t, 4, q.Size()) // the queue will eventually drain, but it will live for a while over capacity
assert.EqualValues(t, 0, len(*q.items)) // the new queue is empty, as the old queue is still full and over capacity
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't we keep accepting here? There are free spots in the queue.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are free spots in the individual channel, but if the queue size was set by the user to 3000, it's wrong to have 6000 items in the two queues combined.

Signed-off-by: Juraci Paixão Kröhling <[email protected]>
Copy link
Member

@pavolloffay pavolloffay left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, Let's see how this will be applied in the collector. But that is part of a separate issue.

@jpkrohling
Copy link
Contributor Author

Thanks!

@jpkrohling jpkrohling merged commit 3fa8a08 into jaegertracing:master Dec 19, 2019
pkg/queue/bounded_queue.go Show resolved Hide resolved
pkg/queue/bounded_queue.go Show resolved Hide resolved
size int32
onDroppedItem func(item interface{})
items chan interface{}
items *chan interface{}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is being updated with a CAS in one place but accessed directly in other places, which is technically not thread safe. I would rather we used a normal mutex or atomic.Value.

pkg/queue/bounded_queue.go Show resolved Hide resolved
@jpkrohling jpkrohling deleted the Resize-bounded-queue branch July 28, 2021 19:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants