-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Dropping small amount of spans on master #2042
Comments
Is there a race condition here:
If the goroutine is swapped off the thread between these lines and then another goroutine consumes the the queue item it could subtract before it adds.
Is this a correct analysis? |
I talked with @gouthamve on Gitter about this. We definitely need some protection to never go negative, but he's going to do some debugging to try to find out why it's going negative. |
@joe-elliott : thanks for the hint! It might be, as you have both the code and the situation, would you be able to give it a try and validate the theory? |
@jpkrohling submitted a fix that's working in our environment #2044 Let me know if you'd rather approach it a different way. |
What is actually going negative, the queue size gauge? (a) I wouldn't worry about it, or (b) just fix where gauge value is recorded to clamp negative values to zero. That gauge is only important when it's a very large number, minor fluctuations are irrelevant and mostly random due to an arbitrary reporting interval. |
Yes, the queue size gauge ( |
This works at 10K spans/sec on a single collector :) |
Requirement - what kind of business use case are you trying to solve?
Trying to run master and checking if everything works.
Problem - what in Jaeger blocks you from solving the requirement?
Dropping a small number of spans
Any open questions to address
Here's more info:
We are currently running Jaeger with about 10K spans/sec on a host. But we noticed that
rate(jaeger_collector_spans_dropped_total[5m])
is~1
on upgrade to master. Which shouldn't be there, because we have an insane queue capacity (300K).So I put in some debug logging
And saw this in the logs:
So I see that recently the queue size handling moved to using ubers atomic package. And from int32 to uint32. I think before the size used to go to -1 but then it worked, but now it's broken.
The text was updated successfully, but these errors were encountered: