Solve panic due to concurrent access to ExportSpans#3058
Solve panic due to concurrent access to ExportSpans#3058tonistiigi merged 1 commit intomoby:masterfrom
Conversation
tonistiigi
left a comment
There was a problem hiding this comment.
I'm not sure if this is really correct.
If your claim that ExportSpans can only be called on a single thread is correct then this is not the only place where it gets called. It will get called by regular context traces from buildkitd (maybe container forwarding as well).
In that case the only way I would see is that all the exporter implementations https://github.com/moby/buildkit/blob/master/util/tracing/detect/detect.go#L78 would need to be wrapped with something that makes ExportSpans safe to call before this exporter is used to create a traceprovider.
Tbh, if the description is correct I don't really understand the logic of this API design. Exporter is supposed to be input to sdktrace.NewBatchSpanProcessor in the regular flow. But if everything breaks down if there are two spanprocessor/traceprovider instances with the same exporter I don't understand what is the point of defining an exporter interface at all.
|
I've mentioned this thread in the original issue in the opentelemetry SDK, hopefully they will respond. Without knowing too much about the reason this architecture was chosen, there is a benefit to the exporter interface, as it allows for different implementations of the exporter. This is used even in Going through all calls to ExportSpans in the codebase:
I see what you mean about |
|
https://github.com/moby/buildkit/blob/master/util/tracing/detect/detect.go#L78 will return for example in this case the Jaeger exporter. Then this exporter is called for example in control where you added mutex. But it also goes to the https://github.com/moby/buildkit/blob/master/util/tracing/detect/detect.go#L99 and from there all the local traces will end up in the exporter. Even if the control path has a mutex and |
1503aa6 to
8c7c731
Compare
|
Sorry about the delay, I've changed the implementation to create a thread-safe exporter instance and use that instead. Let me know if you'd like me to change anything! :) |
|
@tonistiigi - can you re-review? Thanks! |
|
Hi @tonistiigi - can you re-review this please? I'd like to stop having to maintaining my fork :) |
|
@gsaraf Could you rebase. GH doesn't show conflict but this code has actually changed in latest version. |
3cae6c6 to
fe649ce
Compare
|
Thanks for the review! Rebased and made the requested changes. |
fe649ce to
f81f662
Compare
Signed-off-by: Gahl Saraf <saraf.gahl@gmail.com>
f81f662 to
afb01a7
Compare
vendor: github.com/moby/buildkit v0.20.1
Background:
After enabling Open Telemetry export to Jaeger using the
JAEGER_TRACEenv var, started seeing occasional panics across our fleet ofbuildkitdcontainers. A helpful pointer from the people at the OpenTelemetry Go repo indicated a possible concurrency problem. Wrapping theExportSpanswith a mutex solved the problem completely.Issue: #3004
Testing: Before this fix, would get several panics a day. After, have gone for several weeks without any panics.
I've also run the suggested tests:
./hack/test integration gateway dockerfile.