-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Dedupe process in grpc reporter #1181
Dedupe process in grpc reporter #1181
Conversation
return r.send(trace.Spans, process) | ||
} | ||
|
||
func (r *Reporter) send(spans []*model.Span, process model.Process) error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that jaeger spans will have process nil whereas zipkin spans will have the process and process passed here is from the first span.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we should pass process=nil for zipkin path? I don't recall how collector handles batches, it should not apply batch-level Process to each span if spans already have a Process attached. This would be more compatible with, say, Istio use case where one client can emit spans for different processes.
Once we do that, we can kill this zipkin code path in the processor API and pull it higher in the chain, i.e. since you're already calling toDomain in the grpc package (but not in tchannel package), if we confirm that we don't cause a data loss we can call toDomain before invoking the processor.
NB: we may still need to keep the zipkin tchannel endpoint in the collector because some very old clients may be using direct tchannel connections to the collector, although I find it hard to believe since the agent was implemented long before we open sourced Jaeger backend (like 1yr before), so nobody outside of Uber should be using those old clients. If so, we can remove that endpoint and re-keep it in our internal build only.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't recall how collector handles batches
// PostSpans implements gRPC CollectorService.
func (g *GRPCHandler) PostSpans(ctx context.Context, r *api_v2.PostSpansRequest) (*api_v2.PostSpansResponse, error) {
for _, span := range r.GetBatch().Spans {
if span.GetProcess() == nil {
span.Process = &r.Batch.Process
}
}
_, err := g.spanProcessor.ProcessSpans(r.GetBatch().Spans, JaegerFormatType)
if err != nil {
g.logger.Error("cannot process spans", zap.Error(err))
return nil, err
}
return &api_v2.PostSpansResponse{}, nil
}
It sets the process only when it is missing.
Another issue is that this zipkin path skips zipkin sanitizers which are used in collector handler.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Once we do that, we can kill this zipkin code path in the processor API and pull it higher in the chain, i.e. since you're already calling toDomain in the grpc package (but not in tchannel package), if we confirm that we don't cause a data loss we can call toDomain before invoking the processor.
I don't get you here. Once we do what pass the nill process? I have just added sanitizers directly to the agent.
so nobody outside of Uber should be using those old clients. If so, we can remove that endpoint and re-keep it in our internal build only.
tracked in #1186 for
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right now we have this flow:
udp server jaeger.thrift -> reporter::SubmitBatch
udp server zipkin.thrift -> reporter::SubmitZipkinSpans
tchannel reporter::SubmitBatch -> pass through to collector
tchannel reporter::SubmitZipkinSpans -> pass through to collector
grpc reporter::SubmitBatch -> convert to domain and send to collector
grpc reporter::SubmitZipkinSpans -> convert to jaeger.thrift, then to domain and send to collector
What I am proposing (could be done in a different PR), is to eliminate reporter::SubmitZipkinSpans
completely and have this:
udp server jaeger.thrift -> reporter::SubmitBatch
tchannel reporter::SubmitBatch -> pass through to collector
grpc reporter::SubmitBatch -> convert to domain and send to collector
udp server zipking.thrift -> convert to jaeger.thrift -> reporter::SubmitBatch
This way the Reporter only has a single method for Jaeger spans (in Thrift, for now).
Once we remove tchannel completely, we can push thrift->domain conversion into UDP servers as well, and Reporter interface effectively becomes == Collector service from proto IDL.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, it sounds reasonable to me. I would prefer a separate PR to do it as it might be bigger than I think right now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have created to track it separately #1193
Codecov Report
@@ Coverage Diff @@
## master #1181 +/- ##
======================================
Coverage 100% 100%
======================================
Files 157 157
Lines 7107 7104 -3
======================================
- Hits 7107 7104 -3
Continue to review full report at Codecov.
|
Signed-off-by: Pavol Loffay <[email protected]>
Signed-off-by: Pavol Loffay <[email protected]>
Signed-off-by: Pavol Loffay <[email protected]>
Signed-off-by: Pavol Loffay <[email protected]>
Signed-off-by: Pavol Loffay <[email protected]>
6627b51
to
f981d30
Compare
I think I have addressed all comments. I have created #1193 to remove
|
Signed-off-by: Pavol Loffay <[email protected]>
@@ -28,6 +28,8 @@ const ( | |||
|
|||
var ( | |||
defaultDuration = int64(1) | |||
// StandardSanitizers is a list of standard zipkin sanitizers. | |||
StandardSanitizers = []Sanitizer{NewSpanStartTimeSanitizer(), NewSpanDurationSanitizer(), NewParentIDSanitizer(), NewErrorTagSanitizer()} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: one per line
@@ -155,6 +155,6 @@ message Trace { | |||
message Batch { | |||
repeated Span spans = 1; | |||
Process process = 2 [ | |||
(gogoproto.nullable) = false | |||
(gogoproto.nullable) = true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: I would remove the option altogether (true is the default afaik)
Related to #773 (comment) and #1165 (comment)
Signed-off-by: Pavol Loffay [email protected]