Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dedupe process in grpc reporter #1181

Merged
merged 6 commits into from
Nov 16, 2018

Conversation

pavolloffay
Copy link
Member

Related to #773 (comment) and #1165 (comment)

Signed-off-by: Pavol Loffay [email protected]

return r.send(trace.Spans, process)
}

func (r *Reporter) send(spans []*model.Span, process model.Process) error {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that jaeger spans will have process nil whereas zipkin spans will have the process and process passed here is from the first span.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should pass process=nil for zipkin path? I don't recall how collector handles batches, it should not apply batch-level Process to each span if spans already have a Process attached. This would be more compatible with, say, Istio use case where one client can emit spans for different processes.

Once we do that, we can kill this zipkin code path in the processor API and pull it higher in the chain, i.e. since you're already calling toDomain in the grpc package (but not in tchannel package), if we confirm that we don't cause a data loss we can call toDomain before invoking the processor.

NB: we may still need to keep the zipkin tchannel endpoint in the collector because some very old clients may be using direct tchannel connections to the collector, although I find it hard to believe since the agent was implemented long before we open sourced Jaeger backend (like 1yr before), so nobody outside of Uber should be using those old clients. If so, we can remove that endpoint and re-keep it in our internal build only.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't recall how collector handles batches

// PostSpans implements gRPC CollectorService.
func (g *GRPCHandler) PostSpans(ctx context.Context, r *api_v2.PostSpansRequest) (*api_v2.PostSpansResponse, error) {
	for _, span := range r.GetBatch().Spans {
		if span.GetProcess() == nil {
			span.Process = &r.Batch.Process
		}
	}
	_, err := g.spanProcessor.ProcessSpans(r.GetBatch().Spans, JaegerFormatType)
	if err != nil {
		g.logger.Error("cannot process spans", zap.Error(err))
		return nil, err
	}
	return &api_v2.PostSpansResponse{}, nil
}

It sets the process only when it is missing.

Another issue is that this zipkin path skips zipkin sanitizers which are used in collector handler.

Copy link
Member Author

@pavolloffay pavolloffay Nov 15, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Once we do that, we can kill this zipkin code path in the processor API and pull it higher in the chain, i.e. since you're already calling toDomain in the grpc package (but not in tchannel package), if we confirm that we don't cause a data loss we can call toDomain before invoking the processor.

I don't get you here. Once we do what pass the nill process? I have just added sanitizers directly to the agent.

so nobody outside of Uber should be using those old clients. If so, we can remove that endpoint and re-keep it in our internal build only.

tracked in #1186 for

Copy link
Member

@yurishkuro yurishkuro Nov 16, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right now we have this flow:

udp server jaeger.thrift -> reporter::SubmitBatch
udp server zipkin.thrift -> reporter::SubmitZipkinSpans

tchannel reporter::SubmitBatch -> pass through to collector
tchannel reporter::SubmitZipkinSpans -> pass through to collector

grpc reporter::SubmitBatch -> convert to domain and send to collector
grpc reporter::SubmitZipkinSpans -> convert to jaeger.thrift, then to domain and send to collector

What I am proposing (could be done in a different PR), is to eliminate reporter::SubmitZipkinSpans completely and have this:

udp server jaeger.thrift -> reporter::SubmitBatch
    tchannel reporter::SubmitBatch -> pass through to collector
    grpc reporter::SubmitBatch -> convert to domain and send to collector
udp server zipking.thrift -> convert to jaeger.thrift -> reporter::SubmitBatch

This way the Reporter only has a single method for Jaeger spans (in Thrift, for now).

Once we remove tchannel completely, we can push thrift->domain conversion into UDP servers as well, and Reporter interface effectively becomes == Collector service from proto IDL.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, it sounds reasonable to me. I would prefer a separate PR to do it as it might be bigger than I think right now.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have created to track it separately #1193

@codecov
Copy link

codecov bot commented Nov 14, 2018

Codecov Report

Merging #1181 into master will not change coverage.
The diff coverage is 100%.

Impacted file tree graph

@@          Coverage Diff           @@
##           master   #1181   +/-   ##
======================================
  Coverage     100%    100%           
======================================
  Files         157     157           
  Lines        7107    7104    -3     
======================================
- Hits         7107    7104    -3
Impacted Files Coverage Δ
...d/collector/app/sanitizer/zipkin/span_sanitizer.go 100% <ø> (ø) ⬆️
cmd/collector/app/builder/span_handler_builder.go 100% <100%> (ø) ⬆️
model/converter/thrift/jaeger/to_domain.go 100% <100%> (ø) ⬆️
cmd/collector/app/grpc_handler.go 100% <100%> (ø) ⬆️
cmd/agent/app/reporter/grpc/reporter.go 100% <100%> (ø) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update c7da0b4...90326eb. Read the comment docs.

cmd/agent/app/reporter/grpc/reporter.go Outdated Show resolved Hide resolved
cmd/agent/app/reporter/grpc/reporter.go Outdated Show resolved Hide resolved
@pavolloffay
Copy link
Member Author

I think I have addressed all comments. I have created #1193 to remove EmitBatch. If there are no other comments I think it could go in and also the metrics PR #1180. The #1193 will handle removal of duplicated metrics e.g.

jaeger_agent_tchannel_reporter_batches_submitted{format="jaeger"} 0
jaeger_agent_tchannel_reporter_batches_submitted{format="zipkin"} 0

Signed-off-by: Pavol Loffay <[email protected]>
@@ -28,6 +28,8 @@ const (

var (
defaultDuration = int64(1)
// StandardSanitizers is a list of standard zipkin sanitizers.
StandardSanitizers = []Sanitizer{NewSpanStartTimeSanitizer(), NewSpanDurationSanitizer(), NewParentIDSanitizer(), NewErrorTagSanitizer()}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: one per line

@@ -155,6 +155,6 @@ message Trace {
message Batch {
repeated Span spans = 1;
Process process = 2 [
(gogoproto.nullable) = false
(gogoproto.nullable) = true
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I would remove the option altogether (true is the default afaik)

@yurishkuro yurishkuro merged commit 56ec296 into jaegertracing:master Nov 16, 2018
@ghost ghost removed the review label Nov 16, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants