Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add storage metrics to OTEL, metrics by span service name #2431

Merged
merged 5 commits into from
Sep 3, 2020

Conversation

pavolloffay
Copy link
Member

@pavolloffay pavolloffay commented Aug 28, 2020

Resolves #2165 (comment)

This PR adds storage metrics to OTEL distribution. The metrics show many spans were stored per service and exporter. There is a similar metric in the current jaeger distributions.

otelcol_spans_stored{service="jaeger-query",service_instance_id="23e9c69b-1383-4fbd-8614-638339329a8e",storage="in-memory"}

It is a similar metric to the core metric:

otelcol_exporter_sent_spans{exporter="jaeger_memory",service_instance_id="23e9c69b-1383-4fbd-8614-638339329a8e"} 9

however it does not split the metric by reported service name.

@pavolloffay pavolloffay requested a review from a team as a code owner August 28, 2020 16:02
@pavolloffay pavolloffay changed the title Add storage metrics to OTEL, metrics by span service name WIP: Add storage metrics to OTEL, metrics by span service name Aug 28, 2020
@pavolloffay
Copy link
Member Author

Changing to WIP, I will do some cleanup on Monday.

@pavolloffay pavolloffay changed the title WIP: Add storage metrics to OTEL, metrics by span service name Add storage metrics to OTEL, metrics by span service name Aug 31, 2020
@pavolloffay
Copy link
Member Author

@joe-elliott could you please review?

Copy link
Member

@joe-elliott joe-elliott left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added some questions/comments.

It looks like all exporters except for ES and Jaeger are wrapped in the SpanWriterExporter. I suppose we don't count exporting to Jaeger as "storage" so it should be excluded from this metric?

For my understanding, why does ES not use the SpanWriterExporter like the others?

cmd/opentelemetry/app/exporter/storagemetrics/metrics.go Outdated Show resolved Hide resolved
numErrors := 0
for i, d := range blk.Items {
bulkOp := operationToSpan[i]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how confident are we that len(operationToSpan) == len(blk.Items)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the lenght should match. The blk contains resposes for objects in operationToSpan.

if !bulkOp.isService {
ctx, _ := tag.New(ctx,
tag.Insert(storagemetrics.TagServiceName(), bulkOp.span.Process.ServiceName), w.nameTag)
stats.Record(ctx, storagemetrics.StatSpansNotStoredCount().M(1))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it would probably be faster to aggregate the number of failed spans in a local int and then Record them all at once. I believe you'll need to change the view aggregation below to .Sum

if we did this we'd either have to keep a map of ServiceName => ErrorCount or drop the service name tag. Service name could be useful on a metric like this, but I don't think it's vital.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have changed ti to use two maps.

+1 the aggregation had to be changed to .Sum

if we did this we'd either have to keep a map of ServiceName => ErrorCount or drop the service name tag. Service name could be useful on a metric like this, but I don't think it's vital.

If we drop the service name we can drop this metic altogether as there is a generic metric per successful/failed exporter sends

otelcol_exporter_sent_spans{exporter="jaeger_memory",service_instance_id="23e9c69b-1383-4fbd-8614-638339329a8e"} 9

err := s.Writer.WriteSpan(span)
if err != nil {
errs = append(errs, err)
dropped++
stats.Record(ctx, storagemetrics.StatSpansNotStoredCount().M(1))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

similar thoughts here on aggregating and then storing the metrics all at once. concerned mainly about allocations and locking per Record call.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought about it, we would have to calculate map[string]int (serviceName->count) to do that. Now sure which one is better.

@pavolloffay
Copy link
Member Author

For my understanding, why does ES not use the SpanWriterExporter like the others?

The ES use "native" OTEL implementation. The SpanWriterExporter uses storage factory that creates Jaeger span writer that stores data per single span and it does two model translations (OTEL to jaeger model and then to storage model).

@pavolloffay
Copy link
Member Author

It looks like all exporters except for ES and Jaeger are wrapped in the SpanWriterExporter. I suppose we don't count exporting to Jaeger as "storage" so it should be excluded from this metric?

Jaeger is gRPC exporter not storage. The aim of this PR is to provide a metric that would show number of spans stored per service.

Copy link
Member

@joe-elliott joe-elliott left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes look good to me.

Signed-off-by: Pavol Loffay <[email protected]>
Signed-off-by: Pavol Loffay <[email protected]>
Signed-off-by: Pavol Loffay <[email protected]>

var _ spanstore.Writer = (*storageWrapper)(nil)

func (s storageWrapper) WriteSpan(ctx context.Context, span *model.Span) error {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since the storage factory has a similar impl I have used that instead.

Signed-off-by: Pavol Loffay <[email protected]>
@pavolloffay
Copy link
Member Author

codecov seems to be flaky :/

@pavolloffay pavolloffay merged commit 4120220 into jaegertracing:master Sep 3, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Are the current set of collector metrics adequate?
2 participants