Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OTel Collector compatibility of the metrics-generator #2970

Open
schewara opened this issue Sep 29, 2023 · 6 comments
Open

OTel Collector compatibility of the metrics-generator #2970

schewara opened this issue Sep 29, 2023 · 6 comments
Labels
component/metrics-generator keepalive Label to exempt Issues / PRs from stale workflow

Comments

@schewara
Copy link

Is your feature request related to a problem? Please describe.

While trying to test the integration of the Span Metrics Connector I found that there are some compatibility issues between the OTel spanmetricsconnector and the Tempo spanmetrics processor (I didn't have look at the Grafana Agent)

  1. The metric names in Tempo seem to differ from the "OTel Semantic Conventions" (v1.21.0)
    [namespace_]duration_milliseconds_bucket vs. traces_spanmetrics_latency_bucket
    From inspecting the Semantic Conventions for HTTP Metrics as well as other Metrics (e.g. Promtail, Loki, OTel Automatic Instrumentations, ...) duration seems to be the correct and most commonly used name

  2. The OTel spanmetricconnector allows to define a namespace for the generated metrics.
    Grafana and Tempo use a hardcoded traces_spanmetrics namespace/prefix, which seems, can not be changed.

Describe the solution you'd like

It would be really great if the OTel Connectors would also work seamlessly with Grafana and Tempo (including the OTel Service Graph Connector), to allow frictionless migration between different components, based on the individual use-case.

Namespace support would also be nice, but could be addressed alternatively with a default namespace in the Connector, or a note in the documentation.

Describe alternatives you've considered

To have Span Metrics and Service Graphs working with Grafana, the only viable option seems to use Tempo's Metrics-generator, or use a Processor in the Collector to rename the metrics to be compatible with what Grafana needs.

Additional context

@joe-elliott joe-elliott added keepalive Label to exempt Issues / PRs from stale workflow component/metrics-generator labels Oct 2, 2023
@joe-elliott
Copy link
Member

Thanks for the issue!

(I didn't have look at the Grafana Agent)

Grafana agent vendors OTel components directly so it will be in line with the OTel Collector's behavior.

It would be really great if the OTel Connectors would also work seamlessly with Grafana and Tempo

Yes!

Currently we do have a lot of configuration options that allow the user to control the shape of the output metrics. I think a good path here is to make sure that Tempo span metrics can be configured to look like OTel Collector metrics by adding any required configuration. Then we can provide a some example configurations to make the two equivalent.

It is unfortunately not simple to just change the default metric names since operators have built dashboards/alerts/etc. on top of them. This could be a very costly breaking change to some of our users.

Another issue in play is that Grafana has some custom experiences built around these metrics in the Tempo Explore pane. So even if the user were able to adjust their config so Tempo metrics looked like OTel there would still be this gap where it would break some functionality in Grafana. Going to cc @grafana/observability-traces-and-profiling for thoughts.

Also, @rlankfo has done some work in this area on our side and would love to have his input.

@aocenas
Copy link
Member

aocenas commented Oct 3, 2023

Yeah we sort of assume the names of the metrics in all the queries that use them from tempo side. We could make that configurable, that is not hard, although that is another layer of configuration which makes it harder for the user.

I wonder if we could somehow autodetect the naming, if there is a reasonable pattern with just a few options like (namespace_)?(duration_milliseconds_bucket|traces_spanmetrics_latency_bucket) maybe we can just run some discovery query during configuration to set this up automatically.

@rlankfo
Copy link
Member

rlankfo commented Oct 3, 2023

When generating metrics in Tempo, it's possible to use relabeling during remote write. This would allow you to do things like rename metrics, drop labels, etc. You should technically be able to align your metric names with semantic conventions in this way.

Here's an example of a rename and label drop:

metrics_generator:
  registry:
    external_labels:
      source: "tempo"
  storage:
    path: "/tmp/tempo/generator/wal"
    remote_write:
      - url: "${MIMIR_URL}/api/v1/push"
        send_exemplars: true
        write_relabel_configs:
          - source_labels: ["__name__", "connection_type"]
            target_label: "__name__"
            separator: "@"
            regex: "traces_service_graph_request_client_(.*)@database"
            replacement: 'db_client_duration_$1'
          - regex: "connection_type"
            action: "labeldrop"

In this example, I'm renaming traces_service_graph_request_client_(.*) metrics to db_client_duration_$1 if the connection_type is database. Additionally, the connection_type label is dropped.

This is a good article on how relabeling in prometheus works: https://grafana.com/blog/2022/03/21/how-relabeling-in-prometheus-works/
Here's the official documentation: https://prometheus.io/docs/prometheus/latest/configuration/configuration/#remote_write

I hope this helps!

@joe-elliott
Copy link
Member

joe-elliott commented Oct 4, 2023

I wonder if we could somehow autodetect the naming, if there is a reasonable pattern with just a few options like (namespace_)?(duration_milliseconds_bucket|traces_spanmetrics_latency_bucket) maybe we can just run some discovery query during configuration to set this up automatically.

@aocenas This might be a nice feature to add the list. If we can support Tempo and OTEL then we'd also get Grafana Agent (since they create OTEL metrics). Honestly, you might not even need to autodetect anything. We might be able to write some clever PromQL queries that sum both values up.

Thanks for the example @rlankfo !

@schewara
Copy link
Author

schewara commented Oct 4, 2023

After some further testing, I came across some more findings, which are not fully related to the compatibility, as they also affect the specific OTel Connectors, but wanted to add them here for more context.

  • For services which support OTel Automatic Instrumentation including client libraries (like the Java Agent), or services which have their client calls instrumented, they most likely already provide the same metric as the one generated by the span metrics.

    http_client_duration_milliseconds_count vs traces_spanmetrics_calls_total vs [namespace_]duration_milliseconds_count
    (with dimensions matching the client metric)

    Tempo does allow filtering, but when you want to use these metrics in the Service Graph view you still need them around, which more or less duplicates already existing metrics

  • In our case we wanted to also integrate all our external service/API calls but realized, that only for the incoming requests a node metric gets generated

    traces_service_graph_request_total{client="user", connection_type="virtual_node", server="my_service"}

    For non microservices setups, a possibility to also display external virtual nodes would be super helpful (assuming the calls_total metric has the the dimension available and the cardinality does not explode)

Due to these reasons I am currently testing an alternative approach to create a Node Graph Panel in Grafana based on the existing client metrics with some promql and Grafana transformation magic, but still have to wrap my head around it.

@cdaguerre
Copy link

Renaming the prometheus metrics during export would probably break the tempo panel in grafana (node graphs and span RPS, error rate, etc).
Is it possible to "configure" the metric names used by the tempo panel ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
component/metrics-generator keepalive Label to exempt Issues / PRs from stale workflow
Projects
None yet
Development

No branches or pull requests

5 participants