Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add kafka client metrics #6138

Merged
merged 17 commits into from
Jul 6, 2022
Merged

Add kafka client metrics #6138

merged 17 commits into from
Jul 6, 2022

Conversation

jack-berg
Copy link
Member

@jack-berg jack-berg commented Jun 4, 2022

Proposal to bridge metrics from the kafka client library into OpenTelemetry.

As this PR currently stands, only metrics which align with those already declared in the kafka metric semantic convention are bridged. However, the semantic conventions are focused around the use case of monitoring kafka brokers, rather than clients. There's a lot more useful data available, which I'd like the semantic conventions to be extended with.

I've added a utility that makes it easier to explore which metrics are available, and if / how they're mapped into OpenTelemetry metrics. The utility prints out all this info in a markdown table, which I've included in the README.

I could use some feedback about:

  • The best place to include this code from an artifact standpoint.

** Updated 6/8/22 **
I've updated the PR to be a general purpose bridge between kafka client metrics and opentelemetry metric, rather than implementing as an allow list of known metrics. My reasoning is as follows:

  • There are 202 metrics exposed by kafka client metrics. I made it through mapping about 60% of them before I decided that manually mapping all the metrics is madness. Its error prone, brittle against changes between kafka client versions, and adds little value.
  • The metrics kafka client exposes are conceptually similar what we do in our micrometer shim: Kafka client has its own internal metrics system. Its authors have already done the analysis to decide what is important to monitor when using the kafka client, and they've provided a hook for to access the metrics in a generic format. In bridging these metrics to opentelemetry we should use a light hand. We should map the metrics to the appropriate opentelemetry instrument types, but should minimize changes to metric names, descriptions, etc. Trying to map these to semantic conventions is folly, because we don't control the instrumentation at its source and the authors could change the metrics drastically in the next version.
  • The lack of mapping to semantic conventions doesn't mean this data is not useful. This data is very useful to folks operating applications that produce and consume from kafka. We'd be doing a disservice to opentelemetry users and hampering the adoption of opentelemetry by rejecting good telemetry data because of a lack of semantic conventions. Furthermore, I don't think we should try to codify these metrics in the semantic conventions: 202 metrics is just too much data to review. Additionally, there's likely to be significant differences in the metrics that are exposed in kafka clients from language to language. Does that mean we should withhold this data that's available to java kafka client users? No. There's a place for semantic conventions, and there's a place for bridging in telemetry data that is available but not under our control.

Copy link

@brunobat brunobat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First review in the project

@jack-berg jack-berg marked this pull request as ready for review June 13, 2022 23:08
@jack-berg jack-berg requested a review from a team June 13, 2022 23:08
@mateuszrzeszutek
Copy link
Member

Kafka client has its own internal metrics system. Its authors have already done the analysis to decide what is important to monitor when using the kafka client, and they've provided a hook for to access the metrics in a generic format. In bridging these metrics to opentelemetry we should use a light hand. We should map the metrics to the appropriate opentelemetry instrument types, but should minimize changes to metric names, descriptions, etc. Trying to map these to semantic conventions is folly, because we don't control the instrumentation at its source and the authors could change the metrics drastically in the next version.

100% agree with that 👍

The only thing that I'm slightly worried about is that we won't expose metrics that are common to all messaging clients (like the HTTP client/server metrics; think http.server.duration but across all messaging instrumentations) - but since these are not defined yet in the spec I suppose we'll have to put adding them off until later.

@jack-berg
Copy link
Member Author

Thanks for the review @mateuszrzeszutek!

The only thing that I'm slightly worried about is that we won't expose metrics that are common to all messaging clients (like the HTTP client/server metrics; think http.server.duration but across all messaging instrumentations) - but since these are not defined yet in the spec I suppose we'll have to put adding them off until later.

👍 My thinking is that bridging these metrics does not prevent us from later adding additional metrics which have consistency with other messaging systems.

@mateuszrzeszutek mateuszrzeszutek linked an issue Jun 22, 2022 that may be closed by this pull request
@mateuszrzeszutek mateuszrzeszutek removed a link to an issue Jun 22, 2022
and `[client-id, topic]`. If you analyze the sum of records consumed, ignoring dimensions, backends
are likely to double count. To alleviate this, `OpenTelemetryKafkaMetrics` detects this
scenario and only records the most granular set of attributes available. In the case
of `records-consumed-total`, it reports `[client-id, topic]` and ignores `[client-id]`.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just added a commit with new logic that does what's described in this comment. I think its a fairly important improvement.

it.remove();
} else if (curRegisteredObservable.getInstrumentDescriptor().equals(instrumentDescriptor)
&& attributeKeys.size() > curAttributeKeys.size()
&& attributeKeys.containsAll(curAttributeKeys)) {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's how I detect whether a metric exists with less granular set of attribute keys, as explained in this part of the readme.

In the process of adding support for this, I shuffled some stuff around for better organization, including moving the table print method to OpenTelemetryKafkaMetricTest. It turns out that all the information needed to print the table can be obtained without any additional surface area.

@trask trask mentioned this pull request Jun 24, 2022
Copy link
Member

@trask trask left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thx!

* <p>This class is internal and is hence not for public use. Its APIs are unstable and can change
* at any time.
*/
public class OpenTelemetryMetricsReporter implements MetricsReporter {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@anuraaga unfortunately this has to be public, so I've moved it to an internal class.

@trask trask merged commit 3e08f36 into open-telemetry:main Jul 6, 2022
@trask
Copy link
Member

trask commented Jul 6, 2022

thx @jack-berg!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants