Skip to content

Lets make low(er) cardinality metrics#1064

Closed
breedx-splk wants to merge 15 commits intoopen-telemetry:mainfrom
breedx-splk:lets_make_low_cardinality_metrics
Closed

Lets make low(er) cardinality metrics#1064
breedx-splk wants to merge 15 commits intoopen-telemetry:mainfrom
breedx-splk:lets_make_low_cardinality_metrics

Conversation

@breedx-splk
Copy link
Copy Markdown
Contributor

There are some telemetry data (such as jank statistics) that some folks think make sense to send as metrics. I get it. Flinging numeric data into events or spans and then hoping a backend can make sense of it or aggregate it or whatever...that's a little misaligned.

The challenge is that client applications run on a very diverse set of devices or platforms, and this quickly leads to high-cardinality metrics, which makes most timeseries databases unhappy. Thousands or millions of devices potentially generate many more millions of launches, on different OS versions on different manufacturer devices, on different application versions, and the permutation space becomes huge.

This is the main reason so far why we have avoided doing much with metrics. High-cardinality metrics are usually more harmful than helpful. Furthermore, when you start dropping certain dimensions (attributes, resource attributes) to lower the cardinality, you lose granularity and are essentially aggregating across that dimension. For instance, if a metric were measuring start time and we drop the application version string to lower cardinality, then users who look at a dashboard of start time might be seeing data for many different versions in the wild. This makes this kind of data largely unactionable.

But maybe we can find a middle ground and start working toward a set of dimensions which are useful to most users without blowing up the permutation space. And maybe this PR is a start.

This adds a new MetricsConfig for use with the OpenTelemetryRumInitializer in android-agent. This config has two sets of keys to include in metrics -- one for Attributes on data points, and one for Resource Attributes. These are user configurable and have, what I guessed, to be a sane/reasonable default.

By default, the Android resource looks something like this:

Resource attributes:
     -> device.manufacturer: Str(Google)
     -> device.model.identifier: Str(sdk_gphone64_arm64)
     -> device.model.name: Str(sdk_gphone64_arm64)
     -> os.description: Str(Android Version 16 (Build BP22.250325.006 API level 36))
     -> os.name: Str(Android)
     -> os.type: Str(linux)
     -> os.version: Str(16)
     -> rum.sdk.version: Str(0.13.0-alpha-SNAPSHOT)
     -> service.name: Str(OpenTelemetryDemoApp)
     -> telemetry.sdk.language: Str(java)
     -> telemetry.sdk.name: Str(opentelemetry)
     -> telemetry.sdk.version: Str(1.51.0)

and with the defaults here, it reduces to:

Resource attributes:
     -> os.name: Str(Android)
     -> os.type: Str(linux)
     -> os.version: Str(16)
     -> service.name: Str(OpenTelemetryDemoApp)

I'm a little hesitant that this creates a foot-gun, and I'm a little hesitant to send metric data points whose resource doesn't match the resource on traces and logs...but maybe this is a start.

@breedx-splk breedx-splk requested a review from a team as a code owner July 11, 2025 21:39
@breedx-splk breedx-splk marked this pull request as draft July 15, 2025 15:38
@bidetofevil
Copy link
Copy Markdown
Contributor

I don't think modelling jank as a metric is a good idea. While dropped frames and jank are numbers that resemble metrics, the way they are consumed requires one element that OTel metrics do not provide: time. Specifically, you want to pin the jank occurrence to a specific point in time so that you can see what happened before and after.

No matter what dimensions we keep, adding them up and consuming them as metrics will miss the crucial piece of data that allows you to relate it to other things happening on the device. 100 dropped frames mean different things depending on when it happens, and munging them all together in a count, while doable in terms of the data, renders it pretty much useless.

Comment thread core/src/main/java/io/opentelemetry/android/AndroidResource.kt
Comment thread core/src/main/java/io/opentelemetry/android/AndroidResource.kt
Comment on lines +18 to +36
fun createDefault(application: Application): Resource {
val appName = readAppName(application)
val appVersion = readAppVersion(application)
val resourceBuilder =
Resource.getDefault().toBuilder().put(ServiceAttributes.SERVICE_NAME, appName)
if (appVersion != null) {
resourceBuilder.put(ServiceAttributes.SERVICE_VERSION, appVersion)
}

return resourceBuilder
.put(RumConstants.RUM_SDK_VERSION, BuildConfig.OTEL_ANDROID_VERSION)
.put(DeviceIncubatingAttributes.DEVICE_MODEL_NAME, Build.MODEL)
.put(DeviceIncubatingAttributes.DEVICE_MODEL_IDENTIFIER, Build.MODEL)
.put(DeviceIncubatingAttributes.DEVICE_MANUFACTURER, Build.MANUFACTURER)
.put(OsIncubatingAttributes.OS_NAME, "Android")
.put(OsIncubatingAttributes.OS_TYPE, "linux")
.put(OsIncubatingAttributes.OS_VERSION, Build.VERSION.RELEASE)
.put(OsIncubatingAttributes.OS_DESCRIPTION, oSDescription)
.build()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What other frameworks do is have multiple resource detectors usually 1 per resource as defined in semconv, that way developers can opt in. In fact as per the spec android should be producing a resource with the api version.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please link to that part of the spec.

Copy link
Copy Markdown
Contributor

@thompson-tomo thompson-tomo Jul 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://opentelemetry.io/docs/specs/semconv/resource/android/

At the same time following other frameworks by having multiple resource detectors, you can add just 1 to the metrics but traces can have many more.

@breedx-splk
Copy link
Copy Markdown
Contributor Author

Yeah, don't use metrics on mobile.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants