Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cosmos DB: Operation Level Metrics #1438

Open
wants to merge 51 commits into
base: main
Choose a base branch
from

Conversation

sourabh1007
Copy link
Contributor

@sourabh1007 sourabh1007 commented Sep 30, 2024

Introducing operation-level metrics in the .NET Cosmos DB SDK.
This PR includes the registration of semantic conventions in the OpenTelemetry community, ensuring full alignment with OpenTelemetry's established guidelines.

Dimensions

Tag/dimension name Sample value
db.system cosmodb
db.collection.name myCollectionName
db.namespace myDatabaseName
server.address myaccountname.documents.azure.com
server.port 443
db.operation.name query_items
db.cosmosdb.operation_type batch; create; delete
db.response.status_code 200 or 429 etc.
db.cosmos.sub_status_code 1002 etc.
db.cosmos.client_id Guid
db.cosmosdb.consistency_level Eventual, ConsistentPrefix, BoundedStaleness, Strong or Session
db.cosmosdb.machine_id Unique Id for a machine (Dependent of otel generated machine details)

Metrics

Name Unit Type Description
db.cosmosdb.operation.request_charge {request_units} histogram Total request units per operation (sum of RUs for all requested needed when processing an operation) This metric SHOULD be specified with [ExplicitBucketBoundaries] of [ 1, 5, 10, 25, 50, 100, 250, 500, 1000]
db.client.operation.duration s histogram Total end-to-end duration of the operation This metric SHOULD be specified with [ExplicitBucketBoundaries] of [0.001, 0.005, 0.010, 0.050, 0.100, 0.200, 0.500, 1.000, 2.000, 5.000]
db.query.limit {count} histogram For feed operations (query, readAll, readMany, change feed) and batch operations, this meter capture the requested maxItemCount per page/request. This metric SHOULD be specified with [ExplicitBucketBoundaries] of [50, 100, 250, 500, 1000, 2000, 5000]
db.client.response.row_count {count} histogram For feed operations (query, readAll, readMany, change feed) batch operations this meter capture the actual item count in responses from the service. This metric SHOULD be specified with [ExplicitBucketBoundaries] of [10, 50, 100, 250, 500, 1000, 2000, 5000, 10000].
db.cosmosdb.operation.regions_contacted {count} histogram Number of regions contacted when executing an operation. This metric SHOULD be specified with [ExplicitBucketBoundaries] of [1, 2, 3, 5, 10, 20, 50].
db.cosmosdb.client.active_instances {count} histogram Number of active SDK client instances. This metric SHOULD be specified with [ExplicitBucketBoundaries] of [1, 2, 3, 5, 10, 50, 100].

Added db.cosmosdb.consistency_level in attribute list also, we are going to collect it is traces also
Removed db.cosmosdb.operation_type
Added db.cosmosdb.machine_id in attribute list also, we are going to collect it is traces also

Merge requirement checklist

model/database/common.yaml Outdated Show resolved Hide resolved
model/database/common.yaml Outdated Show resolved Hide resolved
model/database/metrics.yaml Show resolved Hide resolved
model/database/metrics.yaml Outdated Show resolved Hide resolved
model/database/metrics.yaml Outdated Show resolved Hide resolved
model/database/metrics.yaml Outdated Show resolved Hide resolved
model/database/metrics.yaml Outdated Show resolved Hide resolved
model/database/metrics.yaml Outdated Show resolved Hide resolved
model/database/metrics.yaml Outdated Show resolved Hide resolved
model/database/metrics.yaml Outdated Show resolved Hide resolved
@sourabh1007 sourabh1007 force-pushed the users/sourabhjain/otelcosmosmetrics branch from 9fd4001 to 229b496 Compare September 30, 2024 16:07
Copy link
Contributor

@lmolkova lmolkova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add markdown for the metrics.

You need to add codegen placeholders like

and metrics will be generated between them.

Please take a look at the existing database metric definitions in https://github.com/open-telemetry/semantic-conventions/blob/v1.27.0/docs/database/database-metrics.md?plain=1 for the examples.

Note that a couple of properties are not yet defined in yaml, but need to be specified in markdown:

Please also document whether cosmos reports any of the standard DB metrics (I assume it does report db.client.operation.duration). If so, please document any difference from the generic metric. E.g. like it's done here - https://github.com/open-telemetry/semantic-conventions/blob/main/docs/dotnet/dotnet-http-metrics.md#metric-httpserverrequestduration

model/database/registry.yaml Outdated Show resolved Hide resolved
model/database/metrics.yaml Outdated Show resolved Hide resolved
model/database/metrics.yaml Outdated Show resolved Hide resolved
model/database/metrics.yaml Outdated Show resolved Hide resolved
model/database/metrics.yaml Outdated Show resolved Hide resolved
model/database/common.yaml Outdated Show resolved Hide resolved
model/database/common.yaml Outdated Show resolved Hide resolved
model/database/common.yaml Outdated Show resolved Hide resolved
model/database/common.yaml Show resolved Hide resolved
Copy link

@jcocchi jcocchi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a few nits, but looks good overall

model/database/cosmosdb-metrics.yaml Outdated Show resolved Hide resolved
model/database/cosmosdb-metrics.yaml Outdated Show resolved Hide resolved
model/database/cosmosdb-metrics.yaml Outdated Show resolved Hide resolved
model/database/registry.yaml Outdated Show resolved Hide resolved
@sourabh1007 sourabh1007 force-pushed the users/sourabhjain/otelcosmosmetrics branch 3 times, most recently from 5db7514 to 01dc811 Compare October 3, 2024 02:15
model/database/registry.yaml Outdated Show resolved Hide resolved
model/database/metrics.yaml Outdated Show resolved Hide resolved
model/database/metrics.yaml Outdated Show resolved Hide resolved
model/database/cosmosdb-metrics.yaml Outdated Show resolved Hide resolved
docs/database/database-metrics.md Outdated Show resolved Hide resolved
docs/database/database-metrics.md Outdated Show resolved Hide resolved
model/database/cosmosdb-metrics.yaml Outdated Show resolved Hide resolved
model/database/cosmosdb-metrics.yaml Outdated Show resolved Hide resolved
model/database/cosmosdb-metrics.yaml Show resolved Hide resolved
@sourabh1007 sourabh1007 force-pushed the users/sourabhjain/otelcosmosmetrics branch 3 times, most recently from 1a632ac to ee1d400 Compare October 10, 2024 06:29
Copy link
Contributor

@lmolkova lmolkova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a few minor final comments, but it looks good otherwise!

model/database/cosmosdb-metrics.yaml Outdated Show resolved Hide resolved
model/database/cosmosdb-metrics.yaml Outdated Show resolved Hide resolved
model/database/metrics.yaml Outdated Show resolved Hide resolved
model/database/registry.yaml Show resolved Hide resolved
model/database/registry.yaml Show resolved Hide resolved
model/database/registry.yaml Outdated Show resolved Hide resolved
docs/attributes-registry/db.md Outdated Show resolved Hide resolved
@sourabh1007 sourabh1007 force-pushed the users/sourabhjain/otelcosmosmetrics branch from e311a60 to bff773c Compare October 11, 2024 01:57
@sourabh1007 sourabh1007 force-pushed the users/sourabhjain/otelcosmosmetrics branch from 6ca8c5c to c5f7ce0 Compare October 14, 2024 18:29
type: metric
metric_name: db.cosmosdb.client.active_instances
brief: "Number of active client instances"
instrument: histogram
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this one should probably be updowncounter - you probably don't expect people to measure P95 for it. Also it does not change much in well-behaved application.

We have plenty of examples of current/active metrics in semconv - http.server.active_requests, db.client.connection.count, dotnet.thread_pool.thread.count, etc all of them are updowncounters.

Histogram will be more expensive but also harder to use and would require you to keep a current value around. With up-down-counter, you'd record +1 when client is created and -1 when it's disposed and won't need to maintain any state in the clients.


This metric SHOULD be specified with
[`ExplicitBucketBoundaries`](https://github.com/open-telemetry/opentelemetry-specification/tree/v1.35.0/specification/metrics/api.md#instrument-advisory-parameters)
of `[ 0.001, 0.005, 0.010, 0.050, 0.100, 0.200, 0.500, 1.000, 2.000, 5.000 ]`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

any reason to have boundaries different than the generic ones ([ 0.001, 0.005, 0.01, 0.05, 0.1, 0.5, 1, 5, 10 ])?

@@ -23,6 +23,11 @@ linkTitle: Metrics
- [Metric: `db.client.connection.create_time`](#metric-dbclientconnectioncreate_time)
- [Metric: `db.client.connection.wait_time`](#metric-dbclientconnectionwait_time)
- [Metric: `db.client.connection.use_time`](#metric-dbclientconnectionuse_time)
- [Azure Cosmos DB](#azure-cosmos-db)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you please move Cosmos DB metrics to cosmosdb.md ?

@@ -23,6 +23,11 @@ linkTitle: Metrics
- [Metric: `db.client.connection.create_time`](#metric-dbclientconnectioncreate_time)
- [Metric: `db.client.connection.wait_time`](#metric-dbclientconnectionwait_time)
- [Metric: `db.client.connection.use_time`](#metric-dbclientconnectionuse_time)
- [Azure Cosmos DB](#azure-cosmos-db)
- [Metric: `db.client.operation.duration`](#metric-dbclientoperationduration-1)
- [Metric: `db.query.response.row_count`](#metric-dbqueryresponserow_count)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this metric should be added to the list of generic metrics above

2. 250, 500, 1000: These values represent typical workloads where moderate amounts of data are returned in each query.
3. 2000, 5000: These boundaries capture scenarios where the query returns large datasets. These larger page sizes can potentially increase memory or CPU usage and may lead to longer query execution times, making it important to track performance in these ranges.
4. 10000: This boundary is included to capture rare, very large response sizes. Such queries can put significant strain on system resources, including memory, CPU, and network bandwidth, and can often lead to performance issues such as high latency or even network drops.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please generate this metric in this file


This metric SHOULD be specified with
[`ExplicitBucketBoundaries`](https://github.com/open-telemetry/opentelemetry-specification/tree/v1.35.0/specification/metrics/api.md#instrument-advisory-parameters)
of `[10, 50, 100, 250, 500, 1000, 2000, 5000, 10000]`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

any reason not to start from 1? I assume many queries always return 1 item.

type: attribute_group
brief: 'Azure Cosmos DB attributes'
attributes:
- ref: db.system
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't have a good way to show it properly using tooling yet. Currently when it's rendered it shows 50+ db.system values table and is confusing.

We're doing something like

      # TODO: add db.system once https://github.com/open-telemetry/build-tools/issues/192 is possible
      # - ref: db.system

in other places and adding some text in markdown that system is populated to improve readability. Suggesting to do the same here.


### Metric: `db.client.operation.duration`

It captures the total time taken by an Azure Cosmos DB operation. This metric follows the common [db.client.operation.duration] (#metric-dbclientoperationduration)b definition.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

each metric should have a requirement level - take a look at the examples of existing metrics in this file. I assume all of them should be required or recommended (when available) for cosmos?

stability: experimental
extends: attributes.db.cosmosdb.minimal

- id: attributes.db.cosmosdb.minimal
Copy link
Contributor

@lmolkova lmolkova Oct 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you're losing the clarifications made for cosmos DB span with this approach - check out briefs/notes on the attributes - they are no longer specific to cosmos.

suggestion:

  1. take the clarifications (levels, briefs, notes) of applicable attributes from db.cosmosdb group defined here
  2. put the resulting attributes.db.cosmosdb.minimal group to database/common.yaml
  3. extend attributes.db.cosmosdb.minimal group in cosmos db spans. You might need to play a bit with extension to make spans look exactly the same, but it should be doable.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: In Discussions
Development

Successfully merging this pull request may close these issues.

5 participants