-
Notifications
You must be signed in to change notification settings - Fork 166
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cosmos DB: Operation Level Metrics #1438
base: main
Are you sure you want to change the base?
Cosmos DB: Operation Level Metrics #1438
Conversation
8106332
to
0e22c02
Compare
9fd4001
to
229b496
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add markdown for the metrics.
You need to add codegen placeholders like
and metrics will be generated between them.
Please take a look at the existing database metric definitions in https://github.com/open-telemetry/semantic-conventions/blob/v1.27.0/docs/database/database-metrics.md?plain=1 for the examples.
Note that a couple of properties are not yet defined in yaml, but need to be specified in markdown:
- metric requirement level - https://github.com/open-telemetry/semantic-conventions/blob/main/docs/general/metric-requirement-level.md
- histogram boundaries - https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/metrics/api.md#instrument-advisory-parameter-explicitbucketboundaries
Please also document whether cosmos reports any of the standard DB metrics (I assume it does report db.client.operation.duration
). If so, please document any difference from the generic metric. E.g. like it's done here - https://github.com/open-telemetry/semantic-conventions/blob/main/docs/dotnet/dotnet-http-metrics.md#metric-httpserverrequestduration
c0bd5ed
to
0e65012
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left a few nits, but looks good overall
5db7514
to
01dc811
Compare
1a632ac
to
ee1d400
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left a few minor final comments, but it looks good otherwise!
e311a60
to
bff773c
Compare
Co-authored-by: Liudmila Molkova <[email protected]>
Co-authored-by: Liudmila Molkova <[email protected]>
Co-authored-by: Liudmila Molkova <[email protected]>
Co-authored-by: Liudmila Molkova <[email protected]>
Co-authored-by: Liudmila Molkova <[email protected]>
Co-authored-by: Liudmila Molkova <[email protected]>
Co-authored-by: Liudmila Molkova <[email protected]>
Co-authored-by: Liudmila Molkova <[email protected]>
Co-authored-by: Liudmila Molkova <[email protected]>
6ca8c5c
to
c5f7ce0
Compare
model/database/cosmosdb-metrics.yaml
Outdated
type: metric | ||
metric_name: db.cosmosdb.client.active_instances | ||
brief: "Number of active client instances" | ||
instrument: histogram |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this one should probably be updowncounter - you probably don't expect people to measure P95 for it. Also it does not change much in well-behaved application.
We have plenty of examples of current/active metrics in semconv - http.server.active_requests
, db.client.connection.count
, dotnet.thread_pool.thread.count
, etc all of them are updowncounters
.
Histogram will be more expensive but also harder to use and would require you to keep a current value around. With up-down-counter, you'd record +1 when client is created and -1 when it's disposed and won't need to maintain any state in the clients.
docs/database/database-metrics.md
Outdated
|
||
This metric SHOULD be specified with | ||
[`ExplicitBucketBoundaries`](https://github.com/open-telemetry/opentelemetry-specification/tree/v1.35.0/specification/metrics/api.md#instrument-advisory-parameters) | ||
of `[ 0.001, 0.005, 0.010, 0.050, 0.100, 0.200, 0.500, 1.000, 2.000, 5.000 ]`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
any reason to have boundaries different than the generic ones ([ 0.001, 0.005, 0.01, 0.05, 0.1, 0.5, 1, 5, 10 ]
)?
docs/database/database-metrics.md
Outdated
@@ -23,6 +23,11 @@ linkTitle: Metrics | |||
- [Metric: `db.client.connection.create_time`](#metric-dbclientconnectioncreate_time) | |||
- [Metric: `db.client.connection.wait_time`](#metric-dbclientconnectionwait_time) | |||
- [Metric: `db.client.connection.use_time`](#metric-dbclientconnectionuse_time) | |||
- [Azure Cosmos DB](#azure-cosmos-db) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could you please move Cosmos DB metrics to cosmosdb.md
?
docs/database/database-metrics.md
Outdated
@@ -23,6 +23,11 @@ linkTitle: Metrics | |||
- [Metric: `db.client.connection.create_time`](#metric-dbclientconnectioncreate_time) | |||
- [Metric: `db.client.connection.wait_time`](#metric-dbclientconnectionwait_time) | |||
- [Metric: `db.client.connection.use_time`](#metric-dbclientconnectionuse_time) | |||
- [Azure Cosmos DB](#azure-cosmos-db) | |||
- [Metric: `db.client.operation.duration`](#metric-dbclientoperationduration-1) | |||
- [Metric: `db.query.response.row_count`](#metric-dbqueryresponserow_count) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this metric should be added to the list of generic metrics above
docs/database/database-metrics.md
Outdated
2. 250, 500, 1000: These values represent typical workloads where moderate amounts of data are returned in each query. | ||
3. 2000, 5000: These boundaries capture scenarios where the query returns large datasets. These larger page sizes can potentially increase memory or CPU usage and may lead to longer query execution times, making it important to track performance in these ranges. | ||
4. 10000: This boundary is included to capture rare, very large response sizes. Such queries can put significant strain on system resources, including memory, CPU, and network bandwidth, and can often lead to performance issues such as high latency or even network drops. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please generate this metric in this file
docs/database/database-metrics.md
Outdated
|
||
This metric SHOULD be specified with | ||
[`ExplicitBucketBoundaries`](https://github.com/open-telemetry/opentelemetry-specification/tree/v1.35.0/specification/metrics/api.md#instrument-advisory-parameters) | ||
of `[10, 50, 100, 250, 500, 1000, 2000, 5000, 10000]`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
any reason not to start from 1? I assume many queries always return 1 item.
model/database/cosmosdb-metrics.yaml
Outdated
type: attribute_group | ||
brief: 'Azure Cosmos DB attributes' | ||
attributes: | ||
- ref: db.system |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't have a good way to show it properly using tooling yet. Currently when it's rendered it shows 50+ db.system values table and is confusing.
We're doing something like
# TODO: add db.system once https://github.com/open-telemetry/build-tools/issues/192 is possible
# - ref: db.system
in other places and adding some text in markdown that system is populated to improve readability. Suggesting to do the same here.
docs/database/database-metrics.md
Outdated
|
||
### Metric: `db.client.operation.duration` | ||
|
||
It captures the total time taken by an Azure Cosmos DB operation. This metric follows the common [db.client.operation.duration] (#metric-dbclientoperationduration)b definition. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
each metric should have a requirement level - take a look at the examples of existing metrics in this file. I assume all of them should be required or recommended (when available) for cosmos?
model/database/cosmosdb-metrics.yaml
Outdated
stability: experimental | ||
extends: attributes.db.cosmosdb.minimal | ||
|
||
- id: attributes.db.cosmosdb.minimal |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you're losing the clarifications made for cosmos DB span with this approach - check out briefs/notes on the attributes - they are no longer specific to cosmos.
suggestion:
- take the clarifications (levels, briefs, notes) of applicable attributes from
db.cosmosdb
group defined here- id: db.cosmosdb - put the resulting
attributes.db.cosmosdb.minimal
group todatabase/common.yaml
- extend
attributes.db.cosmosdb.minimal
group in cosmos db spans. You might need to play a bit with extension to make spans look exactly the same, but it should be doable.
Co-authored-by: Liudmila Molkova <[email protected]>
Co-authored-by: Liudmila Molkova <[email protected]>
Co-authored-by: Liudmila Molkova <[email protected]>
Introducing operation-level metrics in the .NET Cosmos DB SDK.
This PR includes the registration of semantic conventions in the OpenTelemetry community, ensuring full alignment with OpenTelemetry's established guidelines.
Dimensions
db.cosmosdb.operation_typebatch
;create
;delete
db.cosmos.client_idGuiddb.cosmosdb.machine_idUnique Id for a machine(Dependent of otel generated machine details)Metrics
{request_units}
[ExplicitBucketBoundaries] of [ 1, 5, 10, 25, 50, 100, 250, 500, 1000]
s
[ExplicitBucketBoundaries] of [0.001, 0.005, 0.010, 0.050, 0.100, 0.200, 0.500, 1.000, 2.000, 5.000]
db.query.limit{count}
For feed operations (query, readAll, readMany, change feed) and batch operations, this meter capture the requested maxItemCount per page/request. This metric SHOULD be specified with[ExplicitBucketBoundaries] of [50, 100, 250, 500, 1000, 2000, 5000]
{count}
[ExplicitBucketBoundaries] of [10, 50, 100, 250, 500, 1000, 2000, 5000, 10000]
.db.cosmosdb.operation.regions_contacted{count}
Number of regions contacted when executing an operation. This metric SHOULD be specified with[ExplicitBucketBoundaries] of [1, 2, 3, 5, 10, 20, 50]
.{count}
[ExplicitBucketBoundaries] of [1, 2, 3, 5, 10, 50, 100]
.Added
db.cosmosdb.consistency_level
in attribute list also, we are going to collect it is traces alsoRemoved
db.cosmosdb.operation_type
Addeddb.cosmosdb.machine_id
in attribute list also, we are going to collect it is traces alsoMerge requirement checklist
[chore]