diff --git a/CHANGELOG.md b/CHANGELOG.md index 5ff898f002f..a72895f35b2 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -11,8 +11,37 @@ release. ### Traces +### Metrics + +- Add optional min / max fields to histogram data model. + ([#1915](https://github.com/open-telemetry/opentelemetry-specification/pull/1915)) + +### Logs + +### Resource + +### Semantic Conventions + +### Compatibility + +### OpenTelemetry Protocol + +- Make OTLP/HTTP the recommended default transport ([#1969](https://github.com/open-telemetry/opentelemetry-specification/pull/1969)) + +### SDK Configuration + +## v1.7.0 (2021-09-30) + +### Context + +- No changes. + +### Traces + - Prefer global user defined limits over model-sepcific default values. ([#1893](https://github.com/open-telemetry/opentelemetry-specification/pull/1893)) +- Generalize the "message" event to apply to all RPC systems not just gRPC + ([#1914](https://github.com/open-telemetry/opentelemetry-specification/pull/1914)) ### Metrics @@ -26,10 +55,16 @@ release. [#1888](https://github.com/open-telemetry/opentelemetry-specification/pull/1888), [#1912](https://github.com/open-telemetry/opentelemetry-specification/pull/1912), [#1913](https://github.com/open-telemetry/opentelemetry-specification/pull/1913), - [#1938](https://github.com/open-telemetry/opentelemetry-specification/pull/1938)) + [#1938](https://github.com/open-telemetry/opentelemetry-specification/pull/1938), + [#1958](https://github.com/open-telemetry/opentelemetry-specification/pull/1958)) +- Add FaaS metrics semantic conventions ([#1736](https://github.com/open-telemetry/opentelemetry-specification/pull/1736)) +- Update env variable values to match other env variables + ([#1965](https://github.com/open-telemetry/opentelemetry-specification/pull/1965)) ### Logs +- No changes. + ### Resource - Exempt Resource from attribute limits. @@ -52,19 +87,31 @@ release. ([#1890](https://github.com/open-telemetry/opentelemetry-specification/pull/1890)) - Add HTTP request and response headers semantic conventions. ([#1898](https://github.com/open-telemetry/opentelemetry-specification/pull/1898)) +- Add `k8s.container.restart_count` Resource attribute. + ([#1945](https://github.com/open-telemetry/opentelemetry-specification/pull/1945)) - Define http tracing attributes provided at span creation time ([#1919](https://github.com/open-telemetry/opentelemetry-specification/pull/1916)) ### Compatibility +- No changes. + ### OpenTelemetry Protocol - Add environment variables for configuring the OTLP exporter protocol (`grpc`, `http/protobuf`, `http/json`) ([#1880](https://github.com/open-telemetry/opentelemetry-specification/pull/1880)) +- Specify the behavior of the OTLP endpoint variables for OTLP/HTTP more strictly + ([#1975](https://github.com/open-telemetry/opentelemetry-specification/pull/1975)). +- Allow implementations to use their own default for OTLP compression, with `none` denotating no compression + ([#1923](https://github.com/open-telemetry/opentelemetry-specification/pull/1923)) +- Clarify OTLP server components MUST support none/gzip compression + ([#1955](https://github.com/open-telemetry/opentelemetry-specification/pull/1955)) +- Change OTLP/HTTP port from 4317 to 4318 ([#1970](https://github.com/open-telemetry/opentelemetry-specification/pull/1970)) ### SDK Configuration - Change default value for OTEL_EXPORTER_JAEGER_AGENT_PORT to 6831. ([#1812](https://github.com/open-telemetry/opentelemetry-specification/pull/1812)) +- See also the changes for OTLP configuration listed under "OpenTelemetry Protocol" above. ## v1.6.0 (2021-08-06) @@ -135,7 +182,6 @@ release. - Clarify the limit on the instrument unit. ([#1762](https://github.com/open-telemetry/opentelemetry-specification/pull/1762)) -- Add FaaS metrics semantic conventions ([#1736](https://github.com/open-telemetry/opentelemetry-specification/pull/1736)) ### Logs diff --git a/schemas/1.7.0 b/schemas/1.7.0 new file mode 100644 index 00000000000..df9a9b8e003 --- /dev/null +++ b/schemas/1.7.0 @@ -0,0 +1,7 @@ +file_format: 1.0.0 +schema_url: https://opentelemetry.io/schemas/1.7.0 +versions: + 1.7.0: + 1.6.1: + 1.5.0: + 1.4.0: diff --git a/semantic_conventions/resource/k8s.yaml b/semantic_conventions/resource/k8s.yaml index d25bcda2c10..10cefc05ad4 100644 --- a/semantic_conventions/resource/k8s.yaml +++ b/semantic_conventions/resource/k8s.yaml @@ -63,6 +63,13 @@ groups: brief: > The name of the Container in a Pod template. examples: ['redis'] + - id: restart_count + type: int + brief: > + Number of times the container was restarted. This attribute can be + used to identify a particular container (running or stopped) within a + container spec. + examples: [0, 2] - id: k8s.replicaset prefix: k8s.replicaset diff --git a/semantic_conventions/trace/rpc.yaml b/semantic_conventions/trace/rpc.yaml index c4399519f23..6b37a1fa93b 100644 --- a/semantic_conventions/trace/rpc.yaml +++ b/semantic_conventions/trace/rpc.yaml @@ -2,7 +2,7 @@ groups: - id: rpc prefix: rpc brief: 'This document defines semantic conventions for remote procedure calls.' - events: [rpc.grpc.message] + events: [rpc.message] attributes: - id: system type: string @@ -142,10 +142,10 @@ groups: note: > This is always required for jsonrpc. See the note in the general RPC conventions for more information. - - id: rpc.grpc.message - prefix: "message" # TODO: Change the prefix to rpc.grpc.message? + - id: rpc.message + prefix: "message" # TODO: Change the prefix to rpc.message? type: event - brief: "gRPC received/sent message." + brief: "RPC received/sent message." attributes: - id: type type: diff --git a/spec-compliance-matrix.md b/spec-compliance-matrix.md index d98236516ed..485d8ff3991 100644 --- a/spec-compliance-matrix.md +++ b/spec-compliance-matrix.md @@ -163,7 +163,7 @@ Note: Support for environment variables is optional. | In-memory (mock exporter) | | + | + | + | + | + | + | - | - | + | + | + | | [OTLP](specification/protocol/otlp.md) | | | | | | | | | | | | | | OTLP/gRPC Exporter | * | + | + | + | + | | + | | + | + | + | + | -| OTLP/HTTP binary Protobuf Exporter | * | + | + | + | + | + | + | | | + | - | - | +| OTLP/HTTP binary Protobuf Exporter | * | + | + | + | + | + | + | | + | + | - | - | | OTLP/HTTP JSON Protobuf Exporter | | + | - | + | [-][py1003] | | - | | | + | - | - | | OTLP/HTTP gzip Content-Encoding support | X | + | + | + | + | + | - | | | - | - | - | | Concurrent sending | | - | + | + | [-][py1108] | | - | | + | - | - | - | diff --git a/specification/library-guidelines.md b/specification/library-guidelines.md index ceff34e568a..dd837c08d91 100644 --- a/specification/library-guidelines.md +++ b/specification/library-guidelines.md @@ -132,5 +132,6 @@ guidelines on the performance expectations that API implementations should meet, Please refer to individual API specification for guidelines on what concurrency safeties should API implementations provide and how they should be documented: -* [Metrics API](./metrics/api.md#concurrency) +* [Metrics API](./metrics/api.md#concurrency-requirements) +* [Metrics SDK](./metrics/sdk.md#concurrency-requirements) * [Tracing API](./trace/api.md#concurrency) diff --git a/specification/metrics/api.md b/specification/metrics/api.md index 1fefb5e4a25..d0305200467 100644 --- a/specification/metrics/api.md +++ b/specification/metrics/api.md @@ -32,6 +32,8 @@ Table of Contents * [Asynchronous UpDownCounter creation](#asynchronous-updowncounter-creation) * [Asynchronous UpDownCounter operations](#asynchronous-updowncounter-operations) * [Measurement](#measurement) +* [Compatibility requirements](#compatibility-requirements) +* [Concurrency requirements](#concurrency-requirements) @@ -408,7 +410,8 @@ function(s) independently. approach. Here are some examples: * Return a list (or tuple, generator, enumerator, etc.) of `Measurement`s. -* Use an observer argument to allow individual `Measurement`s to be reported. +* Use an observer result argument to allow individual `Measurement`s to be + reported. User code is recommended not to provide more than one `Measurement` with the same `attributes` in a single callback. If it happens, [OpenTelemetry @@ -896,7 +899,8 @@ function(s) independently. approach. Here are some examples: * Return a list (or tuple, generator, enumerator, etc.) of `Measurement`s. -* Use an observer argument to allow individual `Measurement`s to be reported. +* Use an observer result argument to allow individual `Measurement`s to be + reported. User code is recommended not to provide more than one `Measurement` with the same `attributes` in a single callback. If it happens, the @@ -978,15 +982,15 @@ for the interaction between the API and SDK. * A value * [`Attributes`](../common/common.md#attributes) -## Compatibility +## Compatibility requirements All the metrics components SHOULD allow new APIs to be added to existing components without introducing breaking changes. All the metrics APIs SHOULD allow optional parameter(s) to be added to existing -APIs without introducing breaking changes. +APIs without introducing breaking changes, if possible. -## Concurrency +## Concurrency requirements For languages which support concurrent execution the Metrics APIs provide specific guarantees and safeties. diff --git a/specification/metrics/datamodel.md b/specification/metrics/datamodel.md index ac1c4806b4e..a916909099b 100644 --- a/specification/metrics/datamodel.md +++ b/specification/metrics/datamodel.md @@ -386,6 +386,8 @@ Histograms consist of the following: (00:00:00 UTC on 1 January 1970). - A count (`count`) of the total population of points in the histogram. - A sum (`sum`) of all the values in the histogram. + - (optional) The min (`min`) of all values in the histogram. + - (optional) The max (`max`) of all values in the histogram. - (optional) A series of buckets with: - Explicit boundary values. These values denote the lower and upper bounds for buckets and whether not a given observation would be recorded in this @@ -398,6 +400,13 @@ denotes Delta temporality where accumulated event counts are reset to zero after and a new aggregation occurs. Cumulative, on the other hand, continues to aggregate events, resetting with the use of a new start time. +The aggregation temporality also has implications on the min and max fields. Min +and max are more useful for Delta temporality, since the values represented by +Cumulative min and max will stabilize as more events are recorded. Additionally, +it is possible to convert min and max from Delta to Cumulative, but not from +Cumulative to Delta. When converting from Cumulative to Delta, min and max can +be dropped, or captured in an alternative representation such as a gauge. + Bucket counts are optional. A Histogram without buckets conveys a population in terms of only the sum and count, and may be interpreted as a histogram with single bucket covering `(-Inf, +Inf)`. diff --git a/specification/metrics/img/model-delta-histogram.png b/specification/metrics/img/model-delta-histogram.png index f2ec06a270d..e3cc13e46a7 100644 Binary files a/specification/metrics/img/model-delta-histogram.png and b/specification/metrics/img/model-delta-histogram.png differ diff --git a/specification/metrics/sdk.md b/specification/metrics/sdk.md index 2862383297e..b67a478517b 100644 --- a/specification/metrics/sdk.md +++ b/specification/metrics/sdk.md @@ -8,17 +8,19 @@ Table of Contents * [MeterProvider](#meterprovider) -* [Attribute Limits](#attribute-limits) +* [Attribute limits](#attribute-limits) * [Exemplar](#exemplar) * [ExemplarFilter](#exemplarfilter) * [ExemplarReservoir](#exemplarreservoir) - * [Exemplar Defaults](#exemplar-defaults) + * [Exemplar defaults](#exemplar-defaults) * [MetricReader](#metricreader) * [Periodic exporting MetricReader](#periodic-exporting-metricreader) * [MetricExporter](#metricexporter) * [Push Metric Exporter](#push-metric-exporter) * [Pull Metric Exporter](#pull-metric-exporter) -* [Defaults and Configuration](#defaults-and-configuration) +* [Defaults and configuration](#defaults-and-configuration) +* [Compatibility requirements](#compatibility-requirements) +* [Concurrency requirements](#concurrency-requirements) @@ -393,7 +395,7 @@ This Aggregation informs the SDK to collect: - Count of `Measurement` values falling within explicit bucket boundaries. - Arithmetic sum of `Measurement` values in population. -## Attribute Limits +## Attribute limits Attributes which belong to Metrics are exempt from the [common rules of attribute limits](../common/common.md#attribute-limits) at this @@ -477,7 +479,7 @@ from the original sample measurement. The `ExemplarReservoir` SHOULD avoid allocations when sampling exemplars. -### Exemplar Defaults +### Exemplar defaults The SDK will come with two types of built-in exemplar reservoirs: @@ -574,6 +576,22 @@ SDK](../overview.md#sdk) authors MAY choose to add parameters (e.g. callback, filter, timeout). [OpenTelemetry SDK](../overview.md#sdk) authors MAY choose the return value type, or do not return anything. +### Shutdown + +This method provides a way for the `MetricReader` to do any cleanup required. + +`Shutdown` MUST be called only once for each `MetricReader` instance. After the +call to `Shutdown`, subsequent invocations to `Collect` are not allowed. SDKs +SHOULD return some failure for these calls, if possible. + +`Shutdown` SHOULD provide a way to let the caller know whether it succeeded, +failed or timed out. + +`Shutdown` SHOULD complete or abort within some timeout. `Shutdown` CAN be +implemented as a blocking API or an asynchronous API which notifies the caller +via a callback or an event. [OpenTelemetry SDK](../overview.md#sdk) authors CAN +decide if they want to make the shutdown timeout configurable. + ### Periodic exporting MetricReader This is an implementation of the `MetricReader` which collects metrics based on @@ -752,6 +770,32 @@ modeled to interact with other components in the SDK: +-----------------------------+ ``` -## Defaults and Configuration +## Defaults and configuration + +The SDK MUST provide configuration according to the [SDK environment +variables](../sdk-environment-variables.md) specification. + +## Compatibility requirements + +All the metrics components SHOULD allow new methods to be added to existing +components without introducing breaking changes. + +All the metrics SDK methods SHOULD allow optional parameter(s) to be added to +existing methods without introducing breaking changes, if possible. + +## Concurrency requirements + +For languages which support concurrent execution the Metrics SDKs provide +specific guarantees and safeties. + +**MeterProvider** - Meter creation, `ForceFlush` and `Shutdown` are safe to be +called concurrently. + +**ExemplarFilter** - all methods are safe to be called concurrently. + +**ExemplarReservoir** - all methods are safe to be called concurrently. + +**MetricReader** - `Collect` and `Shutdown` are safe to be called concurrently. -The SDK MUST provide configuration according to the [SDK environment variables](../sdk-environment-variables.md) specification. +**MetricExporter** - `ForceFlush` and `Shutdown` are safe to be called +concurrently. diff --git a/specification/metrics/semantic_conventions/http-metrics.md b/specification/metrics/semantic_conventions/http-metrics.md index 38d05416a93..a24e2bbb5cd 100644 --- a/specification/metrics/semantic_conventions/http-metrics.md +++ b/specification/metrics/semantic_conventions/http-metrics.md @@ -17,18 +17,18 @@ type and units. Below is a table of HTTP server metric instruments. -| Name | Instrument | Units | Description | -|-------------------------------|-------------------|--------------|-------------| -| `http.server.duration` | ValueRecorder | milliseconds | measures the duration of the inbound HTTP request | -| `http.server.active_requests` | UpDownSumObserver | requests | measures the number of concurrent HTTP requests that are currently in-flight | +| Name | Instrument | Units | Description | +|-------------------------------|----------------------------|--------------|-------------| +| `http.server.duration` | Histogram | milliseconds | measures the duration of the inbound HTTP request | +| `http.server.active_requests` | Asynchronous UpDownCounter | requests | measures the number of concurrent HTTP requests that are currently in-flight | ### HTTP Client Below is a table of HTTP client metric instruments. -| Name | Instrument | Units | Description | -|------------------------|---------------|--------------|-------------| -| `http.client.duration` | ValueRecorder | milliseconds | measure the duration of the outbound HTTP request | +| Name | Instrument | Units | Description | +|------------------------|------------|--------------|-------------| +| `http.client.duration` | Histogram | milliseconds | measure the duration of the outbound HTTP request | ## Attributes diff --git a/specification/metrics/semantic_conventions/rpc.md b/specification/metrics/semantic_conventions/rpc.md index 61cd34fc501..12463cbef3f 100644 --- a/specification/metrics/semantic_conventions/rpc.md +++ b/specification/metrics/semantic_conventions/rpc.md @@ -31,26 +31,26 @@ MUST be of the specified type and units. Below is a table of RPC server metric instruments. -| Name | Instrument | Units | Description | Status | Streaming | -|----------------------------|---------------|--------------|-------------|--------|-----------| -| `rpc.server.duration` | ValueRecorder | milliseconds | measures duration of inbound RPC | Recommended | N/A. While streaming RPCs may record this metric as start-of-batch to end-of-batch, it's hard to interpret in practice. | -| `rpc.server.request.size` | ValueRecorder | bytes | measures size of RPC request messages (uncompressed) | Optional | Recorded per message in a streaming batch | -| `rpc.server.response.size` | ValueRecorder | bytes | measures size of RPC response messages (uncompressed) | Optional | Recorded per response in a streaming batch | -| `rpc.server.requests_per_rpc` | ValueRecorder | count | measures the number of messages received per RPC. Should be 1 for all non-streaming RPCs | Optional | Required | -| `rpc.server.responses_per_rpc` | ValueRecorder | count | measures the number of messages sent per RPC. Should be 1 for all non-streaming RPCs | Optional | Required | +| Name | Instrument | Units | Description | Status | Streaming | +|------|------------|-------|-------------|--------|-----------| +| `rpc.server.duration` | Histogram | milliseconds | measures duration of inbound RPC | Recommended | N/A. While streaming RPCs may record this metric as start-of-batch to end-of-batch, it's hard to interpret in practice. | +| `rpc.server.request.size` | Histogram | bytes | measures size of RPC request messages (uncompressed) | Optional | Recorded per message in a streaming batch | +| `rpc.server.response.size` | Histogram | bytes | measures size of RPC response messages (uncompressed) | Optional | Recorded per response in a streaming batch | +| `rpc.server.requests_per_rpc` | Histogram | count | measures the number of messages received per RPC. Should be 1 for all non-streaming RPCs | Optional | Required | +| `rpc.server.responses_per_rpc` | Histogram | count | measures the number of messages sent per RPC. Should be 1 for all non-streaming RPCs | Optional | Required | ### RPC Client Below is a table of RPC client metric instruments. These apply to traditional RPC usage, not streaming RPCs. -| Name | Instrument | Units | Description | Status | Streaming | -|----------------------------|---------------|--------------|-------------|--------|-----------| -| `rpc.client.duration` | ValueRecorder | milliseconds | measures duration of outbound RPC | Recommended | N/A. While streaming RPCs may record this metric as start-of-batch to end-of-batch, it's hard to interpret in practice. | -| `rpc.client.request.size` | ValueRecorder | bytes | measures size of RPC request messages (uncompressed) | Optional | Recorded per message in a streaming batch | -| `rpc.client.response.size` | ValueRecorder | bytes | measures size of RPC response messages (uncompressed) | Optional | Recorded per message in a streaming batch | -| `rpc.client.requests_per_rpc` | ValueRecorder | count | measures the number of messages received per RPC. Should be 1 for all non-streaming RPCs | Optional | Required | -| `rpc.client.responses_per_rpc` | ValueRecorder | count | measures the number of messages sent per RPC. Should be 1 for all non-streaming RPCs | Optional | Required | +| Name | Instrument | Units | Description | Status | Streaming | +|------|------------|-------|-------------|--------|-----------| +| `rpc.client.duration` | Histogram | milliseconds | measures duration of outbound RPC | Recommended | N/A. While streaming RPCs may record this metric as start-of-batch to end-of-batch, it's hard to interpret in practice. | +| `rpc.client.request.size` | Histogram | bytes | measures size of RPC request messages (uncompressed) | Optional | Recorded per message in a streaming batch | +| `rpc.client.response.size` | Histogram | bytes | measures size of RPC response messages (uncompressed) | Optional | Recorded per message in a streaming batch | +| `rpc.client.requests_per_rpc` | Histogram | count | measures the number of messages received per RPC. Should be 1 for all non-streaming RPCs | Optional | Required | +| `rpc.client.responses_per_rpc` | Histogram | count | measures the number of messages sent per RPC. Should be 1 for all non-streaming RPCs | Optional | Required | ## Attributes diff --git a/specification/metrics/semantic_conventions/system-metrics.md b/specification/metrics/semantic_conventions/system-metrics.md index 02b69ad24e2..11b97442e6d 100644 --- a/specification/metrics/semantic_conventions/system-metrics.md +++ b/specification/metrics/semantic_conventions/system-metrics.md @@ -29,12 +29,12 @@ instruments not explicitly defined in the specification. **Description:** System level processor metrics. -| Name | Description | Units | Instrument Type | Value Type | Attribute Key(s) | Attribute Values | -| ---------------------- | ----------- | ----- | --------------- | ---------- | ---------------- | ----------------------------------- | -| system.cpu.time | | s | SumObserver | Double | state | idle, user, system, interrupt, etc. | -| | | | | | cpu | CPU number [0..n-1] | -| system.cpu.utilization | | 1 | ValueObserver | Double | state | idle, user, system, interrupt, etc. | -| | | | | | cpu | CPU number (0..n) | +| Name | Description | Units | Instrument Type | Value Type | Attribute Key(s) | Attribute Values | +| ---------------------- | ----------- | ----- | ---------------------| ---------- | ---------------- | ----------------------------------- | +| system.cpu.time | | s | Asynchronous Counter | Double | state | idle, user, system, interrupt, etc. | +| | | | | | cpu | CPU number [0..n-1] | +| system.cpu.utilization | | 1 | Asynchronous Gauge | Double | state | idle, user, system, interrupt, etc. | +| | | | | | cpu | CPU number (0..n) | ### `system.memory.` - Memory metrics @@ -43,34 +43,34 @@ memory](#systempaging---pagingswap-metrics). | Name | Description | Units | Instrument Type | Value Type | Attribute Key | Attribute Values | | ------------------------- | ----------- | ----- | ----------------- | ---------- | ------------- | ------------------------ | -| system.memory.usage | | By | UpDownSumObserver | Int64 | state | used, free, cached, etc. | -| system.memory.utilization | | 1 | ValueObserver | Double | state | used, free, cached, etc. | +| system.memory.usage | | By | Asynchronous UpDownCounter | Int64 | state | used, free, cached, etc. | +| system.memory.utilization | | 1 | Asynchronous Gauge | Double | state | used, free, cached, etc. | ### `system.paging.` - Paging/swap metrics **Description:** System level paging/swap memory metrics. -| Name | Description | Units | Instrument Type | Value Type | Attribute Key | Attribute Values | -| ------------------------- | ----------------------------------- | ------------ | ----------------- | ---------- | ------------- | ---------------- | -| system.paging.usage | Unix swap or windows pagefile usage | By | UpDownSumObserver | Int64 | state | used, free | -| system.paging.utilization | | 1 | ValueObserver | Double | state | used, free | -| system.paging.faults | | {faults} | SumObserver | Int64 | type | major, minor | -| system.paging.operations | | {operations} | SumObserver | Int64 | type | major, minor | -| | | | | | direction | in, out | +| Name | Description | Units | Instrument Type | Value Type | Attribute Key | Attribute Values | +| ------------------------- | ----------------------------------- | ------------ | -------------------------- | ---------- | ------------- | ---------------- | +| system.paging.usage | Unix swap or windows pagefile usage | By | Asynchronous UpDownCounter | Int64 | state | used, free | +| system.paging.utilization | | 1 | Asynchronous Gauge | Double | state | used, free | +| system.paging.faults | | {faults} | Asynchronous Counter | Int64 | type | major, minor | +| system.paging.operations | | {operations} | Asynchronous Counter | Int64 | type | major, minor | +| | | | | | direction | in, out | ### `system.disk.` - Disk controller metrics **Description:** System level disk performance metrics. -| Name | Description | Units | Instrument Type | Value Type | Attribute Key | Attribute Values | -| --------------------------------------------------------- | ----------------------------------------------- | ------------ | --------------- | ---------- | ------------- | ---------------- | -| system.disk.io | | By | SumObserver | Int64 | device | (identifier) | -| | | | | | direction | read, write | -| system.disk.operations | | {operations} | SumObserver | Int64 | device | (identifier) | -| | | | | | direction | read, write | -| system.disk.io_time[1](#io_time) | Time disk spent activated | s | SumObserver | Double | device | (identifier) | -| system.disk.operation_time[2](#operation_time) | Sum of the time each operation took to complete | s | SumObserver | Double | device | (identifier) | -| | | | | | direction | read, write | -| system.disk.merged | | {operations} | SumObserver | Int64 | device | (identifier) | -| | | | | | direction | read, write | +| Name | Description | Units | Instrument Type | Value Type | Attribute Key | Attribute Values | +| --------------------------------------------------------- | ----------------------------------------------- | ------------ | ------------------------ | ---------- | ------------- | ---------------- | +| system.disk.io | | By | Asynchronous Counter | Int64 | device | (identifier) | +| | | | | | direction | read, write | +| system.disk.operations | | {operations} | Asynchronous Counter | Int64 | device | (identifier) | +| | | | | | direction | read, write | +| system.disk.io_time[1](#io_time) | Time disk spent activated | s | Asynchronous Counter | Double | device | (identifier) | +| system.disk.operation_time[2](#operation_time) | Sum of the time each operation took to complete | s | Asynchronous Counter | Double | device | (identifier) | +| | | | | | direction | read, write | +| system.disk.merged | | {operations} | Asynchronous Counter | Int64 | device | (identifier) | +| | | | | | direction | read, write | 1 The real elapsed time ("wall clock") used in the I/O path (time from operations running in parallel are not @@ -94,35 +94,35 @@ perf counter (similar for Writes) ### `system.filesystem.` - Filesystem metrics **Description:** System level filesystem metrics. -| Name | Description | Units | Instrument Type | Value Type | Attribute Key | Attribute Values | -| ----------------------------- | ----------- | ----- | ----------------- | ---------- | -------------- | -------------------- | -| system.filesystem.usage | | By | UpDownSumObserver | Int64 | device | (identifier) | -| | | | | | state | used, free, reserved | -| | | | | | type | ext4, tmpfs, etc. | -| | | | | | mode | rw, ro, etc. | -| | | | | | mountpoint | (path) | -| system.filesystem.utilization | | 1 | ValueObserver | Double | device | (identifier) | -| | | | | | state | used, free, reserved | -| | | | | | type | ext4, tmpfs, etc. | -| | | | | | mode | rw, ro, etc. | -| | | | | | mountpoint | (path) | +| Name | Description | Units | Instrument Type | Value Type | Attribute Key | Attribute Values | +| ----------------------------- | ----------- | ----- | -------------------------- | ---------- | -------------- | -------------------- | +| system.filesystem.usage | | By | Asynchronous UpDownCounter | Int64 | device | (identifier) | +| | | | | | state | used, free, reserved | +| | | | | | type | ext4, tmpfs, etc. | +| | | | | | mode | rw, ro, etc. | +| | | | | | mountpoint | (path) | +| system.filesystem.utilization | | 1 | Asynchronous Gauge | Double | device | (identifier) | +| | | | | | state | used, free, reserved | +| | | | | | type | ext4, tmpfs, etc. | +| | | | | | mode | rw, ro, etc. | +| | | | | | mountpoint | (path) | ### `system.network.` - Network metrics **Description:** System level network metrics. -| Name | Description | Units | Instrument Type | Value Type | Attribute Key | Attribute Values | -| ---------------------------------------------- | ----------------------------------------------------------------------------- | ------------- | ----------------- | ---------- | ------------- | ---------------------------------------------------------------------------------------------- | -| system.network.dropped[1](#dropped) | Count of packets that are dropped or discarded even though there was no error | {packets} | SumObserver | Int64 | device | (identifier) | -| | | | | | direction | transmit, receive | -| system.network.packets | | {packets} | SumObserver | Int64 | device | (identifier) | -| | | | | | direction | transmit, receive | -| system.network.errors[2](#errors) | Count of network errors detected | {errors} | SumObserver | Int64 | device | (identifier) | -| | | | | | direction | transmit, receive | -| system.network.io | | By | SumObserver | Int64 | device | (identifier) | -| | | | | | direction | transmit, receive | -| system.network.connections | | {connections} | UpDownSumObserver | Int64 | device | (identifier) | -| | | | | | protocol | tcp, udp, [etc.](https://en.wikipedia.org/wiki/Transport_layer#Protocols) | -| | | | | | state | [e.g. for tcp](https://en.wikipedia.org/wiki/Transmission_Control_Protocol#Protocol_operation) | +| Name | Description | Units | Instrument Type | Value Type | Attribute Key | Attribute Values | +| ---------------------------------------------- | ----------------------------------------------------------------------------- | ------------- | -------------------------- | ---------- | ------------- | ---------------------------------------------------------------------------------------------- | +| system.network.dropped[1](#dropped) | Count of packets that are dropped or discarded even though there was no error | {packets} | Asynchronous Counter | Int64 | device | (identifier) | +| | | | | | direction | transmit, receive | +| system.network.packets | | {packets} | Asynchronous Counter | Int64 | device | (identifier) | +| | | | | | direction | transmit, receive | +| system.network.errors[2](#errors) | Count of network errors detected | {errors} | Asynchronous Counter | Int64 | device | (identifier) | +| | | | | | direction | transmit, receive | +| system.network.io | | By | Asynchronous Counter | Int64 | device | (identifier) | +| | | | | | direction | transmit, receive | +| system.network.connections | | {connections} | Asynchronous UpDownCounter | Int64 | device | (identifier) | +| | | | | | protocol | tcp, udp, [etc.](https://en.wikipedia.org/wiki/Transport_layer#Protocols) | +| | | | | | state | [e.g. for tcp](https://en.wikipedia.org/wiki/Transmission_Control_Protocol#Protocol_operation) | 1 Measured as: @@ -146,10 +146,10 @@ from **Description:** System level aggregate process metrics. For metrics at the individual process level, see [process metrics](process-metrics.md). -| Name | Description | Units | Instrument Type | Value Type | Attribute Key | Attribute Values | -| ------------------------ | --------------------------------------------------------- | ----------- | ----------------- | ---------- | ------------- | ---------------------------------------------------------------------------------------------- | -| system.processes.count | Total number of processes in each state | {processes} | UpDownSumObserver | Int64 | status | running, sleeping, [etc.](https://man7.org/linux/man-pages/man1/ps.1.html#PROCESS_STATE_CODES) | -| system.processes.created | Total number of processes created over uptime of the host | {processes} | SumObserver | Int64 | - | - | +| Name | Description | Units | Instrument Type | Value Type | Attribute Key | Attribute Values | +| ------------------------ | --------------------------------------------------------- | ----------- | -------------------------- | ---------- | ------------- | ---------------------------------------------------------------------------------------------- | +| system.processes.count | Total number of processes in each state | {processes} | Asynchronous UpDownCounter | Int64 | status | running, sleeping, [etc.](https://man7.org/linux/man-pages/man1/ps.1.html#PROCESS_STATE_CODES) | +| system.processes.created | Total number of processes created over uptime of the host | {processes} | Asynchronous Counter | Int64 | - | - | ### `system.{os}.` - OS Specific System Metrics diff --git a/specification/metrics/supplementary-guidelines.md b/specification/metrics/supplementary-guidelines.md new file mode 100644 index 00000000000..6aeaaed7ebb --- /dev/null +++ b/specification/metrics/supplementary-guidelines.md @@ -0,0 +1,305 @@ +# Supplementary Guidelines + +Note: this document is NOT a spec, it is provided to support the Metrics +[API](./api.md) and [SDK](./sdk.md) specifications, it does NOT add any extra +requirements to the existing specifications. + +Table of Contents: + +* [Guidelines for instrumentation library + authors](#guidelines-for-instrumentation-library-authors) + * [Instrument selection](#instrument-selection) + * [Semantic convention](#semantic-convention) +* [Guidelines for SDK authors](#guidelines-for-sdk-authors) + * [Aggregation temporality](#aggregation-temporality) + * [Memory management](#memory-management) + +## Guidelines for instrumentation library authors + +### Instrument selection + +The [Instruments](./api.md#instrument) are part of the [Metrics API](./api.md). +They allow [Measurements](./api.md#measurement) to be recorded +[synchronously](./api.md#synchronous-instrument) or +[asynchronously](./api.md#asynchronous-instrument). + +Choosing the correct instrument is important, because: + +* It helps the library to achieve better efficiency. For example, if we want to + report room temperature to [Prometheus](https://prometheus.io), we want to + consider using an [Asynchronous Gauge](./api.md#asynchronous-gauge) rather + than periodically poll the sensor, so that we only access the sensor when + scraping happened. +* It makes the consumption easier for the user of the library. For example, if + we want to report HTTP server request latency, we want to consider a + [Histogram](./api.md#histogram), so most of the users can get a reasonable + experience (e.g. default buckets, min/max) by simply enabling the metrics + stream, rather than doing extra configurations. +* It generates clarity to the semantic of the metrics stream, so the consumers + have better understanding of the results. For example, if we want to report + the process heap size, by using an [Asynchronous + UpDownCounter](./api.md#asynchronous-updowncounter) rather than an + [Asynchronous Gauge](./api.md#asynchronous-gauge), we've made it explicit that + the consumer can add up the numbers across all processes to get the "total + heap size". + +Here is one way of choosing the correct instrument: + +* I want to **count** something (by recording a delta value): + * If the value is monotonically increasing (the delta value is always + non-negative) - use a [Counter](./api.md#counter). + * If the value is NOT monotonically increasing (the delta value can be + positive, negative or zero) - use an + [UpDownCounter](./api.md#updowncounter). +* I want to **record** or **time** something, and the **statistics** about this + thing are likely to be meaningful - use a [Histogram](./api.md#histogram). +* I want to **measure** something (by reporting an absolute value): + * If it makes NO sense to add up the values across different dimensions, use + an [Asynchronous Gauge](./api.md#asynchronous-gauge). + * If it makes sense to add up the values across different dimensions: + * If the value is monotonically increasing - use an [Asynchronous + Counter](./api.md#asynchronous-counter). + * If the value is NOT monotonically increasing - use an [Asynchronous + UpDownCounter](./api.md#asynchronous-updowncounter). + +### Semantic convention + +Once you decided [which instrument(s) to be used](#instrument-selection), you +will need to decide the names for the instruments and attributes. + +It is highly recommended that you align with the `OpenTelemetry Semantic +Conventions`, rather than inventing your own semantics. + +## Guidelines for SDK authors + +### Aggregation temporality + +The OpenTelemetry Metrics [Data Model](./datamodel.md) and [SDK](./sdk.md) are +designed to support both Cumulative and Delta +[Temporality](./datamodel.md#temporality). It is important to understand that +temporality will impact how the SDK could manage memory usage. Let's take the +following HTTP requests example: + +* During the time range (T0, T1]: + * verb = `GET`, status = `200`, duration = `50 (ms)` + * verb = `GET`, status = `200`, duration = `100 (ms)` + * verb = `GET`, status = `500`, duration = `1 (ms)` +* During the time range (T1, T2]: + * no HTTP request has been received +* During the time range (T2, T3] + * verb = `GET`, status = `500`, duration = `5 (ms)` + * verb = `GET`, status = `500`, duration = `2 (ms)` +* During the time range (T3, T4]: + * verb = `GET`, status = `200`, duration = `100 (ms)` +* During the time range (T4, T5]: + * verb = `GET`, status = `200`, duration = `100 (ms)` + * verb = `GET`, status = `200`, duration = `30 (ms)` + * verb = `GET`, status = `200`, duration = `50 (ms)` + +Let's imagine we export the metrics as [Histogram](./datamodel.md#histogram), +and to simplify the story we will only have one histogram bucket `(-Inf, +Inf)`: + +If we export the metrics using **Delta Temporality**: + +* (T0, T1] + * dimensions: {verb = `GET`, status = `200`}, count: `2`, min: `50 (ms)`, max: + `100 (ms)` + * dimensions: {verb = `GET`, status = `500`}, count: `1`, min: `1 (ms)`, max: + `1 (ms)` +* (T1, T2] + * nothing since we don't have any Measurement received +* (T2, T3] + * dimensions: {verb = `GET`, status = `500`}, count: `2`, min: `2 (ms)`, max: + `5 (ms)` +* (T3, T4] + * dimensions: {verb = `GET`, status = `200`}, count: `1`, min: `100 (ms)`, + max: `100 (ms)` +* (T4, T5] + * dimensions: {verb = `GET`, status = `200`}, count: `3`, min: `30 (ms)`, max: + `100 (ms)` + +You can see that the SDK **only needs to track what has happened after the +latest collection/export cycle**. For example, when the SDK started to process +measurements in (T1, T2], it can completely forget about +what has happened during (T0, T1]. + +If we export the metrics using **Cumulative Temporality**: + +* (T0, T1] + * dimensions: {verb = `GET`, status = `200`}, count: `2`, min: `50 (ms)`, max: + `100 (ms)` + * dimensions: {verb = `GET`, status = `500`}, count: `1`, min: `1 (ms)`, max: + `1 (ms)` +* (T0, T2] + * dimensions: {verb = `GET`, status = `200`}, count: `2`, min: `50 (ms)`, max: + `100 (ms)` + * dimensions: {verb = `GET`, status = `500`}, count: `1`, min: `1 (ms)`, max: + `1 (ms)` +* (T0, T3] + * dimensions: {verb = `GET`, status = `200`}, count: `2`, min: `50 (ms)`, max: + `100 (ms)` + * dimensions: {verb = `GET`, status = `500`}, count: `3`, min: `1 (ms)`, max: + `5 (ms)` +* (T0, T4] + * dimensions: {verb = `GET`, status = `200`}, count: `3`, min: `50 (ms)`, max: + `100 (ms)` + * dimensions: {verb = `GET`, status = `500`}, count: `3`, min: `1 (ms)`, max: + `5 (ms)` +* (T0, T5] + * dimensions: {verb = `GET`, status = `200`}, count: `6`, min: `30 (ms)`, max: + `100 (ms)` + * dimensions: {verb = `GET`, status = `500`}, count: `3`, min: `1 (ms)`, max: + `5 (ms)` + +You can see that we are performing Delta->Cumulative conversion, and the SDK +**has to track what has happened prior to the latest collection/export cycle**, +in the worst case, the SDK **will have to remember what has happened since the +beginning of the process**. + +Imagine if we have a long running service and we collect metrics with 7 +dimensions and each dimension can have 30 different values. We might eventually +end up having to remember the complete set of all `21,870,000,000` permutations! +This **cardinality explosion** is a well-known challenge in the metrics space. + +Making it even worse, if we export the permutations even if there are no recent +updates, the export batch could become huge and will be very costly. For +example, do we really need/want to export the same thing for (T0, +T2] in the above case? + +So here are some suggestions that we encourage SDK implementers to consider: + +* You want to control the memory usage rather than allow it to grow indefinitely + / unbounded - regardless of what aggregation temporality is being used. +* You want to improve the memory efficiency by being able to **forget about + things that are no longer needed**. +* You probably don't want to keep exporting the same thing over and over again, + if there is no updates. You might want to consider [Resets and + Gaps](./datamodel.md#resets-and-gaps). For example, if a Cumulative metrics + stream hasn't received any updates for a long period of time, would it be okay + to reset the start time? + +In the above case, we have Measurements reported by a [Histogram +Instrument](./api.md#histogram). What if we collect measurements from an +[Asynchronous Counter](./api.md#asynchronous-counter)? + +The following example shows the number of [page +faults](https://en.wikipedia.org/wiki/Page_fault) of each thread since the +thread ever started: + +* During the time range (T0, T1]: + * pid = `1001`, tid = `1`, #PF = `50` + * pid = `1001`, tid = `2`, #PF = `30` +* During the time range (T1, T2]: + * pid = `1001`, tid = `1`, #PF = `53` + * pid = `1001`, tid = `2`, #PF = `38` +* During the time range (T2, T3] + * pid = `1001`, tid = `1`, #PF = `56` + * pid = `1001`, tid = `2`, #PF = `42` +* During the time range (T3, T4]: + * pid = `1001`, tid = `1`, #PF = `60` + * pid = `1001`, tid = `2`, #PF = `47` +* During the time range (T4, T5]: + * thread 1 died, thread 3 started + * pid = `1001`, tid = `2`, #PF = `53` + * pid = `1001`, tid = `3`, #PF = `5` + +If we export the metrics using **Cumulative Temporality**: + +* (T0, T1] + * dimensions: {pid = `1001`, tid = `1`}, sum: `50` + * dimensions: {pid = `1001`, tid = `2`}, sum: `30` +* (T0, T2] + * dimensions: {pid = `1001`, tid = `1`}, sum: `53` + * dimensions: {pid = `1001`, tid = `2`}, sum: `38` +* (T0, T3] + * dimensions: {pid = `1001`, tid = `1`}, sum: `56` + * dimensions: {pid = `1001`, tid = `2`}, sum: `42` +* (T0, T4] + * dimensions: {pid = `1001`, tid = `1`}, sum: `60` + * dimensions: {pid = `1001`, tid = `2`}, sum: `47` +* (T0, T5] + * dimensions: {pid = `1001`, tid = `2`}, sum: `53` + * dimensions: {pid = `1001`, tid = `3`}, sum: `5` + +It is quite straightforward - we just take the data being reported from the +asynchronous instruments and send them. We might want to consider if [Resets and +Gaps](./datamodel.md#resets-and-gaps) should be used to denote the end of a +metric stream - e.g. thread 1 died, the thread ID might be reused by the +operating system, and we probably don't want to confuse the metrics backend. + +If we export the metrics using **Delta Temporality**: + +* (T0, T1] + * dimensions: {pid = `1001`, tid = `1`}, delta: `50` + * dimensions: {pid = `1001`, tid = `2`}, delta: `30` +* (T1, T2] + * dimensions: {pid = `1001`, tid = `1`}, delta: `3` + * dimensions: {pid = `1001`, tid = `2`}, delta: `8` +* (T2, T3] + * dimensions: {pid = `1001`, tid = `1`}, delta: `3` + * dimensions: {pid = `1001`, tid = `2`}, delta: `4` +* (T3, T4] + * dimensions: {pid = `1001`, tid = `1`}, delta: `4` + * dimensions: {pid = `1001`, tid = `2`}, delta: `5` +* (T4, T5] + * dimensions: {pid = `1001`, tid = `2`}, delta: `6` + * dimensions: {pid = `1001`, tid = `3`}, delta: `5` + +You can see that we are performing Cumulative->Delta conversion, and it requires +us to remember the last value of **every single permutation we've encountered so +far**, because if we don't, we won't be able to calculate the delta value using +`current value - last value`. And as you can tell, this is super expensive. + +Making it more interesting, if we have min/max value, it is **mathematically +impossible** to reliably deduce the Delta temporality from Cumulative +temporality. For example: + +* If the maximum value is 10 during (T0, T2] and the + maximum value is 20 during (T0, T3], we know that the + maximum value during (T2, T3] must be 20. +* If the maximum value is 20 during (T0, T2] and the + maximum value is also 20 during (T0, T3], we wouldn't + know what the maximum value is during (T2, T3], unless + we know that there is no value (count = 0). + +So here are some suggestions that we encourage SDK implementers to consider: + +* You probably don't want to encourage your users to do Cumulative->Delta + conversion. Actually, you might want to discourage them from doing this. +* If you have to do Cumulative->Delta conversion, and you encountered min/max, + rather than drop the data on the floor, you might want to convert them to + something useful - e.g. [Gauge](./datamodel.md#gauge). + +### Memory management + +Memory management is a wide topic, here we will only cover some of the most +important things for OpenTelemetry SDK. + +**Choose a better design so the SDK has less things to be memorized**, avoid +keeping things in memory unless there is a must need. One good example is the +[aggregation temporality](#aggregation-temporality). + +**Design a better memory layout**, so the storage is efficient and accessing the +storage can be fast. This is normally specific to the targeting programming +language and platform. For example, aligning the memory to the CPU cache line, +keeping the hot memories close to each other, keeping the memory close to the +hardware (e.g. non-paged pool, +[NUMA](https://en.wikipedia.org/wiki/Non-uniform_memory_access)). + +**Pre-allocate and pool the memory**, so the SDK doesn't have to allocate memory +on-the-fly. This is especially useful to language runtimes that have garbage +collectors, as it ensures the hot path in the code won't trigger garbage +collection. + +**Limit the memory usage, and handle critical memory condition.** The general +expectation is that a telemetry SDK should not fail the application. This can be +done via some dimension-capping algorithm - e.g. start to combine/drop some data +points when the SDK hits the memory limit, and provide a mechanism to report the +data loss. + +**Provide configurations to the application owner.** The answer to _"what is an +efficient memory usage"_ is ultimately depending on the goal of the application +owner. For example, the application owners might want to spend more memory in +order to keep more permutations of metrics dimensions, or they might want to use +memory aggressively for certain dimensions that are important, and keep a +conservative limit for dimensions that are less important. diff --git a/specification/protocol/exporter.md b/specification/protocol/exporter.md index 5a3268f2f54..429be0af077 100644 --- a/specification/protocol/exporter.md +++ b/specification/protocol/exporter.md @@ -10,38 +10,83 @@ The following configuration options MUST be available to configure the OTLP expo | Configuration Option | Description | Default | Env variable | | -------------------- | ------------------------------------------------------------ | ----------------- | ------------------------------------------------------------ | -| Endpoint (OTLP/HTTP) | Target to which the exporter is going to send spans or metrics. The endpoint MUST be a valid URL with scheme (http or https) and host, and MAY contain a port and path. A scheme of https indicates a secure connection. When using `OTEL_EXPORTER_OTLP_ENDPOINT`, exporters SHOULD follow the collector convention of appending the version and signal to the path (e.g. `v1/traces` or `v1/metrics`), if not present already. The per-signal endpoint configuration options take precedence and can be used to override this behavior. See the [OTLP Specification][otlphttp-req] for more details. | `https://localhost:4317` | `OTEL_EXPORTER_OTLP_ENDPOINT` `OTEL_EXPORTER_OTLP_TRACES_ENDPOINT` `OTEL_EXPORTER_OTLP_METRICS_ENDPOINT` | +| Endpoint (OTLP/HTTP) | Target URL to which the exporter is going to send spans or metrics. The endpoint MUST be a valid URL with scheme (http or https) and host, MAY contain a port, SHOULD contain a path and MUST NOT contain other parts (such as query string or fragment). A scheme of https indicates a secure connection. When using `OTEL_EXPORTER_OTLP_ENDPOINT`, exporters MUST construct per-signal URLs as [described below](#per-signal-urls). The per-signal endpoint configuration options take precedence and can be used to override this behavior (the URL is used as-is for them, without any modifications). See the [OTLP Specification][otlphttp-req] for more details. | `https://localhost:4318` | `OTEL_EXPORTER_OTLP_ENDPOINT` `OTEL_EXPORTER_OTLP_TRACES_ENDPOINT` `OTEL_EXPORTER_OTLP_METRICS_ENDPOINT` | | Endpoint (OTLP/gRPC) | Target to which the exporter is going to send spans or metrics. The endpoint SHOULD accept any form allowed by the underlying gRPC client implementation. Additionally, the endpoint MUST accept a URL with a scheme of either `http` or `https`. A scheme of `https` indicates a secure connection and takes precedence over the `insecure` configuration setting. If the gRPC client implementation does not support an endpoint with a scheme of `http` or `https` then the endpoint SHOULD be transformed to the most sensible format for that implementation. | `https://localhost:4317` | `OTEL_EXPORTER_OTLP_ENDPOINT` `OTEL_EXPORTER_OTLP_TRACES_ENDPOINT` `OTEL_EXPORTER_OTLP_METRICS_ENDPOINT` | | Insecure | Whether to enable client transport security for the exporter's gRPC connection. This option only applies to OTLP/gRPC - OTLP/HTTP always uses the scheme provided for the `endpoint`. Implementations MAY choose to not implement the `insecure` option if it is not required or supported by the underlying gRPC client implementation. | `false` | `OTEL_EXPORTER_OTLP_INSECURE` `OTEL_EXPORTER_OTLP_SPAN_INSECURE` `OTEL_EXPORTER_OTLP_METRIC_INSECURE` | | Certificate File | The trusted certificate to use when verifying a server's TLS credentials. Should only be used for a secure connection. | n/a | `OTEL_EXPORTER_OTLP_CERTIFICATE` `OTEL_EXPORTER_OTLP_TRACES_CERTIFICATE` `OTEL_EXPORTER_OTLP_METRICS_CERTIFICATE` | | Headers | Key-value pairs to be used as headers associated with gRPC or HTTP requests. See [Specifying headers](./exporter.md#specifying-headers-via-environment-variables) for more details. | n/a | `OTEL_EXPORTER_OTLP_HEADERS` `OTEL_EXPORTER_OTLP_TRACES_HEADERS` `OTEL_EXPORTER_OTLP_METRICS_HEADERS` | -| Compression | Compression key for supported compression types. Supported compression: `gzip`| No value | `OTEL_EXPORTER_OTLP_COMPRESSION` `OTEL_EXPORTER_OTLP_TRACES_COMPRESSION` `OTEL_EXPORTER_OTLP_METRICS_COMPRESSION` | +| Compression | Compression key for supported compression types. Supported compression: `gzip`| No value [1] | `OTEL_EXPORTER_OTLP_COMPRESSION` `OTEL_EXPORTER_OTLP_TRACES_COMPRESSION` `OTEL_EXPORTER_OTLP_METRICS_COMPRESSION` | | Timeout | Maximum time the OTLP exporter will wait for each batch export. | 10s | `OTEL_EXPORTER_OTLP_TIMEOUT` `OTEL_EXPORTER_OTLP_TRACES_TIMEOUT` `OTEL_EXPORTER_OTLP_METRICS_TIMEOUT` | -| Protocol | The transport protocol. Options MAY include `grpc`, `http/protobuf`, and `http/json`. See [Specify Protocol](./exporter.md#specify-protocol) for more details. | n/a | `OTEL_EXPORTER_OTLP_PROTOCOL` `OTEL_EXPORTER_OTLP_TRACES_PROTOCOL` `OTEL_EXPORTER_OTLP_METRICS_PROTOCOL` | +| Protocol | The transport protocol. Options MAY include `grpc`, `http/protobuf`, and `http/json`. See [Specify Protocol](./exporter.md#specify-protocol) for more details. | `http/protobuf` [2] | `OTEL_EXPORTER_OTLP_PROTOCOL` `OTEL_EXPORTER_OTLP_TRACES_PROTOCOL` `OTEL_EXPORTER_OTLP_METRICS_PROTOCOL` | + +**[1]**: If no compression value is explicitly specified, SIGs can default to the value they deem +most useful among the supported options. This is especially important in the presence of technical constraints, +e.g. directly sending telemetry data from mobile devices to backend servers. Supported values for `OTEL_EXPORTER_OTLP_*COMPRESSION` options: -- If the value is missing, then compression is disabled. -- `gzip` is the only specified compression method for now. Other options MAY be supported by language SDKs and should be documented for each particular language. +- `none` if compression is disabled. +- `gzip` is the only specified compression method for now. + + + +### Endpoint URLs for OTLP/HTTP + +Based on the environment variables above, the OTLP/HTTP exporter MUST construct URLs +for each signal as follow: -Example 1 +1. For the per-signal variables (`OTEL_EXPORTER_OTLP__ENDPOINT`), the URL + MUST be used as-is without any modification. The only exception is that if an + URL contains no path part, the root path `/` MUST be used (see [Example 2](#example-2)). +2. If signals are sent that have no per-signal configuration from the previous point, + `OTEL_EXPORTER_OTLP_ENDPOINT` is used as a base URL and the signals are sent + to these paths relative to that: + + * Traces: `v1/traces` + * Metrics: `v1/metrics`. + + Non-normatively, this could be implemented by ensuring that the base URL ends with + a slash and then appending the relative URLs as strings. + +#### Example 1 The following configuration sends all signals to the same collector: ```bash -export OTEL_EXPORTER_OTLP_ENDPOINT=http://collector:4317 +export OTEL_EXPORTER_OTLP_ENDPOINT=http://collector:4318 ``` -Example 2 +Traces are sent to `http://collector:4318/v1/traces` and metrics to +`http://collector:4318/v1/metrics`. -Traces and metrics are sent to different collectors: +#### Example 2 -```bash -export OTEL_EXPORTER_OTLP_TRACES_ENDPOINT=http://collector:4317 +Traces and metrics are sent to different collectors and paths: +```bash +export OTEL_EXPORTER_OTLP_TRACES_ENDPOINT=http://collector:4318 export OTEL_EXPORTER_OTLP_METRICS_ENDPOINT=https://collector.example.com/v1/metrics ``` +This will send traces directly to the root path `http://collector:4318/` +(`/v1/traces` is only automatically added when using the non-signal-specific +environment variable) and metrics +to `https://collector.example.com/v1/metrics`. + +#### Example 3 + +The following configuration sends all signals except for metrics to the same collector: + +```bash +export OTEL_EXPORTER_OTLP_ENDPOINT=http://collector:4318/mycollector/ +export OTEL_EXPORTER_OTLP_METRICS_ENDPOINT=https://collector.example.com/v1/metrics/ +``` + +Traces are sent to `http://collector:4318/mycollector/v1/traces` +and metrics to `https://collector.example.com/v1/metrics/` +(other signals, would they be defined, would be sent to their specific paths +relative to `http://collector:4318/mycollector/`). + ### Specify Protocol The `OTEL_EXPORTER_OTLP_PROTOCOL`, `OTEL_EXPORTER_OTLP_TRACES_PROTOCOL`, and `OTEL_EXPORTER_OTLP_METRICS_PROTOCOL` environment variables specify the OTLP transport protocol. Supported values: @@ -50,9 +95,14 @@ The `OTEL_EXPORTER_OTLP_PROTOCOL`, `OTEL_EXPORTER_OTLP_TRACES_PROTOCOL`, and `OT - `http/protobuf` for protobuf-encoded data over HTTP connection - `http/json` for JSON-encoded data over HTTP connection -SDKs MUST support either `grpc` or `http/protobuf` and SHOULD support both. They also MAY support `http/json`. +**[2]**: SDKs SHOULD support both `grpc` and `http/protobuf` transports and MUST +support at least one of them. If they support only one, it SHOULD be +`http/protobuf`. They also MAY support `http/json`. -SDKs have an unspecified default, if no configuration is provided. +If no configuration is provided the default transport SHOULD be `http/protobuf` +unless SDKs have good reasons to choose `grpc` as the default (e.g. for backward +compatibility reasons when `grpc` was already the default in a stable SDK +release). ### Specifying headers via environment variables diff --git a/specification/protocol/otlp.md b/specification/protocol/otlp.md index 6c83e2dc571..ff5f5679a99 100644 --- a/specification/protocol/otlp.md +++ b/specification/protocol/otlp.md @@ -70,6 +70,11 @@ OTLP is a request/response style protocols: the clients send requests, the server replies with corresponding responses. This document defines one requests and response type: `Export`. +All server components MUST support the following transport compression options: + +* No compression, denotated by `none`. +* Gzip compression, denoted by `gzip`. + ### OTLP/gRPC **Status**: [Stable](../document-status.md) @@ -473,7 +478,7 @@ connections SHOULD be configurable. #### OTLP/HTTP Default Port -The default network port for OTLP/HTTP is 4317. +The default network port for OTLP/HTTP is 4318. ## Implementation Recommendations diff --git a/specification/resource/semantic_conventions/k8s.md b/specification/resource/semantic_conventions/k8s.md index 0ed9a477b4c..8414fbe2462 100644 --- a/specification/resource/semantic_conventions/k8s.md +++ b/specification/resource/semantic_conventions/k8s.md @@ -88,6 +88,7 @@ to a running container. | Attribute | Type | Description | Examples | Required | |---|---|---|---|---| | `k8s.container.name` | string | The name of the Container in a Pod template. | `redis` | No | +| `k8s.container.restart_count` | int | Number of times the container was restarted. This attribute can be used to identify a particular container (running or stopped) within a container spec. | `0`; `2` | No | ## ReplicaSet diff --git a/specification/trace/semantic_conventions/rpc.md b/specification/trace/semantic_conventions/rpc.md index 87879313a6b..49078143be8 100644 --- a/specification/trace/semantic_conventions/rpc.md +++ b/specification/trace/semantic_conventions/rpc.md @@ -14,11 +14,11 @@ This document defines how to describe remote procedure calls - [Span name](#span-name) - [Attributes](#attributes) - [Service name](#service-name) + - [Events](#events) - [Distinction from HTTP spans](#distinction-from-http-spans) - [gRPC](#grpc) - [gRPC Attributes](#grpc-attributes) - [gRPC Status](#grpc-status) - - [Events](#events) - [JSON RPC](#json-rpc) - [JSON RPC Attributes](#json-rpc-attributes) @@ -95,6 +95,32 @@ Generally, a user SHOULD NOT set `peer.service` to a fully qualified RPC service [`service.name`]: ../../resource/semantic_conventions/README.md#service [`peer.service`]: span-general.md#general-remote-service-attributes +### Events + +In the lifetime of an RPC stream, an event for each message sent/received on +client and server spans SHOULD be created. In case of unary calls only one sent +and one received message will be recorded for both client and server spans. + +The event name MUST be `"message"`. + + +| Attribute | Type | Description | Examples | Required | +|---|---|---|---|---| +| `message.type` | string | Whether this is a received or sent message. | `SENT` | No | +| `message.id` | int | MUST be calculated as two different counters starting from `1` one for sent messages and one for received message. [1] | | No | +| `message.compressed_size` | int | Compressed size of the message in bytes. | | No | +| `message.uncompressed_size` | int | Uncompressed size of the message in bytes. | | No | + +**[1]:** This way we guarantee that the values will be consistent between different implementations. + +`message.type` MUST be one of the following: + +| Value | Description | +|---|---| +| `SENT` | sent | +| `RECEIVED` | received | + + ### Distinction from HTTP spans HTTP calls can generally be represented using just [HTTP spans](./http.md). @@ -143,33 +169,6 @@ For remote procedure calls via [gRPC][], additional conventions are described in The [Span Status](../api.md#set-status) MUST be left unset for an `OK` gRPC status code, and set to `Error` for all others. -### Events - -In the lifetime of a gRPC stream, an event for each message sent/received on -client and server spans SHOULD be created. In case of -unary calls only one sent and one received message will be recorded for both -client and server spans. - -The event name MUST be `"message"`. - - -| Attribute | Type | Description | Examples | Required | -|---|---|---|---|---| -| `message.type` | string | Whether this is a received or sent message. | `SENT` | No | -| `message.id` | int | MUST be calculated as two different counters starting from `1` one for sent messages and one for received message. [1] | | No | -| `message.compressed_size` | int | Compressed size of the message in bytes. | | No | -| `message.uncompressed_size` | int | Uncompressed size of the message in bytes. | | No | - -**[1]:** This way we guarantee that the values will be consistent between different implementations. - -`message.type` MUST be one of the following: - -| Value | Description | -|---|---| -| `SENT` | sent | -| `RECEIVED` | received | - - ## JSON RPC Conventions specific to [JSON RPC](https://www.jsonrpc.org/).