diff --git a/content/en/about/faq/metrics-and-logs/metric-expiry.md b/content/en/about/faq/metrics-and-logs/metric-expiry.md index 04d4bf3b3354f..ee4a8d74e90bf 100644 --- a/content/en/about/faq/metrics-and-logs/metric-expiry.md +++ b/content/en/about/faq/metrics-and-logs/metric-expiry.md @@ -4,8 +4,18 @@ weight: 20 --- Short-lived metrics can hamper the performance of Prometheus, as they often are a large source of label cardinality. Cardinality is a measure of the number of unique values for a label. To manage the impact of your short-lived metrics on Prometheus, you must first identify the high cardinality metrics and labels. Prometheus provides cardinality information at its `/status` page. Additional information can be retrieved [via PromQL](https://www.robustperception.io/which-are-my-biggest-metrics). + There are several ways to reduce the cardinality of Istio metrics: +* On Istio 1.28.0 and above, add the + [`sidecar.istio.io/statsEvictionInterval`](/docs/reference/config/annotations/) + annotation to workload pods to expire metrics for inactive peers. This will + help prevent endless growth of the metric scrape responses from the Istio + proxy and the resulting in large `scrape_samples_scraped` and + `scrape_response_size_bytes` for job instances. This will not prevent + Prometheus TSDB index bloat and label churn because Prometheus must still + record all the unique values. But it will help with excessive scrape-time + memory use. * Disable host header fallback. The `destination_service` label is one potential source of high-cardinality. The values for `destination_service` default to the host header if the Istio proxy is not able to determine the destination service from other request metadata. @@ -13,6 +23,19 @@ There are several ways to reduce the cardinality of Istio metrics: In this case, follow the [metric customization](/docs/tasks/observability/metrics/customize-metrics/) guide to disable host header fallback mesh wide. To disable host header fallback for a particular workload or namespace, you need to copy the stats `EnvoyFilter` configuration, update it to have host header fallback disabled, and apply it with a more specific selector. [This issue](https://github.com/istio/istio/issues/25963#issuecomment-666037411) has more detail on how to achieve this. -* Drop unnecessary labels from collection. If the label with high cardinality is not needed, you can drop it from metric collection via [metric customization](/docs/tasks/observability/metrics/customize-metrics/) using `tags_to_remove`. -* Normalize label values, either through federation or classification. - If the information provided by the label is desired, you can use [Prometheus federation](/docs/ops/best-practices/observability/#using-prometheus-for-production-scale-monitoring) or [request classification](/docs/tasks/observability/metrics/classify-metrics/) to normalize the label. +* Disable unnecessary labels or entire metric series. If the label or metric + with high cardinality is not needed, you can drop it from metric generation + via + [metric customization](/docs/tasks/observability/metrics/customize-metrics/) + using a `Telemetry` resource's + [`metricsOverrides`](/docs/reference/config/telemetry/#MetricsOverrides). + See [Telemetry API](/docs/tasks/observability/telemetry/) for examples. +* Normalize label values through federation or classification. + If the information provided by the label is desired, you can use [Prometheus federation](/docs/ops/best-practices/observability/#using-prometheus-for-production-scale-monitoring), Istio workload k8s labels like [`service.istio.io/workload-name`](/docs/reference/config/labels/index.html), or [request classification](/docs/tasks/observability/metrics/classify-metrics/) to normalize the label. + +It is not recommended to use Prometheus scrape-time label rewriting to reduce +cardinality by dropping unwanted labels. Prometheus does not perform +aggregation during label rewriting, so dropping labels may create conflicting +series where two or more series have the same labels but different values. Use +Istio's `Telemetry` configuration to suppress the unwanted dimension(s) +instead. diff --git a/content/en/about/faq/metrics-and-logs/telemetry-v1-vs-v2.md b/content/en/about/faq/metrics-and-logs/telemetry-v1-vs-v2.md index 2f4bee86f2a85..b1b23ab8792fd 100644 --- a/content/en/about/faq/metrics-and-logs/telemetry-v1-vs-v2.md +++ b/content/en/about/faq/metrics-and-logs/telemetry-v1-vs-v2.md @@ -41,11 +41,15 @@ v2 which are listed below: in Mixer-based telemetry. However, more buckets are available by default in in-proxy telemetry for latency metrics at the lower latency levels. -* **No metric expiration for short-lived metrics** +* **No metric expiration for short-lived metrics by default** Mixer-based telemetry supported metric expiration whereby metrics which were not generated for a configurable amount of time were de-registered for - collection by Prometheus. This is useful in scenarios, such as one-off jobs, that generate short-lived metrics. De-registering + collection by Prometheus. This is useful in scenarios, such as one-off jobs, + that generate short-lived metrics. De-registering the metrics prevents reporting of metrics which would no longer change in the future, thereby reducing network traffic and storage in Prometheus. - This expiration mechanism is not available in in-proxy telemetry. - The workaround for this can be found [here](/about/faq/#metric-expiry). + + The [`sidecar.istio.io/statsEvictionInterval` annotation](/docs/reference/config/annotations/) + provides equivalent functionality in newer Istio versions, but metric expiry + is not enabled by default. See also + [FAQ: metric expiry](/about/faq/#metric-expiry). diff --git a/content/en/docs/reference/config/metrics/index.md b/content/en/docs/reference/config/metrics/index.md index 0d721cc0b2700..4ca999aea9154 100644 --- a/content/en/docs/reference/config/metrics/index.md +++ b/content/en/docs/reference/config/metrics/index.md @@ -45,65 +45,81 @@ For TCP traffic, Istio generates the following metrics: * **Tcp Connections Closed** (`istio_tcp_connections_closed_total`): This is a `COUNTER` incremented for every closed connection. +The metrics Istio emits can be overridden with [the `Telemetry` resource's `metricsOverrides` field](/docs/reference/config/telemetry/#MetricsOverrides); see [Telemetry API](/docs/tasks/observability/telemetry/). + ## Labels -* **Reporter**: This identifies the reporter of the request. It is set to `destination` +Labels are added to metrics to identify unique series or provide auxiliary +information. + +The label name exposed in Prometheus scrapes and used when referring to the +label in configuration is shown in parentheses below. + +* **Reporter** (`reporter`): This identifies the reporter of the request. It is set to `destination` if report is from a server Istio proxy and `source` if report is from a client Istio proxy or a gateway. -* **Source Workload**: This identifies the name of source workload which +* **Source Workload** (`source_workload`): This identifies the name of source workload which controls the source, or `unknown` if the source information is missing. -* **Source Workload Namespace**: This identifies the namespace of the source + See also workload label + [`service.istio.io/workload-name`](/docs/reference/config/labels/index.html) + and proxy env-var `ISTIO_META_WORKLOAD_NAME`. + +* **Source Workload Namespace** (`source_workload_namespace`): This identifies the namespace of the source workload, or `unknown` if the source information is missing. -* **Source Principal**: This identifies the peer principal of the traffic source. +* **Source Principal** (`source_princpial`): This identifies the peer principal of the traffic source. It is set when peer authentication is used. -* **Source App**: This identifies the source application based on `app` label +* **Source App** (`source_app`): This identifies the source application based on `app` label of the source workload, or `unknown` if the source information is missing. -* **Source Version**: This identifies the version of the source workload, or +* **Source Version** (`source_version`): This identifies the version of the source workload, or `unknown` if the source information is missing. -* **Destination Workload**: This identifies the name of destination workload, +* **Destination Workload** (`destination_workload`): This identifies the name of destination workload, or `unknown` if the destination information is missing. -* **Destination Workload Namespace**: This identifies the namespace of the + See also workload label + [`service.istio.io/workload-name`](/docs/reference/config/labels/index.html) + and proxy env-var `ISTIO_META_WORKLOAD_NAME`. + +* **Destination Workload Namespace** (`DESTINATION_WORKLOAD_NAMESPACE`): This identifies the namespace of the destination workload, or `unknown` if the destination information is missing. -* **Destination Principal**: This identifies the peer principal of the traffic destination. +* **Destination Principal** (`destination_principal`): This identifies the peer principal of the traffic destination. It is set when peer authentication is used. -* **Destination App**: This identifies the destination application based on +* **Destination App** (`destination_app`): This identifies the destination application based on `app` label of the destination workload, or `unknown` if the destination information is missing. -* **Destination Version**: This identifies the version of the destination workload, +* **Destination Version** (`destination_version`): This identifies the version of the destination workload, or `unknown` if the destination information is missing. -* **Destination Service**: This identifies destination service host responsible +* **Destination Service** (`destination_service`): This identifies destination service host responsible for an incoming request. Ex: `details.default.svc.cluster.local`. -* **Destination Service Name**: This identifies the destination service name. +* **Destination Service Name** (`destination_service_name`): This identifies the destination service name. Ex: `details`. -* **Destination Service Namespace**: This identifies the namespace of +* **Destination Service Namespace** (`destination_service_namespace`): This identifies the namespace of destination service. -* **Request Protocol**: This identifies the protocol of the request. It is set +* **Request Protocol** (`request_protocol`): This identifies the protocol of the request. It is set to request or connection protocol. -* **Response Code**: This identifies the response code of the request. This +* **Response Code** (`response_code`): This identifies the response code of the request. This label is present only on HTTP metrics. -* **Connection Security Policy**: This identifies the service authentication policy of +* **Connection Security Policy** (`connection_security_policy`): This identifies the service authentication policy of the request. It is set to `mutual_tls` when Istio is used to make communication secure and report is from destination. It is set to `unknown` when report is from source since security policy cannot be properly populated. -* **Response Flags**: Additional details about the response or connection from proxy. +* **Response Flags** (`response_flags`): Additional details about the response or connection from proxy. In case of Envoy, see `%RESPONSE_FLAGS%` in [Envoy Access Log](https://www.envoyproxy.io/docs/envoy/latest/configuration/observability/access_log/usage#config-access-log-format-response-flags) for more detail. @@ -117,11 +133,18 @@ For TCP traffic, Istio generates the following metrics: destination_canonical_revision {{< /text >}} -* **Destination Cluster**: This identifies the cluster of the destination workload. + See also labels + [`service.istio.io/canonical-name`](/docs/reference/config/labels/#ServiceCanonicalName) + and + [`service.istio.io/canonical-revision`](/docs/reference/config/labels/#ServiceCanonicalRevision). + +* **Destination Cluster** (`destination_cluster`): This identifies the cluster of the destination workload. This is set by: `global.multiCluster.clusterName` at cluster install time. -* **Source Cluster**: This identifies the cluster of the source workload. +* **Source Cluster** (`source_cluster`): This identifies the cluster of the source workload. This is set by: `global.multiCluster.clusterName` at cluster install time. -* **gRPC Response Status**: This identifies the response status of the gRPC. This +* **gRPC Response Status** (`grpc_response_status`): This identifies the response status of the gRPC. This label is present only on gRPC metrics. + +Metric dimensions can be suppressed with [the `Telemetry` resource's `metricsOverrides.tagOverride` field](/docs/reference/config/telemetry/#MetricsOverrides); see [Telemetry API](/docs/tasks/observability/telemetry/). Labels may also be added or modified using [metric classification](docs/tasks/observability/metrics/classify-metrics/) filters.