diff --git a/docs/router/configuration.mdx b/docs/router/configuration.mdx index 13d53a30..8d7a060e 100644 --- a/docs/router/configuration.mdx +++ b/docs/router/configuration.mdx @@ -602,6 +602,7 @@ This option may change or be removed in future versions as the OpenTelemetry SDK | METRICS_OTLP_EXCLUDE_METRIC_LABELS | exclude_metric_labels | | The metric labels to exclude from the OTEL metrics. Accepts a list of Go regular expressions. Use https://regex101.com/ to test your regular expressions. | [] | | METRICS_OTLP_CONNECTION_STATS | connection_stats | | Enable connection metrics. | false | | METRICS_OTLP_CIRCUIT_BREAKER | circuit_breaker | | Ensure that circuit breaker metrics are enabled for OTEL. | false | +| METRICS_OTLP_STREAM | streams | | Enable EDFS stream metrics. | false | ### Attributes @@ -627,6 +628,7 @@ telemetry: graphql_cache: true exclude_metrics: [] exclude_metric_labels: [] + streams: true attributes: - key: "x-new-attribute" default: "foo" @@ -651,6 +653,7 @@ telemetry: | PROMETHEUS_EXCLUDE_METRIC_LABELS | exclude_metric_labels | | | | | PROMETHEUS_EXCLUDE_SCOPE_INFO | exclude_scope_info | | Exclude scope info from Prometheus metrics. | false | | PROMETHEUS_CIRCUIT_BREAKER | circuit_breaker | | Enable the circuit breaker metrics for prometheus metric collection. | false | +| PROMETHEUS_OTLP_STREAM | streams | | Enable EDFS stream metrics. | false | ### Example YAML config: @@ -668,6 +671,7 @@ telemetry: graphql_cache: true exclude_metrics: [] exclude_metric_labels: [] + streams: true exclude_scope_info: false ``` diff --git a/docs/router/metrics-and-monitoring.mdx b/docs/router/metrics-and-monitoring.mdx index 1a01434b..3d8c3e5e 100644 --- a/docs/router/metrics-and-monitoring.mdx +++ b/docs/router/metrics-and-monitoring.mdx @@ -214,6 +214,48 @@ telemetry: * `router.engine.messages.sent`: The number of total messages for subscriptions sent over from the subgraph to the router. +### EDFS Streams Metrics +We expose metrics for EDFS streams, these statistics are collected at the level when the message is sent to the messaging backend or directly received from the messaging backend. + +```yaml config.yaml +telemetry: + metrics: + otlp: + streams: true + prometheus: + streams: true +``` + +#### Metrics + +* `router.streams.sent.messages`: Counter of messages produced/sent to the messaging backend +* `router.streams.received.messages`: Counter of messages consumed/received directly by the router from the messaging backend. + +The following attributes are attached to both metrics: + +* `wg.stream.operation.name`: +This contains the operation type used to send a message to the message backend. This is useful to differentiate when an edfs adapter has multiple ways of sending messages, like in the case of "nats", with `publish`, and `request`. +The following values are possible, based on the messaging backend + + - nats: `publish`, `request`, `receive` + + - kafka: `produce`, `receive` + + - redis: `publish`, `receive` + +* `wg.provider.type`: +One of the supported edfs provider types, which includes `kafka`, `nats`, `redis` + +* `wg.destination.name`: +The name of the destination of the messaging backend (topic, queue, etc) + +* `wg.provider.id`: +The provider id as specified in the router configuration + +* `wg.error.type`: +A generic error type to indicate an error occurred. This is available only for `router.streams.sent.messages` at this moment, as we do not catch message receive errors. + + ### Connection Metrics We also provide lower level metrics which helps track connection and pool metrics. By utilizing these metrics users can figure out when the router's connection pool is full and when connections are misbehaving by for example observing spikes in time to acquire connections. This can be enabled for Open Telemetry or Prometheus via @@ -395,6 +437,28 @@ Here you can see a few example queries to query useful information about your cl router_http_client_connection_acquire_duration_bucket{wg_http_client_reused_connection="true",le="25.0"} ``` + + + EDFS stream metrics contain only two metrics. To make sense of your data you need to filter by the attributes. The following examples give you a basic idea of how to use these two metrics. + + #### Get failed publishes for a message broker + Let's say we want to see any failed publishes to our kafka broker. We can use the following query, + ``` + router_streams_sent_messages_total{wg_provider_type="kafka",wg_provider_id="provider-id",wg_destination_name="employeeUpdated",wg_error_type!=""} + ``` + If we are using a provider like nats which could publish to the same destination through different methods, we could also filter using the stream operation name + ``` + router_streams_sent_messages_total{wg_provider_type="nats",wg_destination_name="employeeUpdatedMyNats.1",wg_error_type!="",wg_stream_operation_name="publish"} + ``` + + #### Look at messaged received throughput + You can use the rate function to understand the per-second average rate of messages received over the last 60 minutes + ``` + rate(router_streams_received_messages_total{wg_provider_type="kafka"}[60m]) + ``` + With proper attribute filtering as demonstrated above, you can make it specific for a messaging backend or a destination. + + ## Summary diff --git a/docs/router/metrics-and-monitoring/prometheus-metric-reference.mdx b/docs/router/metrics-and-monitoring/prometheus-metric-reference.mdx index 94725751..76424243 100644 --- a/docs/router/metrics-and-monitoring/prometheus-metric-reference.mdx +++ b/docs/router/metrics-and-monitoring/prometheus-metric-reference.mdx @@ -122,6 +122,50 @@ telemetry: * [`router_engine_messages_sent_total`](#router-engine-messages-sent-total): The number of total messages for subscriptions sent over from the subgraph to the router. +### EDFS Streams Metrics +We expose metrics for EDFS streams, these statistics are collected at the level when the message is sent to the messaging backend or directly received from the messaging backend. + +```yaml config.yaml +telemetry: + metrics: + prometheus: + streams: true +``` + +#### Metrics + +* `router_streams_sent_messages_total`: Counter of messages produced/sent to the messaging backend +* `router_streams_received_messages_total`: Counter of messages consumed/received directly by the router from the messaging backend. + +The following attributes are attached to both metrics: + +* `wg_stream_operation_name`: +This contains the operation type used to send a message to the message backend. This is useful to differentiate when an edfs adapter has multiple ways of sending messages, like in the case of "nats", with `publish`, and `request`. +The following values are possible, based on the messaging backend + + - nats: `publish`, `request`, `receive` + + - kafka: `produce`, `receive` + + - redis: `publish`, `receive` + +* `wg_provider_type`: +One of the supported edfs provider types, which includes `kafka`, `nats`, `redis` + +* `wg_destination_name`: +The name of the destination of the messaging backend (topic, queue, etc) + +* `wg_provider_id`: +The provider id as specified in the router configuration + +* `wg_error_type`: +A generic error type to indicate an error occurred. This is available only for `router.streams.sent.messages` at this moment, as we do not catch message receive errors. + +#### Examples + +You can find more examples here under [`Streams`](/router/metrics-and-monitoring#example-prometheus-queries) + + ### Connection Metrics These metrics provide lower level metrics which helps track connection and pool metrics. By utilizing these metrics users can figure out when the router's connection pool is full and when connections are misbehaving by for example observing spikes in time to acquire connections. @@ -139,9 +183,9 @@ telemetry: #### Metrics -* `router_http_client_connection_max`: Static configuration values with the maximum connections allowed per host with a subgraph dimension. +* `router_http_client_max_connections`: Static configuration values with the maximum connections allowed per host with a subgraph dimension. -* `router_http_client_connection_active`: The number of currently active connections, grouped by both subgraph and host. A connection is considered active once it has completed DNS resolution, TLS handshake, and dialing. While it's less common, multiple subgraphs can share the same host, which is why both dimensions are included. +* `router_http_client_active_connections`: The number of currently active connections, grouped by both subgraph and host. A connection is considered active once it has completed DNS resolution, TLS handshake, and dialing. While it's less common, multiple subgraphs can share the same host, which is why both dimensions are included. * `router_http_client_connection_acquire_duration`: The duration in ms that a connection took to be initialized, which includes all of DNS, TLS Handshakes, and Dialing the host. @@ -159,7 +203,7 @@ These metrics help monitor application memory usage, concurrency, and garbage co * **Inefficient Allocation:** A mismatch between allocated and used memory may indicate suboptimal memory usage patterns. -* [`go_memstats_heap_alloc_bytes`](/router/metrics-and-monitoring/prometheus-metric-reference#go_memstats_heap_alloc_bytes): Number of heap bytes allocated and still in use across all instances. Focuses on heap memory usage for efficient memory management. The value is same as [**go\_memstats\_alloc\_bytes**](/router/metrics-and-monitoring/prometheus-metric-reference#go_memstats_alloc_bytes). +* [`go_memstats_heap_alloc_bytes`](/router/metrics-and-monitoring/prometheus-metric-reference#go_memstats_heap_alloc_bytes): Number of heap bytes allocated and still in use across all instances. Focuses on heap memory usage for efficient memory management. The value is same as [**go_memstats_alloc_bytes**](/router/metrics-and-monitoring/prometheus-metric-reference#go_memstats_alloc_bytes). * **Heap Saturation Risk:** High heap memory usage can lead to increased garbage collection frequency and performance degradation.