Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions docs/router/configuration.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -602,6 +602,7 @@ This option may change or be removed in future versions as the OpenTelemetry SDK
| METRICS_OTLP_EXCLUDE_METRIC_LABELS | exclude_metric_labels | <Icon icon="square" /> | The metric labels to exclude from the OTEL metrics. Accepts a list of Go regular expressions. Use https://regex101.com/ to test your regular expressions. | [] |
| METRICS_OTLP_CONNECTION_STATS | connection_stats | <Icon icon="square" /> | Enable connection metrics. | false |
| METRICS_OTLP_CIRCUIT_BREAKER | circuit_breaker | <Icon icon="square" /> | Ensure that circuit breaker metrics are enabled for OTEL. | false |
| METRICS_OTLP_STREAM | streams | <Icon icon="square" /> | Enable EDFS stream metrics. | false |

### Attributes

Expand All @@ -627,6 +628,7 @@ telemetry:
graphql_cache: true
exclude_metrics: []
exclude_metric_labels: []
streams: true
attributes:
- key: "x-new-attribute"
default: "foo"
Expand All @@ -651,6 +653,7 @@ telemetry:
| PROMETHEUS_EXCLUDE_METRIC_LABELS | exclude_metric_labels | <Icon icon="square" /> | | |
| PROMETHEUS_EXCLUDE_SCOPE_INFO | exclude_scope_info | <Icon icon="square" /> | Exclude scope info from Prometheus metrics. | false |
| PROMETHEUS_CIRCUIT_BREAKER | circuit_breaker | <Icon icon="square" /> | Enable the circuit breaker metrics for prometheus metric collection. | false |
| PROMETHEUS_OTLP_STREAM | streams | <Icon icon="square" /> | Enable EDFS stream metrics. | false |

### Example YAML config:

Expand All @@ -668,6 +671,7 @@ telemetry:
graphql_cache: true
exclude_metrics: []
exclude_metric_labels: []
streams: true
exclude_scope_info: false
```

Expand Down
64 changes: 64 additions & 0 deletions docs/router/metrics-and-monitoring.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -214,6 +214,48 @@ telemetry:
* `router.engine.messages.sent`: The number of total messages for subscriptions sent over from the subgraph to the router.


### EDFS Streams Metrics
We expose metrics for EDFS streams, these statistics are collected at the level when the message is sent to the messaging backend or directly received from the messaging backend.

```yaml config.yaml
telemetry:
metrics:
otlp:
streams: true
prometheus:
streams: true
```

#### Metrics

* `router.streams.sent.messages`: Counter of messages produced/sent to the messaging backend
* `router.streams.received.messages`: Counter of messages consumed/received directly by the router from the messaging backend.

The following attributes are attached to both metrics:

* `wg.stream.operation.name`:
This contains the operation type used to send a message to the message backend. This is useful to differentiate when an edfs adapter has multiple ways of sending messages, like in the case of "nats", with `publish`, and `request`.
The following values are possible, based on the messaging backend

- nats: `publish`, `request`, `receive`

- kafka: `produce`, `receive`

- redis: `publish`, `receive`

* `wg.provider.type`:
One of the supported edfs provider types, which includes `kafka`, `nats`, `redis`

* `wg.destination.name`:
The name of the destination of the messaging backend (topic, queue, etc)

* `wg.provider.id`:
The provider id as specified in the router configuration

* `wg.error.type`:
A generic error type to indicate an error occurred. This is available only for `router.streams.sent.messages` at this moment, as we do not catch message receive errors.


### Connection Metrics

We also provide lower level metrics which helps track connection and pool metrics. By utilizing these metrics users can figure out when the router's connection pool is full and when connections are misbehaving by for example observing spikes in time to acquire connections. This can be enabled for Open Telemetry or Prometheus via
Expand Down Expand Up @@ -395,6 +437,28 @@ Here you can see a few example queries to query useful information about your cl
router_http_client_connection_acquire_duration_bucket{wg_http_client_reused_connection="true",le="25.0"}
```
</Tab>

<Tab title="EDFS Streams">
EDFS stream metrics contain only two metrics. To make sense of your data you need to filter by the attributes. The following examples give you a basic idea of how to use these two metrics.
Comment thread
SkArchon marked this conversation as resolved.

#### Get failed publishes for a message broker
Let's say we want to see any failed publishes to our kafka broker. We can use the following query,
```
router_streams_sent_messages_total{wg_provider_type="kafka",wg_provider_id="provider-id",wg_destination_name="employeeUpdated",wg_error_type!=""}
```
If we are using a provider like nats which could publish to the same destination through different methods, we could also filter using the stream operation name
```
router_streams_sent_messages_total{wg_provider_type="nats",wg_destination_name="employeeUpdatedMyNats.1",wg_error_type!="",wg_stream_operation_name="publish"}
```

#### Look at messaged received throughput
You can use the rate function to understand the per-second average rate of messages received over the last 60 minutes
```
rate(router_streams_received_messages_total{wg_provider_type="kafka"}[60m])
```
With proper attribute filtering as demonstrated above, you can make it specific for a messaging backend or a destination.

</Tab>
</Tabs>

## Summary
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -122,6 +122,50 @@ telemetry:

* [`router_engine_messages_sent_total`](#router-engine-messages-sent-total): The number of total messages for subscriptions sent over from the subgraph to the router.

### EDFS Streams Metrics
We expose metrics for EDFS streams, these statistics are collected at the level when the message is sent to the messaging backend or directly received from the messaging backend.

```yaml config.yaml
telemetry:
metrics:
prometheus:
streams: true
```

#### Metrics

* `router_streams_sent_messages_total`: Counter of messages produced/sent to the messaging backend
* `router_streams_received_messages_total`: Counter of messages consumed/received directly by the router from the messaging backend.

The following attributes are attached to both metrics:

* `wg_stream_operation_name`:
This contains the operation type used to send a message to the message backend. This is useful to differentiate when an edfs adapter has multiple ways of sending messages, like in the case of "nats", with `publish`, and `request`.
The following values are possible, based on the messaging backend

- nats: `publish`, `request`, `receive`

- kafka: `produce`, `receive`

- redis: `publish`, `receive`

* `wg_provider_type`:
One of the supported edfs provider types, which includes `kafka`, `nats`, `redis`

* `wg_destination_name`:
The name of the destination of the messaging backend (topic, queue, etc)

* `wg_provider_id`:
The provider id as specified in the router configuration

* `wg_error_type`:
A generic error type to indicate an error occurred. This is available only for `router.streams.sent.messages` at this moment, as we do not catch message receive errors.

#### Examples

You can find more examples here under [`Streams`](/router/metrics-and-monitoring#example-prometheus-queries)


### Connection Metrics

These metrics provide lower level metrics which helps track connection and pool metrics. By utilizing these metrics users can figure out when the router's connection pool is full and when connections are misbehaving by for example observing spikes in time to acquire connections.
Expand All @@ -139,9 +183,9 @@ telemetry:

#### Metrics

* `router_http_client_connection_max`: Static configuration values with the maximum connections allowed per host with a subgraph dimension.
* `router_http_client_max_connections`: Static configuration values with the maximum connections allowed per host with a subgraph dimension.
Comment thread
SkArchon marked this conversation as resolved.

* `router_http_client_connection_active`: The number of currently active connections, grouped by both subgraph and host. A connection is considered active once it has completed DNS resolution, TLS handshake, and dialing. While it's less common, multiple subgraphs can share the same host, which is why both dimensions are included.
* `router_http_client_active_connections`: The number of currently active connections, grouped by both subgraph and host. A connection is considered active once it has completed DNS resolution, TLS handshake, and dialing. While it's less common, multiple subgraphs can share the same host, which is why both dimensions are included.

* `router_http_client_connection_acquire_duration`: The duration in ms that a connection took to be initialized, which includes all of DNS, TLS Handshakes, and Dialing the host.

Expand All @@ -159,7 +203,7 @@ These metrics help monitor application memory usage, concurrency, and garbage co

* **Inefficient Allocation:** A mismatch between allocated and used memory may indicate suboptimal memory usage patterns.

* [`go_memstats_heap_alloc_bytes`](/router/metrics-and-monitoring/prometheus-metric-reference#go_memstats_heap_alloc_bytes): Number of heap bytes allocated and still in use across all instances. Focuses on heap memory usage for efficient memory management. The value is same as [**go\_memstats\_alloc\_bytes**](/router/metrics-and-monitoring/prometheus-metric-reference#go_memstats_alloc_bytes).
* [`go_memstats_heap_alloc_bytes`](/router/metrics-and-monitoring/prometheus-metric-reference#go_memstats_heap_alloc_bytes): Number of heap bytes allocated and still in use across all instances. Focuses on heap memory usage for efficient memory management. The value is same as [**go_memstats_alloc_bytes**](/router/metrics-and-monitoring/prometheus-metric-reference#go_memstats_alloc_bytes).

* **Heap Saturation Risk:** High heap memory usage can lead to increased garbage collection frequency and performance degradation.

Expand Down