Skip to content
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 2 additions & 1 deletion docs/docs.json
Original file line number Diff line number Diff line change
Expand Up @@ -461,7 +461,8 @@
"pages": [
"router/traffic-shaping",
"router/traffic-shaping/retry",
"router/traffic-shaping/timeout"
"router/traffic-shaping/timeout",
"router/traffic-shaping/circuit-breaker"
]
},
"router/storage-providers",
Expand Down
38 changes: 38 additions & 0 deletions docs/router/configuration.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -532,6 +532,8 @@ telemetry:
| METRICS_OTLP_GRAPHQL_CACHE | graphql_cache | <Icon icon="square" /> | Enable the collection of metrics for the GraphQL operation router caches. | false |
| METRICS_OTLP_EXCLUDE_METRICS | exclude_metrics | <Icon icon="square" /> | The metrics to exclude from the OTEL metrics. Accepts a list of Go regular expressions. Use https://regex101.com/ to test your regular expressions. | [] |
| METRICS_OTLP_EXCLUDE_METRIC_LABELS | exclude_metric_labels | <Icon icon="square" /> | The metric labels to exclude from the OTEL metrics. Accepts a list of Go regular expressions. Use https://regex101.com/ to test your regular expressions. | [] |
| METRICS_OTLP_CONNECTION_STATS | connection_stats | <Icon icon="square" /> | Enable connection metrics. | false |
| METRICS_OTLP_CIRCUIT_BREAKER | circuit_breaker | <Icon icon="square" /> | Ensure that circuit breaker metrics are enabled for OTEL. | false |

### Attributes

Expand Down Expand Up @@ -575,9 +577,11 @@ telemetry:
| PROMETHEUS_HTTP_PATH | path | <Icon icon="square" /> | The HTTP path where metrics are exposed. | "/metrics" |
| PROMETHEUS_LISTEN_ADDR | listen_addr | <Icon icon="square" /> | The address to listen on for the prometheus metrics endpoint. | "127.0.0.1:8088" |
| PROMETHEUS_GRAPHQL_CACHE | graphql_cache | <Icon icon="square" /> | Enable the collection of metrics for the GraphQL operation router caches. | false |
| PROMETHEUS_CONNECTION_STATS | connection_stats | <Icon icon="square" /> | Enable connection metrics. | false |
| PROMETHEUS_EXCLUDE_METRICS | exclude_metrics | <Icon icon="square" /> | | |
| PROMETHEUS_EXCLUDE_METRIC_LABELS | exclude_metric_labels | <Icon icon="square" /> | | |
| PROMETHEUS_EXCLUDE_SCOPE_INFO | exclude_scope_info | <Icon icon="square" /> | Exclude scope info from Prometheus metrics. | false |
| PROMETHEUS_CIRCUIT_BREAKER | circuit_breaker | <Icon icon="square" /> | Enable the circuit breaker metrics for prometheus metric collection. | false |

### Example YAML config:

Expand Down Expand Up @@ -1127,6 +1131,19 @@ traffic_shaping:
max_attempts: 5
interval: 3s
max_duration: 10s
# Circuit Breaker
circuit_breaker:
enabled: true
request_threshold: 20
error_threshold_percentage: 50
sleep_window: 30s
half_open_attempts: 5
required_successful: 3
rolling_duration: 60s
num_buckets: 10
execution_timeout: 60s
max_concurrent_requests: -1

subgraphs: # allows you to create subgraph specific traffic shaping rules
products: # Will only affect this subgraph, and override the options in "all" for that subgraph
request_timeout: 60s
Expand All @@ -1139,6 +1156,9 @@ traffic_shaping:
max_idle_conns: 1024
max_conns_per_host: 100
max_idle_conns_per_host: 20
# You can configure circuit breakers per subgraph, which includes the above configurations
circuit_breaker:
enabled: false
```

### Subgraph Request Rules
Expand All @@ -1148,6 +1168,7 @@ These rules apply to requests being made from the Router to all Subgraphs.
| Environment Variable | YAML | Required | Description | Default Value |
| -------------------- | ------------------------- | --------------------------------------------- | ----------------------------------------------------------------------------------- | ------------- |
| | retry | <Icon icon="square" /> | [#traffic-shaping-jitter-retry](/router/configuration#traffic-shaping-jitter-retry) | |
| | circuit_breaker | <Icon icon="square" /> | [#circuit-breaker](/router/configuration#circuit-breaker) | |
| | request_timeout | <Icon icon="square-check" iconType="solid" /> | | 60s |
| | dial_timeout | <Icon icon="square" /> | | 30s |
| | response_header_timeout | <Icon icon="square" /> | | 0s |
Expand Down Expand Up @@ -1175,6 +1196,23 @@ In addition to the general traffic shaping rules, we also allow users to set sub
| | max_conns_per_host | <Icon icon="square" /> | | 100 |
| | max_idle_conns_per_host | <Icon icon="square" /> | | 20 |

### Circuit Breaker

Configure circuit breaker either for all subgraphs, or per subgraph. More information on circuit breakers can be found [here](/router/traffic-shaping/circuit-breaker).

| Environment Variable | YAML | Required | Description | Default Value |
| -------------------- | ----------------------------| --------------------------------------------- | -------------- | -------------- |
| | enabled | <Icon icon="square" /> | Enable the circuit breaker for the target (all subgraphs or a specific subgraph). | false |
| | error_threshold_percentage | <Icon icon="square" /> | Minimum number of requests before the circuit breaker evaluates error rates. | 50 |
| | request_threshold | <Icon icon="square" /> | Percentage of failed requests (in the rolling window) to open the circuit. | 20 |
| | sleep_window | <Icon icon="square" /> | How long the circuit remains open before allowing test requests (e.g., `"30s"`). | 5s |
| | half_open_attempts | <Icon icon="square" /> | Number of test requests allowed in the half-open state. | 1 |
| | required_successful | <Icon icon="square" /> | Number of successful test requests required to close the circuit. | 1 |
| | rolling_duration | <Icon icon="square" /> | Time window for measuring requests and errors (e.g., `"60s"`). | 10s |
| | num_buckets | <Icon icon="square" /> | Number of buckets for statistics in the rolling window (higher = finer granularity). | 10 |
| | execution_timeout | <Icon icon="square" /> | The execution duration when exceeded records an error for the circuit breaker. | 60s |
| | max_concurrent_requests | <Icon icon="square" /> | The max number of concurrent requests the circuit breaker can handle (-1 disables) | -1 |

### Jitter Retry

| Environment Variable | YAML | Required | Description | Default Value |
Expand Down
18 changes: 18 additions & 0 deletions docs/router/metrics-and-monitoring.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -49,6 +49,24 @@ Additionally, we expose the following metric:
* `wg_router_version`: The version of the router that is running
* `wg_feature_flag`: (Optional) The name of the feature flag if this is a feature flag configuration

#### Circuit Breaker specific metrics

We currently support two attributes for monitoring circuit breakers. To enable these metrics you need to set one of the following
```yaml
telemetry:
metrics:
otlp:
circuit_breaker: true
prometheus:
circuit_breaker: true
```

All the below mentioned metrics have the `wg.subgraph.name` dimensions. Do note that since a circuit breaker can be shared across subgraphs if they have the same routing url, the dimension is a string slice instead of a string.

* `router.circuit_breaker.state`: This indicates the current state of a circuit, `0` represents not opened, and `1` represents opened.
* `router.circuit_breaker.short_circuits`: This indicates how many requests for this circuit have failed without even being processed, because the circuit was open.


#### GraphQL specific metrics

`router.graphql.operation.planning_time`: Time taken to plan the operation. An additional attribute `wg.engine.plan_cache_hit` indicates if the plan was served from the cache.
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -43,6 +43,15 @@ These metrics ensure efficient request handling, operation planning, and system

* The value will always be 1.

#### Circuit Breaker Metrics

* `router_circuit_breaker_state`: Info metric that provides information on the current state of the circuit breaker.

* This indicates the current state of a circuit: `0` represents closed, and `1` represents open.

* `router_circuit_breaker_short_circuits`: Metric for the number of short-circuited requests.

* This indicates how many requests for this circuit have failed without even being processed because the circuit was open.

### GraphQL Operation Cache Metrics

Expand Down Expand Up @@ -106,7 +115,7 @@ telemetry:

* `router_http_client_connection_max`: Static configuration values with the maximum connections allowed per host with a subgraph dimension.

* `router_http_client_connection_active`: The number of currently active connections, grouped by both subgraph and host. A connection is considered active once it has completed DNS resolution, TLS handshake, and dialing. While its less common, multiple subgraphs can share the same host, which is why both dimensions are included.
* `router_http_client_connection_active`: The number of currently active connections, grouped by both subgraph and host. A connection is considered active once it has completed DNS resolution, TLS handshake, and dialing. While it's less common, multiple subgraphs can share the same host, which is why both dimensions are included.

* `router_http_client_connection_acquire_duration`: The duration in ms that a connection took to be initialized, which includes all of DNS, TLS Handshakes, and Dialing the host.

Expand Down Expand Up @@ -383,9 +392,51 @@ This means that we can assume for the last N (in this case 20) seconds that ther
**Reason for Monitoring:**

1. **Router Change Detection:** Monitoring whenever a new router execution configuration was pushed to the router.

2. **Uptime Detection:** Whenever the router is running the `router_info` metric will be available. This can be used to detect whenever the router is down.

## `router_circuit_breaker_state`

**Description:**

Indicates the current state of the circuit breaker for a subgraph. `0` means the circuit is closed (requests are allowed), and `1` means the circuit is open (requests are blocked). Includes the `wg_subgraph` and `wg_feature_flag` dimensions for granular monitoring.

**Example PromQL Query:**

```bash
max by(wg_subgraph) (router_circuit_breaker_state)
```

**Reason for Monitoring:**

- Alert when a subgraph's circuit breaker is open (value is `1`).
- Track the health of subgraphs and feature-flagged variants.

**Error Cases Addressed:**

* Subgraph is unhealthy or experiencing repeated failures.
* Circuit breaker is open for extended periods.

## `router_circuit_breaker_short_circuits`

**Description:**

Counts how many requests have been immediately failed (short-circuited) because the circuit was open. Includes the `wg_subgraph` and `wg_feature_flag` dimensions for granular monitoring.

**Example PromQL Query:**

```bash
increase(router_circuit_breaker_short_circuits[5m])
```

**Reason for Monitoring:**

- Alert if many requests are being short-circuited, indicating persistent subgraph issues.
- Understand the impact of circuit breaker activity on request flow over time.

**Error Cases Addressed:**

* High rate of short-circuited requests due to subgraph instability.
* Circuit breaker is frequently protecting the system from cascading failures.

## `go_memstats_alloc_bytes`

Expand Down
Loading