You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Added flag to explicitly enable zone-awareness replication and added store-gateway support (#3200)
* Added flag to explicitly enable zone-awareness replication and added store-gateway support
Signed-off-by: Marco Pracucci <[email protected]>
* Update docs/blocks-storage/store-gateway.template
Signed-off-by: Marco Pracucci <[email protected]>
Co-authored-by: Peter Štibraný <[email protected]>
* Update docs/guides/zone-replication.md
Signed-off-by: Marco Pracucci <[email protected]>
Co-authored-by: Peter Štibraný <[email protected]>
* Addressed review comments
Signed-off-by: Marco Pracucci <[email protected]>
* Improved error message when there are not enough healthy instances for the replication set
Signed-off-by: Marco Pracucci <[email protected]>
Co-authored-by: Peter Štibraný <[email protected]>
Copy file name to clipboardExpand all lines: CHANGELOG.md
+2Lines changed: 2 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -7,11 +7,13 @@
7
7
*`-experimental.distributor.user-subring-size` flag renamed to `-distributor.ingestion-tenant-shard-size`
8
8
*`user_subring_size` limit YAML config option renamed to `ingestion_tenant_shard_size`
9
9
*[CHANGE] Dropped "blank Alertmanager configuration; using fallback" message from Info to Debug level. #3205
10
+
*[CHANGE] Zone-awareness replication for time-series now should be explicitly enabled in the distributor via the `-distributor.zone-awareness-enabled` CLI flag (or its respective YAML config option). Before, zone-aware replication was implicitly enabled if a zone was set on ingesters. #3200
10
11
*[FEATURE] Added support for shuffle-sharding queriers in the query-frontend. When configured (`-frontend.max-queriers-per-user` globally, or using per-user limit `max_queriers_per_user`), each user's requests will be handled by different set of queriers. #3113
11
12
*[ENHANCEMENT] Added `cortex_query_frontend_connected_clients` metric to show the number of workers currently connected to the frontend. #3207
12
13
*[ENHANCEMENT] Shuffle sharding: improved shuffle sharding in the write path. Shuffle sharding now should be explicitly enabled via `-distributor.sharding-strategy` CLI flag (or its respective YAML config option) and guarantees stability, consistency, shuffling and balanced zone-awareness properties. #3090
13
14
*[ENHANCEMENT] Ingester: added new metric `cortex_ingester_active_series` to track active series more accurately. Also added options to control whether active series tracking is enabled (`-ingester.active-series-enabled`, defaults to false), and how often this metric is updated (`-ingester.active-series-update-period`) and max idle time for series to be considered inactive (`-ingester.active-series-idle-timeout`). #3153
14
15
*[ENHANCEMENT] Blocksconvert – Builder: download plan file locally before processing it. #3209
16
+
*[ENHANCEMENT] Store-gateway: added zone-aware replication support to blocks replication in the store-gateway. #3200
15
17
*[BUGFIX] No-longer-needed ingester operations for queries triggered by queriers and rulers are now canceled. #3178
16
18
*[BUGFIX] Ruler: directories in the configured `rules-path` will be removed on startup and shutdown in order to ensure they don't persist between runs. #3195
17
19
*[BUGFIX] Handle hash-collisions in the query path. #3192
Copy file name to clipboardExpand all lines: docs/blocks-storage/store-gateway.md
+20Lines changed: 20 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -56,6 +56,16 @@ To protect from this, when an healthy store-gateway instance finds another insta
56
56
57
57
This feature is called **auto-forget** and is built into the store-gateway.
58
58
59
+
### Zone-awareness
60
+
61
+
The store-gateway replication optionally supports [zone-awareness](../guides/zone-replication.md). When zone-aware replication is enabled and the blocks replication factor is > 1, each block is guaranteed to be replicated across store-gateway instances running in different availability zones.
62
+
63
+
**To enable** the zone-aware replication for the store-gateways you should:
64
+
65
+
1. Configure the availability zone for each store-gateway via the `-store-gateway.sharding-ring.instance-availability-zone` CLI flag (or its respective YAML config option)
66
+
2. Enable blocks zone-aware replication via the `-store-gateway.sharding-ring.zone-awareness-enabled` CLI flag (or its respective YAML config option). Please be aware this configuration option should be set to store-gateways, queriers and rulers.
67
+
3. Rollout store-gateways, queriers and rulers to apply the new configuration
Copy file name to clipboardExpand all lines: docs/blocks-storage/store-gateway.template
+10Lines changed: 10 additions & 0 deletions
Original file line number
Diff line number
Diff line change
@@ -56,6 +56,16 @@ To protect from this, when an healthy store-gateway instance finds another insta
56
56
57
57
This feature is called **auto-forget** and is built into the store-gateway.
58
58
59
+
### Zone-awareness
60
+
61
+
The store-gateway replication optionally supports [zone-awareness](../guides/zone-replication.md). When zone-aware replication is enabled and the blocks replication factor is > 1, each block is guaranteed to be replicated across store-gateway instances running in different availability zones.
62
+
63
+
**To enable** the zone-aware replication for the store-gateways you should:
64
+
65
+
1. Configure the availability zone for each store-gateway via the `-store-gateway.sharding-ring.instance-availability-zone` CLI flag (or its respective YAML config option)
66
+
2. Enable blocks zone-aware replication via the `-store-gateway.sharding-ring.zone-awareness-enabled` CLI flag (or its respective YAML config option). Please be aware this configuration option should be set to store-gateways, queriers and rulers.
67
+
3. Rollout store-gateways, queriers and rulers to apply the new configuration
Copy file name to clipboardExpand all lines: docs/guides/zone-replication.md
+27-14Lines changed: 27 additions & 14 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -5,26 +5,39 @@ weight: 5
5
5
slug: zone-aware-replication
6
6
---
7
7
8
-
In a default configuration, time-series written to ingesters are replicated based on the container/pod name of the ingester instances. It is completely possible that all the replicas for the given time-series are held with in the same availability zone, even if the cortex infrastructure spans multiple zones within the region. Storing multiple replicas for a given time-series poses a risk for data loss if there is an outage affecting various nodes within a zone or a total outage.
8
+
Cortex supports data replication for different services. By default, data is transparently replicated across the whole pool of service instances, regardless of whether these instances are all running within the same availability zone (or data center, or rack) or in different ones.
9
9
10
-
## Configuration
10
+
It is completely possible that all the replicas for the given data are held within the same availability zone, even if the Cortex cluster spans multiple zones. Storing multiple replicas for a given data within the same availability zone poses a risk for data loss if there is an outage affecting various nodes within a zone or a full zone outage.
11
11
12
-
Cortex can be configured to consider an availability zone value in its replication system. Doing so mitigates risks associated with losing multiple nodes within the same availability zone. The availability zone for an ingester can be defined on the command line of the ingester using the `ingester.availability-zone` flag or using the yaml configuration:
12
+
For this reason, Cortex optionally supports zone-aware replication. When zone-aware replication is **enabled**, replicas for the given data are guaranteed to span across different availability zones. This requires Cortex cluster to run at least in a number of zones equal to the configured replication factor.
13
13
14
-
```yaml
15
-
ingester:
16
-
lifecycler:
17
-
availability_zone: "zone-3"
18
-
```
14
+
The Cortex services supporting **zone-aware replication** are:
19
15
20
-
## Zone Replication Considerations
16
+
-**[Distributors and Ingesters](#distributors-and-ingesters-time-series-replication)**
Enabling availability zone awareness helps mitigate risks regarding data loss within a single zone, some items need consideration by an operator if they are thinking of enabling this feature.
The Cortex time-series replication is used to hold multiple (typically 3) replicas of each time series in the **ingesters**.
25
22
26
-
For cortex to function correctly, there must be at least the same number of availability zones as there is replica count. So by default, a cortex cluster should be spread over 3 zones as the default replica count is 3. It is safe to have more zones than the replica count, but it cannot be less. Having fewer availability zones than replica count causes a replica write to be missed, and in some cases, the write fails if the availability zone count is too low.
23
+
**To enable**the zone-aware replication for the ingesters you should:
27
24
28
-
### Cost
25
+
1. Configure the availability zone for each ingester via the `-ingester.availability-zone` CLI flag (or its respective YAML config option)
26
+
2. Rollout ingesters to apply the configured zone
27
+
3. Enable time-series zone-aware replication via the `-distributor.zone-awareness-enabled` CLI flag (or its respective YAML config option). Please be aware this configuration option should be set to distributors, queriers and rulers.
29
28
30
-
Depending on the existing cortex infrastructure being used, this may cause an increase in running costs as most cloud providers charge for cross availability zone traffic. The most significant change would be for a cortex cluster currently running in a singular zone.
29
+
## Store-gateways: blocks replication
30
+
31
+
The Cortex [store-gateway](../blocks-storage/store-gateway.md) (used only when Cortex is running with the [blocks storage](../blocks-storage/_index.md)) supports blocks sharding, used to horizontally scale blocks in a large cluster without hitting any vertical scalability limit.
32
+
33
+
To enable the zone-aware replication for the store-gateways, please refer to the [store-gateway](../blocks-storage/store-gateway.md#zone-awareness) documentation.
34
+
35
+
## Minimum number of zones
36
+
37
+
For Cortex to function correctly, there must be at least the same number of availability zones as the replication factor. For example, if the replication factor is configured to 3 (default for time-series replication), the Cortex cluster should be spread at least over 3 availability zones.
38
+
39
+
It is safe to have more zones than the replication factor, but it cannot be less. Having fewer availability zones than replication factor causes a replica write to be missed, and in some cases, the write fails if the availability zones count is too low.
40
+
41
+
## Impact on costs
42
+
43
+
Depending on the underlying infrastructure being used, deploying Cortex across multiple availability zones may cause an increase in running costs as most cloud providers charge for inter availability zone networking. The most significant change would be for a Cortex cluster currently running in a single zone.
f.StringVar(&cfg.Addr, prefix+"lifecycler.addr", "", "IP address to advertise in consul.")
102
102
f.IntVar(&cfg.Port, prefix+"lifecycler.port", 0, "port to advertise in consul (defaults to server.grpc-listen-port).")
103
103
f.StringVar(&cfg.ID, prefix+"lifecycler.ID", hostname, "ID to register into consul.")
104
-
f.StringVar(&cfg.Zone, prefix+"availability-zone", "", "The availability zone of the host, this instance is running on. Default is an empty string, which disables zone awareness for writes.")
104
+
f.StringVar(&cfg.Zone, prefix+"availability-zone", "", "The availability zone where this instance is running.")
105
105
}
106
106
107
107
// Lifecycler is responsible for managing the lifecycle of entries in the ring.
0 commit comments