Skip to content
Merged
Show file tree
Hide file tree
Changes from 12 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 5 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,11 @@
* [CHANGE] Blocks storage: compactor is now required when running a Cortex cluster with the blocks storage, because it also keeps the bucket index updated. #3583
* [CHANGE] Blocks storage: block deletion marks are now stored in a per-tenant global markers/ location too, other than within the block location. The compactor, at startup, will copy deletion marks from the block location to the global location. This migration is required only once, so you can safely disable it via `-compactor.block-deletion-marks-migration-enabled=false` once new compactor has successfully started once in your cluster. #3583
* [ENHANCEMENT] Blocks storage: introduced a per-tenant bucket index, periodically updated by the compactor, used to avoid full bucket scanning done by queriers and store-gateways. The bucket index is updated by the compactor during blocks cleanup, on every `-compactor.cleanup-interval`. #3553 #3555 #3561 #3583
* [ENHANCEMENT] Blocks storage: introduced an option `-blocks-storage.bucket-store.bucket-index.enabled` to enable the usage of the bucket index in the querier. When enabled, the querier will use the bucket index to find a tenant's blocks instead of running the periodic bucket scan. The following new metrics have been added: #3614
* `cortex_bucket_index_loads_total`
* `cortex_bucket_index_load_failures_total`
* `cortex_bucket_index_load_duration_seconds`
* `cortex_bucket_index_loaded`
* [ENHANCEMENT] Compactor: exported the following metrics. #3583
* `cortex_bucket_blocks_count`: Total number of blocks per tenant in the bucket. Includes blocks marked for deletion.
* `cortex_bucket_blocks_marked_for_deletion_count`: Total number of blocks per tenant marked for deletion in the bucket.
Expand Down
2 changes: 1 addition & 1 deletion docs/blocks-storage/_index.md
Original file line number Diff line number Diff line change
Expand Up @@ -29,7 +29,7 @@ When running the Cortex blocks storage, the Cortex architecture doesn't signific

The **[store-gateway](./store-gateway.md)** is responsible to query blocks and is used by the [querier](./querier.md) at query time. The store-gateway is required when running the blocks storage.

The **[compactor](./compactor.md)** is responsible to merge and deduplicate smaller blocks into larger ones, in order to reduce the number of blocks stored in the long-term storage for a given tenant and query them more efficiently. It also keeps the bucket index updated and, for this reason, it's a required component.
The **[compactor](./compactor.md)** is responsible to merge and deduplicate smaller blocks into larger ones, in order to reduce the number of blocks stored in the long-term storage for a given tenant and query them more efficiently. It also keeps the [bucket index](./bucket-index.md) updated and, for this reason, it's a required component.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this going to cause issues for users running with compactor now(not sure if there are such people)?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, bucket-index.md says the bucket index is optional which seems like it contradicts this statement.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this going to cause issues for users running with compactor now(not sure if there are such people)?

Did you mean "with compactor" or "without compactor"? I'm no sure I understand the question.

Also, bucket-index.md says the bucket index is optional which seems like it contradicts this statement.

The compactor always writes the bucket index. The flag to enable bucket index is whether it should be used or not in queriers (and in the store-gateway in the upcoming PR too). The point is that, before enabling the bucket index in the queriers and store-gateway, you have to rollout the compactor first, so that the bucket index for all tenants is created before you enable it in querier and store-gateway. To simplify it (at least I thought it would have simplified), I've kept it always enabled in the compactor.


Finally, the [**table-manager**](../chunks-storage/table-manager.md) and the [**schema config**](../chunks-storage/schema-config.md) are **not used** by the blocks storage.

Expand Down
42 changes: 42 additions & 0 deletions docs/blocks-storage/bucket-index.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,42 @@
---
title: "Bucket Index"
linkTitle: "Bucket Index"
weight: 5
slug: bucket-index
---

The bucket index is a **per-tenant file containing the list of blocks and block deletion marks** in the storage. The bucket index itself is stored in the backend object storage, is periodically updated by the compactor and used by queriers to discover blocks in the storage.

The bucket index usage is **optional** and can be enabled via `-blocks-storage.bucket-store.bucket-index.enabled=true` (or its respective YAML config option).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Explain why it is useful?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. WDYT?


## Structure of the index

The `bucket-index.json.gz` contains:

- **`blocks`**<br />
List of complete blocks of a tenant, including blocks marked for deletion (partial blocks are excluded from the index).
- **`block_deletion_marks`**<br />
List of block deletion marks.
- **`updated_at`**<br />
Unix timestamp (seconds precision) of when the index has been updated (written in the storage) the last time.

## How it gets updated

The [compactor](./compactor.md) periodically scans the bucket and uploads an updated bucket index to the storage. The frequency at which the bucket index is updated can be configured via `-compactor.cleanup-interval`.

The bucket index is built and updated by the compactor even if `-blocks-storage.bucket-store.bucket-index.enabled` has **not** been enabled. This is intentional and the overhead introduced by keeping the bucket index is non significative.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This mostly answers my earlier question about non-optional compactor to create optional bucket index, but it still may be a bit unclear to someone new.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've updated the doc. Is it more clear now?


## How it's used by the querier

The [querier](./querier.md), at query time, checks whether the bucket index for the tenant has already been loaded in memory. If not, the querier downloads it from the storage and cache it in memory. Given it's a small file, lazy downloading it doesn't significantly impact on 1st query performances, but allows to get a querier up and running without pre-downloading every tenant's bucket index.

While in-memory, a background process will keep it **updated at periodic intervals**, so that subsequent queries from the same tenant to the same querier instance will use the cached (and periodically updated) bucket index. There are two config options involved:

- `-blocks-storage.bucket-store.bucket-index.update-on-stale-interval`<br />
This option configures how frequently a cached bucket index should be refreshed.
- `-blocks-storage.bucket-store.bucket-index.update-on-error-interval`<br />
If downloading a bucket index fails, the failure is cached for a short time in order to avoid hammering the backend storage. This option configures how frequently a bucket index, which previously failed to load, should be tried to load again.

If a bucket index is unused for a long time (configurable via `-blocks-storage.bucket-store.bucket-index.idle-timeout`), e.g. because that querier instance is not receiving any query from the tenant, the querier will offload it, stopping to keep it updated at regular intervals. This is particularly for tenants which are resharded to different queriers when [shuffle sharding](../guides/shuffle-sharding.md) is enabled.

Finally, the querier, at query time, checks how old is a bucket index (based on its `updated_at`) and fail a query if its age is older than `-blocks-storage.bucket-store.bucket-index.max-stale-period`. This circuit breaker is used to ensure queriers will not return any partial query results due to a stale view over the long-term storage.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You have also added caching of index into caching-bucket. Is that worth mentioning as well?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, sure, done. WDYT?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It isn't possible to fall back to the behavior used when the bucket index is not enabled? Or this is undesirable or some reason?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think falling back is a viable solution. The "bucket scan" logic requires a preventive bucket scanning, which we don't do when we enable the bucket index. Lazily bucket scanning would be too slow to do at query time. Moreover, fallback logic unfrequently exercised would bring further risks that fallback logic doesn't work as expected.

In my opinion, when you run Cortex with bucket index, the bucket index is an essential part of the system and it's required to be kept updated. The compactor already exports a metric with the timestamp of the last time the bucket index of each tenant has been updated, so that we can alert on it before the max-stale-period is reached.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think?

2 changes: 1 addition & 1 deletion docs/blocks-storage/compactor.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@ slug: compactor
The **compactor** is an service which is responsible to:

- Compact multiple blocks of a given tenant into a single optimized larger block. This helps to reduce storage costs (deduplication, index size reduction), and increase query speed (querying fewer blocks is faster).
- Keep the per-tenant bucket index updated. The bucket index is used by [queriers](./querier.md) and [store-gateways](./store-gateway.md) to discover new blocks in the storage.
- Keep the per-tenant bucket index updated. The [bucket index](./bucket-index.md) is used by [queriers](./querier.md) and [store-gateways](./store-gateway.md) to discover new blocks in the storage.

The compactor is **stateless**.

Expand Down
2 changes: 1 addition & 1 deletion docs/blocks-storage/compactor.template
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@ slug: compactor
The **compactor** is an service which is responsible to:

- Compact multiple blocks of a given tenant into a single optimized larger block. This helps to reduce storage costs (deduplication, index size reduction), and increase query speed (querying fewer blocks is faster).
- Keep the per-tenant bucket index updated. The bucket index is used by [queriers](./querier.md) and [store-gateways](./store-gateway.md) to discover new blocks in the storage.
- Keep the per-tenant bucket index updated. The [bucket index](./bucket-index.md) is used by [queriers](./querier.md) and [store-gateways](./store-gateway.md) to discover new blocks in the storage.

The compactor is **stateless**.

Expand Down
58 changes: 55 additions & 3 deletions docs/blocks-storage/querier.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,12 +13,28 @@ The querier is **stateless**.

## How it works

At startup **queriers** iterate over the entire storage bucket to discover all tenants blocks and download the `meta.json` for each block. During this initial bucket scanning phase, a querier is not ready to handle incoming queries yet and its `/ready` readiness probe endpoint will fail.
The querier needs to have an almost up-to-date view over the entire storage bucket, in order to find the right blocks to lookup at query time. The querier can keep the bucket view updated in to two different ways:

1. Periodically scanning the bucket (default)
2. Periodically downloading the [bucket index](./bucket-index.md)

### Bucket index disabled (default)

At startup, **queriers** iterate over the entire storage bucket to discover all tenants blocks and download the `meta.json` for each block. During this initial bucket scanning phase, a querier is not ready to handle incoming queries yet and its `/ready` readiness probe endpoint will fail.

While running, queriers periodically iterate over the storage bucket to discover new tenants and recently uploaded blocks. Queriers do **not** download any content from blocks except a small `meta.json` file containing the block's metadata (including the minimum and maximum timestamp of samples within the block).

Queriers use the metadata to compute the list of blocks that need to be queried at query time and fetch matching series from the [store-gateway](./store-gateway.md) instances holding the required blocks.

### Bucket index enabled

When [bucket index](./bucket-index.md) is enabled, queriers lazily download the bucket index upon the 1st query received for a given tenant, cache it in memory and periodically keep it update. The bucket index contains the list of blocks and block deletion marks of a tenant, which is later used during the query execution to find the set of blocks that need to be queried for the given query.

Given the bucket index removes the need to scan the bucket, it brings few benefits:

1. The querier is expected to be ready shortly after startup.
2. Lower volume of API calls to object storage.

### Anatomy of a query request

When a querier receives a query range request, it contains the following parameters:
Expand Down Expand Up @@ -60,6 +76,7 @@ Caching is optional, but **highly recommended** in a production environment. Ple
- List of blocks per tenant
- Block's `meta.json` content
- Block's `deletion-mark.json` existence and content
- Tenant's `bucket-index.json.gz` content

Using the metadata cache can significantly reduce the number of API calls to object storage and protects from linearly scale the number of these API calls with the number of querier and store-gateway instances (because the bucket is periodically scanned and synched by each querier and store-gateway).

Expand Down Expand Up @@ -341,8 +358,8 @@ blocks_storage:
# CLI flag: -blocks-storage.filesystem.dir
[dir: <string> | default = ""]

# This configures how the store-gateway synchronizes blocks stored in the
# bucket.
# This configures how the querier and store-gateway discover and synchronize
# blocks stored in the bucket.
bucket_store:
# Directory to store synchronized TSDB index headers.
# CLI flag: -blocks-storage.bucket-store.sync-dir
Expand Down Expand Up @@ -587,6 +604,14 @@ blocks_storage:
# CLI flag: -blocks-storage.bucket-store.metadata-cache.metafile-attributes-ttl
[metafile_attributes_ttl: <duration> | default = 168h]

# How long to cache content of the bucket index.
# CLI flag: -blocks-storage.bucket-store.metadata-cache.bucket-index-content-ttl
[bucket_index_content_ttl: <duration> | default = 5m]

# Maximum size of bucket index content to cache in bytes.
# CLI flag: -blocks-storage.bucket-store.metadata-cache.bucket-index-max-size-bytes
[bucket_index_max_size_bytes: <int> | default = 1048576]

# Duration after which the blocks marked for deletion will be filtered out
# while fetching blocks. The idea of ignore-deletion-marks-delay is to
# ignore blocks that are marked for deletion with some delay. This ensures
Expand All @@ -596,6 +621,33 @@ blocks_storage:
# CLI flag: -blocks-storage.bucket-store.ignore-deletion-marks-delay
[ignore_deletion_mark_delay: <duration> | default = 6h]

bucket_index:
# True to enable querier to discover blocks in the storage via bucket
# index instead of bucket scanning.
# CLI flag: -blocks-storage.bucket-store.bucket-index.enabled
[enabled: <boolean> | default = false]

# How frequently a cached bucket index should be refreshed.
# CLI flag: -blocks-storage.bucket-store.bucket-index.update-on-stale-interval
[update_on_stale_interval: <duration> | default = 15m]

# How frequently a bucket index, which previously failed to load, should
# be tried to load again.
# CLI flag: -blocks-storage.bucket-store.bucket-index.update-on-error-interval
[update_on_error_interval: <duration> | default = 1m]

# How long a unused bucket index should be cached. Once this timeout
# expires, the unused bucket index is removed from the in-memory cache.
# CLI flag: -blocks-storage.bucket-store.bucket-index.idle-timeout
[idle_timeout: <duration> | default = 1h]

# The maximum allowed age of a bucket index (last updated) before queries
# start failing because the bucket index is too old. The bucket index is
# periodically updated by the compactor, while this check is enforced in
# the querier (at query time).
# CLI flag: -blocks-storage.bucket-store.bucket-index.max-stale-period
[max_stale_period: <duration> | default = 1h]

tsdb:
# Local directory to store TSDBs in the ingesters.
# CLI flag: -blocks-storage.tsdb.dir
Expand Down
19 changes: 18 additions & 1 deletion docs/blocks-storage/querier.template
Original file line number Diff line number Diff line change
Expand Up @@ -13,12 +13,28 @@ The querier is **stateless**.

## How it works

At startup **queriers** iterate over the entire storage bucket to discover all tenants blocks and download the `meta.json` for each block. During this initial bucket scanning phase, a querier is not ready to handle incoming queries yet and its `/ready` readiness probe endpoint will fail.
The querier needs to have an almost up-to-date view over the entire storage bucket, in order to find the right blocks to lookup at query time. The querier can keep the bucket view updated in to two different ways:

1. Periodically scanning the bucket (default)
2. Periodically downloading the [bucket index](./bucket-index.md)

### Bucket index disabled (default)

At startup, **queriers** iterate over the entire storage bucket to discover all tenants blocks and download the `meta.json` for each block. During this initial bucket scanning phase, a querier is not ready to handle incoming queries yet and its `/ready` readiness probe endpoint will fail.

While running, queriers periodically iterate over the storage bucket to discover new tenants and recently uploaded blocks. Queriers do **not** download any content from blocks except a small `meta.json` file containing the block's metadata (including the minimum and maximum timestamp of samples within the block).

Queriers use the metadata to compute the list of blocks that need to be queried at query time and fetch matching series from the [store-gateway](./store-gateway.md) instances holding the required blocks.

### Bucket index enabled

When [bucket index](./bucket-index.md) is enabled, queriers lazily download the bucket index upon the 1st query received for a given tenant, cache it in memory and periodically keep it update. The bucket index contains the list of blocks and block deletion marks of a tenant, which is later used during the query execution to find the set of blocks that need to be queried for the given query.

Given the bucket index removes the need to scan the bucket, it brings few benefits:

1. The querier is expected to be ready shortly after startup.
2. Lower volume of API calls to object storage.

### Anatomy of a query request

When a querier receives a query range request, it contains the following parameters:
Expand Down Expand Up @@ -60,6 +76,7 @@ Caching is optional, but **highly recommended** in a production environment. Ple
- List of blocks per tenant
- Block's `meta.json` content
- Block's `deletion-mark.json` existence and content
- Tenant's `bucket-index.json.gz` content

Using the metadata cache can significantly reduce the number of API calls to object storage and protects from linearly scale the number of these API calls with the number of querier and store-gateway instances (because the bucket is periodically scanned and synched by each querier and store-gateway).

Expand Down
40 changes: 38 additions & 2 deletions docs/blocks-storage/store-gateway.md
Original file line number Diff line number Diff line change
Expand Up @@ -125,6 +125,7 @@ Store-gateway and [querier](./querier.md) can use memcached for caching bucket m
- List of blocks per tenant
- Block's `meta.json` content
- Block's `deletion-mark.json` existence and content
- Tenant's `bucket-index.json.gz` content

Using the metadata cache can significantly reduce the number of API calls to object storage and protects from linearly scale the number of these API calls with the number of querier and store-gateway instances (because the bucket is periodically scanned and synched by each querier and store-gateway).

Expand Down Expand Up @@ -391,8 +392,8 @@ blocks_storage:
# CLI flag: -blocks-storage.filesystem.dir
[dir: <string> | default = ""]

# This configures how the store-gateway synchronizes blocks stored in the
# bucket.
# This configures how the querier and store-gateway discover and synchronize
# blocks stored in the bucket.
bucket_store:
# Directory to store synchronized TSDB index headers.
# CLI flag: -blocks-storage.bucket-store.sync-dir
Expand Down Expand Up @@ -637,6 +638,14 @@ blocks_storage:
# CLI flag: -blocks-storage.bucket-store.metadata-cache.metafile-attributes-ttl
[metafile_attributes_ttl: <duration> | default = 168h]

# How long to cache content of the bucket index.
# CLI flag: -blocks-storage.bucket-store.metadata-cache.bucket-index-content-ttl
[bucket_index_content_ttl: <duration> | default = 5m]

# Maximum size of bucket index content to cache in bytes.
# CLI flag: -blocks-storage.bucket-store.metadata-cache.bucket-index-max-size-bytes
[bucket_index_max_size_bytes: <int> | default = 1048576]

# Duration after which the blocks marked for deletion will be filtered out
# while fetching blocks. The idea of ignore-deletion-marks-delay is to
# ignore blocks that are marked for deletion with some delay. This ensures
Expand All @@ -646,6 +655,33 @@ blocks_storage:
# CLI flag: -blocks-storage.bucket-store.ignore-deletion-marks-delay
[ignore_deletion_mark_delay: <duration> | default = 6h]

bucket_index:
# True to enable querier to discover blocks in the storage via bucket
# index instead of bucket scanning.
# CLI flag: -blocks-storage.bucket-store.bucket-index.enabled
[enabled: <boolean> | default = false]

# How frequently a cached bucket index should be refreshed.
# CLI flag: -blocks-storage.bucket-store.bucket-index.update-on-stale-interval
[update_on_stale_interval: <duration> | default = 15m]

# How frequently a bucket index, which previously failed to load, should
# be tried to load again.
# CLI flag: -blocks-storage.bucket-store.bucket-index.update-on-error-interval
[update_on_error_interval: <duration> | default = 1m]

# How long a unused bucket index should be cached. Once this timeout
# expires, the unused bucket index is removed from the in-memory cache.
# CLI flag: -blocks-storage.bucket-store.bucket-index.idle-timeout
[idle_timeout: <duration> | default = 1h]

# The maximum allowed age of a bucket index (last updated) before queries
# start failing because the bucket index is too old. The bucket index is
# periodically updated by the compactor, while this check is enforced in
# the querier (at query time).
# CLI flag: -blocks-storage.bucket-store.bucket-index.max-stale-period
[max_stale_period: <duration> | default = 1h]

tsdb:
# Local directory to store TSDBs in the ingesters.
# CLI flag: -blocks-storage.tsdb.dir
Expand Down
Loading