cortexproject · pracucci · Dec 21, 2020 · Dec 16, 2020 · Dec 17, 2020 · Dec 17, 2020
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -6,6 +6,11 @@
 * [CHANGE] Blocks storage: compactor is now required when running a Cortex cluster with the blocks storage, because it also keeps the bucket index updated. #3583
 * [CHANGE] Blocks storage: block deletion marks are now stored in a per-tenant global markers/ location too, other than within the block location. The compactor, at startup, will copy deletion marks from the block location to the global location. This migration is required only once, so you can safely disable it via `-compactor.block-deletion-marks-migration-enabled=false` once new compactor has successfully started once in your cluster. #3583
 * [ENHANCEMENT] Blocks storage: introduced a per-tenant bucket index, periodically updated by the compactor, used to avoid full bucket scanning done by queriers and store-gateways. The bucket index is updated by the compactor during blocks cleanup, on every `-compactor.cleanup-interval`. #3553 #3555 #3561 #3583
+* [ENHANCEMENT] Blocks storage: introduced an option `-blocks-storage.bucket-store.bucket-index.enabled` to enable the usage of the bucket index in the querier. When enabled, the querier will use the bucket index to find a tenant's blocks instead of running the periodic bucket scan. The following new metrics have been added: #3614
+  * `cortex_bucket_index_loads_total`
+  * `cortex_bucket_index_load_failures_total`
+  * `cortex_bucket_index_load_duration_seconds`
+  * `cortex_bucket_index_loaded`
 * [ENHANCEMENT] Compactor: exported the following metrics. #3583
   * `cortex_bucket_blocks_count`: Total number of blocks per tenant in the bucket. Includes blocks marked for deletion.
   * `cortex_bucket_blocks_marked_for_deletion_count`: Total number of blocks per tenant marked for deletion in the bucket.

diff --git a/docs/blocks-storage/_index.md b/docs/blocks-storage/_index.md
@@ -29,7 +29,7 @@ When running the Cortex blocks storage, the Cortex architecture doesn't signific
 
 The **[store-gateway](./store-gateway.md)** is responsible to query blocks and is used by the [querier](./querier.md) at query time. The store-gateway is required when running the blocks storage.
 
-The **[compactor](./compactor.md)** is responsible to merge and deduplicate smaller blocks into larger ones, in order to reduce the number of blocks stored in the long-term storage for a given tenant and query them more efficiently. It also keeps the bucket index updated and, for this reason, it's a required component.
+The **[compactor](./compactor.md)** is responsible to merge and deduplicate smaller blocks into larger ones, in order to reduce the number of blocks stored in the long-term storage for a given tenant and query them more efficiently. It also keeps the [bucket index](./bucket-index.md) updated and, for this reason, it's a required component.
 
 Finally, the [**table-manager**](../chunks-storage/table-manager.md) and the [**schema config**](../chunks-storage/schema-config.md) are **not used** by the blocks storage.
 

diff --git a/docs/blocks-storage/bucket-index.md b/docs/blocks-storage/bucket-index.md
@@ -0,0 +1,42 @@
+---
+title: "Bucket Index"
+linkTitle: "Bucket Index"
+weight: 5
+slug: bucket-index
+---
+
+The bucket index is a **per-tenant file containing the list of blocks and block deletion marks** in the storage. The bucket index itself is stored in the backend object storage, is periodically updated by the compactor and used by queriers to discover blocks in the storage.
+
+The bucket index usage is **optional** and can be enabled via `-blocks-storage.bucket-store.bucket-index.enabled=true` (or its respective YAML config option).
+
+## Structure of the index
+
+The `bucket-index.json.gz` contains:
+
+- **`blocks`**<br />
+  List of complete blocks of a tenant, including blocks marked for deletion (partial blocks are excluded from the index).
+- **`block_deletion_marks`**<br />
+  List of block deletion marks.
+- **`updated_at`**<br />
+  Unix timestamp (seconds precision) of when the index has been updated (written in the storage) the last time.
+
+## How it gets updated
+
+The [compactor](./compactor.md) periodically scans the bucket and uploads an updated bucket index to the storage. The frequency at which the bucket index is updated can be configured via `-compactor.cleanup-interval`.
+
+The bucket index is built and updated by the compactor even if `-blocks-storage.bucket-store.bucket-index.enabled` has **not** been enabled. This is intentional and the overhead introduced by keeping the bucket index is non significative.
+
+## How it's used by the querier
+
+The [querier](./querier.md), at query time, checks whether the bucket index for the tenant has already been loaded in memory. If not, the querier downloads it from the storage and cache it in memory. Given it's a small file, lazy downloading it doesn't significantly impact on 1st query performances, but allows to get a querier up and running without pre-downloading every tenant's bucket index.
+
+While in-memory, a background process will keep it **updated at periodic intervals**, so that subsequent queries from the same tenant to the same querier instance will use the cached (and periodically updated) bucket index. There are two config options involved:
+
+- `-blocks-storage.bucket-store.bucket-index.update-on-stale-interval`<br />
+  This option configures how frequently a cached bucket index should be refreshed.
+- `-blocks-storage.bucket-store.bucket-index.update-on-error-interval`<br />
+  If downloading a bucket index fails, the failure is cached for a short time in order to avoid hammering the backend storage. This option configures how frequently a bucket index, which previously failed to load, should be tried to load again.
+
+If a bucket index is unused for a long time (configurable via `-blocks-storage.bucket-store.bucket-index.idle-timeout`), e.g. because that querier instance is not receiving any query from the tenant, the querier will offload it, stopping to keep it updated at regular intervals. This is particularly for tenants which are resharded to different queriers when [shuffle sharding](../guides/shuffle-sharding.md) is enabled.
+
+Finally, the querier, at query time, checks how old is a bucket index (based on its `updated_at`) and fail a query if its age is older than `-blocks-storage.bucket-store.bucket-index.max-stale-period`. This circuit breaker is used to ensure queriers will not return any partial query results due to a stale view over the long-term storage.
diff --git a/docs/blocks-storage/compactor.md b/docs/blocks-storage/compactor.md
@@ -10,7 +10,7 @@ slug: compactor
 The **compactor** is an service which is responsible to:
 
 - Compact multiple blocks of a given tenant into a single optimized larger block. This helps to reduce storage costs (deduplication, index size reduction), and increase query speed (querying fewer blocks is faster).
-- Keep the per-tenant bucket index updated. The bucket index is used by [queriers](./querier.md) and [store-gateways](./store-gateway.md) to discover new blocks in the storage.
+- Keep the per-tenant bucket index updated. The [bucket index](./bucket-index.md) is used by [queriers](./querier.md) and [store-gateways](./store-gateway.md) to discover new blocks in the storage.
 
 The compactor is **stateless**.
 

diff --git a/docs/blocks-storage/compactor.template b/docs/blocks-storage/compactor.template
@@ -10,7 +10,7 @@ slug: compactor
 The **compactor** is an service which is responsible to:
 
 - Compact multiple blocks of a given tenant into a single optimized larger block. This helps to reduce storage costs (deduplication, index size reduction), and increase query speed (querying fewer blocks is faster).
-- Keep the per-tenant bucket index updated. The bucket index is used by [queriers](./querier.md) and [store-gateways](./store-gateway.md) to discover new blocks in the storage.
+- Keep the per-tenant bucket index updated. The [bucket index](./bucket-index.md) is used by [queriers](./querier.md) and [store-gateways](./store-gateway.md) to discover new blocks in the storage.
 
 The compactor is **stateless**.
 

diff --git a/docs/blocks-storage/querier.md b/docs/blocks-storage/querier.md
@@ -13,12 +13,28 @@ The querier is **stateless**.
 
 ## How it works
 
-At startup **queriers** iterate over the entire storage bucket to discover all tenants blocks and download the `meta.json` for each block. During this initial bucket scanning phase, a querier is not ready to handle incoming queries yet and its `/ready` readiness probe endpoint will fail.
+The querier needs to have an almost up-to-date view over the entire storage bucket, in order to find the right blocks to lookup at query time. The querier can keep the bucket view updated in to two different ways:
+
+1. Periodically scanning the bucket (default)
+2. Periodically downloading the [bucket index](./bucket-index.md)
+
+### Bucket index disabled (default)
+
+At startup, **queriers** iterate over the entire storage bucket to discover all tenants blocks and download the `meta.json` for each block. During this initial bucket scanning phase, a querier is not ready to handle incoming queries yet and its `/ready` readiness probe endpoint will fail.
 
 While running, queriers periodically iterate over the storage bucket to discover new tenants and recently uploaded blocks. Queriers do **not** download any content from blocks except a small `meta.json` file containing the block's metadata (including the minimum and maximum timestamp of samples within the block).
 
 Queriers use the metadata to compute the list of blocks that need to be queried at query time and fetch matching series from the [store-gateway](./store-gateway.md) instances holding the required blocks.
 
+### Bucket index enabled
+
+When [bucket index](./bucket-index.md) is enabled, queriers lazily download the bucket index upon the 1st query received for a given tenant, cache it in memory and periodically keep it update. The bucket index contains the list of blocks and block deletion marks of a tenant, which is later used during the query execution to find the set of blocks that need to be queried for the given query.
+
+Given the bucket index removes the need to scan the bucket, it brings few benefits:
+
+1. The querier is expected to be ready shortly after startup.
+2. Lower volume of API calls to object storage.
+
 ### Anatomy of a query request
 
 When a querier receives a query range request, it contains the following parameters:
@@ -60,6 +76,7 @@ Caching is optional, but **highly recommended** in a production environment. Ple
 - List of blocks per tenant
 - Block's `meta.json` content
 - Block's `deletion-mark.json` existence and content
+- Tenant's `bucket-index.json.gz` content
 
 Using the metadata cache can significantly reduce the number of API calls to object storage and protects from linearly scale the number of these API calls with the number of querier and store-gateway instances (because the bucket is periodically scanned and synched by each querier and store-gateway).
 
@@ -341,8 +358,8 @@ blocks_storage:
     # CLI flag: -blocks-storage.filesystem.dir
     [dir: <string> | default = ""]
 
-  # This configures how the store-gateway synchronizes blocks stored in the
-  # bucket.
+  # This configures how the querier and store-gateway discover and synchronize
+  # blocks stored in the bucket.
   bucket_store:
     # Directory to store synchronized TSDB index headers.
     # CLI flag: -blocks-storage.bucket-store.sync-dir
@@ -587,6 +604,14 @@ blocks_storage:
       # CLI flag: -blocks-storage.bucket-store.metadata-cache.metafile-attributes-ttl
       [metafile_attributes_ttl: <duration> | default = 168h]
 
+      # How long to cache content of the bucket index.
+      # CLI flag: -blocks-storage.bucket-store.metadata-cache.bucket-index-content-ttl
+      [bucket_index_content_ttl: <duration> | default = 5m]
+
+      # Maximum size of bucket index content to cache in bytes.
+      # CLI flag: -blocks-storage.bucket-store.metadata-cache.bucket-index-max-size-bytes
+      [bucket_index_max_size_bytes: <int> | default = 1048576]
+
     # Duration after which the blocks marked for deletion will be filtered out
     # while fetching blocks. The idea of ignore-deletion-marks-delay is to
     # ignore blocks that are marked for deletion with some delay. This ensures
@@ -596,6 +621,33 @@ blocks_storage:
     # CLI flag: -blocks-storage.bucket-store.ignore-deletion-marks-delay
     [ignore_deletion_mark_delay: <duration> | default = 6h]
 
+    bucket_index:
+      # True to enable querier to discover blocks in the storage via bucket
+      # index instead of bucket scanning.
+      # CLI flag: -blocks-storage.bucket-store.bucket-index.enabled
+      [enabled: <boolean> | default = false]
+
+      # How frequently a cached bucket index should be refreshed.
+      # CLI flag: -blocks-storage.bucket-store.bucket-index.update-on-stale-interval
+      [update_on_stale_interval: <duration> | default = 15m]
+
+      # How frequently a bucket index, which previously failed to load, should
+      # be tried to load again.
+      # CLI flag: -blocks-storage.bucket-store.bucket-index.update-on-error-interval
+      [update_on_error_interval: <duration> | default = 1m]
+
+      # How long a unused bucket index should be cached. Once this timeout
+      # expires, the unused bucket index is removed from the in-memory cache.
+      # CLI flag: -blocks-storage.bucket-store.bucket-index.idle-timeout
+      [idle_timeout: <duration> | default = 1h]
+
+      # The maximum allowed age of a bucket index (last updated) before queries
+      # start failing because the bucket index is too old. The bucket index is
+      # periodically updated by the compactor, while this check is enforced in
+      # the querier (at query time).
+      # CLI flag: -blocks-storage.bucket-store.bucket-index.max-stale-period
+      [max_stale_period: <duration> | default = 1h]
+
   tsdb:
     # Local directory to store TSDBs in the ingesters.
     # CLI flag: -blocks-storage.tsdb.dir

diff --git a/docs/blocks-storage/querier.template b/docs/blocks-storage/querier.template
@@ -13,12 +13,28 @@ The querier is **stateless**.
 
 ## How it works
 
-At startup **queriers** iterate over the entire storage bucket to discover all tenants blocks and download the `meta.json` for each block. During this initial bucket scanning phase, a querier is not ready to handle incoming queries yet and its `/ready` readiness probe endpoint will fail.
+The querier needs to have an almost up-to-date view over the entire storage bucket, in order to find the right blocks to lookup at query time. The querier can keep the bucket view updated in to two different ways:
+
+1. Periodically scanning the bucket (default)
+2. Periodically downloading the [bucket index](./bucket-index.md)
+
+### Bucket index disabled (default)
+
+At startup, **queriers** iterate over the entire storage bucket to discover all tenants blocks and download the `meta.json` for each block. During this initial bucket scanning phase, a querier is not ready to handle incoming queries yet and its `/ready` readiness probe endpoint will fail.
 
 While running, queriers periodically iterate over the storage bucket to discover new tenants and recently uploaded blocks. Queriers do **not** download any content from blocks except a small `meta.json` file containing the block's metadata (including the minimum and maximum timestamp of samples within the block).
 
 Queriers use the metadata to compute the list of blocks that need to be queried at query time and fetch matching series from the [store-gateway](./store-gateway.md) instances holding the required blocks.
 
+### Bucket index enabled
+
+When [bucket index](./bucket-index.md) is enabled, queriers lazily download the bucket index upon the 1st query received for a given tenant, cache it in memory and periodically keep it update. The bucket index contains the list of blocks and block deletion marks of a tenant, which is later used during the query execution to find the set of blocks that need to be queried for the given query.
+
+Given the bucket index removes the need to scan the bucket, it brings few benefits:
+
+1. The querier is expected to be ready shortly after startup.
+2. Lower volume of API calls to object storage.
+
 ### Anatomy of a query request
 
 When a querier receives a query range request, it contains the following parameters:
@@ -60,6 +76,7 @@ Caching is optional, but **highly recommended** in a production environment. Ple
 - List of blocks per tenant
 - Block's `meta.json` content
 - Block's `deletion-mark.json` existence and content
+- Tenant's `bucket-index.json.gz` content
 
 Using the metadata cache can significantly reduce the number of API calls to object storage and protects from linearly scale the number of these API calls with the number of querier and store-gateway instances (because the bucket is periodically scanned and synched by each querier and store-gateway).
 

diff --git a/docs/blocks-storage/store-gateway.md b/docs/blocks-storage/store-gateway.md
@@ -125,6 +125,7 @@ Store-gateway and [querier](./querier.md) can use memcached for caching bucket m
 - List of blocks per tenant
 - Block's `meta.json` content
 - Block's `deletion-mark.json` existence and content
+- Tenant's `bucket-index.json.gz` content
 
 Using the metadata cache can significantly reduce the number of API calls to object storage and protects from linearly scale the number of these API calls with the number of querier and store-gateway instances (because the bucket is periodically scanned and synched by each querier and store-gateway).
 
@@ -391,8 +392,8 @@ blocks_storage:
     # CLI flag: -blocks-storage.filesystem.dir
     [dir: <string> | default = ""]
 
-  # This configures how the store-gateway synchronizes blocks stored in the
-  # bucket.
+  # This configures how the querier and store-gateway discover and synchronize
+  # blocks stored in the bucket.
   bucket_store:
     # Directory to store synchronized TSDB index headers.
     # CLI flag: -blocks-storage.bucket-store.sync-dir
@@ -637,6 +638,14 @@ blocks_storage:
       # CLI flag: -blocks-storage.bucket-store.metadata-cache.metafile-attributes-ttl
       [metafile_attributes_ttl: <duration> | default = 168h]
 
+      # How long to cache content of the bucket index.
+      # CLI flag: -blocks-storage.bucket-store.metadata-cache.bucket-index-content-ttl
+      [bucket_index_content_ttl: <duration> | default = 5m]
+
+      # Maximum size of bucket index content to cache in bytes.
+      # CLI flag: -blocks-storage.bucket-store.metadata-cache.bucket-index-max-size-bytes
+      [bucket_index_max_size_bytes: <int> | default = 1048576]
+
     # Duration after which the blocks marked for deletion will be filtered out
     # while fetching blocks. The idea of ignore-deletion-marks-delay is to
     # ignore blocks that are marked for deletion with some delay. This ensures
@@ -646,6 +655,33 @@ blocks_storage:
     # CLI flag: -blocks-storage.bucket-store.ignore-deletion-marks-delay
     [ignore_deletion_mark_delay: <duration> | default = 6h]
 
+    bucket_index:
+      # True to enable querier to discover blocks in the storage via bucket
+      # index instead of bucket scanning.
+      # CLI flag: -blocks-storage.bucket-store.bucket-index.enabled
+      [enabled: <boolean> | default = false]
+
+      # How frequently a cached bucket index should be refreshed.
+      # CLI flag: -blocks-storage.bucket-store.bucket-index.update-on-stale-interval
+      [update_on_stale_interval: <duration> | default = 15m]
+
+      # How frequently a bucket index, which previously failed to load, should
+      # be tried to load again.
+      # CLI flag: -blocks-storage.bucket-store.bucket-index.update-on-error-interval
+      [update_on_error_interval: <duration> | default = 1m]
+
+      # How long a unused bucket index should be cached. Once this timeout
+      # expires, the unused bucket index is removed from the in-memory cache.
+      # CLI flag: -blocks-storage.bucket-store.bucket-index.idle-timeout
+      [idle_timeout: <duration> | default = 1h]
+
+      # The maximum allowed age of a bucket index (last updated) before queries
+      # start failing because the bucket index is too old. The bucket index is
+      # periodically updated by the compactor, while this check is enforced in
+      # the querier (at query time).
+      # CLI flag: -blocks-storage.bucket-store.bucket-index.max-stale-period
+      [max_stale_period: <duration> | default = 1h]
+
   tsdb:
     # Local directory to store TSDBs in the ingesters.
     # CLI flag: -blocks-storage.tsdb.dir