-
Notifications
You must be signed in to change notification settings - Fork 55
Added playbook for CortexFrontendQueriesStuck and CortexSchedulerQueriesStuck #341
Changes from 1 commit
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change | ||||
|---|---|---|---|---|---|---|
|
|
@@ -408,11 +408,31 @@ _TODO: this playbook has not been written yet._ | |||||
|
|
||||||
| ### CortexFrontendQueriesStuck | ||||||
|
|
||||||
| _TODO: this playbook has not been written yet._ | ||||||
| This alert fires if Cortex is running without query-scheduler and queries are piling up in the query-frontend queue. | ||||||
|
|
||||||
| The procedure to investigate it is the same as the one for [`CortexSchedulerQueriesStuck`](#CortexSchedulerQueriesStuck): please see the other playbook for more details. | ||||||
|
|
||||||
| ### CortexSchedulerQueriesStuck | ||||||
|
|
||||||
| _TODO: this playbook has not been written yet._ | ||||||
| This alert fires if Cortex is queries are piling up in the query-scheduler. | ||||||
|
|
||||||
| How it **works**: | ||||||
| - A query-frontend API endpoint is called to execute a query | ||||||
| - The query-frontend enqueues the request to the query-scheduler | ||||||
| - The query-scheduler is responsible to dispatch enqueued queries to idle querier workers | ||||||
pracucci marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||||||
| - The querier runs the query, sends the response back directly to the query-frontend and notifies the query-scheduler | ||||||
pracucci marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||||||
|
|
||||||
| How to **investigate**: | ||||||
| - Are queriers in a crash loop (eg. OOMKilled)? | ||||||
| - `OOMKilled`: temporarily increase queriers memory request/limit | ||||||
| - `panic`: look for the stack trace in the logs and investigate from there | ||||||
| - Is QPS increased? | ||||||
| - Scale up queriers to satisfy the increased workload | ||||||
| - Is query latency increased? | ||||||
| - An increased latency reduces the number of queries we can run / sec: once all workers are busy, new queries will pile up in the queue | ||||||
| - Temporarily scale up queriers to try to stop the bleed | ||||||
| - Check the `Cortex / Slow Queries` dashboard to see if a specific tenant is running heavy queries | ||||||
|
||||||
| - If it's a multi-tenant Cortex cluster and shuffle-sharing is disabled for queriers, you may consider to enable it only for that specific tenant to reduce its blast radius. To enable queriers shuffle-sharding for a single tenant you need to set the `max_queriers_per_tenant` limit override for the specific tenant (the value should be set to the number of queriers assigned to the tenant). | ||||||
|
||||||
| - If it's a multi-tenant Cortex cluster and shuffle-sharing is disabled for queriers, you may consider to enable it only for that specific tenant to reduce its blast radius. To enable queriers shuffle-sharding for a single tenant you need to set the `max_queriers_per_tenant` limit override for the specific tenant (the value should be set to the number of queriers assigned to the tenant). | |
| - On multi-tenant Cortex cluster with shuffle-sharing for queriers disabled, you may consider to enable it for that specific tenant to reduce its blast radius. To enable queriers shuffle-sharding for a single tenant you need to set the `max_queriers_per_tenant` limit override for the specific tenant (the value should be set to the number of queriers assigned to the tenant). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If user is already configured with querier sharding, and is sending too many slow queries, then nothing from this will help. Arguably, operator could increase the shard size, but that may affect other users too much. Depending on the situation, operator may choose to do nothing, and let Cortex return errors for that given user. (This assumes that only single user is affected).
Uh oh!
There was an error while loading. Please reload this page.