diff --git a/troubleshoot/elasticsearch/fix-watermark-errors.md b/troubleshoot/elasticsearch/fix-watermark-errors.md index fe133f9253..d3a8a3420c 100644 --- a/troubleshoot/elasticsearch/fix-watermark-errors.md +++ b/troubleshoot/elasticsearch/fix-watermark-errors.md @@ -23,9 +23,18 @@ To prevent a full disk, when a node reaches this watermark, {{es}} [blocks write ::: +## Context + +Elasticsearch uses [disk-based shard allocation watermarks](elasticsearch://reference/elasticsearch/configuration-reference/cluster-level-shard-allocation-routing-settings.md) to prevent disk overuse and protect against data loss. Until a node reaches the flood-stage watermark, indexing is not blocked and shards can continue to grow on disk. Default watermark thresholds and their effects: +- **75% (`none`)** – In the Cloud UI (ECE and ECH), the disk bar appears red. Elasticsearch takes no action. +- **85% (`low`)** – Stops allocating new primary or replica shards to the affected node(s). +- **90% (`high`)** – Moves shards away from the affected node(s). +- **95% (`flood-stage`)** – Sets all indices on the affected node(s) to read-only. This is automatically reverted once the node’s usage drops below the high watermark. Indexing on affected nodes stops. + + ## Monitor rebalancing [fix-watermark-errors-rebalance] -To verify that shards are moving off the affected node until it falls below high watermark., use the [cat shards API](https://www.elastic.co/docs/api/doc/elasticsearch/operation/operation-cat-shards) and [cat recovery API](https://www.elastic.co/docs/api/doc/elasticsearch/operation/operation-cat-recovery): +To verify that shards are moving off the affected node until it falls below high watermark, use the [cat shards API](https://www.elastic.co/docs/api/doc/elasticsearch/operation/operation-cat-shards) and [cat recovery API](https://www.elastic.co/docs/api/doc/elasticsearch/operation/operation-cat-recovery): ```console GET _cat/shards?v=true @@ -45,6 +54,16 @@ GET _cluster/allocation/explain ``` +## Common causes of watermark errors + +Watermark errors occur when a node’s disk usage exceeds the configured thresholds (`low`, `high`, or `flood-stage`). While these thresholds protect cluster stability, they can be triggered by several underlying factors including: + +* Sudden ingestion of large volumes of data, often referred to as large indexing bursts, can quickly consume disk space, especially if the cluster is not sized for peak loads. Refer to [Indexing performance considerations](/deploy-manage/production-guidance/optimize-performance/indexing-speed.md) for guidance. +* Inefficient index settings, unnecessary stored fields, and suboptimal document structures can increase disk consumption. See [Tune for disk usage](/deploy-manage/production-guidance/optimize-performance/disk-usage.md) for guidance on reducing storage requirements. +* A high number of replicas can quickly multiply storage requirements, as each replica consumes the same disk space as the primary shard. Refer to [Index settings](elasticsearch://reference/elasticsearch/index-settings/index-modules.md) for details. +* Oversized shards can make disk usage spikes more likely and slow down recovery and rebalancing. Learn more in [Size your shards](/deploy-manage/production-guidance/optimize-performance/size-shards.md). + + ## Temporary relief [fix-watermark-errors-temporary] To immediately restore write operations, you can temporarily increase [disk watermarks](elasticsearch://reference/elasticsearch/configuration-reference/cluster-level-shard-allocation-routing-settings.md#disk-based-shard-allocation) and remove the [write block](elasticsearch://reference/elasticsearch/index-settings/index-block.md). @@ -98,6 +117,16 @@ To resolve watermark errors permanently, perform one of the following actions: * Delete indices using the [delete index API](https://www.elastic.co/docs/api/doc/elasticsearch/operation/operation-indices-delete), either permanently if the index isn’t needed, or temporarily to later [restore](../../deploy-manage/tools/snapshot-and-restore/restore-snapshot.md). * update related [ILM policy](../../manage-data/lifecycle/index-lifecycle-management.md) to push indices through to later [data tiers](../../manage-data/lifecycle/data-tiers.md) + +## Preventing watermark errors + +To reduce the likelihood of watermark errors: + +* Implement more restrictive ILM policies to delete or move data sooner, helping keep disk usage under control. Refer to [Index lifecycle management](/manage-data/lifecycle/index-lifecycle-management.md). +* Enable [Autoscaling](/deploy-manage/autoscaling.md) to automatically adjust resources based on storage and performance needs. +* Configure [Stack monitoring](/deploy-manage/monitor/stack-monitoring/ece-ech-stack-monitoring.md) and enable [disk usage monitoring alerts](/solutions/observability/incident-management/alerting.md) to track disk usage trends and identify increases before watermark thresholds are exceeded. +* Optimize shard sizes to balance disk usage (and performance), avoiding a mix of overly large and small shards. Refer to [Size your shards](/deploy-manage/production-guidance/optimize-performance/size-shards.md). + ::::{tip} On {{ech}} and {{ece}}, indices may need to be temporarily deleted using the its [{{es}} API Console](cloud://reference/cloud-hosted/ec-api-console.md) to later [snapshot restore](../../deploy-manage/tools/snapshot-and-restore/restore-snapshot.md) to resolve [cluster health](https://www.elastic.co/docs/api/doc/elasticsearch/operation/operation-cluster-health) `status:red` which blocks [attempted changes](../../deploy-manage/deploy/elastic-cloud/keep-track-of-deployment-activity.md). If you experience issues with this resolution flow, reach out to [Elastic Support](https://support.elastic.co) for assistance. ::::