-
Notifications
You must be signed in to change notification settings - Fork 25.6k
Description
Today if a node exceeds the disk flood stage watermark, the disk threshold monitor will apply a special read-only index block to any indices that have a shard allocated to the node that exceeded the watermark. This block carries with it a forbidden status code so that if an attempt is made to index into such an index, the client receives a HTTP 403 status code.
Clients assume that a 403 status code is not retryable and they drop data.
This situation is retryable though, as once the disk threshold monitor observes the free disk space go above the appropriate threshold, the index block is automatically removed.
Rather than expecting our clients to all account for this situation (by inspecting the specifics of the exception that led to the 403 status code), we should indicate retryability by using HTTP status code 429. While 429 is often translated as "too many requests", the HTTP specification is liberal about what this means:
Note that this specification does not define how the origin server identifies the user, nor how it counts requests. For example, an origin server that is limiting request rates can do so based upon counts of requests on a per-resource basis, across the entire server, or even among a set of servers.
By making this change, all of our clients can start retrying when faced with an index that was marked read-only due to a flood stage watermark exceeded event.
Similarly, the status codes of other cluster blocks should be reexamined in this context.