Adds cluster manager task throttling documentation#1826
Conversation
Signed-off-by: Fanit Kolchina <kolchfa@amazon.com>
|
@dhwanilpatel As discussed, could you review for technical accuracy please? |
|
|
||
| # Cluster manager task throttling | ||
|
|
||
| For many cluster activities, such as defining a mapping or creating an index, data nodes submit tasks to the cluster manager. The cluster manager maintains a pending task queue for these tasks and runs them in a single-threaded environment. Sometimes data nodes may flood the cluster manager with too many tasks at the same time. When this happens, the number of tasks in the queue spikes; this affects the cluster manager performance, and may in turn affect the availability of the whole cluster. |
There was a problem hiding this comment.
the task can land up on cluster manager node directly or routed via some other node.
For many cluster state updates**
|
|
||
| # Cluster manager task throttling | ||
|
|
||
| For many cluster activities, such as defining a mapping or creating an index, data nodes submit tasks to the cluster manager. The cluster manager maintains a pending task queue for these tasks and runs them in a single-threaded environment. Sometimes data nodes may flood the cluster manager with too many tasks at the same time. When this happens, the number of tasks in the queue spikes; this affects the cluster manager performance, and may in turn affect the availability of the whole cluster. |
There was a problem hiding this comment.
The cluster manager maintains a pending task queue for these tasks and runs them in a single-
and executes them in ...
There was a problem hiding this comment.
Sometimes data nodes may flood the cluster manager with too many tasks at the same time.
In the past, put-mappings or snapshot tasks have caused too much pile of pending tasks on cluster manager
There was a problem hiding this comment.
Even though the ideal solution is to prevent the caller from submitting too many tasks and fix the underlying issue which caused flooding of pending tasks. But, this can take longer and leaves the cluster manager vulnerable to such bugs or issues.
There was a problem hiding this comment.
There is a need to build protection mechanism in the cluster manager itself.
There was a problem hiding this comment.
Hi @shwetathareja. Thanks for your suggestions. The word "executes" is on the list of words to avoid in our style guide. The style guide suggests replacing it with the word "run".
|
|
||
| For many cluster activities, such as defining a mapping or creating an index, data nodes submit tasks to the cluster manager. The cluster manager maintains a pending task queue for these tasks and runs them in a single-threaded environment. Sometimes data nodes may flood the cluster manager with too many tasks at the same time. When this happens, the number of tasks in the queue spikes; this affects the cluster manager performance, and may in turn affect the availability of the whole cluster. | ||
|
|
||
| To avoid task overload on the cluster manager, you can specify to throttle tasks by setting throttling limits. The cluster manager uses the throttling limits to determine whether to reject tasks from the data nodes. It rejects a task if the total number of tasks of the same type in the pending task queue exceeds the threshold. Since the cluster manager throttles tasks based on the task type, rejecting one task does not affect any other tasks of a different type. For example, if the cluster manager rejects a `put-mapping` task, it can still accept a subsequent `create-index` task. If the cluster manager rejects a task, the data node performs retries with exponential backoff to resubmit the task to the cluster manager. If retries are unsuccessful within the timeout period, OpenSearch returns a cluster timeout error. |
There was a problem hiding this comment.
task submission can be from any node including the cluster manager itself right?
There was a problem hiding this comment.
@shwetathareja: I have implemented the comments. Please take a look when you get a chance. Thanks!
Signed-off-by: Fanit Kolchina <kolchfa@amazon.com>
Signed-off-by: Fanit Kolchina <kolchfa@amazon.com>
|
|
||
| The first line of defense is to implement mechanisms in the caller nodes to avoid task overload on the cluster manager. However, even with those mechanisms in place, the cluster manager needs a built-in way to protect itself---cluster manager task throttling. | ||
|
|
||
| To turn on cluster manager task throttling, you need to specify to throttle tasks by setting throttling limits. The cluster manager uses the throttling limits to determine whether to reject a task. |
There was a problem hiding this comment.
specify throttling tasks?
|
|
||
| To turn on cluster manager task throttling, you need to specify to throttle tasks by setting throttling limits. The cluster manager uses the throttling limits to determine whether to reject a task. | ||
|
|
||
| The cluster manager rejects tasks on the task type basis. For any incoming task, the cluster manager evaluates the total number of tasks of the same type in the pending task queue. If this number exceeds the threshold for this task type, the cluster manager rejects the incoming task. Rejecting a task does not affect tasks of a different type. For example, if the cluster manager rejects a `put-mapping` task, it can still accept a subsequent `create-index` task. |
There was a problem hiding this comment.
rejects tasks on the basis of task types?
|
|
||
| ## Setting throttling limits | ||
|
|
||
| You can set the throttling limits by specifying them in the `cluster_manager.throttling.thresholds` object and updating the [OpenSearch cluster settings]({{site.url}}{{site.baseurl}}/api-reference/cluster-settings). The setting is dynamic, so you can change the behavior of this feature without restarting your cluster. |
|
|
||
| The following table describes the `cluster_manager.throttling.thresholds` object. | ||
|
|
||
| Field name | Description |
There was a problem hiding this comment.
| Field name | Description | |
| Field Name | Description |
| Field name | Description | ||
| :--- | :--- | ||
| task-type | The task type. See [supported task types](#supported-task-types) for a list of valid values. | ||
| value | The maximum number of tasks of the type specified by the `task-type` in the cluster manager's pending task queue. Default is `-1` (no task throttling). |
There was a problem hiding this comment.
tasks of the task-type type specified by?
ariamarble
left a comment
There was a problem hiding this comment.
looks good other than my comments
Signed-off-by: Fanit Kolchina <kolchfa@amazon.com>
|
|
||
| # Cluster manager task throttling | ||
|
|
||
| For many cluster state updates, such as defining a mapping or creating an index, nodes submit tasks to the cluster manager. The cluster manager maintains a pending task queue for these tasks and runs them in a single-threaded environment. When nodes send tens of thousands of resource-intensive tasks, like `put-mapping` or snapshot tasks, these tasks pile up in the queue, and the cluster manager is flooded. This affects the cluster manager performance, and may in turn affect the availability of the whole cluster. |
There was a problem hiding this comment.
suggestion only:
"When nodes send tens of thousands of resource-intensive tasks, like put-mapping or snapshot tasks, these tasks can pile up in the queue and flood the cluster manager."
"This affects cluster manager performance..."
or
"This affects the cluster manager's performance..."
There was a problem hiding this comment.
This is good. I'll change. Thanks!
Signed-off-by: Fanit Kolchina <kolchfa@amazon.com>
natebower
left a comment
There was a problem hiding this comment.
@kolchfa-aws Only two minor changes. Thanks!
Co-authored-by: Nate Bower <nbower@amazon.com>
Co-authored-by: Nate Bower <nbower@amazon.com>
* Adds cluster manager task throttling documentation Signed-off-by: Fanit Kolchina <kolchfa@amazon.com> * Update cluster-manager-task-throttling.md * Rewording Signed-off-by: Fanit Kolchina <kolchfa@amazon.com> * More rewording Signed-off-by: Fanit Kolchina <kolchfa@amazon.com> * Reworded for clarity Signed-off-by: Fanit Kolchina <kolchfa@amazon.com> * Incorporated doc review comments Signed-off-by: Fanit Kolchina <kolchfa@amazon.com> * More doc review comments Signed-off-by: Fanit Kolchina <kolchfa@amazon.com> * Update _opensearch/cluster-manager-task-throttling.md Co-authored-by: Nate Bower <nbower@amazon.com> * Update _opensearch/cluster-manager-task-throttling.md Co-authored-by: Nate Bower <nbower@amazon.com> Signed-off-by: Fanit Kolchina <kolchfa@amazon.com> Co-authored-by: Nate Bower <nbower@amazon.com> (cherry picked from commit 99bc98a)
* Adds cluster manager task throttling documentation Signed-off-by: Fanit Kolchina <kolchfa@amazon.com> * Update cluster-manager-task-throttling.md * Rewording Signed-off-by: Fanit Kolchina <kolchfa@amazon.com> * More rewording Signed-off-by: Fanit Kolchina <kolchfa@amazon.com> * Reworded for clarity Signed-off-by: Fanit Kolchina <kolchfa@amazon.com> * Incorporated doc review comments Signed-off-by: Fanit Kolchina <kolchfa@amazon.com> * More doc review comments Signed-off-by: Fanit Kolchina <kolchfa@amazon.com> * Update _opensearch/cluster-manager-task-throttling.md Co-authored-by: Nate Bower <nbower@amazon.com> * Update _opensearch/cluster-manager-task-throttling.md Co-authored-by: Nate Bower <nbower@amazon.com> Signed-off-by: Fanit Kolchina <kolchfa@amazon.com> Co-authored-by: Nate Bower <nbower@amazon.com> (cherry picked from commit 99bc98a) Co-authored-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com>
Fixes #1792
Checklist
For more information on following Developer Certificate of Origin and signing off your commits, please check here.