Adds cluster manager task throttling documentation by kolchfa-aws · Pull Request #1826 · opensearch-project/documentation-website

kolchfa-aws · 2022-11-06T23:24:03Z

Fixes #1792

Checklist

[x ] By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license and subject to the Developers Certificate of Origin.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Signed-off-by: Fanit Kolchina <kolchfa@amazon.com>

kolchfa-aws · 2022-11-08T16:53:34Z

@dhwanilpatel As discussed, could you review for technical accuracy please?

shwetathareja · 2022-11-09T08:25:00Z

_opensearch/cluster-manager-task-throttling.md

+
+# Cluster manager task throttling
+
+For many cluster activities, such as defining a mapping or creating an index, data nodes submit tasks to the cluster manager. The cluster manager maintains a pending task queue for these tasks and runs them in a single-threaded environment. Sometimes data nodes may flood the cluster manager with too many tasks at the same time. When this happens, the number of tasks in the queue spikes; this affects the cluster manager performance, and may in turn affect the availability of the whole cluster.


the task can land up on cluster manager node directly or routed via some other node.

For many cluster state updates**

shwetathareja · 2022-11-09T08:25:45Z

_opensearch/cluster-manager-task-throttling.md

+
+# Cluster manager task throttling
+
+For many cluster activities, such as defining a mapping or creating an index, data nodes submit tasks to the cluster manager. The cluster manager maintains a pending task queue for these tasks and runs them in a single-threaded environment. Sometimes data nodes may flood the cluster manager with too many tasks at the same time. When this happens, the number of tasks in the queue spikes; this affects the cluster manager performance, and may in turn affect the availability of the whole cluster.


The cluster manager maintains a pending task queue for these tasks and runs them in a single-

and executes them in ...

Sometimes data nodes may flood the cluster manager with too many tasks at the same time.

In the past, put-mappings or snapshot tasks have caused too much pile of pending tasks on cluster manager

Even though the ideal solution is to prevent the caller from submitting too many tasks and fix the underlying issue which caused flooding of pending tasks. But, this can take longer and leaves the cluster manager vulnerable to such bugs or issues.

There is a need to build protection mechanism in the cluster manager itself.

Hi @shwetathareja. Thanks for your suggestions. The word "executes" is on the list of words to avoid in our style guide. The style guide suggests replacing it with the word "run".

shwetathareja · 2022-11-09T08:34:15Z

_opensearch/cluster-manager-task-throttling.md

+
+For many cluster activities, such as defining a mapping or creating an index, data nodes submit tasks to the cluster manager. The cluster manager maintains a pending task queue for these tasks and runs them in a single-threaded environment. Sometimes data nodes may flood the cluster manager with too many tasks at the same time. When this happens, the number of tasks in the queue spikes; this affects the cluster manager performance, and may in turn affect the availability of the whole cluster.
+
+To avoid task overload on the cluster manager, you can specify to throttle tasks by setting throttling limits. The cluster manager uses the throttling limits to determine whether to reject tasks from the data nodes. It rejects a task if the total number of tasks of the same type in the pending task queue exceeds the threshold. Since the cluster manager throttles tasks based on the task type, rejecting one task does not affect any other tasks of a different type. For example, if the cluster manager rejects a `put-mapping` task, it can still accept a subsequent `create-index` task. If the cluster manager rejects a task, the data node performs retries with exponential backoff to resubmit the task to the cluster manager. If retries are unsuccessful within the timeout period, OpenSearch returns a cluster timeout error.


task submission can be from any node including the cluster manager itself right?

@shwetathareja: I have implemented the comments. Please take a look when you get a chance. Thanks!

Signed-off-by: Fanit Kolchina <kolchfa@amazon.com>

ariamarble · 2022-11-10T16:24:12Z

_opensearch/cluster-manager-task-throttling.md

+
+The first line of defense is to implement mechanisms in the caller nodes to avoid task overload on the cluster manager. However, even with those mechanisms in place, the cluster manager needs a built-in way to protect itself---cluster manager task throttling.
+
+To turn on cluster manager task throttling, you need to specify to throttle tasks by setting throttling limits. The cluster manager uses the throttling limits to determine whether to reject a task. 


specify throttling tasks?

ariamarble · 2022-11-10T16:25:09Z

_opensearch/cluster-manager-task-throttling.md

+
+To turn on cluster manager task throttling, you need to specify to throttle tasks by setting throttling limits. The cluster manager uses the throttling limits to determine whether to reject a task. 
+
+The cluster manager rejects tasks on the task type basis. For any incoming task, the cluster manager evaluates the total number of tasks of the same type in the pending task queue. If this number exceeds the threshold for this task type, the cluster manager rejects the incoming task. Rejecting a task does not affect tasks of a different type. For example, if the cluster manager rejects a `put-mapping` task, it can still accept a subsequent `create-index` task. 


rejects tasks on the basis of task types?

ariamarble · 2022-11-10T16:26:34Z

_opensearch/cluster-manager-task-throttling.md

+
+## Setting throttling limits
+
+You can set the throttling limits by specifying them in the `cluster_manager.throttling.thresholds` object and updating the [OpenSearch cluster settings]({{site.url}}{{site.baseurl}}/api-reference/cluster-settings). The setting is dynamic, so you can change the behavior of this feature without restarting your cluster.


set throttling limits?

ariamarble · 2022-11-10T16:27:14Z

_opensearch/cluster-manager-task-throttling.md

+
+The following table describes the `cluster_manager.throttling.thresholds` object.
+
+Field name | Description


Suggested change

Field name | Description

Field Name | Description

ariamarble · 2022-11-10T16:28:30Z

_opensearch/cluster-manager-task-throttling.md

+Field name | Description
+:--- | :---
+task-type | The task type. See [supported task types](#supported-task-types) for a list of valid values.
+value | The maximum number of tasks of the type specified by the `task-type` in the cluster manager's pending task queue. Default is `-1` (no task throttling).  


tasks of the task-type type specified by?

ariamarble

looks good other than my comments

Signed-off-by: Fanit Kolchina <kolchfa@amazon.com>

cwillum · 2022-11-10T16:55:22Z

_opensearch/cluster-manager-task-throttling.md

+
+# Cluster manager task throttling
+
+For many cluster state updates, such as defining a mapping or creating an index, nodes submit tasks to the cluster manager. The cluster manager maintains a pending task queue for these tasks and runs them in a single-threaded environment. When nodes send tens of thousands of resource-intensive tasks, like `put-mapping` or snapshot tasks, these tasks pile up in the queue, and the cluster manager is flooded. This affects the cluster manager performance, and may in turn affect the availability of the whole cluster. 


suggestion only:
"When nodes send tens of thousands of resource-intensive tasks, like put-mapping or snapshot tasks, these tasks can pile up in the queue and flood the cluster manager."
"This affects cluster manager performance..."
or
"This affects the cluster manager's performance..."

This is good. I'll change. Thanks!

cwillum

Thumbs up.

Signed-off-by: Fanit Kolchina <kolchfa@amazon.com>

natebower

@kolchfa-aws Only two minor changes. Thanks!

_opensearch/cluster-manager-task-throttling.md

Co-authored-by: Nate Bower <nbower@amazon.com>

* Adds cluster manager task throttling documentation Signed-off-by: Fanit Kolchina <kolchfa@amazon.com> * Update cluster-manager-task-throttling.md * Rewording Signed-off-by: Fanit Kolchina <kolchfa@amazon.com> * More rewording Signed-off-by: Fanit Kolchina <kolchfa@amazon.com> * Reworded for clarity Signed-off-by: Fanit Kolchina <kolchfa@amazon.com> * Incorporated doc review comments Signed-off-by: Fanit Kolchina <kolchfa@amazon.com> * More doc review comments Signed-off-by: Fanit Kolchina <kolchfa@amazon.com> * Update _opensearch/cluster-manager-task-throttling.md Co-authored-by: Nate Bower <nbower@amazon.com> * Update _opensearch/cluster-manager-task-throttling.md Co-authored-by: Nate Bower <nbower@amazon.com> Signed-off-by: Fanit Kolchina <kolchfa@amazon.com> Co-authored-by: Nate Bower <nbower@amazon.com> (cherry picked from commit 99bc98a)

* Adds cluster manager task throttling documentation Signed-off-by: Fanit Kolchina <kolchfa@amazon.com> * Update cluster-manager-task-throttling.md * Rewording Signed-off-by: Fanit Kolchina <kolchfa@amazon.com> * More rewording Signed-off-by: Fanit Kolchina <kolchfa@amazon.com> * Reworded for clarity Signed-off-by: Fanit Kolchina <kolchfa@amazon.com> * Incorporated doc review comments Signed-off-by: Fanit Kolchina <kolchfa@amazon.com> * More doc review comments Signed-off-by: Fanit Kolchina <kolchfa@amazon.com> * Update _opensearch/cluster-manager-task-throttling.md Co-authored-by: Nate Bower <nbower@amazon.com> * Update _opensearch/cluster-manager-task-throttling.md Co-authored-by: Nate Bower <nbower@amazon.com> Signed-off-by: Fanit Kolchina <kolchfa@amazon.com> Co-authored-by: Nate Bower <nbower@amazon.com> (cherry picked from commit 99bc98a) Co-authored-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com>

Adds cluster manager task throttling documentation

d639ba7

Signed-off-by: Fanit Kolchina <kolchfa@amazon.com>

kolchfa-aws requested a review from a team as a code owner November 6, 2022 23:24

kolchfa-aws self-assigned this Nov 6, 2022

kolchfa-aws added Tech review PR: Tech review in progress v2.4.0 'Issues and PRs related to version v2.4.0' labels Nov 6, 2022

shwetathareja reviewed Nov 9, 2022

View reviewed changes

kolchfa-aws and others added 4 commits November 10, 2022 08:53

Update cluster-manager-task-throttling.md

19d416f

Rewording

4895a22

Signed-off-by: Fanit Kolchina <kolchfa@amazon.com>

More rewording

280c12d

Signed-off-by: Fanit Kolchina <kolchfa@amazon.com>

Reworded for clarity

94f3ede

Signed-off-by: Fanit Kolchina <kolchfa@amazon.com>

ariamarble reviewed Nov 10, 2022

View reviewed changes

ariamarble approved these changes Nov 10, 2022

View reviewed changes

Incorporated doc review comments

41a90cb

Signed-off-by: Fanit Kolchina <kolchfa@amazon.com>

cwillum reviewed Nov 10, 2022

View reviewed changes

cwillum approved these changes Nov 10, 2022

View reviewed changes

More doc review comments

1543e14

Signed-off-by: Fanit Kolchina <kolchfa@amazon.com>

natebower reviewed Nov 11, 2022

View reviewed changes

_opensearch/cluster-manager-task-throttling.md Outdated Show resolved Hide resolved

_opensearch/cluster-manager-task-throttling.md Outdated Show resolved Hide resolved

kolchfa-aws and others added 2 commits November 11, 2022 12:06

Update _opensearch/cluster-manager-task-throttling.md

02fd114

Co-authored-by: Nate Bower <nbower@amazon.com>

Update _opensearch/cluster-manager-task-throttling.md

0429040

Co-authored-by: Nate Bower <nbower@amazon.com>

kolchfa-aws added Done but waiting to merge PR: The work is done and ready to merge and removed Tech review PR: Tech review in progress labels Nov 11, 2022

kolchfa-aws merged commit 99bc98a into main Nov 15, 2022

Naarcha-AWS deleted the Fix1792-task-throttling branch December 13, 2022 19:57

kolchfa-aws added v2.5.0 'Issues and PRs related to version v2.5.0' and removed v2.4.0 'Issues and PRs related to version v2.4.0' labels Jan 9, 2023

hdhalter added the release-notes PR: Include this PR in the automated release notes label Jan 13, 2023

kolchfa-aws added the backport 2.5 PR: Backport label for 2.5 label Jan 24, 2023

opensearch-trigger-bot bot mentioned this pull request Jan 24, 2023

[Backport 2.5] Adds cluster manager task throttling documentation #2474

Merged


		# Cluster manager task throttling

		For many cluster activities, such as defining a mapping or creating an index, data nodes submit tasks to the cluster manager. The cluster manager maintains a pending task queue for these tasks and runs them in a single-threaded environment. Sometimes data nodes may flood the cluster manager with too many tasks at the same time. When this happens, the number of tasks in the queue spikes; this affects the cluster manager performance, and may in turn affect the availability of the whole cluster.


		For many cluster activities, such as defining a mapping or creating an index, data nodes submit tasks to the cluster manager. The cluster manager maintains a pending task queue for these tasks and runs them in a single-threaded environment. Sometimes data nodes may flood the cluster manager with too many tasks at the same time. When this happens, the number of tasks in the queue spikes; this affects the cluster manager performance, and may in turn affect the availability of the whole cluster.

		To avoid task overload on the cluster manager, you can specify to throttle tasks by setting throttling limits. The cluster manager uses the throttling limits to determine whether to reject tasks from the data nodes. It rejects a task if the total number of tasks of the same type in the pending task queue exceeds the threshold. Since the cluster manager throttles tasks based on the task type, rejecting one task does not affect any other tasks of a different type. For example, if the cluster manager rejects a `put-mapping` task, it can still accept a subsequent `create-index` task. If the cluster manager rejects a task, the data node performs retries with exponential backoff to resubmit the task to the cluster manager. If retries are unsuccessful within the timeout period, OpenSearch returns a cluster timeout error.


		The first line of defense is to implement mechanisms in the caller nodes to avoid task overload on the cluster manager. However, even with those mechanisms in place, the cluster manager needs a built-in way to protect itself---cluster manager task throttling.

		To turn on cluster manager task throttling, you need to specify to throttle tasks by setting throttling limits. The cluster manager uses the throttling limits to determine whether to reject a task.


		To turn on cluster manager task throttling, you need to specify to throttle tasks by setting throttling limits. The cluster manager uses the throttling limits to determine whether to reject a task.

		The cluster manager rejects tasks on the task type basis. For any incoming task, the cluster manager evaluates the total number of tasks of the same type in the pending task queue. If this number exceeds the threshold for this task type, the cluster manager rejects the incoming task. Rejecting a task does not affect tasks of a different type. For example, if the cluster manager rejects a `put-mapping` task, it can still accept a subsequent `create-index` task.


		## Setting throttling limits

		You can set the throttling limits by specifying them in the `cluster_manager.throttling.thresholds` object and updating the [OpenSearch cluster settings]({{site.url}}{{site.baseurl}}/api-reference/cluster-settings). The setting is dynamic, so you can change the behavior of this feature without restarting your cluster.


		The following table describes the `cluster_manager.throttling.thresholds` object.

		Field name \| Description


		# Cluster manager task throttling

		For many cluster state updates, such as defining a mapping or creating an index, nodes submit tasks to the cluster manager. The cluster manager maintains a pending task queue for these tasks and runs them in a single-threaded environment. When nodes send tens of thousands of resource-intensive tasks, like `put-mapping` or snapshot tasks, these tasks pile up in the queue, and the cluster manager is flooded. This affects the cluster manager performance, and may in turn affect the availability of the whole cluster.

Conversation

kolchfa-aws commented Nov 6, 2022

Checklist

Uh oh!

kolchfa-aws commented Nov 8, 2022

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ariamarble left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cwillum left a comment

Choose a reason for hiding this comment

Uh oh!

natebower left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants